Chapter 29 String manipulation
29.1 String concatenation
Dot (.
) can be used to concatenate two strings together.
# concatenate two strings together and assing to $z
$z = $x . $y;
# Append $y to $x
$x = $x . $y;
# Append $y to $x
$x .= $y;
A more convenient way is to use operator .=
to append one variable to another.
As is any other assignments in Perl, if you see an assignment written this way $x = $x
op expr, where op stands for any operator and expr stands for the rest of the statement, you can make a shorter version by moving the op to the front of the assignment, e.g., $x op= expr
. The string concatenation operator .
is just one possible op among many others such as +
, -
, *
and /
.
my $x = 5;
my $y = 6;
# Add $y to $x
$x = $x + $y;
# Add $y to $x
$x += $y;
29.2 Substring extraction
29.3 Substring search
Index
29.4 Split String
split
join
29.5 Regular expression
A regular expresion is a string of characters that defines the pattern or patterns you are searching. Usually this pattern is used by string searching algorithms for ‘find’ or ‘find and replace’ operations on strings or for input validation.
You can use pattern binding operators =~
and !~
. The first operator is a test and assignment operator.
In Perl, there are three three scenarios you may use regular expression:
Pattern matching:
m//
Pattern substitution:
s///
Modifiers to pattern matching and substitution:
tr///
In each case above, the forward slashes is used as delimiters for regular expression specified by you.
29.5.1 Pattern matching
In Perl, m//
is used to match a string (could be sequence in Bioinformatics) to a regular expression. For example, to match a mRNA sequence $mRNA
against the mRNA sequence ‘$mRNA’
The match operator, m//, is used to match a string or statement to a regular expression. For example, to match the character sequence “foo” against the scalar $bar, you might use a statement like this −
#!/usr/bin/perl
$mRNA = "ATG";
if ($bar =~ /foo/) {
print "First time is matching\n";
else {
} print "First time is not matching\n";
}
## First time is not matching
Several special variables also refer back to portions of the previous match.
(ref:regexp_perl) Several special variables
Special.variables | Description |
---|---|
$+ | Whatever the last bracket match matched |
$& | The entire matched string |
$` | Everything before the matched string |
$’ | Everything after the marched string |
$^N | Whatever was matched by the most-recently closed group (submatch) |
29.5.2 Pattern substitution
29.5.3 Modifiers to pattern matching and substitution
29.5.4 Greedy or non-greedy
29.5.5 Practical Perl for regular expresssion (Advanced)
#!/usr/bin/perl -l
# http://perlmonks.org/?node_id=1146191
use strict;
use warnings;
my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';
while( $sequence =~ /ATG/g ){
## Post match: $':
## The string following whatever was matched by the last successful pattern match
my $rest = $';
while($rest =~ /(TAG|TAA|TGA)/g){
my $output = 'ATG' . $` . $1;
print $output;
} }
## ATGGTTTCTCCCATCTCTCCATCGGCATAA
## ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA
## ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA
## ATGATCTAA
#!/usr/bin/perl
# http://perlmonks.org/?node_id=1146191
use strict;
use warnings;
my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';
$sequence =~ /(ATG.*?(?:TAG|TAA|TGA))(??{print "$1\n" if (length $1)%3 == 0})/;
## ATGGTTTCTCCCATCTCTCCATCGGCATAA
#!/usr/bin/perl
use strict;
use warnings;
my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';
while($sequence =~ /ATG(([ATGC]{3})+?)(TAG|TAA|TGA)/g){
print "ATG$1\n";
print "$2";
}
## ATGGTTTCTCCCATCTCTCCATCGGCA
## GCAATGATC
## ATC
We start with matching the start codon ATG. Then the thing we want to match is inside a parenthesis and it is at least one group of three arbitrary nucleotides ([ATGC]{3}). Finally we want to end with either of the three stop codons. Now if we look by eye, we can see that there are two such matches in the string: GGAATACGCGGA and CACCACGACGCCATAATT, but if we run the script it will only find one sequence starting after the first start codon and ending with the last stop codon: GGAATACGCGGATAAGGCGAAATGCACCACGACGCCATAATT. This is because the + character matches as many as possible. It is greedy. To stop it from being greedy we need to add a ? sign after it: