Chapter 29 String manipulation

29.1 String concatenation

Dot (.) can be used to concatenate two strings together.

# concatenate two strings together and assing to $z
$z = $x . $y;
# Append $y to $x
$x = $x . $y;
# Append $y to $x
$x .= $y;

A more convenient way is to use operator .= to append one variable to another.

As is any other assignments in Perl, if you see an assignment written this way $x = $x op expr, where op stands for any operator and expr stands for the rest of the statement, you can make a shorter version by moving the op to the front of the assignment, e.g., $x op= expr. The string concatenation operator . is just one possible op among many others such as +, -, * and /.

my $x = 5;
my $y = 6;
# Add $y to $x
$x = $x + $y;
# Add $y to $x
$x += $y;

29.2 Substring extraction

29.4 Split String

split

join

29.5 Regular expression

A regular expresion is a string of characters that defines the pattern or patterns you are searching. Usually this pattern is used by string searching algorithms for ‘find’ or ‘find and replace’ operations on strings or for input validation.

You can use pattern binding operators =~ and !~. The first operator is a test and assignment operator.

In Perl, there are three three scenarios you may use regular expression:

  • Pattern matching: m//

  • Pattern substitution: s///

  • Modifiers to pattern matching and substitution: tr///

In each case above, the forward slashes is used as delimiters for regular expression specified by you.

29.5.1 Pattern matching

In Perl, m// is used to match a string (could be sequence in Bioinformatics) to a regular expression. For example, to match a mRNA sequence $mRNA against the mRNA sequence ‘$mRNA’

The match operator, m//, is used to match a string or statement to a regular expression. For example, to match the character sequence “foo” against the scalar $bar, you might use a statement like this −

#!/usr/bin/perl

$mRNA = "ATG";
if ($bar =~ /foo/) {
   print "First time is matching\n";
} else {
   print "First time is not matching\n";
}
## First time is not matching

Several special variables also refer back to portions of the previous match.

(ref:regexp_perl) Several special variables

(#tab:regexp_perl)(ref:regexp_perl)
Special.variables Description
$+ Whatever the last bracket match matched
$& The entire matched string
$` Everything before the matched string
$’ Everything after the marched string
$^N Whatever was matched by the most-recently closed group (submatch)

29.5.2 Pattern substitution

29.5.3 Modifiers to pattern matching and substitution

29.5.4 Greedy or non-greedy

29.5.5 Practical Perl for regular expresssion (Advanced)

#!/usr/bin/perl -l

# http://perlmonks.org/?node_id=1146191

use strict;
use warnings;

my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';

while( $sequence =~ /ATG/g ){
    ## Post match: $':
    ## The string following whatever was matched by the last successful pattern match 
    my $rest = ;
    while($rest =~ /(TAG|TAA|TGA)/g){
        my $output = 'ATG' .  . ;
        print $output;
    }
}
## ATGGTTTCTCCCATCTCTCCATCGGCATAA
## ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGA
## ATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAA
## ATGATCTAA
#!/usr/bin/perl

# http://perlmonks.org/?node_id=1146191

use strict;
use warnings;

my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';

$sequence =~ /(ATG.*?(?:TAG|TAA|TGA))(??{print "$1\n" if (length $1)%3 == 0})/;
## ATGGTTTCTCCCATCTCTCCATCGGCATAA
#!/usr/bin/perl

use strict;
use warnings;
my $sequence = 'AATGGTTTCTCCCATCTCTCCATCGGCATAAAAATACAGAATGATCTAACGAA';

while($sequence =~ /ATG(([ATGC]{3})+?)(TAG|TAA|TGA)/g){
    print "ATG\n";
    print "";
}
## ATGGTTTCTCCCATCTCTCCATCGGCA
## GCAATGATC
## ATC

We start with matching the start codon ATG. Then the thing we want to match is inside a parenthesis and it is at least one group of three arbitrary nucleotides ([ATGC]{3}). Finally we want to end with either of the three stop codons. Now if we look by eye, we can see that there are two such matches in the string: GGAATACGCGGA and CACCACGACGCCATAATT, but if we run the script it will only find one sequence starting after the first start codon and ending with the last stop codon: GGAATACGCGGATAAGGCGAAATGCACCACGACGCCATAATT. This is because the + character matches as many as possible. It is greedy. To stop it from being greedy we need to add a ? sign after it: