Chapter 30 Input and output in Perl
30.1 Input
30.1.1 Standard input
30.1.2 Input from a file on the disk
#!/usr/bin/perl
use warnings;
use strict;
my $fasta = "./data/test_ref.fa";
open FASTA, $fasta or die "$!: $fasta";
while(my $line = <FASTA>){
chomp $line;
print "$line\n";
}close FASTA;
perl code_perl/open_file.pl
## >chr1
## ACGCTAGCTAGTCAGTCGATCGT
## CGTAGCTAGCTAG
## >chr2
## CTGCGGGCTAAATCGATCGATCG
## GTACGTACGAGCTAGCTA
## >chr3
## CTGCGGGCTAAATCGATCGATCG
## GTACGTACGAGCTAGCTA
30.1.3 Special variable $_
#!/usr/bin/perl
use warnings;
use strict;
my $fasta = "./data/test_ref.fa";
open FASTA, $fasta or die "$!: $fasta";
while(my $line = <FASTA>){
chomp $line;
print "$line\n";
}close FASTA;
When you are using while
and foreach
, the content will be assigned to $_
if you don’t assign it explicitly to any variable.
For some build-in functions (chomp
, length
, split
and print
),
I do NOT suggest you to use $_
because $_
will reduce the readerbility of the code.
In Perl, $_ is a powerful In Perl, several functions and operators use this variable as a default, in case no parameter is explicitly used. In general, I’d say you should NOT see $_ in real code. I think the whole point of $_ is that you don’t have to write it explicitly.
30.1.4 The input record separator: $/
$/
is the input record separator. By default, $/
equals newline (\n
).
You could actually change the value of $/
. For example, by assigning the greate-than ‘>’ like this: $/ = ">"
;. Then every call to the read-line operator $chunk = <FILEHANDLE>
will read in all the characters up-to and including the first ‘>.’
We could also assign longer strings to $/
and then that would be the input record separator.
30.1.4.1 How to process fasta
file by changing $/
FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes (A, T, G, C, etc.).
In fasta
format, each sequence begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column.
Here is an example of fasta file (Figure 30.1).
#!/usr/bin/perl
use warnings;
use strict;
my $fasta = "./data/test_ref.fa";
open FASTA, $fasta or die "$!: $fasta";
# Set the input record separator
$/ = "\n>";
# Print header
print "Chrom\tLength\n";
while(my $chunk = <FASTA>){
# Remove > in the first chunk
$chunk =~ s/^>//;
# Remove \n> from the end of the chunk
chomp $chunk;
# Assign ID and seq to two variables
my ($chrom, $seq) = split(/\n/, $chunk, 2);
# Remove \n in the string
$seq =~ s/\n//g;
# Calculate the length of sequence.
my $seq_length = length $seq;
# Print ID and length of the sequence.
print "$chrom\t$seq_length\n";
}close FASTA;
You can consider the content in the fasta file as a string (Figure 30.2) sperarated by \n>
.
We set the input record separator $/
as \n>
.
In the first cycle of the while
loop, the first chunk is >chr1\nACGCTAGCTAGTCAGTCGATCGT\nCGTAGCTAGCTAG\n>
.
You should notice that there is a leading >
in the first chunk. So you need to use s/^>//
to remove the leading >
.
To remove the trailing \n>
, you can use chomp $chunk;
. Then the string is:
chr1\nACGCTAGCTAGTCAGTCGATCGT\nCGTAGCTAGCTAG
The string before first \n
is the sequence ID. The string after the first \n
is the sequence. To get the sequnce ID and sequnce and assign to two variables, you can use split(/\n/, $chunk, 2)
. Here LIMIT is set to 2, it represents the maximum number (here 2) of fields into which the EXPR may be split.
The goal of this code is to output the sequence IDs and their lengths. Currently the string of the variable ($seq
) is ACGCTAGCTAGTCAGTCGATCGT\nCGTAGCTAGCTAG
. To do this, you first need to remove \n
in the sequence by using $seq =~ s/\n//g;
. Now the string of the variable ($seq
) is ACGCTAGCTAGTCAGTCGATCGTCGTAGCTAGCTAG
. Then built-in function length
can be used to calculate the length of the sequence. The function print
is used to output the sequnce ID and the length of the sequence.
In the second cycles of the while
loop, the chunk is
chr2\nCTGCGGGCTAAATCGATCGATCG\nGTACGTACGAGCTAGCTA\n>
.
Unlike the first chunk, there is no leading >
in the second chunk. So the code s/^>//
will do nothing for the second chunk.
The following steps is the same as you can do for the first cycle.
In the third cycle of the while
loop, you can do the same thing on the third sequence.
perl code_perl/fa_seq_len.pl
## Chrom Length
## chr1 36
## chr2 41
## chr3 41
30.2 Output to a file on the disk
#!/usr/bin/perl
use warnings;
use strict;
my $fasta = "./data/test_ref.fa";
my $out_len = "./data/test_ref_len.txt";
open FASTA, $fasta or die "Can't open file for reading: $! $fasta";
open OUT, "+>$out_len" or die "Can't open file for writing: $! $out_len";
# Set the input record separator
$/ = "\n>";
# Print header
print OUT "Chrom\tLength\n";
while(my $chunk = <FASTA>){
# Remove > in the first chunk
$chunk =~ s/^>//;
# Remove \n> from the end of the chunk
chomp $chunk;
# Assign ID and seq to two variables
my ($chrom, $seq) = split(/\n/, $chunk, 2);
# Remove \n in the string
$seq =~ s/\n//g;
# Calculate the length of sequence.
my $seq_length = length $seq;
# Print ID and length of the sequence.
print OUT "$chrom\t$seq_length\n";
}close FASTA;
close OUT;
perl code_perl/fa_seq_len_out.pl
cat ./data/test_ref_len.txt
## Chrom Length
## chr1 36
## chr2 41
## chr3 41
30.3 Arguments from command line
Imagine you have 5 fasta files and you want to calculate sequence lengths for all the 5 files, if you modify the file each time you run the code, it will be very tedious.
Perls provides an array called @ARGV
. @ARGV
holds all the arguments from the command line. The first one will be $ARGV[0]
.
@ARGV
will automatically hold all the arguments. If no arguments are provided, @ARGV
will be empty.
The following example shows you how it looks like in real code.
#!/usr/bin/perl
use warnings;
use strict;
my ($fasta, $out_len) = @ARGV;
print "First arguments \$ARGV[0] : $ARGV[0]\n";
print "Second arguments \$ARGV[0]: $ARGV[1]\n";
print "The variable \$fasta: $fasta\n";
print "The variable \$out_len: $out_len\n";
perl code_perl/test_argv.pl test_ref.fa test_ref_len.txt
## First arguments $ARGV[0] : test_ref.fa
## Second arguments $ARGV[0]: test_ref_len.txt
## The variable $fasta: test_ref.fa
## The variable $out_len: test_ref_len.txt
In the above example, the first argument is $ARGV[0]
which stores the file name: test_ref.fa. The second one stores the file name: test_ref_len.txt. It’s highly recommended to assign the values in @ARGV
to some variables that the readers (including yourself) can understand the variables based on the names.
Here I’ll show you how to use ARGV using an example. The code below will show you what to do if you want to filter a fasta file based on the sequence length.
#!/usr/bin/perl
use warnings;
use strict;
my ($fasta, $out_len) = @ARGV;
open FASTA, $fasta or die "Can't open file for reading: $! $fasta";
open OUT, "+>$out_len" or die "Can't open file for writing: $! $out_len";
# Set the input record separator
$/ = "\n>";
# Print header
print OUT "The input file name is: $fasta\n";
print OUT "Chrom\tLength\n";
while(my $chunk = <FASTA>){
# Remove > in the first chunk
$chunk =~ s/^>//;
# Remove \n> from the end of the chunk
chomp $chunk;
# Assign ID and seq to two variables
my ($chrom, $seq) = split(/\n/, $chunk, 2);
# Remove \n in the string
$seq =~ s/\n//g;
# Calculate the length of sequence.
my $seq_length = length $seq;
# Print ID and length of the sequence.
print OUT "$chrom\t$seq_length\n";
}close FASTA;
close OUT;
For example if you want to remove the sequences that is shorter than 30 bps, you can use the following example.
printf "Before filtering:\n"
cat ./data/test_ref2.fa
perl code_perl/fa_seq_len_out_argv_fil.pl ./data/test_ref2.fa 30 ./data/test_ref2_30.fa
printf "\nAfter filtering: \n"
cat ./data/test_ref2_30.fa
## Before filtering:
## >chr1
## ACGCTAGCTAGTCAGTCGATCGT
## CGTAGCTAGCTAG
## >chr2
## CTGCGGGCTAAATCGATCGATCG
## GTACGTACGAGCTAGCTAA
## >chr3
## CTGCGGGCTAAATCGATCGATCG
## GTACGTACGAG
## >chr4
## CTGCGGGCTAAATCAGCTAA
## >chr5
## CTGCTCGATCGATCGACGAGCTA
## GCTA
## After filtering:
## The input file name is: ./data/test_ref2.fa
## Chrom Length
## >chr1 36
## ACGCTAGCTAGTCAGTCGATCGTCGTAGCTAGCTAG
## >chr2 42
## CTGCGGGCTAAATCGATCGATCGGTACGTACGAGCTAGCTAA
## >chr3 34
## CTGCGGGCTAAATCGATCGATCGGTACGTACGAG
If you have multiple files, you can just change the file names in the command line. It’s very handy. You can also change the length cutoff in the command line.