Chapter 30 Input and output in Perl

30.1 Input

30.1.1 Standard input

30.1.2 Input from a file on the disk

#!/usr/bin/perl
use warnings;
use strict;

my $fasta = "./data/test_ref.fa";

open FASTA, $fasta or die "$!: $fasta";
while(my $line = <FASTA>){
    chomp $line;
    print "$line\n";
}
close FASTA;

perl code_perl/open_file.pl

## >chr1
## ACGCTAGCTAGTCAGTCGATCGT
## CGTAGCTAGCTAG
## >chr2
## CTGCGGGCTAAATCGATCGATCG
## GTACGTACGAGCTAGCTA
## >chr3
## CTGCGGGCTAAATCGATCGATCG
## GTACGTACGAGCTAGCTA

30.1.3 Special variable `$_`

#!/usr/bin/perl
use warnings;
use strict;

my $fasta = "./data/test_ref.fa";

open FASTA, $fasta or die "$!: $fasta";
while(my $line = <FASTA>){
    chomp $line;
    print "$line\n";
}
close FASTA;

When you are using while and foreach, the content will be assigned to $_ if you don’t assign it explicitly to any variable.

For some build-in functions (chomp, length, split and print),

I do NOT suggest you to use $_ because $_ will reduce the readerbility of the code.

In Perl, $_ is a powerful In Perl, several functions and operators use this variable as a default, in case no parameter is explicitly used. In general, I’d say you should NOT see $_ in real code. I think the whole point of $_ is that you don’t have to write it explicitly.

30.1.4 The input record separator: `$/`

$/ is the input record separator. By default, $/ equals newline (\n).

You could actually change the value of $/. For example, by assigning the greate-than ‘>’ like this: $/ = ">";. Then every call to the read-line operator $chunk = <FILEHANDLE> will read in all the characters up-to and including the first ‘>.’

We could also assign longer strings to $/ and then that would be the input record separator.

30.1.4.1 How to process `fasta` file by changing `$/`

FASTA format is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes (A, T, G, C, etc.).

In fasta format, each sequence begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (“>”) symbol in the first column.

Here is an example of fasta file (Figure 30.1).

Figure 30.1: Protein sequence of PYL10 in Arabidopsis.

#!/usr/bin/perl
use warnings;
use strict;

my $fasta = "./data/test_ref.fa";

open FASTA, $fasta or die "$!: $fasta";
# Set the input record separator
$/ = "\n>";
# Print header
print "Chrom\tLength\n";
while(my $chunk = <FASTA>){
    # Remove > in the first chunk
    $chunk =~ s/^>//;
    # Remove \n> from the end of the chunk
    chomp $chunk;
    # Assign ID and seq to two variables
    my ($chrom, $seq) = split(/\n/, $chunk, 2);
    # Remove \n in the string
    $seq =~ s/\n//g;
    # Calculate the length of sequence.
    my $seq_length = length $seq;
    # Print ID and length of the sequence.
    print "$chrom\t$seq_length\n";
}
close FASTA;

You can consider the content in the fasta file as a string (Figure 30.2) sperarated by \n>.

We set the input record separator $/ as \n>.

In the first cycle of the while loop, the first chunk is >chr1\nACGCTAGCTAGTCAGTCGATCGT\nCGTAGCTAGCTAG\n>.

You should notice that there is a leading > in the first chunk. So you need to use s/^>// to remove the leading >.

To remove the trailing \n>, you can use chomp $chunk;. Then the string is:

chr1\nACGCTAGCTAGTCAGTCGATCGT\nCGTAGCTAGCTAG

The string before first \n is the sequence ID. The string after the first \n is the sequence. To get the sequnce ID and sequnce and assign to two variables, you can use split(/\n/, $chunk, 2). Here LIMIT is set to 2, it represents the maximum number (here 2) of fields into which the EXPR may be split.

The goal of this code is to output the sequence IDs and their lengths. Currently the string of the variable ($seq) is ACGCTAGCTAGTCAGTCGATCGT\nCGTAGCTAGCTAG. To do this, you first need to remove \n in the sequence by using $seq =~ s/\n//g;. Now the string of the variable ($seq) is ACGCTAGCTAGTCAGTCGATCGTCGTAGCTAGCTAG. Then built-in function length can be used to calculate the length of the sequence. The function print is used to output the sequnce ID and the length of the sequence.

In the second cycles of the while loop, the chunk is chr2\nCTGCGGGCTAAATCGATCGATCG\nGTACGTACGAGCTAGCTA\n>.

Unlike the first chunk, there is no leading > in the second chunk. So the code s/^>// will do nothing for the second chunk.

The following steps is the same as you can do for the first cycle.

In the third cycle of the while loop, you can do the same thing on the third sequence.

Figure 30.2: Fasta string.

perl code_perl/fa_seq_len.pl

## Chrom    Length
## chr1 36
## chr2 41
## chr3 41

30.2 Output to a file on the disk

#!/usr/bin/perl
use warnings;
use strict;

my $fasta   = "./data/test_ref.fa";
my $out_len = "./data/test_ref_len.txt";

open FASTA, $fasta or die "Can't open file for reading: $! $fasta";
open OUT, "+>$out_len" or die "Can't open file for writing: $! $out_len";
# Set the input record separator
$/ = "\n>";
# Print header
print OUT "Chrom\tLength\n";
while(my $chunk = <FASTA>){
    # Remove > in the first chunk
    $chunk =~ s/^>//;
    # Remove \n> from the end of the chunk
    chomp $chunk;
    # Assign ID and seq to two variables
    my ($chrom, $seq) = split(/\n/, $chunk, 2);
    # Remove \n in the string
    $seq =~ s/\n//g;
    # Calculate the length of sequence.
    my $seq_length = length $seq;
    # Print ID and length of the sequence.
    print OUT "$chrom\t$seq_length\n";
}
close FASTA;
close OUT;

perl code_perl/fa_seq_len_out.pl
cat ./data/test_ref_len.txt

## Chrom    Length
## chr1 36
## chr2 41
## chr3 41

30.3 Arguments from command line

Imagine you have 5 fasta files and you want to calculate sequence lengths for all the 5 files, if you modify the file each time you run the code, it will be very tedious.

Perls provides an array called @ARGV. @ARGV holds all the arguments from the command line. The first one will be $ARGV[0].

@ARGV will automatically hold all the arguments. If no arguments are provided, @ARGV will be empty.

The following example shows you how it looks like in real code.

#!/usr/bin/perl
use warnings;
use strict;

my ($fasta, $out_len) = @ARGV; 
print "First arguments \$ARGV[0] : $ARGV[0]\n";
print "Second arguments \$ARGV[0]: $ARGV[1]\n";

print "The variable \$fasta:   $fasta\n";
print "The variable \$out_len: $out_len\n";

perl code_perl/test_argv.pl test_ref.fa test_ref_len.txt

## First arguments $ARGV[0] : test_ref.fa
## Second arguments $ARGV[0]: test_ref_len.txt
## The variable $fasta:   test_ref.fa
## The variable $out_len: test_ref_len.txt

In the above example, the first argument is $ARGV[0] which stores the file name: test_ref.fa. The second one stores the file name: test_ref_len.txt. It’s highly recommended to assign the values in @ARGV to some variables that the readers (including yourself) can understand the variables based on the names.

Here I’ll show you how to use ARGV using an example. The code below will show you what to do if you want to filter a fasta file based on the sequence length.

#!/usr/bin/perl
use warnings;
use strict;

my ($fasta, $out_len) = @ARGV; 

open FASTA, $fasta or die "Can't open file for reading: $! $fasta";
open OUT, "+>$out_len" or die "Can't open file for writing: $! $out_len";
# Set the input record separator
$/ = "\n>";
# Print header
print OUT "The input file name is: $fasta\n";
print OUT "Chrom\tLength\n";
while(my $chunk = <FASTA>){
    # Remove > in the first chunk
    $chunk =~ s/^>//;
    # Remove \n> from the end of the chunk
    chomp $chunk;
    # Assign ID and seq to two variables
    my ($chrom, $seq) = split(/\n/, $chunk, 2);
    # Remove \n in the string
    $seq =~ s/\n//g;
    # Calculate the length of sequence.
    my $seq_length = length $seq;
    # Print ID and length of the sequence.
    print OUT "$chrom\t$seq_length\n";
}
close FASTA;
close OUT;

For example if you want to remove the sequences that is shorter than 30 bps, you can use the following example.

printf "Before filtering:\n"
cat ./data/test_ref2.fa
perl code_perl/fa_seq_len_out_argv_fil.pl ./data/test_ref2.fa 30 ./data/test_ref2_30.fa

printf "\nAfter filtering: \n"
cat ./data/test_ref2_30.fa

## Before filtering:
## >chr1
## ACGCTAGCTAGTCAGTCGATCGT
## CGTAGCTAGCTAG
## >chr2
## CTGCGGGCTAAATCGATCGATCG
## GTACGTACGAGCTAGCTAA
## >chr3
## CTGCGGGCTAAATCGATCGATCG
## GTACGTACGAG
## >chr4
## CTGCGGGCTAAATCAGCTAA
## >chr5
## CTGCTCGATCGATCGACGAGCTA
## GCTA
## After filtering: 
## The input file name is: ./data/test_ref2.fa
## Chrom    Length
## >chr1 36
## ACGCTAGCTAGTCAGTCGATCGTCGTAGCTAGCTAG
## >chr2 42
## CTGCGGGCTAAATCGATCGATCGGTACGTACGAGCTAGCTAA
## >chr3 34
## CTGCGGGCTAAATCGATCGATCGGTACGTACGAG

If you have multiple files, you can just change the file names in the command line. It’s very handy. You can also change the length cutoff in the command line.