Chapter 4 File content filtering

4.1 File Filtering

4.1.1 Column filtering

4.1.2 cut

cut can be used to print selected parts of lines from each FILE to standard output.

cut sort uniq wc grep

https://www.youtube.com/playlist?list=PLtK75qxsQaMLZSo7KL-PmiRarU7hrpnwK

4.1.3 Row filtering

4.1.3.1 grep


The grep command which stands for “global regular expression print,” processes text line by line and prints any lines which match a specified pattern. The grep command is used to search text or searches the given file for lines containing a match to the given strings or words. By default, grep displays the matching lines.

grep 'WRKY' data/gene_annotation.txt
## gene2    WRKY
## gene4    WRKY1
## gene5    WRKY2
grep 'WRKY' data/gene_annotation.txt  |wc -l 
## 3
grep -i 'WRKY' data/gene_annotation.txt 
## gene2    WRKY
## gene4    WRKY1
## gene5    WRKY2
## gene6    wrky

If you want to search for a word, and avoid matching substrings use ’-w ’option.

grep 'gene1' data/gene_annotation.txt 
## gene1    ROS
## gene10   MCU
grep -w 'gene1' data/gene_annotation.txt 
## gene1    ROS

4.1.3.2 awk

4.2 paste

paste data/DEG_list.txt data/gene_annotation.txt |head -5
## #gene_id log2fc  p-val   gene_id gene_shortname
## gene1    2   0.01    gene1   ROS
## gene2    3   0.04    gene2   WRKY
## gene3    -2  0.06    gene3   ZmCCT
## gene4    -8  0.001   gene4   WRKY1
paste data/DEG_list.txt data/gene_annotation.txt |head -5 |cut -f 1-3,5
## #gene_id log2fc  p-val   gene_shortname
## gene1    2   0.01    ROS
## gene2    3   0.04    WRKY
## gene3    -2  0.06    ZmCCT
## gene4    -8  0.001   WRKY1
head -2 data/WGBS_example_data/EV1.fastq
## @EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51
## TGTTTTTGGGTTAATTAATATTAATTAAATATTTTAATATATTTTTATATA
paste - - - - <data/WGBS_example_data/EV1.fastq |head -2 
## @EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51 TGTTTTTGGGTTAATTAATATTAATTAAATATTTTAATATATTTTTATATA +EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51    BCBDFFFFHHFHHIJJJJJJJJHHIICEFGHIIJJC?FDFHIIJJJHHFHG
## @EV1.2 SN603_WB082:5:1101:30.70:119.80 length=51 TTGATGGGTATTTTAATTGGTATTTAATTTATTGTTGAGGGTTTTATTATT +EV1.2 SN603_WB082:5:1101:30.70:119.80 length=51    BCCDFFFFCFHHHJIIJJJIFHHGJIGJJJGIJJIIJGIJIDGIJEHJGIJ
paste - - - - <data/WGBS_example_data/EV1.fastq |cut -f1-2 |head -2
## @EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51 TGTTTTTGGGTTAATTAATATTAATTAAATATTTTAATATATTTTTATATA
## @EV1.2 SN603_WB082:5:1101:30.70:119.80 length=51 TTGATGGGTATTTTAATTGGTATTTAATTTATTGTTGAGGGTTTTATTATT

Convert fastq file to fasta file using paste:

paste - - - - <data/WGBS_example_data/EV1.fastq |cut -f1-2 | sed 's/^@/>/' | tr "\t" "\n"  |head -4
## >EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51
## TGTTTTTGGGTTAATTAATATTAATTAAATATTTTAATATATTTTTATATA
## >EV1.2 SN603_WB082:5:1101:30.70:119.80 length=51
## TTGATGGGTATTTTAATTGGTATTTAATTTATTGTTGAGGGTTTTATTATT

For fastq files, another exmaple is to either combine two paired-end read files into one interleaved read file and vise versa.

## https://biowize.wordpress.com/2015/03/26/the-fastest-darn-fastq-decoupling-procedure-i-ever-done-seen/
paste <(paste - - - - < reads-1.fastq) \
      <(paste - - - - < reads-2.fastq) | \
      tr '\t' '\n' > reads-interleave.fastq
paste - - - - - - - - < reads-int.fastq \
    | tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
    | cut -f 5-8 | tr '\t' '\n' > reads-2.fastq

4.2.1 Advanced topic:

## https://gist.github.com/nathanhaigh/3521724
sh code_shell/deinterleave_fastq.sh < 

4.3 Finding Things

4.3.1 Find files with pattern matching

## Find any files with "Linux" and ".Rmd" in the file names
find . -type f -name "*Linux*.Rmd"
## ./01_WhyLinux.Rmd
## ./04_Linux_FilteringOutputandFindingThings.Rmd
## ./slides/slides01_Linux.Rmd
## ./08_InstallationOfSoftwareInLinux.Rmd
## ./02_Connect2Linux.Rmd
## ./09_TextEditorInLinux.Rmd
## ./03_FileSystemLinux.Rmd
## ./06_procManageLinux.Rmd

4.3.2 Count file numbers in a folder and its subdirectories

find . -type f | wc -l
## 2380

4.3.3 List files bigger than filesize specified

#To find files larger than 10MB:
find . -type f -size +10M
## ./download
## ./data/ESP6500-African_American.vcf.gz
## ./.git/objects/pack/pack-b6b6179d7bb7190bc99389e158b2a8e4cbd98d2f.pack
# If you want the current dir only:
find . -maxdepth 1 -type f -size +1M
## ./download

4.3.4 Find files and do someting

find . -type f -name "*fa*.pl" -exec ls -l {} +; 
## -rw-rw-rw- 1 rstudio rstudio 879 Apr 15  2019 ./code_perl/fa_seq_len_out_argv_fil.pl
## -rw-rw-rw- 1 rstudio rstudio 810 Apr 15  2019 ./code_perl/fa_seq_len_out_argv.pl
## -rw-rw-rw- 1 rstudio rstudio 809 Apr 15  2019 ./code_perl/fa_seq_len_out.pl
## -rw-rw-rw- 1 rstudio rstudio 642 Apr 15  2019 ./code_perl/fa_seq_len.pl

4.4 Check you job status

$ ps aux  
USER       PID  %CPU %MEM  VSZ RSS     TTY   STAT START   TIME COMMAND
timothy  29217  0.0  0.0 11916 4560 pts/21   S+   08:15   0:00 pine  
root     29505  0.0  0.0 38196 2728 ?        Ss   Mar07   0:00 sshd: can [priv]   
can      29529  0.0  0.0 38332 1904 ?        S    Mar07   0:00 sshd: can@notty  

USER = user owning the process PID = process ID of the process %CPU = It is the CPU time used divided by the time the process has been running. %MEM = ratio of the process’s resident set size to the physical memory on the machine VSZ = virtual memory usage of entire process (in KiB) RSS = resident set size, the non-swapped physical memory that a task has used (in KiB) TTY = controlling tty (terminal) STAT = multi-character process state START = starting time or date of the process TIME = cumulative CPU time COMMAND = command with all its arguments

References:

https://superuser.com/questions/117913/ps-aux-output-meaning

https://biowize.wordpress.com/2015/03/26/the-fastest-darn-fastq-decoupling-procedure-i-ever-done-seen/