Chapter 4 File content filtering

4.1 File Filtering

4.1.1 Column filtering

4.1.2 `cut`

cut can be used to print selected parts of lines from each FILE to standard output.

cut sort uniq wc grep

https://www.youtube.com/playlist?list=PLtK75qxsQaMLZSo7KL-PmiRarU7hrpnwK

4.1.3 Row filtering

4.1.3.1 `grep`

The grep command which stands for “global regular expression print,” processes text line by line and prints any lines which match a specified pattern. The grep command is used to search text or searches the given file for lines containing a match to the given strings or words. By default, grep displays the matching lines.

grep 'WRKY' data/gene_annotation.txt

## gene2    WRKY
## gene4    WRKY1
## gene5    WRKY2

grep 'WRKY' data/gene_annotation.txt  |wc -l

## 3

grep -i 'WRKY' data/gene_annotation.txt

## gene2    WRKY
## gene4    WRKY1
## gene5    WRKY2
## gene6    wrky

If you want to search for a word, and avoid matching substrings use ’-w ’option.

grep 'gene1' data/gene_annotation.txt

## gene1    ROS
## gene10   MCU

grep -w 'gene1' data/gene_annotation.txt

## gene1    ROS

4.1.3.2 `awk`

4.2 `paste`

paste data/DEG_list.txt data/gene_annotation.txt |head -5

## #gene_id log2fc  p-val   gene_id gene_shortname
## gene1    2   0.01    gene1   ROS
## gene2    3   0.04    gene2   WRKY
## gene3    -2  0.06    gene3   ZmCCT
## gene4    -8  0.001   gene4   WRKY1

paste data/DEG_list.txt data/gene_annotation.txt |head -5 |cut -f 1-3,5

## #gene_id log2fc  p-val   gene_shortname
## gene1    2   0.01    ROS
## gene2    3   0.04    WRKY
## gene3    -2  0.06    ZmCCT
## gene4    -8  0.001   WRKY1

head -2 data/WGBS_example_data/EV1.fastq

## @EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51
## TGTTTTTGGGTTAATTAATATTAATTAAATATTTTAATATATTTTTATATA

paste - - - - <data/WGBS_example_data/EV1.fastq |head -2

## @EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51 TGTTTTTGGGTTAATTAATATTAATTAAATATTTTAATATATTTTTATATA +EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51    BCBDFFFFHHFHHIJJJJJJJJHHIICEFGHIIJJC?FDFHIIJJJHHFHG
## @EV1.2 SN603_WB082:5:1101:30.70:119.80 length=51 TTGATGGGTATTTTAATTGGTATTTAATTTATTGTTGAGGGTTTTATTATT +EV1.2 SN603_WB082:5:1101:30.70:119.80 length=51    BCCDFFFFCFHHHJIIJJJIFHHGJIGJJJGIJJIIJGIJIDGIJEHJGIJ

paste - - - - <data/WGBS_example_data/EV1.fastq |cut -f1-2 |head -2

## @EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51 TGTTTTTGGGTTAATTAATATTAATTAAATATTTTAATATATTTTTATATA
## @EV1.2 SN603_WB082:5:1101:30.70:119.80 length=51 TTGATGGGTATTTTAATTGGTATTTAATTTATTGTTGAGGGTTTTATTATT

Convert fastq file to fasta file using paste:

paste - - - - <data/WGBS_example_data/EV1.fastq |cut -f1-2 | sed 's/^@/>/' | tr "\t" "\n"  |head -4

## >EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51
## TGTTTTTGGGTTAATTAATATTAATTAAATATTTTAATATATTTTTATATA
## >EV1.2 SN603_WB082:5:1101:30.70:119.80 length=51
## TTGATGGGTATTTTAATTGGTATTTAATTTATTGTTGAGGGTTTTATTATT

For fastq files, another exmaple is to either combine two paired-end read files into one interleaved read file and vise versa.

## https://biowize.wordpress.com/2015/03/26/the-fastest-darn-fastq-decoupling-procedure-i-ever-done-seen/
paste <(paste - - - - < reads-1.fastq) \
      <(paste - - - - < reads-2.fastq) | \
      tr '\t' '\n' > reads-interleave.fastq
paste - - - - - - - - < reads-int.fastq \
    | tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
    | cut -f 5-8 | tr '\t' '\n' > reads-2.fastq

4.2.1 Advanced topic:

## https://gist.github.com/nathanhaigh/3521724
sh code_shell/deinterleave_fastq.sh <

4.3 Finding Things

4.3.1 Find files with pattern matching

## Find any files with "Linux" and ".Rmd" in the file names
find . -type f -name "*Linux*.Rmd"

## ./01_WhyLinux.Rmd
## ./04_Linux_FilteringOutputandFindingThings.Rmd
## ./slides/slides01_Linux.Rmd
## ./08_InstallationOfSoftwareInLinux.Rmd
## ./02_Connect2Linux.Rmd
## ./09_TextEditorInLinux.Rmd
## ./03_FileSystemLinux.Rmd
## ./06_procManageLinux.Rmd

4.3.2 Count file numbers in a folder and its subdirectories

find . -type f | wc -l

## 2380

4.3.3 List files bigger than filesize specified

#To find files larger than 10MB:
find . -type f -size +10M

## ./download
## ./data/ESP6500-African_American.vcf.gz
## ./.git/objects/pack/pack-b6b6179d7bb7190bc99389e158b2a8e4cbd98d2f.pack

# If you want the current dir only:
find . -maxdepth 1 -type f -size +1M

## ./download

4.3.4 Find files and do someting

find . -type f -name "*fa*.pl" -exec ls -l {} +;

## -rw-rw-rw- 1 rstudio rstudio 879 Apr 15  2019 ./code_perl/fa_seq_len_out_argv_fil.pl
## -rw-rw-rw- 1 rstudio rstudio 810 Apr 15  2019 ./code_perl/fa_seq_len_out_argv.pl
## -rw-rw-rw- 1 rstudio rstudio 809 Apr 15  2019 ./code_perl/fa_seq_len_out.pl
## -rw-rw-rw- 1 rstudio rstudio 642 Apr 15  2019 ./code_perl/fa_seq_len.pl

4.4 Check you job status

$ ps aux  
USER       PID  %CPU %MEM  VSZ RSS     TTY   STAT START   TIME COMMAND
timothy  29217  0.0  0.0 11916 4560 pts/21   S+   08:15   0:00 pine  
root     29505  0.0  0.0 38196 2728 ?        Ss   Mar07   0:00 sshd: can [priv]   
can      29529  0.0  0.0 38332 1904 ?        S    Mar07   0:00 sshd: can@notty

USER = user owning the process PID = process ID of the process %CPU = It is the CPU time used divided by the time the process has been running. %MEM = ratio of the process’s resident set size to the physical memory on the machine VSZ = virtual memory usage of entire process (in KiB) RSS = resident set size, the non-swapped physical memory that a task has used (in KiB) TTY = controlling tty (terminal) STAT = multi-character process state START = starting time or date of the process TIME = cumulative CPU time COMMAND = command with all its arguments

References:

https://superuser.com/questions/117913/ps-aux-output-meaning

https://biowize.wordpress.com/2015/03/26/the-fastest-darn-fastq-decoupling-procedure-i-ever-done-seen/