Chapter 4 File content filtering
4.1 File Filtering
4.1.1 Column filtering
4.1.2 cut
cut
can be used to print selected parts of lines from each FILE to standard output.
cut sort uniq wc grep
https://www.youtube.com/playlist?list=PLtK75qxsQaMLZSo7KL-PmiRarU7hrpnwK
4.1.3 Row filtering
4.1.3.1 grep
The grep
command which stands for “global regular expression print,” processes text line by line and prints any lines which match a specified pattern. The grep command is used to search text or searches the given file for lines containing a match to the given strings or words. By default, grep displays the matching lines.
grep 'WRKY' data/gene_annotation.txt
## gene2 WRKY
## gene4 WRKY1
## gene5 WRKY2
grep 'WRKY' data/gene_annotation.txt |wc -l
## 3
grep -i 'WRKY' data/gene_annotation.txt
## gene2 WRKY
## gene4 WRKY1
## gene5 WRKY2
## gene6 wrky
If you want to search for a word, and avoid matching substrings use ’-w ’option.
grep 'gene1' data/gene_annotation.txt
## gene1 ROS
## gene10 MCU
grep -w 'gene1' data/gene_annotation.txt
## gene1 ROS
4.1.3.2 awk
4.2 paste
paste data/DEG_list.txt data/gene_annotation.txt |head -5
## #gene_id log2fc p-val gene_id gene_shortname
## gene1 2 0.01 gene1 ROS
## gene2 3 0.04 gene2 WRKY
## gene3 -2 0.06 gene3 ZmCCT
## gene4 -8 0.001 gene4 WRKY1
paste data/DEG_list.txt data/gene_annotation.txt |head -5 |cut -f 1-3,5
## #gene_id log2fc p-val gene_shortname
## gene1 2 0.01 ROS
## gene2 3 0.04 WRKY
## gene3 -2 0.06 ZmCCT
## gene4 -8 0.001 WRKY1
head -2 data/WGBS_example_data/EV1.fastq
## @EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51
## TGTTTTTGGGTTAATTAATATTAATTAAATATTTTAATATATTTTTATATA
paste - - - - <data/WGBS_example_data/EV1.fastq |head -2
## @EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51 TGTTTTTGGGTTAATTAATATTAATTAAATATTTTAATATATTTTTATATA +EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51 BCBDFFFFHHFHHIJJJJJJJJHHIICEFGHIIJJC?FDFHIIJJJHHFHG
## @EV1.2 SN603_WB082:5:1101:30.70:119.80 length=51 TTGATGGGTATTTTAATTGGTATTTAATTTATTGTTGAGGGTTTTATTATT +EV1.2 SN603_WB082:5:1101:30.70:119.80 length=51 BCCDFFFFCFHHHJIIJJJIFHHGJIGJJJGIJJIIJGIJIDGIJEHJGIJ
paste - - - - <data/WGBS_example_data/EV1.fastq |cut -f1-2 |head -2
## @EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51 TGTTTTTGGGTTAATTAATATTAATTAAATATTTTAATATATTTTTATATA
## @EV1.2 SN603_WB082:5:1101:30.70:119.80 length=51 TTGATGGGTATTTTAATTGGTATTTAATTTATTGTTGAGGGTTTTATTATT
Convert fastq file to fasta file using paste:
paste - - - - <data/WGBS_example_data/EV1.fastq |cut -f1-2 | sed 's/^@/>/' | tr "\t" "\n" |head -4
## >EV1.1 SN603_WB082:5:1101:49.70:115.70 length=51
## TGTTTTTGGGTTAATTAATATTAATTAAATATTTTAATATATTTTTATATA
## >EV1.2 SN603_WB082:5:1101:30.70:119.80 length=51
## TTGATGGGTATTTTAATTGGTATTTAATTTATTGTTGAGGGTTTTATTATT
For fastq files, another exmaple is to either combine two paired-end read files into one interleaved read file and vise versa.
## https://biowize.wordpress.com/2015/03/26/the-fastest-darn-fastq-decoupling-procedure-i-ever-done-seen/
paste <(paste - - - - < reads-1.fastq) \
<(paste - - - - < reads-2.fastq) | \
tr '\t' '\n' > reads-interleave.fastq
paste - - - - - - - - < reads-int.fastq \
| tee >(cut -f 1-4 | tr '\t' '\n' > reads-1.fastq) \
| cut -f 5-8 | tr '\t' '\n' > reads-2.fastq
4.2.1 Advanced topic:
## https://gist.github.com/nathanhaigh/3521724
sh code_shell/deinterleave_fastq.sh <
4.3 Finding Things
4.3.1 Find files with pattern matching
## Find any files with "Linux" and ".Rmd" in the file names
find . -type f -name "*Linux*.Rmd"
## ./01_WhyLinux.Rmd
## ./04_Linux_FilteringOutputandFindingThings.Rmd
## ./slides/slides01_Linux.Rmd
## ./08_InstallationOfSoftwareInLinux.Rmd
## ./02_Connect2Linux.Rmd
## ./09_TextEditorInLinux.Rmd
## ./03_FileSystemLinux.Rmd
## ./06_procManageLinux.Rmd
4.3.2 Count file numbers in a folder and its subdirectories
find . -type f | wc -l
## 2380
4.3.3 List files bigger than filesize specified
#To find files larger than 10MB:
find . -type f -size +10M
## ./download
## ./data/ESP6500-African_American.vcf.gz
## ./.git/objects/pack/pack-b6b6179d7bb7190bc99389e158b2a8e4cbd98d2f.pack
# If you want the current dir only:
find . -maxdepth 1 -type f -size +1M
## ./download
4.3.4 Find files and do someting
find . -type f -name "*fa*.pl" -exec ls -l {} +;
## -rw-rw-rw- 1 rstudio rstudio 879 Apr 15 2019 ./code_perl/fa_seq_len_out_argv_fil.pl
## -rw-rw-rw- 1 rstudio rstudio 810 Apr 15 2019 ./code_perl/fa_seq_len_out_argv.pl
## -rw-rw-rw- 1 rstudio rstudio 809 Apr 15 2019 ./code_perl/fa_seq_len_out.pl
## -rw-rw-rw- 1 rstudio rstudio 642 Apr 15 2019 ./code_perl/fa_seq_len.pl
4.4 Check you job status
$ ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
timothy 29217 0.0 0.0 11916 4560 pts/21 S+ 08:15 0:00 pine
root 29505 0.0 0.0 38196 2728 ? Ss Mar07 0:00 sshd: can [priv]
can 29529 0.0 0.0 38332 1904 ? S Mar07 0:00 sshd: can@notty
USER = user owning the process PID = process ID of the process %CPU = It is the CPU time used divided by the time the process has been running. %MEM = ratio of the process’s resident set size to the physical memory on the machine VSZ = virtual memory usage of entire process (in KiB) RSS = resident set size, the non-swapped physical memory that a task has used (in KiB) TTY = controlling tty (terminal) STAT = multi-character process state START = starting time or date of the process TIME = cumulative CPU time COMMAND = command with all its arguments
References:
https://superuser.com/questions/117913/ps-aux-output-meaning