• From Novice to Expert in Bioinformatics
  • Preface
  • I Linux
  • 1 Why Linux?
    • 1.1 What is Linux
    • 1.2 Linux for bioinformatics
  • 2 Connecting to Linux
    • 2.1 User interfaces
    • 2.2 How to connect
      • 2.2.1 Alternative ways to gain access to a Linux Server
  • 3 Navigating in Linux file system
    • 3.1 Path
      • 3.1.1 Relative and absolute path
    • 3.2 Surfing in Linux file system
      • 3.2.1 Check where you are using command pwd
      • 3.2.2 Listing the contents using command ls
      • 3.2.3 Change directory using command ‘cd’
    • 3.3 Path shortcuts
    • 3.4 Manipulations of files and directories
      • 3.4.1 Command cp
      • 3.4.2 Command mv
      • 3.4.3 Command mkdir
      • 3.4.4 Command ‘rm’
    • 3.5 Viewing text files in Linux
      • 3.5.1 Command cat
      • 3.5.2 Command more and less
      • 3.5.3 Command head and tail
      • 3.5.4 Auto-completion
    • 3.6 Understand standard input and stardard output
      • 3.6.1 STDIN
      • 3.6.2 STDOUT
      • 3.6.3 STDERR
    • 3.7 Find Disk Usage of Files and Directories
    • 3.8 Advanced topic
      • 3.8.1 Linux md5sum Command
  • 4 File content filtering
    • 4.1 File Filtering
      • 4.1.1 Column filtering
      • 4.1.2 cut
      • 4.1.3 Row filtering
    • 4.2 paste
      • 4.2.1 Advanced topic:
    • 4.3 Finding Things
      • 4.3.1 Find files with pattern matching
      • 4.3.2 Count file numbers in a folder and its subdirectories
      • 4.3.3 List files bigger than filesize specified
      • 4.3.4 Find files and do someting
    • 4.4 Check you job status
  • 5 Achiving and compressing files
    • 5.1 Common compressed file format
    • 5.2 How to work with different format
      • 5.2.1 *.gz
      • 5.2.2 *.tar.gz
      • 5.2.3 *.zip
      • 5.2.4 *.tar.bz2
  • 6 Process management in Linux
    • 6.1 top
    • 6.2 ps
    • 6.3 kill
    • 6.4 df
    • 6.5 Advanced topic free
    • 6.6 Commands for Linux administration (Advanced topic)
      • 6.6.1 w
      • 6.6.2 who
      • 6.6.3 uptime
      • 6.6.4 whoami
      • 6.6.5 ifconfig
      • 6.6.6 useradd and passwd
  • 7 File transfer
    • 7.1 Transferring files between local computer and Linux server
      • 7.1.1 Use command line tools
      • 7.1.2 Download files
  • 8 Install Bioinformatics software in Linux
    • 8.1 Installation from source code
      • 8.1.1 Install bwa
      • 8.1.2 Install samtools
      • 8.1.3 Align reads to genome using bwa and store the alignment results in SAM/BAM files
    • 8.2 Installing a precompiled binary (executable)
      • 8.2.1 Install bwa
      • 8.2.2 Install with conda (recommended way)
      • 8.2.3 Install ussing Docker (Advanced topic)
  • 9 Text editor in Linux
    • 9.1 Basic vi skills
    • 9.2 Create new text file with vi
    • 9.3 An example for using editor R
  • II Python
  • 10 First Python Program
  • 11 Varibles in Python
    • 11.1 Variable names
    • 11.2 Data types in Python
  • 12 Functions in Python
    • 12.1 Create a function to determine the GC-content of DNA sequence
    • 12.2 Reverse complement
  • III Statistics and R
  • 13 R introduction
    • 13.1 Basic R function
    • 13.2 Producing Simple Graphs with R
      • 13.2.1 Line Charts
    • 13.3 XXX
    • 13.4 Logic && and |
    • 13.5 List as dictionary
    • 13.6 Parsing arguments as string
      • 13.6.1 String as xlim
      • 13.6.2 How to access data frame column using variable
      • 13.6.3 How to create a formula from a
  • 14 ggplot2
    • 14.1 ggplot2
    • 14.2 ggplot2 practical
  • 15 Heatmap Tutorial
    • 15.1 Install pheatmap package
    • 15.2 Draw a heatmap for gene expression of RNA-seq data
    • 15.3 Add the annotation
    • 15.4
    • 15.5 Transfrom the data
    • 15.6 How to add annotations
    • 15.7 How to cut the trees
    • 15.8 How to get the cluster information from the heatmap
    • 15.9 Change color
      • 15.9.1 Use Brewer
  • 16 Preparation of figures for manuscript
    • 16.1 test1
  • IV Omics data
  • 17 Introduction to NGS
    • 17.1 Introduction of Biology for Bioinformatics
      • 17.1.1 What is DNA
      • 17.1.2 What is genome
      • 17.1.3 How to assemble a genome
      • 17.1.4 What is gene
    • 17.2 What is NGS
    • 17.3 Application of NGS
    • 17.4
    • 17.5 Data file formats to store the genomic information
      • 17.5.1 fasta file
      • 17.5.2 GFF file
      • 17.5.3 GTF file
      • 17.5.4 FASTQ file
      • 17.5.5 SAM/BAM file
      • 17.5.6 VCF file
      • 17.5.7 BED format
      • 17.5.8 bedGraph format
    • 17.6 Usefull links:
  • 18 Start a project
    • 18.1 Experimetal design
      • 18.1.1 How many biological replicates
      • 18.1.2 How big
      • 18.1.3 How much data to generate
      • 18.1.4
  • 19 Capstone project:
    • 19.1 Introduction
    • 19.2 Method
    • 19.3 Pipelines
    • 19.4
  • 20 Useful measurements in Bioinformatics
    • 20.1 K-mer
    • 20.2 Shannon Entropy
    • 20.3 Gini index
  • 21 data.table
    • 21.1 Split data.table into chunks in a list
      • 21.1.1 nested list using flatten arguments
      • 21.1.2 Example
  • 22 Good resouces to learn Bioinformatics
    • 22.1 References:
  • 23 Basic statistics - Descriptive Statistics
    • 23.1 Probability distribution
      • 23.1.1 Geometirc distribution
      • 23.1.2 Bionomial distribution
      • 23.1.3 Negative binomial distribution
  • 24 Inferential Statistics
    • 24.1 Comparing two groups
      • 24.1.1 Independent groups
      • 24.1.2 Dependent groups
      • 24.1.3 Controlling for other variables
    • 24.2 Categorical association
      • 24.2.1 Categorical association - Chi-squared test for association
      • 24.2.2 Categorical association - Chi-squared test for goodness of fit
      • 24.2.3 Categorical association - Sidenotes and an alternative to the Chi-squared test
    • 24.3 Simple regression
      • 24.3.1 Describing quantitative association
      • 24.3.2 Drawing inferences
      • 24.3.3 Exponential regression
    • 24.4 Multiple regression
    • 24.5 Analysis of variance
    • 24.6 Non-parametric tests
      • 24.6.1 Comparing groups with respect to mean rank
      • 24.6.2 Rank-based correlation & randomness
      • 24.6.3
  • V Perl
  • 25 First Perl Program
    • 25.1 First Program
  • 26 Varible in Perl
    • 26.1 Scalar variable
    • 26.2 Arithmetic operations in Perl
      • 26.2.1 Shorthand operations
      • 26.2.2 Auto increment and auto decrement
      • 26.2.3 inf and -inf in Perl
    • 26.3 use strict; user warnings and my
    • 26.4 Array
      • 26.4.1 Init an array
      • 26.4.2 Array index
      • 26.4.3 Length of the array
      • 26.4.4 Sort Arrays in Perl
      • 26.4.5 Use push, pop, shift and unshift in Perl
    • 26.5 Hash in Perl
      • 26.5.1 Sort keys numerically using sort {$a<=>$b}
      • 26.5.2 Use exists function on a hash
  • 27 Linear regression
    • 27.1 Interpreting regression parameters in a basic model
      • 27.1.1 R-squared and correlation
    • 27.2 Reference
  • 28 Control structure
    • 28.1 for loop
    • 28.2 foreach loop
    • 28.3 while loop
    • 28.4 Statement if-else
    • 28.5 Operator last and next
    • 28.6 Operator redo
  • 29 String manipulation
    • 29.1 String concatenation
    • 29.2 Substring extraction
    • 29.3 Substring search
    • 29.4 Split String
    • 29.5 Regular expression
      • 29.5.1 Pattern matching
      • 29.5.2 Pattern substitution
      • 29.5.3 Modifiers to pattern matching and substitution
      • 29.5.4 Greedy or non-greedy
      • 29.5.5 Practical Perl for regular expresssion (Advanced)
  • 30 Input and output in Perl
    • 30.1 Input
      • 30.1.1 Standard input
      • 30.1.2 Input from a file on the disk
      • 30.1.3 Special variable $_
      • 30.1.4 The input record separator: $/
    • 30.2 Output to a file on the disk
    • 30.3 Arguments from command line
  • 31 Practical Perl program
    • 31.1 Add annotation information to DESeq2 results
    • 31.2 Merge overlap genomic regions
  • 32 Perl modules
    • 32.1 What is a Perl module
    • 32.2 How to install a Perl module
    • 32.3 How to use a Perl module
    • 32.4 How to use BioPerl module
    • 32.5 How to write a module
  • 33 Perl one-liner
    • 33.1 Convert a fastq file from FASTQ to FAST
    • 33.2 Escape single quote
    • 33.3 Reference
  • 34 Famouse people and/or blogs to follow
  • References
  • Published with bookdown

All things considered for Bioinformatics

Chapter 3 Navigating in Linux file system

You are in your home directory after you log into the system and are directed to the shell command prompt. This section will show you hot to explore Linux file system using shell commands.

To start, you need to take a tour of what the Linux filesystem looks like so you know where you are going.

3.1 Path

To understand Linux file system, you can image it as a tree structure (Figure 3.1).

Tree structure of Linux system.

Figure 3.1: Tree structure of Linux system.

In Linux, a path is a unique location of a file or a directory in the file system.

For convenience, Linux file system is usually thought of in a tree structure. On a standard Linux system you will find the layout generally follows the scheme presented below.

The tree of the file system starts at the trunk or slash, indicated by a forward slash (/). This directory, containing all underlying directories and files, is also called the root directory or “the root” of the file system.

3.1.1 Relative and absolute path

  • Absolute path

An absolute path is defined as the location of a file or directory from the root directory(/). An absolute path starts from the root of the tree (/).

Here are some examples: >/home/xie186 >/home/xie186/perl5

  • Relative path

Relative path is a path related to the present working directory: data/sample1/ and ../doc/.

If you want to get the absolute path based on relative path, you can use readlink with parameter -f:

pwd
readlink -f ../
## /data/projects/bix_book
## /data/projects

3.2 Surfing in Linux file system

Once we enter into a Linux file system, we need to 1) know where we are; 2) how to get where we want; 3) how to know what files or directories we have in a particular path.

3.2.1 Check where you are using command pwd

In order to know where we are, we need to use pwd command. The command pwd is short for “print name of current/working directory.” It will return the full path of current directory.

Command pwd is almost always used by itself. This means you only need to type pwd and press ENTER (Figure 3.2).

ref:linuxCMDpwd

Figure 3.2: ref:linuxCMDpwd

3.2.2 Listing the contents using command ls

After you know where you are, then you want to know what you have in that directory, we can use command ls to list directory contents (Figure 3.3). Its syntax is:

ls [option]... [file]...
ref:linuxCMDls

Figure 3.3: ref:linuxCMDls

ls with no option will list files and directories in bare format. Bare format means the detailed information (type, size, modified date and time, permissions and links etc) won’t be viewed. When you use ls by itself (Figure 3.3), it will list files and directories in the current directory.

cd tables
ls

echo "ls -a"
ls -a 

echo "ls -t"
ls -t
## 10_PerlInputOutput_bak.Rmd
## linuxPathShortcuts.tsv
## regexp_perl.tsv
## textEditorLinuxVi3modes.csv
## ls -a
## .
## ..
## 10_PerlInputOutput_bak.Rmd
## linuxPathShortcuts.tsv
## regexp_perl.tsv
## textEditorLinuxVi3modes.csv
## ls -t
## regexp_perl.tsv
## 10_PerlInputOutput_bak.Rmd
## linuxPathShortcuts.tsv
## textEditorLinuxVi3modes.csv
ls -l -a tables/
## total 28
## drwxrwxrwx  2 rstudio rstudio 4096 May 19  2019 .
## drwxrwxrwx 20 rstudio rstudio 4096 Dec 28 03:00 ..
## -rw-rw-rw-  1 rstudio rstudio 4139 Apr 15  2019 10_PerlInputOutput_bak.Rmd
## -rw-rw-rw-  1 rstudio rstudio  223 Apr 15  2019 linuxPathShortcuts.tsv
## -rw-rw-rw-  1 rstudio rstudio  259 May 19  2019 regexp_perl.tsv
## -rw-rw-rw-  1 rstudio rstudio  766 Apr 15  2019 textEditorLinuxVi3modes.csv

Linux command options can be combined without a space between them and with a single - (dash).

The following command is a faster way to use the l and a options and gives the same output as the Linux command shown above.

ls -la 

3.2.3 Change directory using command ‘cd’

Command cd is used to change the current directory. It’s syntax is:

cd [option] [directory]

Unlike pwd, when you use cd you usually need to provide the path (either absolute or relative path) which we want to enter.

If we didn’t provide any path information, we will change to home directory by default.

3.3 Path shortcuts

In Linux, there are three commonly used path shortpaths (Table 3.1).

Table 3.1: Shortcuts of path.
Path Shortcuts Description
Single dot . The current folder
Double dots .. The folder above the current folder
Tilde character ~ Home directory (normally the directory:/home/my_login_name)
Dash - Your last working directory

Here are some examples:

cd ~
pwd
ls
ls ../
## 
pwd
cd ../
pwd
cd ./
pwd

Each directory has two entries in it at the start, with names . (a link to itself) and .. (a link to its parent directory). The exception, of course, is the root directory, where the .. directory also refers to the root directory.

Sometimes you go to a new directory and do something, then you remember that you need to go to the previous working direcotry. To get back instantly, use a dash.

# This is our current directory
pwd

# Let us go our home diretory
cd ~

# Check where we are
pwd

# Let us go to your previous working directory
cd -
# Check where we are now
pwd
## /data/projects/bix_book
## /home/rstudio
## /data/projects/bix_book
## /data/projects/bix_book

3.4 Manipulations of files and directories

In Linux, manipulations of files and directories are the most frequent work. In this section, you will learn how to copy, rename, remove, and create files and directories.

3.4.1 Command cp

In Linux, command cp can help you copy files and directories into a target directory.

3.4.2 Command mv

The command mv is short for move (or rename) files.

3.4.2.1 Move one file

Here is one common example of mv.

mv file1 directory1/

3.4.2.2 Move multiple files into a directory

mv file1 file2 file3 target_direcotry/

3.4.2.3 Move a directory

mv dir1

3.4.2.4 Rename a file or a directory

3.4.3 Command mkdir

Command mkdir is short for make directory.

The syntax is shown as below:

mkdir [OPTION ...] DIRECTORY ...
mkdir directory

Multiple directories can be specified when calling mkdir.

mkdir directory1 directory2

3.4.3.1 How to create a directory

mkdir -p foo/bar/baz

How to defining complex directory trees with one command

mkdir -p project/{software,results,doc/{html,info,pdf},scripts}

This will create a direcotry trees as shown below:

$ tree project/
project/
├── doc
│   ├── html
│   ├── info
│   └── pdf
├── results
├── scripts
└── software

7 directories, 0 files

The command line above will directories foo, foo/bar, and foo/bar/baz if they don’t exist.

3.4.4 Command ‘rm’

You can use rm to remove both files and directories.

3.4.4.1 How to remove a file or multiple files

## You can remove one file. 
rm file1 
## `rm` can remove multiple files simutaneously
rm file2 file3 

3.4.4.2 How to remove a folder

If a folder is empty, you can remove it using rm with -r.

rm -r FOLDER

If a folder is not empty, you can remove it using rm with -r and -f.

mkdir test_folder
rm -r test_folder

3.5 Viewing text files in Linux

3.5.1 Command cat

The command cat is short for concatenate files and print on the standard output.

The syntax is shown as below:

cat [OPTION]... [FILE]...

For small text file, cat can be used to view the files on the standard output.

cat data/testdata4linux_cmd.txt
## gene1
## gene2
## gene3
## gene4
## gene5
## gene6
## gene7
## gene8
## gene9
## gene10
## gene11
## gene12
## gene13
## gene14
## gene15
## gene16

You can also use cat to merge two text files.

cat file1 file2 > merged_file

3.5.2 Command more and less

The command more is old utility. When the text passed to it is too large to fit on one screen, it pages it. You can scroll down but not up.

The syntaxt of more is shown below:

more [options] file [...]

The command less was written by a man who was fed up with more’s inability to scroll backwards through a file. He turned less into an open source project and over time, various individuals added new features to it. less is massive now. That’s why some small embedded systems have more but not less. For comparison, less’s source is over 27000 lines long. more implementations are generally only a little over 2000 lines long.

The syntaxt of less is shown below:

less [options] file [...]

3.5.3 Command head and tail

The command head is used to output the first part of files. By default, it outputs the first 10 lines of the file.

head [OPTION]... [FILE]...

Here is an exmaple of printing the first 5 files of the file:

head -n 5 code_perl/variable_assign.pl
## #!/usr/bin/perl
## use warnings;
## use strict;
## 
## #assign two strings to two variables

In fact, the letter n does not even need to be used at all. Just the hyphen and the integer (with no intervening space) are sufficient to tell head how many lines to return. Thus, the following would produce the same result as the above commands:

head -5 data/testdata4linux_cmd.txt
## gene1
## gene2
## gene3
## gene4
## gene5

The command tail is used to output the last part of files. By default, it prints the last 10 lines of the file to standard output.

The syntax is shown below:

tail [OPTION]... [FILE]...

Here is an exmaple of printing the last 5 files of the file:

tail -5 data/testdata4linux_cmd.txt
## gene12
## gene13
## gene14
## gene15
## gene16

To view lines from a specific point in a file, you can use -n +NUMBER with the tail command. For example, here is an example of viewing the file from the 2nd line of the line.

tail -n +2 data/testdata4linux_cmd.txt
## gene2
## gene3
## gene4
## gene5
## gene6
## gene7
## gene8
## gene9
## gene10
## gene11
## gene12
## gene13
## gene14
## gene15
## gene16

3.5.4 Auto-completion

In most Shell environment, programmable completion feature will also improve your speed of typing. It permits typing a partial name of command or a partial file (or directory), then pressing TAB key to auto-complete the command (Figure 3.4). If there are more than one possible completions, then TAB will list all of them (Figure 3.4).

Demonstration of programmable completion feature.

Figure 3.4: Demonstration of programmable completion feature.

3.6 Understand standard input and stardard output

In the Linux environment, input and output is distributed across three streams: standard input (STDIN), standard output (STDOUT), standard error (STDERR). These three streams are also numbered: STDIN (0), STDOUT (1), STDERR (2).

3.6.1 STDIN

… The standard input stream typically carries data from a user to a program. Programs that expect standard input usually receive input from a device, such as a keyboard. Standard input is terminated by reaching EOF (end-of-file). As described by its name, EOF indicates that there is no more data to be read.

To see standard input in action, run the cat program. Cat stands for concatenate, which means to link or combine something. It is commonly used to combine the contents of two files. When run on its own, cat opens a looping prompt. …

tail
1
2
3
`CTRL+D`
1
2
3

3.6.2 STDOUT

Data that is generated by a program will be written by STDOUT. If the STDOUT is not redirected, it will output the data on to the terminal.

stdout="Hello world"
echo $stdout
## Hello world

The STDOUT can be redirected to a file. See the example below:

stdout="Hello world"
echo $stdout > data/test_output.txt
# cat the data
cat data/test_output.txt
## Hello world

3.6.3 STDERR

During a program’s execution, some errors may be generated when the program fails at some parts. STDERR will help you write the errors. By default, the STDERR will be outputed onto the terminal.

Here is an example of STDERR

ls NOTAFILE
## ls: cannot access 'NOTAFILE': No such file or directory

3.7 Find Disk Usage of Files and Directories

The Linux du (short for Disk Usage) is a standard Unix/Linux command, used to check the information of disk usage of files and directories on a machine. The du command has many parameter options that can be used to get the results in many formats. The du command also displays the files and directory sizes in a recursively manner.

du data/ESP6500-African_American.vcf.gz
du -h data/ESP6500-African_American.vcf.gz
## 27388    data/ESP6500-African_American.vcf.gz
## 27M  data/ESP6500-African_American.vcf.gz

To get the summary of a grand total disk usage size of an directory use the option “-s” as follows.

du -sh data/
## 38M  data/

Using “-a” flag with “du” command displays the disk usage of all the files and directories.

du -ah data/
## 4.0K data/test_ref2_30.fa
## 4.0K data/sgRNA_count_norm4gini_index.txt
## 9.3M data/Arabidopsis_thaliana.TAIR10.37.gff3.gz
## 4.0K data/stcp-Rdataset-Diet.csv
## 4.0K data/WGBS_sample_information.txt
## 4.0K data/nhanes_2015_2016.csv.README
## 4.0K data/gene_annotation.txt
## 4.0K data/DMR_region.txt
## 4.0K data/DEG_list.txt
## 4.0K data/test_ref_len.txt
## 532K data/ESP6500-African_American.vcf.gz.tbi
## 4.0K data/README
## 4.0K data/test_ref.fa
## 4.0K data/PYL10_ARATH.fasta
## 0    data/regexp_perl.txt
## 88K  data/maize_embryo_specific_gene_Sheet1.tsv
## 8.0K data/WGBS_example_data/EV1.fastq
## 12K  data/WGBS_example_data
## 4.0K data/famous_people.txt
## 4.0K data/test_ref2.fa
## 4.0K data/test_output.txt
## 748K data/nhanes_2015_2016.csv
## 27M  data/ESP6500-African_American.vcf.gz
## 4.0K data/DMR_region_merged.txt
## 4.0K data/testdata4linux_cmd.txt
## 8.0K data/Pneumonia_china_2020.RDS
## 38M  data/

3.8 Advanced topic

3.8.1 Linux md5sum Command

md5sum is used to verify the integrity of files, as virtually any change to a file will cause its MD5 hash to change. Most commonly, md5sum is used to verify that a file has not changed as a result of a faulty file transfer, a disk error or non-malicious meddling. The md5sum program is included in most Unix-like operating systems.

echo "The MD5 value of index.Rmd is: "
md5sum index.Rmd
cp index.Rmd index.Rmd_bak
echo "The MD5 value of index.Rmd_bak is: "
md5sum index.Rmd_bak 
echo "The MD5 value of new index.Rmd_bak is: "
head index.Rmd > index.Rmd_bak
md5sum index.Rmd_bak 
## The MD5 value of index.Rmd is: 
## 8a08026382b5fe49d9f764534976b0fc  index.Rmd
## The MD5 value of index.Rmd_bak is: 
## 8a08026382b5fe49d9f764534976b0fc  index.Rmd_bak
## The MD5 value of new index.Rmd_bak is: 
## 6c6d75e8839891bf7ba1ab152c8f267c  index.Rmd_bak