login@hostname:~/path$
1 “bash” ?
Many/most bioinformatic tools do not have graphical interfaces. To use them, you need some basics on the command line and bash is the name of one language to “talk” to a computer through a terminal.
2 Learn the basics of using the command line
The goal is not to turn you into bioinformatics specialists but to give you basic skills in running commands. From there you will be able to find, follow and understand tutorials and help available on the ‘jungle of the Internet’
2.1 Connecting to your Virtual Machine (VM)
To start, you need a linux environment where you can train yourself. For that, go back to the IFB website (There is a link on the top of this page) and go to the deployment frame in the “myVM” and connect you to the VM you started at the beginning of the session.
To connect to an instance you need 3 pieces of information: a login, a password and the IP address of your VM:
- Click on deployment “Params” link (right), then copy your Rstudio_PASWORD
- Click on deployment “https” link (right), which will send you to the correct IP.
- fill the login (“rstudio”), and your password in the popup window, et type “enter”. ”
- You must accept the security warning to access the machine.
- If you are using firefox, it is possible that firefox does not want to override the security warning, you must then use another browser e.g. chrome.
2.2 Discovering the Rstudio interface and access to its terminal
By default, the “R” console is open on the left-hand side. This is because R is the expected language in Rstudio, but we are going to use another Rstudio function to talk directly to the computer. To do this, we are going to use the terminal.
You can also see a file browser in the bottom right-hand corner.
2.3 Access to the terminal
Open the “Terminal” tab of your Rstudio interface to access the terminal on your VM.
2.4 What is a terminal ?
A terminal is an interface used to issue commands to a computer by typing only text. You already use a lot of commands indirectly through the graphical interface of your computer (like creating or opening folders, for example).
2.5 Why use a terminal ?
There are plenty or reasons. Here are 4 :
- Automation/scalability. With one command, it is possible to create 100 folders instantly, which is much more tedious through a graphical interface.
- Flexibility. Commands can have a wide array of inputs and options.
- Availability. A graphical interface is a luxury! Some programs (even some operating sytems) just. . . don’t have any. Thus, if you want to use a lot of programs out there, you need the terminal.
- Reproducibility. A set of commands can be stored in a script, which makes them easy to reproduce.
2.6 What does a terminal look like ?
The terminal’s input line has the following layout
We can split this in two parts:
• who is executing the commands: name of the user ( login ), @ , and name of the machine ( hostname ).
• where on the system is the command being executed is indicated the spot “from where” the command is executed ( ~/path ).
A $ sign ends the input prompt.
Whereas the R console will look like :
>
2.7 Launch your first bash commands
You have to imagine that you are a cursor in a tree (or hierarchy) of directories and files on your computer. You can move from one directory to another, and launch commands from your position in the file/folder hierarchy you are in, but also on the computer as a whole.
2.7.1 What is a command ?
A command is a text you pass to your terminal. It starts a system process on the computer, resulting in an action. Commands consist of the command name, followed by arguments: a series of space-delimited strings that modify how the process will behave.
An example will be easier to understand. cowsay
is a (fun and useless) program that takes a message as argument.
Type this command in your terminal and type enter to execute it.
cowsay Hello
Commands can also take a special kind of argument called an option. An option starts with -
and can be followed, or not, by another argument. Below the -f
option is used:
cowsay -f dragon Hi
What’s the point of options? Non-optional arguments (above, “Hi”) are related to the command’s goal, or default behaviour (here, outputting a message). Options fine-tune the command’s behaviour (above, who is speaking).
Warning, each space is interpreted as a new argument, so you have to use quotes to protect an argument containing spaces.
cowsay -e "@@" "What! A talking dragon!"
The following command will list all available animals that can be made to speak.
cowsay -l
Many commands provide help on their usage through the option “-h” or “–help”. You can also find more detailed help on the Internet.
cowsay -h
There are plenty of (more useful) programs. We will cover the basics.
2.7.3 Absolute/Relative paths
To know where your files are, the machine uses paths. There are 2 kinds of paths, relative and absolute.
- A relative path is “how to get somewhere from my current position”. Relative paths use the “.” and “..” notations to look under or above your current position. “../../” means the parent directory of your parent directory (remember the file tree analogy!)
cd tuto_bash/folderA/folderA_1
pwd
cd ../
pwd
cd ../../
pwd
cd
- Absolute paths all start with “/” , which is the root of the file system (remember the file tree analogy! :)). Absolute paths are useful because (unlike relative paths) they are correct no matter where you are on the machine.
cd /home/rstudio/mydatalocal/tuto_bash
- * the wildcard
a star is a “wildcard” matching any number of any characters. e.g. folder*
matches all files/folders starting by folder
ls folder*
Using the commands we’ve just seen:
- Make sure you are inside tuto_bash . What is its absolute path ?
Show the code
# Answer: check the prompt, and
pwd
#or
cd /home/rstudio/mydatalocal/tuto_bash
- Make a new directory inside tuto_bash named folderB/ and create a subdirectory folderB/folderB_1
Show the code
cd /home/rstudio/mydatalocal/tuto_bash
mkdir -p folderB/folderB_1
#or
mkdir folderB
cd folderB
mkdir folderB_1
- Go in the last sub directory you created (/home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1)
Show the code
#with an absolute path
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
pwd
#or a relative path from tuto_bash
cd /home/rstudio/mydatalocal/tuto_bash
cd folderB/folderB_1
pwd
#or a relative path from folderB
cd /home/rstudio/mydatalocal/tuto_bash/folderB
cd folderB_1
pwd
- From here, create a subdirectory folderA_2 in folderA
Show the code
#with an absolute path
mkdir -p /home/rstudio/mydatalocal/tuto_bash/folderA/folderA_2
#or a relative path from folderB_1
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
mkdir -p ../../folderA/folderA_2
- List all the directories of folderA
Show the code
#with an absolute path
ls /home/rstudio/mydatalocal/tuto_bash/folderA
#or a relative path from folderB_1
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
ls ../../folderA
2.7.4 Writing and Reading Text Files
Go back to the /home/rstudio/mydatalocal/tuto_bash/ directory
cd /home/rstudio/mydatalocal/tuto_bash/
touch
-> to create empty file
touch empty_file.txt
You can also easily create new file using the right bottom panel or using file > New File > Text File
Create a text file and name it toto.txt,
Write something inside.
Save it.
echo
-> display text in the terminal
echo "hello-world"
- redirection
>
and>>
->>
redirect the output of a command in a text file,>>
add output of a command at the end of a file
echo "hello-world" > tata.txt
echo "hello-world 2" >> tata.txt
You can also redirect output of other commands.
cowsay -f dragon "hello-world" > dragon.txt
cowsay "hi" >> dragon.txt
cat
-> concatenate files and display output
cat dragon.txt tata.txt
It is often use to display the content of one file
cat dragon.txt
The cat command is useful for viewing the contents of smaller files, but if the file contains hundreds of lines of text, it is overwhelming to have everything printed to the Terminal at once. To see only a part of the file, we can use the commands head and tail to see the first few or the last few lines of the file, respectively. Using myFile.txt as an example, typing
head
andtail
-> to display first or last lines of a file
head dragon.txt
tail dragon.txt
you can specify the number of lines you want to display
head -n 5 dragon.txt
- Create 2 new files named turtle.txt and sheep.txt, containing animal saying something. You will use
cowsay -f turtle message
and redirection
- Create 2 new files named turtle.txt and sheep.txt, containing animal saying something. You will use
Show the code
cowsay -f turtle hello > turtle.txt
cowsay -f sheep hello > sheep.txt
- Concatenate them and redirect the result in a file turtle_and_sheep.txt
Show the code
cat turtle.txt sheep.txt > turtle_and_sheep.txt
- Display the entire file, only the first 8 lines, the last 8 lines.
Show the code
cat turtle_and_sheep.txt
head -n 8 turtle_and_sheep.txt
tail -n 8 turtle_and_sheep.txt
- list all files of your current directory and save it in list_files.txt (Try to write in the first line: “All the files of this directory:”)
Show the code
echo "All the files of this directory:" > list_files.txt
pwd >> list_files.txt
echo >> list_files.txt
ls >> list_files.txt
cat list_files.txt
2.7.5 Copying and Removing Files
cp
-> copy a file in another path. cp<file|folder>
If you want to copy a file in the same directory: (you have to change the name)
cp dragon.txt dragons_cp.txt
If you want to copy a file in another directory: (you can keep the same name or change the name)
cp dragon.txt folderA/
ls folderA/
cp dragon.txt folderA/dragon_cp.txt
ls folderA/
cp -r
-> copy a directory
cp -r folderA folderAprime
mv
-> move a file/folder in another path. mv<file|folder>
Move mv
can be used to rename a file/folder or move it in another paths
ls
mv dragon.txt dragons_new.txt
ls
mv folderAprime folderC
If you want to move a file in another directory: (you can keep the same name or change the name)
mv dragons_new.txt folderA/
ls folderA/
mv dragon_cp.txt folderA/dragon_cp_NEW.txt
ls folderA/
rm
-> remove file(s). Deleting is permanent!
rm empty_file.txt
rm -r
-> remove recursively file(s) and directories. Warning: deleting is permanent!
mkdir -p folderC/folderC_1
cp dragon.txt folderC/folderC_1
ls folderC/folderC_1
rm -r folderC
ls folderC/folderC_1 #raises an error
- Copy turtle.txt in folderB/ and copy sheep.txt in folderB/folderB_1
Show the code
cp turtle.txt folderB/
cp sheep.txt folderB/folderB_1/
- Move folderB/turtle.txt in folderB/folderB_1/
Show the code
mv folderB/turtle.txt folderB/folderB_1/
- Move all files of folderB/folderB_1/ in folderA/folderA_1/. You will use the “*” (wildcard).
Show the code
ls folderB/folderB_1/
mv folderB/folderB_1/* folderA/folderA_1/
ls folderB/folderB_1/
ls folderA/folderA_1/
- List files of your current directory and remove files starting by sheep
Show the code
ls
rm sheep*
ls
- Copy the subdirectory folderB_1 in folderD
Show the code
cp -r folderB/folderB_1/ folderD
- 6 - use the command
tree
to get the file tree
Show the code
cd /home/rstudio/mydatalocal/tuto_bash
tree
Show the code
└── tuto_bash
├── dragons_cp.txt
├── empty_file.txt
├── folderA
│ ├── dragon_cp.txt
│ ├── dragons_new.txt
│ ├── dragon.txt
│ ├── folderA_1
│ │ ├── sheep.txt
│ │ └── turtle.txt
│ └── folderA_2
├── folderB
│ └── folderB_1
├── folderD
├── list_files.txt
├── tata.txt
├── turtle_and_sheep.txt
└── turtle.txt
Most commands have multiple options specified using a single or double dash (e.g. -v , –verbose ). All options should be detailed on the help page of a command, which you can usually print in the terminal with the –help option.
cp --help
To get more information, you can also use the man
command:
man cp
2.8 Combining Commands in scripts
So far you have learned how to use commands. You’ll soon find, however, that large and complex blocks of code are tedious to write out by hand every time you want to run them. It is also difficult to debug a long string of code that you wrote in the Terminal.
Instead, we can put everything into a script: a file containing code. This allows you to make your code compact and easy to move between directories if you need to. It also makes debugging much easier.
A bash (the language you use to communicate) script is a simple text file, usually starting with #!/bin/bash
and named with a ‘.sh’ extension, in which you list commands to run.
Here is an example of a simple script:
tuto_bash/my_script.sh
#!/bin/bash
cd /home/rstudio/mydatalocal/tuto_bash
echo "Run in:"
pwd
mkdir -p folder_1
echo "Made a folder named folder_1"
touch folder_1/text_file.txt
echo "I want to eat strawberries." > folder_1/text_file.txt
echo "Created a file in folder_1 with some text"
Copy it in a text file named my_script.sh (saved in the tuto_bash folder) and run it with the bash command:
cd /home/rstudio/mydatalocal/tuto_bash
bash my_script.sh
You can check that the script worked using the commands previously shown.
Comments
Always add comments to your scripts to keep track of what they do. In Bash, lines starting with a # symbol are ignored by the computer.
folderA/my_script_comments.sh
#!/bin/bash
# This is a comment
#Display the working directory
echo "Run in:"
pwd
#Made a folder named folder_1
mkdir -p folder_1
# Created a file in folder_1 with some text
echo "I want to eat strawberries." > folder_1/text_file.txt
cowsay "me too" >> folder_1/text_file.txt
Add a couple of comments to the script we created above to specify what the commands do.
In longer and more complex scripts, it’s important (necessary, even) to document each section properly.
You’ll be thankful when reading properly commented code from others (or from past you), rather than spending time figuring out what each part does.
A script can be run from another folder, you just have to adjust the path:
cd /home/rstudio/mydatalocal/tuto_bash/
bash folderA/my_script_comments.sh
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1/
bash /home/rstudio/mydatalocal/tuto_bash/folderA/my_script_comments.sh
You can use tree
to search for the “folder1” you created with the previous script
cd /home/rstudio/mydatalocal/tuto_bash/
tree
(Usually, we prefer to use relative path.)
If you run a script from another directory, you’ll have to pay attention to the path used in your script. Indeed, the working directory will be the directory where you launch the script and not the directory containing the script. A good habit is to fix at the beginning of your script the working directory, for example :
sandbox/my_template_script.sh
#!/bin/bash
cd /home/rstudio/mydatalocal/tuto_bash
[...]
You can redirect the output of your script to a file to keep log of its execution (if you have print message)
cd /home/rstudio/mydatalocal/tuto_bash/
bash folderA/my_script_comments.sh > script.log
2.9 Helpful resources
- A detailed bash script cheatsheet: https://devhints.io/bash
- Another great bash cheatsheet
- A full course introduction to Shell/Bash
- Another one : https://andysbrainbook.readthedocs.io/en/stable/unix/
- Google (and by extension, Stackoverflow) is the programmer’s best friend :)
3 Get a toy dataset for the training
For the following of the training, we will work in a new directory, /home/rstudio/mydatalocal/training_project/
.
Create this directory and go inside.
mkdir -p /home/rstudio/mydatalocal/training_project
cd /home/rstudio/mydatalocal/training_project
You can also remove the tuto_bash directory if you want
rm -r /home/rstudio/mydatalocal/tuto_bash
The storage area of your VM (the 200Go or 400Go) is connected to the directory /home/rstudio/mydatalocal/ so in the following of the training you must only work in this folder if you do not want to have space issues.
The dataset we are using for this training is part of a larger study described in Ryu et al., G3 2022. The authors assembled and annotated the genome of the clown fish (Amphiprion ocellaris). Here, we will just use a very small subset of two paired-end RNA-seq samples, one from White skin and the other from Orange Skin.
3.1 Get the raw fastq files
Copy the script below in a file training_project/01_download_toydata.sh to automatically download the training dataset.
We will use wget
, a command (like ls
or cat
) that downloads file(s) from an URL.
training_project/01_download_toydata.sh
#!/bin/bash
mkdir -p /home/rstudio/mydatalocal/training_project
cd /home/rstudio/mydatalocal/training_project
wget https://ue.gitbiopages.ens-lyon.fr/ue-ngs/presentations/intro_bash/toy_data.tar.gz
tar xzvf toy_data.tar.gz
echo "Files:"
ls
Terminal
cd /home/rstudio/mydatalocal/training_project
bash 01_download_toydata.sh
3.2 Run the quality checking on the training dataset
Now we get some data, you can run your first bioinformatic analysis: perform quality control on your raw sequencing data.
FastQC is a commonly used software that provides a simple way to do this.
The template command to run fastqc is :
fastqc -o output_dir sample.fastq.gz
Fisrt get the path of a fastq file of the training dataset
cd /home/rstudio/mydatalocal/training_project
ls toy_data/*
toy_data/White_R1.fastq.gz
toy_data/White_R2.fastq.gz
toy_data/Orange_R1.fastq.gz
toy_data/Orange_R2.fastq.gz
Note the difference between the command with the wildcard “*”
ls toy_data/*
toy_data/White_R1.fastq.gz
toy_data/White_R2.fastq.gz
toy_data/Orange_R1.fastq.gz
toy_data/Orange_R2.fastq.gz
and without the “*”
ls toy_data/
White_R1.fastq.gz
White_R2.fastq.gz
Orange_R1.fastq.gz
Orange_R2.fastq.gz
Now we “just” have to adapt the command to run fastqc on a file of the training dataset.
You can save the commands in a script : “02_toy_check_quality.sh”
training_project/02_toy_check_quality.sh
#!/bin/bash
cd /home/rstudio/mydatalocal/training_project
mkdir -p 01_quality_checking_fastqc/
fastqc -o 01_quality_checking_fastqc/ toy_data/White_R1.fastq.gz
and run it
Terminal
cd /home/rstudio/mydatalocal/training_project
bash 02_check_quality_fastqc.sh
We can now look at the results; we will study them tomorrow.
The result will be in the 01_quality_checking_fastqc folder in the results folder
└── results
└── 01_quality_checking_fastqc
├── White_R1_fastqc.html <-----
└── White_R1_fastqc.zip
You can search it in the right bottom panel of rstudio and open it with “View in a web Browser”
Fastqc can also take multiple files as arguments, to analyse them together. We will use the “*” wildcard to avoid writing the names of all the files
Modify the script 02_toy_check_quality.sh accordingly:
training_project/02_toy_check_quality.sh
#!/bin/bash
cd /home/rstudio/mydatalocal/training_project
mkdir -p 01_quality_checking_fastqc/
fastqc -o 01_quality_checking_fastqc/ toy_data/*.fastq.gz
and run the script again
Terminal
cd /home/rstudio/mydatalocal/training_project
bash 02_toy_check_quality.sh
In Rstudio you can open several terminals at once to avoid waiting for the completion of one command.
By making a list of files corresponding to the pattern passed to fastqc, we can check we didn’t make any mistakes:
cd /home/rstudio/mydatalocal/training_project
ls toy_data/*.fastq.gz
3.3 Aggregate reports with multiqc
MultiQC aggregates results from bioinformatics analyses across many samples into a single report It searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarizing the output from numerous bioinformatics tools.
Write this script and execute it.
training_project/03_toy_aggregate_quality_report_multiqc.sh
#!/bin/bash
cd /home/rstudio/mydatalocal/training_project
mkdir -p 01_quality_checking_aggregate_multiqc/
multiqc 01_quality_checking_fastqc/ -o 01_quality_checking_aggregate_multiqc/
Terminal
cd /home/rstudio/mydatalocal/training_project
bash 03_toy_aggregate_quality_report_multiqc.sh
4 How to organize your project directory ?
For the moment we placed all scripts at the root of our project directory. In fact this is a very bad idea.
It is advisable to put all files related to the same project in the same folder, whether they are:
- the raw data,
- the scripts,
- the results,
- the project documentation,
- …
But to separate the different types of data into sub directories.
This way, you will more easily find your way around and avoiding mixing up files or accidentally deleting them.
For example, your project directory might look like this:
project_name/
├── README.md # overview of the project
├── data/ # data files used in the project
├── results/ # results of the analysis (data, tables, figures)
├── scripts/ # contains all code in the project
│ ├── 01_...
│ ├── 02_...
│ └── ...
└── doc/ # documentation for your project
└── ...
Moreover, to be able to find your way around and to ensure reproducibility, you need to add a file, often named README.md, at the root of your folder that should contain all the useful information to take the project in hand. This is also the file that will be visible on the home page of your project on gitlab. So when someone wants to (re)work on the project, they will be able to open the folder, and they will know where to go to see and understand what has been done. This person can be a collaborator, your manager or simply yourself 6 months later.
- In the training project folder, create a “scripts” folder and move scripts inside.
Show the code
cd /home/rstudio/mydatalocal/training_project
mkdir -p scripts
mv 0*sh scripts
- Do the same thing for results files
Show the code
cd /home/rstudio/mydatalocal/training_project
mkdir -p results
mv *quality* results/
4.1 Scripts and data naming
Another part of staying organized is making sure that all the directories and filenames for an analysis are as consistent as possible. You want to avoid names like alignment1.bam, and rather have names like 20170823_kd_rep1_gmap-1.4.bam that provide a basic level of information about the file.
Good name must be :
- machine readable
- avoid spaces, punctuation, accented
- regular expression and globbing friendly (ex: ls CondA* )
- easy to compute on (ex: deliberate use of delimiters, CondA_rep1.tsv)
- human readable
- name contains info on content (ex 01_download_training_dataset.sh)
- plays well with default ordering (ex: put something numeric first for script for example, or use date for results)
This link and this slideshow have some good guidelines for file naming dos and don’ts.
5 More useful bash commands
In the last part, you have seen some basic bash commands. In this session we will see some new to complete your bash knowledge. As previously, the idea is only to give you useful command for this training but not a complete view of bash.
5.1 New bash commands
Create a new “tuto_bash” directory for the exercises in mydatalocal
cd /home/rstudio/mydatalocal
mkdir -p tuto_bash
cd tuto_bash
grep
-> to get only lines of a file with a pattern
cowsay -f dragon "Hello Hello!" > dragon.txt
cat dragon.txt
grep "-" dragon.txt
sed
-> to replace all the occurrences of a pattern by another
sed "s/Hello/TOTO/g" dragon.txt
sed
can do a lot of other things but we will use only this one.
If you want to replace only the first occurrence of a line, you have to remove the “g” character
sed "s/Hello/TOTO/" dragon.txt
wc -l
-> to count the number of lines of a file
wc -l dragon.txt
wc
can do other things but we will use only this one.
basename
-> to get only the name of a file from a path (the file can not exist)
basename toy_data/sample1.fastq.gz
dirname
-> to get only the name of the directory containing the file from a path (the file can not exist)
dirname toy_data/sample1.fastq.gz
- command ‘piping’, using
|
The pipe allows forwarding the output of one command as the input of the next command. This way you can execute several consecutive commands. For example, if you want to count the number of line containing a given pattern in a file:
grep "-" dragon.txt > occurence.txt
wc -l occurence.txt
You can instead “pipe” the command using a pipe |
(AltGr+6 for PC and Alt + Maj + L for Mac ) between two commands to link the two commands
grep "-" dragon.txt | wc -l
(By the way, grep
already has an option to do this directly (grep -c
), but it was for the example.)
gunzip
andzcat
to decompress files
As previously seen, fastq files are large files and you should store them in a compressed form whenever useful. In unix we used preferentially .gz compression (gunzip) than .zip, but it is the same idea.
To check the contents of a compressed file, you can unzip it with gunzip
but this takes time and disk space. To do it quickly, there is a command called zcat
that allows you to decompress a gz file on the fly and display its contents. So, zcat
piped (|
) with head
will allow you to display the first few lines of a gz file without taking up time or disc space.
zcat /home/rstudio/mydatalocal/training_project/toy_data/White_R1.fastq.gz | head -n 8
Using the commands we’ve just seen:
- Display the first 100 sequences of a training fastq file
Show the code
zcat /home/rstudio/mydatalocal/training_project/toy_data/White_R1.fastq.gz | head -n 400
- Count the number of sequences of a training fastq file
Show the code
zcat /home/rstudio/mydatalocal/training_project/toy_data/White_R1.fastq.gz | grep "@SRR1" | wc -l
5.2 Store strings in variables
In scripts, we frequently re-use bits of text: for example, a path to a file. It is more convenient to keep these in a named “variable” and then re-use that variable in the script. This way, if you need to modify the file path, you only need to modify the variable once, and not all the occurrences of the file path.
To define a variable you will declare it with an =
without space.
variable1="toto"
Quotes are not strictly mandatory, for example if there are no spaces in your string:
variable1=toto
and then you can refer to the variable’s content by using a $
before its name. To display the command of a variable you can use echo
.
echo $variable1
echo variable1
echo "variable1: $variable1"
You can also combine variables to define a new one
variable1="TOTO"
variable2="TATA"
echo variable1: $variable1
echo variable2: $variable2
variable3="$variable1 $variable2"
echo variable3: $variable3
It is better to use explicit variable names, rather than “variable1”, “variable2”
To use a variable, you sometimes need to add curly brackets ({}) around the variable name, to distinguish the variable name from the string that follows.
For example, from a variable, $sample
you want to create a filename using the contant of sample
as prefix of your file using $sample_rep1.fastq.gz
. You need to add curly brackets around sample (${sample}_rep1.fastq.gz
), else the terminal will search for the variable $sample_rep1
.
sample="sampleA"
filenameX=$sample_rep1.txt #variable sample_rep1 don't exist so filenameX=.txt
filenameY=${sample}_rep1.txt
echo filenameX: $filenameX
echo filenameY: $filenameY
Variables can also capture the output of a command, using $(command)
list_files=$(ls)
working_directory=$(pwd)
echo "In $working_directory, there are :
$list_files"
You can use also pipe inside this construct
nb_dash=$(grep "-" dragon.txt | wc -l)
echo $nb_dash
Using variables will simplify writing scripts but also reading them.
filename=dragon.txt
new_filename=dragon_bis.txt
echo "$filename will be cp to $new_filename"
cp $filename $new_filename
You can also do pattern substitutions in a variable, using ${variable/pattern/replace}
to replace the first occurrence.
filenameR1=sample1.R1.fastq.gz
echo ${filenameR1/R1/R2}
If you want to remove a pattern, you can use the same syntax but without replacement string. It is a good way to remove the extension of a file name.
filenameR1=sample1.R1.fastq.gz
echo ${filenameR1/.R1.fastq.gz/}
You can save the output of this pattern substitution in a new variable
filenameR1=sample1.R1.fastq.gz
filenameR2=${filenameR1/R1/R2}
echo R1: $filenameR1 R2: $filenameR2
You will work in the /home/rstudio/mydatalocal/tuto_bash directory to do the following exercice. But data will be in /home/rstudio/mydatalocal/training_project/toy_data/
cd /home/rstudio/mydatalocal/tuto_bash
fastq_data_dir="/home/rstudio/mydatalocal/training_project/toy_data/"
Using the commands we’ve just seen:
- Complete the commands below to define a variable
prefix
containing the prefix of the sample name (“White”) from one file of a pair of PE fastq files (White_R1.fastq.gz
)
full_filenameR1=$fastq_data_dir/White_R1.fastq.gz
...
echo $prefix # display White
Show the code
full_filenameR1=$fastq_data_dir/White_R1.fastq.gz
filenameR1=$(basename $full_filenameR1)
prefix=${filenameR1/_R1.fastq.gz/}
echo $prefix # display White
- From this prefix variable, write codes to save in a file (results/seq_numbers/$prefix.nb_sequences.txt) the number of sequences of the R1 fastq file. Before create a directory to save the results (“results/seq_numbers”).
Show the code
output_dir="results/seq_numbers"
mkdir -p $output_dir
full_filenameR1=$fastq_data_dir/White_R1.fastq.gz
filenameR1=$(basename $full_filenameR1)
prefix=${filenameR1/_R1.fastq.gz/}
echo $prefix # display White
zcat $full_filenameR1 | grep "@SRR1" | wc -l > $output_dir/$prefix.nb_sequences.txt
Starting from the full filename:
full_filenameR1=$fastq_data_dir/White_R1.fastq.gz
, write codes to save automatically:- in a varibale named filenameR2 the name of the R2 file (=White_R2.fastq.gz)
- and in a variable full_filenameR2 the path to the file R2 (=/home/rstudio/mydatalocal/training_project/toy_data//White_R2.fastq.gz)
full_filenameR1=$fastq_data_dir/White_R1.fastq.gz
...
echo filenameR2: $filenameR2 # display White_R2.fastq.gz
echo full_filenameR2: $full_filenameR2 # display /home/rstudio/mydatalocal/training_project/toy_data//White_R2.fastq.gz
Show the code
full_filenameR1=$fastq_data_dir/White_R1.fastq.gz
fastq_dirname=$(dirname $full_filenameR1)
filenameR1=$(basename $full_filenameR1)
echo filenameR1: $filenameR1
echo full_filenameR1: $full_filenameR1
filenameR2=${filenameR1/R1/R2}
full_filenameR2="$fastq_dirname/$filenameR2"
echo filenameR2: $filenameR2
echo full_filenameR2: $full_filenameR2
- Complete the previous codes to get the number of sequences of each files of the sample (R1 and R2)
the output file must contain:
White_R1.fastq.gz: 1000
White_R2.fastq.gz: 1000
Show the code
cd /home/rstudio/mydatalocal/tuto_bash/
output_dir="results/seq_numbers"
mkdir -p $output_dir
full_filenameR1=$fastq_data_dir/White_R1.fastq.gz
fastq_dirname=$(dirname $full_filenameR1)
filenameR1=$(basename $full_filenameR1)
filenameR2=${filenameR1/R1/R2}
full_filenameR2="$fastq_dirname/$filenameR2"
prefix=${filenameR1/_R1.fastq.gz/}
output_file="$output_dir/$prefix.nb_sequences.txt"
echo $prefix # display sampleA
nb_lines_R1=$(zcat $full_filenameR1 | grep "@SRR1" | wc -l)
echo "$filenameR1: $nb_lines_R1"> $output_file
nb_lines_R2=$(zcat $full_filenameR2 | grep "@SRR1" | wc -l)
echo "$filenameR2: $nb_lines_R2">> $output_file
cat $output_dir/$prefix.nb_sequences.txt
5.3 “For” loops to iterate on a list
Bioinformatic analysis often involves running many commands, but with only a small alteration each time you run a new command.
For example, you might have to analyze sample1, then sample2, sample3, and so on. To save time, we will use something called a Loop - also known as a for-loop.
Let’s illustrate what this is with a simple example. Let’s say that you need to print the letters a, b, c. You could do this by typing echo a
and hitting return; then echo b
, and echo c
. This gets the job done, but you can see that this would quickly become tedious if your goal was to print dozens or hundreds of numbers.
How can we make this easier? You probably noticed that each time we ran the command we changed the number after the echo command. A for-loop will automatically do this for you.
Here is an example of a for-loop to print the letters a, b, c, d:
for X in a b c d
do
echo $X
done
The for-loop has three sections:
- The first section is the Declaration: it begins by assigning the first item after in to the variable *X; in this case, it would assign the value “a” to “X”. The letters after in are called the List**.
- The next section is the Body, which runs the commands written after do, replacing the variable with whichever value is currently assigned to the variable - for the first loop, this will be the letter a. Since items remain in the list, the loop goes back to the declaration and assigns the next number in the list to the variable X; in this case, the number b. Then the body is run, and the process repeats until the end of the list is reached.
- The last section, called the End, contains only the word done, meaning to exit the loop after all of the items in the list have been run through the Body of the loop.
You can add more commands to the Body section. For example, we could change the loop to:
for X in a b c d
do
echo Do something
echo Do something with X=$X
echo Do another thing
done
You can use a variable as seen previously to store the list of things to iterate. And use a more explicit than X to iterate on (for example letter)
list="a b c d"
for letter in $list
do
echo Do something
echo Do something with $letter
echo Do another thing
done
Finally, the list you iterate through can be created automatically. For example, if you want to iterate through a list of files you can do:
cd /home/rstudio/mydatalocal/training_project/
list_files=$(ls toy_data/*gz)
echo $list_files
echo "Start:"
for file in $list_files
do
echo Do something
echo Do something with $file
echo Do another thing
done
echo "End"
#OR
for file in toy_data/*gz
do
echo Do something
echo Do something with $file
echo Do another thing
done
Side note: Indentations are not important for the machine but help readability for us humans.
Using the commands we’ve just seen:
- Iterate on each fastq files of the toy_data directory and print its path
Show the code
for full_filename in toy_data/*gz
do
echo Path : $full_filename
done
- Using the same basis, get from its name the prefix of the sample
Path : White_R1.fastq.gz
Prefix : White_R1
Path : White_R2.fastq.gz
Prefix : White_R2
Path : Orange_R1.fastq.gz
Prefix : Orange_R1
Path : Orange_R2.fastq.gz
Prefix : Orange_R2
Show the code
cd /home/rstudio/mydatalocal/training_project/
for full_filename in toy_data/*gz
do
echo Path : $full_filename
filename=$(basename $full_filename)
prefix=${filename/.fastq.gz/}
echo Prefix : $prefix
done
- As the training dataset is Paired-end each sample is printed 2 times. Search for a way to display one time the prefix of each sample only once.
Show the code
cd /home/rstudio/mydatalocal/training_project/
for full_filenameR1 in toy_data/*R1*gz
do
echo Path R1: $full_filenameR1
filename=$(basename $full_filenameR1)
prefix=${filename/_R1.fastq.gz/}
echo Prefix : $prefix
done
- Take back the code you write to get the number of sequences of one fastq file and iterate over each sample of the PE training dataset to get the number of sequences of each fastq files.
cd /home/rstudio/mydatalocal/training_project
output_dir="/home/rstudio/mydatalocal/tuto_bash/results/seq_numbers"
mkdir -p $output_dir
for full_filenameR1 in toy_data/*R1*gz
do
...
done
Show the code
cd /home/rstudio/mydatalocal/training_project
output_dir="/home/rstudio/mydatalocal/tuto_bash/results/seq_numbers"
mkdir -p $output_dir
for full_filenameR1 in toy_data/*R1*gz
do
echo Filename : $full_filenameR1
fastq_dirname=$(dirname $full_filenameR1)
filenameR1=$(basename $full_filenameR1)
filenameR2=${filenameR1/R1/R2}
full_filenameR2="$fastq_dirname/$filenameR2"
prefix=${filenameR1/.R1.fastq.gz/}
output_file="$output_dir/$prefix.nb_sequences.txt"
echo Prefix: $prefix # display White
nb_lines_R1=$(zcat $full_filenameR1 | grep "@SRR1" | wc -l)
echo "$filenameR1: $nb_lines_R1"> $output_file
nb_lines_R2=$(zcat $full_filenameR2 | grep "@SRR1" | wc -l)
echo "$filenameR2: $nb_lines_R2">> $output_file
#to check
cat $output_file
done
#to check the output dir
ls $output_dir
Now it’s time to get to work on your project!
Remember, if you’re looking for a command, have an incomprehensible bug or don’t know where to start :
- ask other members of your group
- the internet is also your friend (it’s not cheating!!)
- ask one of the trainers
These materials have been developed by members of UE NGE teaching team of the ENS de Lyon. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.