login@hostname:~/path$
1 “bash” ?
Many/most bioinformatic tools do not have graphical interfaces. To use them, you need some basics on the command line and bash is the name of one language to “talk” to a computer through a terminal.
2 Learn the basics of using the command line
The goal is not to turn you into bioinformatics specialists but to give you basic skills in running commands. From there you will be able to find, follow and understand tutorials and help available on the ‘jungle of the Internet’
2.1 Connecting to your Virtual Machine (VM)
To start, you need a linux environment where you can train yourself. For that, go back to the IFB website (https://biosphere.france-bioinformatique.fr/cloud/) and go to the deployment frame in the “myVM” and connect you to the VM you started at the beginning of the session.
To connect to an instance you need 3 pieces of information: a login, a password and the IP address of your VM:
- Click on deployment “Params” link (right), then copy your Rstudio_PASWORD
- Click on deployment “https” link (right), which will send you to the correct IP.
- fill the login (“rstudio”), and your password in the popup window, et type “enter”.
- You must accept the security warning to access the machine.
- If you are using firefox, it is possible that firefox does not want to override the security warning, you must then use another browser e.g. chrome.
2.2 Discovering the Rstudio interface and access to its terminal
By default, the “R” console is open on the left-hand side. This is because R is the expected language in Rstudio, but we are going to use another Rstudio function to talk directly to the computer. To do this, we are going to use the terminal.
You can also see a file browser in the bottom right-hand corner.
2.3 Access to the terminal
Open the “Terminal” tab of your Rstudio interface to access the terminal on your VM.
2.4 What is a terminal ?
A terminal is an interface used to issue commands to a computer by typing only text. You already use a lot of commands indirectly through the graphical interface of your computer (like creating or opening folders, for example).
2.5 Why use a terminal ?
There are plenty or reasons. Here are 4 :
- Automation/scalability. With one command, it is possible to create 100 folders instantly, which is much more tedious through a graphical interface.
- Flexibility. Commands can have a wide array of inputs and options.
- Availability. A graphical interface is a luxury! Some programs (even some operating sytems) just. . . don’t have any. Thus, if you want to use a lot of programs out there, you need the terminal.
- Reproducibility. A set of commands can be stored in a script, which makes them easy to reproduce.
2.6 What does a terminal look like ?
The terminal’s input line has the following layout
We can split this in two parts:
• who is executing the commands: name of the user ( login ), @ , and name of the machine ( hostname ).
• where on the system is the command being executed is indicated the spot “from where” the command is executed ( ~/path ).
A $ sign ends the input prompt.
Whereas the R console will look like :
>
2.7 Launch your first bash commands
You have to imagine that you are a cursor in a tree (or hierarchy) of directories and files on your computer. You can move from one directory to another, and launch commands from your position in the file/folder hierarchy you are in, but also on the computer as a whole.
2.7.1 What is a command ?
A command is a text you pass to your terminal. It starts a system process on the computer, resulting in an action. Commands consist of the command name, followed by arguments: a series of space-delimited strings that modify how the process will behave.
An example will be easier to understand. cowsay
is a (fun and useless) program that takes a message as argument.
Type this command in your terminal and type enter to execute it.
cowsay Hello
Commands can also take a special kind of argument called an option. An option starts with -
and can be followed, or not, by another argument. Below the -f
option is used:
cowsay -f dragon Hi
What’s the point of options? Non-optional arguments (above, “Hi”) are related to the command’s goal, or default behaviour (here, outputting a message). Options fine-tune the command’s behaviour (above, who is speaking).
Warning, each space is interpreted as a new argument, so you have to use quotes to protect an argument containing spaces.
cowsay -e "@@" "What! A talking dragon!"
The following command will list all available animals that can be made to speak.
cowsay -l
Many commands provide help on their usage through the option “-h” or “–help”. You can also find more detailed help on the Internet.
cowsay -h
There are plenty of (more useful) programs. We will cover the basics.
2.7.3 Absolute/Relative paths
To know where your files are, the machine uses paths. There are 2 kinds of paths, relative and absolute.
- A relative path is “how to get somewhere from my current position”. Relative paths use the “.” and “..” notations to look under or above your current position. “../../” means the parent directory of your parent directory (remember the file tree analogy!)
cd tuto_bash/folderA/folderA_1
pwd
cd ../
pwd
cd ../../
pwd
cd # In the absence of an argument, returns to your starting directory (your "home").
- Absolute paths all start with “/” , which is the root of the file system (remember the file tree analogy! :)). Absolute paths are useful because (unlike relative paths) they are correct no matter where you are on the machine.
cd /home/rstudio/mydatalocal/tuto_bash
*
the wildcard
a star is a “wildcard” matching any number of any characters. e.g. folder*
matches all files/folders starting by folder
ls folder*
Using the commands we’ve just seen:
- Make sure you are inside tuto_bash . What is its absolute path ?
Show the code
# Answer: check the prompt, and
pwd
#or
cd /home/rstudio/mydatalocal/tuto_bash
- Make a new directory inside tuto_bash named folderB/ and create a subdirectory folderB/folderB_1
Show the code
cd /home/rstudio/mydatalocal/tuto_bash
mkdir -p folderB/folderB_1
#or
mkdir folderB
cd folderB
mkdir folderB_1
- Go in the last sub directory you created (/home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1)
Show the code
#with an absolute path
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
pwd
#or a relative path from tuto_bash
cd /home/rstudio/mydatalocal/tuto_bash
cd folderB/folderB_1
pwd
#or a relative path from folderB
cd /home/rstudio/mydatalocal/tuto_bash/folderB
cd folderB_1
pwd
- From here, create a subdirectory folderA_2 in folderA
Show the code
#with an absolute path
mkdir -p /home/rstudio/mydatalocal/tuto_bash/folderA/folderA_2
#or a relative path from folderB_1
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
mkdir -p ../../folderA/folderA_2
- List all the directories of folderA
Show the code
#with an absolute path
ls /home/rstudio/mydatalocal/tuto_bash/folderA
#or a relative path from folderB_1
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
ls ../../folderA
2.7.4 Writing and Reading Text Files
Go back to the /home/rstudio/mydatalocal/tuto_bash/ directory
cd /home/rstudio/mydatalocal/tuto_bash/
touch
-> to create empty file
touch empty_file.txt
But you can easily create new file using the right bottom panel or using file > New File > Text File
- Create a text file and name it toto.txt,
- Write something inside.
- Save it.
echo
-> display text in the terminal
echo "hello-world"
- redirection
>
and>>
->>
redirect the output of a command in a text file,>>
add output of a command at the end of a file
echo "hello-world" > tata.txt
echo "hello-world 2" >> tata.txt
You can also redirect output of other commands.
cowsay -f dragon "hello-world" > dragon.txt
cowsay "hi" >> dragon.txt
cat
-> concatenate files and display output
cat dragon.txt tata.txt
It is often use to display the content of one file
cat dragon.txt
The cat
command is useful for viewing the contents of smaller files, but if the file contains hundreds of lines of text, it is overwhelming to have everything printed to the Terminal at once. To see only a part of the file, we can use the commands head
and tail
to see the first few or the last few lines of the file, respectively. Using myFile.txt as an example, typing
head
andtail
-> to display first or last lines of a file
head dragon.txt
tail dragon.txt
You can specify the number of lines you want to display
head -n 5 dragon.txt
- Create 2 new files named turtle.txt and sheep.txt, containing animal saying something. You will use
cowsay -f turtle message
and redirection
- Create 2 new files named turtle.txt and sheep.txt, containing animal saying something. You will use
Show the code
cowsay -f turtle hello > turtle.txt
cowsay -f sheep hello > sheep.txt
- Concatenate them and redirect the result in a file turtle_and_sheep.txt
Show the code
cat turtle.txt sheep.txt > turtle_and_sheep.txt
- Display the entire file, only the first 8 lines, the last 8 lines.
Show the code
cat turtle_and_sheep.txt
head -n 8 turtle_and_sheep.txt
tail -n 8 turtle_and_sheep.txt
- list all files of your current directory and save it in list_files.txt (Try to write in the first line: “All the files of this directory:”)
Show the code
echo "All the files of this directory:" > list_files.txt
pwd >> list_files.txt
echo >> list_files.txt
ls >> list_files.txt
cat list_files.txt
2.7.5 Copying and Removing Files
cp
-> copy a file in another path.cp <file> <file|folder>
If you want to copy a file in the same directory: (you have to change the name)
cp dragon.txt dragons_cp.txt
If you want to copy a file in another directory: (you can keep the same name or change the name)
cp dragon.txt folderA/
ls folderA/
cp dragon.txt folderA/dragon_cp.txt
ls folderA/
cp -r
-> copy a directory
cp -r folderA folderAprime
mv
-> move a file/folder in another path.mv <file> <file|folder>
Move mv
can be used to rename a file/folder or move it in another paths
ls
mv dragon.txt dragons_new.txt
ls
mv folderAprime folderC
If you want to move a file in another directory: (you can keep the same name or change the name)
mv dragons_new.txt folderA/
ls folderA/
mv dragon_cp.txt folderA/dragon_cp_NEW.txt
ls folderA/
rm
-> remove file(s). Deleting is permanent!
rm empty_file.txt
rm -r
-> remove recursively file(s) and directories. Warning: deleting is permanent!
mkdir -p folderC/folderC_1
cp dragon.txt folderC/folderC_1
ls folderC/folderC_1
rm -r folderC
ls folderC/folderC_1 #raises an error
- Copy turtle.txt in folderB/ and copy sheep.txt in folderB/folderB_1
Show the code
cp turtle.txt folderB/
cp sheep.txt folderB/folderB_1/
- Move folderB/turtle.txt in folderB/folderB_1/
Show the code
mv folderB/turtle.txt folderB/folderB_1/
- Move all files of folderB/folderB_1/ in folderA/folderA_1/. You will use the
*
(wildcard).
- Move all files of folderB/folderB_1/ in folderA/folderA_1/. You will use the
Show the code
ls folderB/folderB_1/
mv folderB/folderB_1/* folderA/folderA_1/
ls folderB/folderB_1/
ls folderA/folderA_1/
- List files of your current directory and remove files starting by sheep
Show the code
ls
rm sheep*
ls
- Copy the subdirectory folderB_1 in folderD
Show the code
cp -r folderB/folderB_1/ folderD
- 6 - use the command
tree
to get the file tree
Show the code
cd /home/rstudio/mydatalocal/tuto_bash
tree
Show the code
└── tuto_bash
├── dragons_cp.txt
├── empty_file.txt
├── folderA
│ ├── dragon_cp.txt
│ ├── dragons_new.txt
│ ├── dragon.txt
│ ├── folderA_1
│ │ ├── sheep.txt
│ │ └── turtle.txt
│ └── folderA_2
├── folderB
│ └── folderB_1
├── folderD
├── list_files.txt
├── tata.txt
├── turtle_and_sheep.txt
└── turtle.txt
Most commands have multiple options specified using a single or double dash (e.g. -v
, --verbose
). All options should be detailed on the help page of a command, which you can usually print in the terminal with the --help
option.
cp --help
To get more information, you can also use the man
command:
man cp
2.8 Combining Commands in scripts
So far you have learned how to use commands. You’ll soon find, however, that large and complex blocks of code are tedious to write out by hand every time you want to run them. It is also difficult to debug a long string of code that you wrote in the Terminal.
Instead, we can put everything into a script: a file containing code. This allows you to make your code compact and easy to move between directories if you need to. It also makes debugging much easier.
A bash (the language you use to communicate) script is a simple text file, usually starting with #!/bin/bash
and named with a ‘.sh’ extension, in which you list commands to run.
Here is an example of a simple script:
tuto_bash/my_script.sh
#!/bin/bash
cd /home/rstudio/mydatalocal/tuto_bash
echo "Run in:"
pwd
mkdir -p folder_1
echo "Made a folder named folder_1"
touch folder_1/text_file.txt
echo "I want to eat strawberries." > folder_1/text_file.txt
echo "Created a file in folder_1 with some text"
Open a text file in the RStudio text editor using File -> New File -> Shell Script, copy the script text in the new text file and save it as “my_script.sh” (saved in the tuto_bash folder).
Run it with the bash command:
cd /home/rstudio/mydatalocal/tuto_bash
bash my_script.sh
You can check that the script worked using the commands previously shown.
Comments
Always add comments to your scripts to keep track of what they do. In Bash, lines starting with a # symbol are ignored by the computer.
folderA/my_script_comments.sh
#!/bin/bash
# This is a comment
#Display the working directory
echo "Run in:"
pwd
#Made a folder named folder_1
mkdir -p folder_1
# Created a file in folder_1 with some text
echo "I want to eat strawberries." > folder_1/text_file.txt
cowsay "me too" >> folder_1/text_file.txt
Add a couple of comments to the script we created above to specify what the commands do.
In longer and more complex scripts, it’s important (necessary, even) to document each section properly.
You’ll be thankful when reading properly commented code from others (or from past you), rather than spending time figuring out what each part does.
A script can be run from another folder, you just have to adjust the path:
cd /home/rstudio/mydatalocal/tuto_bash/
bash folderA/my_script_comments.sh
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1/
bash /home/rstudio/mydatalocal/tuto_bash/folderA/my_script_comments.sh
You can use tree
to search for the “folder1” you created with the previous script
cd /home/rstudio/mydatalocal/tuto_bash/
tree
(Usually, we prefer to use relative path.)
If you run a script from another directory, you’ll have to pay attention to the path used in your script. Indeed, the working directory will be the directory where you launch the script and not the directory containing the script. A good habit is to fix at the beginning of your script the working directory, for example :
sandbox/my_template_script.sh
#!/bin/bash
cd /home/rstudio/mydatalocal/tuto_bash
[...]
You can redirect the output of your script to a file to keep log of its execution (if you have print message)
cd /home/rstudio/mydatalocal/tuto_bash/
bash folderA/my_script_comments.sh > script.log
2.9 Helpful resources
- A detailed bash script cheatsheet: https://devhints.io/bash
- Another great bash cheatsheet
- A full course introduction to Shell/Bash
- Another one : https://andysbrainbook.readthedocs.io/en/stable/unix/
- Google (and by extension, Stackoverflow) is the programmer’s best friend :)
3 Get a toy dataset for the training
As we want to keep this data between all the sessions, you will use the permanent storage shared between all the students.
For the following of the training, we will work in a new personal directory: /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
.
Warning: Change “YourName” by your actual name.
Create this directory and go inside.
mkdir -p /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
You can also remove the tuto_bash directory
rm -r /home/rstudio/mydatalocal/tuto_bash
The ephemeral storage area of your VM is connected to the directory /home/rstudio/mydatalocal/ so in the following of the training you must only work in the folder /home/rstudio/ifbdata/ueens_lyon_ngs24/ which will be permanent.
The dataset we are using for this training is part of a larger study described in Ryu et al., G3 2022. The authors assembled and annotated the genome of the clown fish (Amphiprion ocellaris). Here, we will just use a very small subset of two paired-end RNA-seq samples, one from White skin and the other from Orange Skin.
3.1 Get the raw fastq files
Toy input data are stored in this directory: /home/rstudio/ifbdata/ueens_lyon_ngs24/projects_raw_data/data_bash_training/
As they are very small, you can copy the input data in your personal repository using the script below which can be saved in training_project/01_copy_toydata.sh
.
training_project/01_copy_toydata.sh
#!/bin/bash
mkdir -p /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
mkdir -p 01_toy_data/
cd 01_toy_data/
cp -r /home/rstudio/ifbdata/ueens_lyon_ngs24/projects_raw_data/data_bash_training/toy_data/* .
echo "Files:"
ls
Terminal
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
bash 01_copy_toydata.sh
For your real project, you will directly use the input data from one of these folders to avoid to replicate big files :
projects_raw_data/
├── data_bash_training
├── data_chipseq_petunia
├── data_envgenomics
├── data_meso_sexchromevol
├── data_scrnaseq
└── data_spatial_transcriptomcs
3.2 Run the quality cleaning on the training dataset
Now we get some data, you can run your first bioinformatic analysis (as an example): perform a quality cleaning on your raw sequencing data.
fastp is a commonly used software that provides a simple way to do this. (You may need to use a different one depending on the specific features of the tool).
The template command to run fastp is :
#Not run
fastp --in1 sample_R1.fastq.gz --in2 sample_R2.fastq.gz --out1 sample_trim_R1.fastq.gz --out2 sample_trim_R2.fastq.gz
CONDA
Bioinformatics tools are not installed on computers by default; you have to install them. If you are working on several projects, you may need tools whose dependencies are incompatible. For greater convenience, we will be using conda, a package and environment management system that allows you to install several versions of software packages and their dependencies and to switch easily from one to another. We have prepared several environments for you with the tools you will need for the various projects.
To activate an environment, you have to type:
conda activate uengs
You can see the name of the active environment at the beginning of your prompt.
(uengs) rstudio@351e742a1b2f:~$
To deactivate it:
conda deactivate
And, you come back to the “base” environment.
(base) rstudio@351e742a1b2f:~$
You can list available environment by typing:
conda env list
# conda environments:
#
base /home/rstudio/conda
meso /home/rstudio/conda/envs/meso
scrnaseq /home/rstudio/conda/envs/scrnaseq
spatial /home/rstudio/conda/envs/spatial
uengs * /home/rstudio/conda/envs/uengs
If you need to install a “common” tool during the practical, you have to activate an environment and type something like that according to the tool documentation:
conda activate ENV_NAMES
conda install -c bioconda TOOLS
You can also list all tools and their dependencies of an environment:
conda activate uengs
conda list
It can be useful to get version of a tool:
conda list | grep fastp
Environments are automatically created when your VM starts up, which may take some time, so don’t worry if you don’t see them immediately.
You will have to activate your conda environment at the beginning of your script to be able to “see” and execute tools.
SCRIPT.sh
#!/bin/bash
CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
conda activate ENV_NAMES
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/project_dir/...
...
Come back to your quality control analysis :
First, get the path of a fastq file of the training dataset
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
ls 01_toy_data/*
01_toy_data/White_R1.fastq.gz
01_toy_data/White_R2.fastq.gz
01_toy_data/Orange_R1.fastq.gz
01_toy_data/Orange_R2.fastq.gz
Note the difference between the command with the wildcard “*”
ls 01_toy_data/*
01_toy_data/White_R1.fastq.gz
01_toy_data/White_R2.fastq.gz
01_toy_data/Orange_R1.fastq.gz
01_toy_data/Orange_R2.fastq.gz
and without the “*”
ls 01_toy_data/
White_R1.fastq.gz
White_R2.fastq.gz
Orange_R1.fastq.gz
Orange_R2.fastq.gz
Now we “just” have to adapt the command to run fastp on a file of the training dataset.
By default, fastp writes 2 output files fastp.html and fastp.json to the current directory along with the execution statistics. We will use the --json
and --html
arguments to give them a unique name and thus avoid the files being overwritten by execution of the command on other samples.
You can save the commands in a script : “02_toy_data_cleaning.sh”
training_project/02_toy_data_cleaning.sh
#!/bin/bash
CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
conda activate uengs
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
mkdir -p 02_toy_data_cleaning_fastp/
fastp --in1 01_toy_data/White_R1.fastq.gz \
--in2 01_toy_data/White_R2.fastq.gz \
--out1 02_toy_data_cleaning_fastp/White_trim_R1.fastq.gz \
--out2 02_toy_data_cleaning_fastp/White_trim_R2.fastq.gz \
--json 02_toy_data_cleaning_fastp/White_fastp_report.json \
--html 02_toy_data_cleaning_fastp/White_fastp_report.html
When a command is too looooong and becomes hard to read you can use \
for splitting a line (! without space after !)
fastp --in1 01_toy_data/White_R1.fastq.gz \
--in2 01_toy_data/White_R2.fastq.gz \
--out1 02_toy_data_cleaning_fastp/White_trim_R1.fastq.gz \
--out2 02_toy_data_cleaning_fastp/White_trim_R2.fastq.gz \
--json 02_toy_data_cleaning_fastp/White_fastp_report.json \
--html 02_toy_data_cleaning_fastp/White_fastp_report.html
is equivalent to and easier to read than
fastp --in1 01_toy_data/White_R1.fastq.gz --in2 01_toy_data/White_R2.fastq.gz --out1 02_toy_data_cleaning_fastp/White_trim_R1.fastq.gz --out2 02_toy_data_cleaning_fastp/White_trim_R2.fastq.gz --json 02_toy_data_cleaning_fastp/White_fastp_report.json --html 02_toy_data_cleaning_fastp/White_fastp_report.html
Now, you can run your cleaning script.
Terminal
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
bash 02_toy_data_cleaning.sh
We can now look at the results, using tree
or ls 02_toy_data_cleaning_fastp
or using the files browser of rstudio (right bottom panel of rstudio)
The result will be in the 02_toy_data_cleaning_fastp folder
02_toy_data_cleaning_fastp/
├── White_fastp_report.html <------- trimming report
├── White_fastp_report.json
├── White_trim_R1.fastq.gz
└── White_trim_R2.fastq.gz
If you want to look at the trimming report you can search it in the right bottom panel of rstudio and open it with “View in a web Browser”
- Modify the script to clean raw sequencing files for both samples.
training_project/02_toy_data_cleaning.sh
Show the code
#!/bin/bash
CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
conda activate uengs
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
mkdir -p 02_toy_data_cleaning_fastp/
fastp --in1 01_toy_data/White_R1.fastq.gz \
--in2 01_toy_data/White_R2.fastq.gz \
--out1 02_toy_data_cleaning_fastp/White_trim_R1.fastq.gz \
--out2 02_toy_data_cleaning_fastp/White_trim_R2.fastq.gz \
--json 02_toy_data_cleaning_fastp/White_fastp_report.json \
--html 02_toy_data_cleaning_fastp/White_fastp_report.html
fastp --in1 01_toy_data/Orange_R1.fastq.gz \
--in2 01_toy_data/Orange_R2.fastq.gz \
--out1 02_toy_data_cleaning_fastp/Orange_trim_R1.fastq.gz \
--out2 02_toy_data_cleaning_fastp/Orange_trim_R2.fastq.gz \
--json 02_toy_data_cleaning_fastp/Orange_fastp_report.json \
--html 02_toy_data_cleaning_fastp/Orange_fastp_report.html
In Rstudio you can open several terminals at once to avoid waiting for the completion of one command.
4 How to organize your project directory ?
For the moment we placed all scripts at the root of our project directory. In fact this is a very bad idea.
It is advisable to put all files related to the same project in the same folder, whether they are:
- the raw data,
- the scripts,
- the results,
- the project documentation,
- …
But to separate the different types of data into sub directories.
This way, you will more easily find your way around and avoiding mixing up files or accidentally deleting them.
For example, your project directory might look like this:
project_name/
├── README.md # overview of the project
├── data/ # data files used in the project
├── results/ # results of the analysis (data, tables, figures)
├── scripts/ # contains all code in the project
│ ├── 01_...
│ ├── 02_...
│ └── ...
└── doc/ # documentation for your project
└── ...
Moreover, to be able to find your way around and to ensure reproducibility, you need to add a file, often named README.md, at the root of your folder that should contain all the useful information to take the project in hand. This is also the file that will be visible on the home page of your project on gitlab. So when someone wants to (re)work on the project, they will be able to open the folder, and they will know where to go to see and understand what has been done. This person can be a collaborator, your manager or simply yourself 6 months later.
- In the training project folder, create a “scripts” folder and move scripts inside.
Show the code
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
mkdir -p scripts
mv 0*sh scripts
- Do the same thing for results files
Show the code
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
mkdir -p results
mv *cleanning* results/
4.1 Scripts and data naming
Another part of staying organized is making sure that all the directories and filenames for an analysis are as consistent as possible. You want to avoid names like alignment1.bam, and rather have names like 20170823_kd_rep1_gmap-1.4.bam that provide a basic level of information about the file.
Good name must be :
- machine readable
- avoid spaces, punctuation, accented
- regular expression and globbing friendly (ex: ls CondA* )
- easy to compute on (ex: deliberate use of delimiters, CondA_rep1.tsv)
- human readable
- name contains info on content (ex 01_download_training_dataset.sh)
- plays well with default ordering (ex: put something numeric first for script for example, or use date for results)
This link and this slideshow have some good guidelines for file naming dos and don’ts.
5 More useful bash commands
In the last part, you have seen some basic bash commands. In this session we will see some new to complete your bash knowledge. As previously, the idea is only to give you useful command for this training but not a complete view of bash.
5.1 New bash commands
Create a new “tuto_bash” directory for the exercises in /home/rstudio/mydatalocal
cd /home/rstudio/mydatalocal
mkdir -p tuto_bash
cd tuto_bash
grep
-> to get only lines of a file with a pattern
cowsay -f dragon "Hello Hello!" > dragon.txt
cat dragon.txt
grep "-" dragon.txt
wc -l
-> to count the number of lines of a file
wc -l dragon.txt
wc
can do other things but we will use only this one.
basename
-> to get only the name of a file from a path (the file can not exist)
basename toy_data/sample1.fastq.gz
dirname
-> to get only the name of the directory containing the file from a path (the file can not exist)
dirname toy_data/sample1.fastq.gz
- pipe command with
|
The pipe allows forwarding the output of one command as the input of the next command. This way you can execute several consecutive commands. For example, if you want to count the number of line containing a given pattern in a file:
grep "-" dragon.txt > occurence.txt
wc -l occurence.txt
You can instead “pipe” the command using a pipe |
(AltGr+6 for PC and Alt + Maj + L for Mac ) between two commands to link the two commands
grep "-" dragon.txt | wc -l
(By the way, grep
already has an option to do this directly (grep -c
), but it was for the example.)
gunzip
andzcat
to decompress files
As previously seen, fastq files are large files and you should store them in a compressed form whenever useful. In unix we used preferentially .gz compression (gunzip) than .zip, but it is the same idea.
To check the contents of a compressed file, you can unzip it with gunzip
but this takes time and disk space. To do it quickly, there is a command called zcat
that allows you to decompress a gz file on the fly and display its contents. So, zcat
piped (|
) with head
will allow you to display the first few lines of a gz file without taking up time or disc space.
zcat /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/toy_data/White_R1.fastq.gz | head -n 8
Using the commands we’ve just seen:
- Display the first 100 sequences of a training fastq file
Show the code
zcat /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/toy_data/White_R1.fastq.gz | head -n 400
- Count the number of sequences of a training fastq file and of the corresponding cleanned files
Show the code
zcat /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/01_toy_data/White_R1.fastq.gz | grep "@SRR1" | wc -l
zcat /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/02_toy_data_cleaning_fastp/White_trim_R1.fastq.gz | grep "@SRR1" | wc -l
5.2 Store strings in variables
In scripts, we frequently re-use bits of text: for example, a path to a file. It is more convenient to keep these in a named “variable” and then re-use that variable in the script. This way, if you need to modify the file path, you only need to modify the variable once, and not all the occurrences of the file path.
To define a variable you will declare it with an =
without space.
variable1="toto"
Quote are not mandatory unless the string contains spaces.
variable1=toto
Then, you can refer to the variable’s content by using a $
before its name. To display the command of a variable you can use echo
.
echo $variable1
echo variable1
echo "variable1: $variable1"
You can also combine variables to define a new one
variable1="TOTO"
variable2="TATA"
echo variable1: $variable1
echo variable2: $variable2
variable3="$variable1 $variable2"
echo variable3: $variable3
It is advisable to use descriptive variable names rather than generic ones like “variable1”, “variable2”.
Sometimes, you may need to enclose the variable name in curly brackets ({}
) to distinguish it from the following string.
For example, if you want to create a filename using the content of a variable called $sample
as a prefix, such as $sample_rep1.fastq.gz
, you need to add curly brackets around sample (${sample}_rep1.fastq.gz
). Otherwise, the terminal will attempt to interpret $sample_rep1
as a single variable.
sample="sampleA"
filenameX=$sample_rep1.txt #variable sample_rep1 don't exist so filenameX=.txt
filenameY=${sample}_rep1.txt
echo filenameX: $filenameX
echo filenameY: $filenameY
Variables can also capture the output of a command, using $(command)
. (Note the use of parenthesis ()
and not brackets {}
)
list_files=$(ls)
working_directory=$(pwd)
echo "$working_directory contains the following files:
$list_files"
You can use also pipe inside
nb_dash=$(grep "-" dragon.txt | wc -l)
echo $nb_dash
Using variables will simplify both your script writing and reading processes.
filename=dragon.txt
new_filename=dragon_bis.txt
echo "$filename will be cp to $new_filename"
cp $filename $new_filename
You can also do pattern substitutions in a variable, using ${variable/pattern/replace}
to replace the first occurrence of “pattern”.
filenameR1=sample1.R1.fastq.gz
echo ${filenameR1/R1/R2}
If you want to remove a pattern, you can use the same syntax but without a replacement string. This is a useful method to remove the extension of a file name.
filenameR1=sample1.R1.fastq.gz
echo ${filenameR1/.R1.fastq.gz/}
You can save the output of this pattern substitution in a new variable.
filenameR1=sample1.R1.fastq.gz
filenameR2=${filenameR1/R1/R2}
echo R1: $filenameR1 R2: $filenameR2
For the following exercise, we need Paired-End (PE) data. PE fastq file names differ only by “R1” or “R2”. The file containing reads 1, contains “R1” in its file name and the other file containing reads 2, contains “R2”. Therefore, from one, you can deduce the other.
You will work in the /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/ directory to do the following exercise.
The objectives are to rewrite the training_project/02_toy_data_cleaning.sh
script using variables to simplify the script and also to re-use it more easily on other samples.
Using the commands we’ve just seen:
- Complete the commands to build the complete path of the R1 (White_R1.fastq.gz) and R2 (White_R2.fastq.gz) files using a variable
prefix
containing the prefix of the sample name (“White”)
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/
fastq_data_dir="01_toy_data/"
prefix="White"
filenameR1=${prefix}_R1.fastq.gz
path_filenameR1=...
filenameR2=...
path_filenameR2=...
echo $path_filenameR1 # display 01_toy_data/White_R1.fastq.gz
echo $path_filenameR2 # display 01_toy_data/White_R2.fastq.gz
ls $path_filenameR1 # to check the path exist
ls $path_filenameR2 # to check the path exist
Show the code
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/
fastq_data_dir="01_toy_data/"
prefix="White"
filenameR1=${prefix}_R1.fastq.gz
path_filenameR1=$fastq_data_dir/$filenameR1
filenameR2=${prefix}_R2.fastq.gz
path_filenameR2=$fastq_data_dir/$filenameR2
echo $path_filenameR1 # display 01_toy_data/White_R1.fastq.gz
echo $path_filenameR2 # display 01_toy_data/White_R2.fastq.gz
ls $path_filenameR1 # to check the path exist
ls $path_filenameR2 # to check the path exist
- Using these variables, rewrite the
training_project/02_toy_data_cleaning.sh
. You may start by one sample and also create a variable for the output directory:trim_data_dir="02_toy_data_cleaning_fastp2/"
training_project/02_toy_data_cleaning2.sh
Show the code
#!/bin/bash
CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
conda activate uengs
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
fastq_data_dir="01_toy_data/"
trim_data_dir="02_toy_data_cleaning_fastp2/"
mkdir -p $trim_data_dir/
prefix="White"
filenameR1=${prefix}_R1.fastq.gz
path_filenameR1=$fastq_data_dir/$filenameR1
out_filenameR1=$trim_data_dir/$filenameR1
filenameR2=${prefix}_R2.fastq.gz
path_filenameR2=$fastq_data_dir/$filenameR2
out_filenameR2=$trim_data_dir/$filenameR2
path_report=$trim_data_dir/${prefix}_fastp_report
fastp --in1 $path_filenameR1 \
--in2 $path_filenameR2 \
--out1 $out_filenameR1 \
--out2 $out_filenameR2 \
--json $path_report.json \
--html $path_report.html
prefix="Orange"
filenameR1=${prefix}_R1.fastq.gz
path_filenameR1=$fastq_data_dir/$filenameR1
out_filenameR1=$trim_data_dir/$filenameR1
filenameR2=${prefix}_R2.fastq.gz
path_filenameR2=$fastq_data_dir/$filenameR2
out_filenameR2=$trim_data_dir/$filenameR2
path_report=$trim_data_dir/${prefix}_fastp_report
fastp --in1 $path_filenameR1 \
--in2 $path_filenameR2 \
--out1 $out_filenameR1 \
--out2 $out_filenameR2 \
--json $path_report.json \
--html $path_report.html
5.3 “For” loops to iterate on a list
Bioinformatic analysis often involves executing numerous commands, but with only slight variations each time you run a new command.
For instance, you might have to analyze sample1, then sample2, sample3, and so forth. To save time, we employ something known as a loop, specifically a for-loop.
Let’s clarify this concept with a straightforward example. Suppose you wish to print the letters a, b and c. You could achieve this by typing echo a
, hitting return; then echo b
, and echo c
. While this accomplishes the task, it is evident that this method would become cumbersome if you had to to print dozens or hundreds of numbers.
How can we simplify this process? You probably noticed that each time we ran the command, we changed the letter after the echo command. A for-loop automates this process for you.
Here is an example of a for-loop to print the letters a, b, c, d:
for X in a b c d
do
echo $X
done
The for-loop comprises three sections:
The first section is the Declaration: it begins by assigning the first item after
in to the variable **X**; in this case, it would assign the value "a" to "X". The letters after
in` are called the List.The next section is the Body, which runs the commands written after do, replacing the variable with whichever value is currently assigned to the variable - for the first loop, this will be the letter a. Since items remain in the list, the loop goes back to the declaration and assigns the next number in the list to the variable X; in this case, the number b. Then the body is run, and the process repeats until the end of the list is reached.
The final section, called the End, contains only the word
done
, meaning to exit the loop after all of the items in the list have been run through the Body of the loop.
You can include additional commands to the Body section. For instance, we could modify the loop to:
for X in sheep elephant fox duck
do
# Display the value in X
echo $X
#Do something independent of the variable X
cowsay -f turtle "Hello"
#Do something dependent of the content of X ($X)
cowsay -f $X "Hello, I'm a $X"
#Do another thing, if you want
cowsay -f turtle "Welcome, $X"
done
You can use a variable, as seen previously, to store the list of items to iterate over. Additionally, it is preferable to use a more descriptive variable name than “X” for iteration (for example “letter”).
list="a b c d"
for letter in $list
do
echo Do something
echo Do something with $letter
echo Do another thing
done
Furthermore, the list to iterate over can be automatically generated. For instance, if you wish to iterate over a list of files you can do:
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/
list_files=$(ls 01_toy_data/*gz)
echo $list_files
echo "Start:"
for file in $list_files
do
echo $file
echo Do something
echo Do something with $file
echo Do another thing
done
echo "End"
#OR
for file in toy_data/*gz
do
echo $file
echo Do something
echo Do something with $file
echo Do another thing
done
Note: Indentations are not important for the machine, but they helps in code readability for humans.
Using the commands we’ve just seen:
- Iterate on each sample and print the path of the R1 and R2 fastq files contained in
01_toy_data
directory.
Show the code
fastq_data_dir="01_toy_data/"
prefix_list="Orange White"
#or
prefix_list="Orange
White"
for prefix in $prefix_list
do
echo prefix: $prefix
filenameR1=${prefix}_R1.fastq.gz
path_filenameR1=$fastq_data_dir/$filenameR1
filenameR2=${prefix}_R2.fastq.gz
path_filenameR2=$fastq_data_dir/$filenameR2
echo R1: $path_filenameR1
echo R2: $path_filenameR2
done
prefix: Orange
R1: 01_toy_data//Orange_R1.fastq.gz
R2: 01_toy_data//Orange_R2.fastq.gz
prefix: White
R1: 01_toy_data//White_R1.fastq.gz
R2: 01_toy_data//White_R2.fastq.gz
- Using the same basis, rewrite the
training_project/02_toy_data_cleaning2.sh
using a for-loop. Use a different output directory (02_toy_data_cleaning_fastp3/
) to check that your script is ok.
training_project/02_toy_data_cleaning3.sh
Show the code
#!/bin/bash
CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh
conda activate uengs
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
fastq_data_dir="01_toy_data/"
trim_data_dir="02_toy_data_cleaning_fastp3/"
mkdir -p $trim_data_dir/
prefix_list="Orange
White"
for prefix in $prefix_list
do
filenameR1=${prefix}_R1.fastq.gz
path_filenameR1=$fastq_data_dir/$filenameR1
out_filenameR1=$trim_data_dir/$filenameR1
filenameR2=${prefix}_R2.fastq.gz
path_filenameR2=$fastq_data_dir/$filenameR2
out_filenameR2=$trim_data_dir/$filenameR2
path_report=$trim_data_dir/${prefix}_fastp_report
echo R1: $path_filenameR1
echo R2: $path_filenameR2
echo out_R1: $out_filenameR1
echo out_R2: $out_filenameR2
echo report: $path_report
fastp --in1 $path_filenameR1 \
--in2 $path_filenameR2 \
--out1 $out_filenameR1 \
--out2 $out_filenameR2 \
--json $path_report.json \
--html $path_report.html
done
R1: 01_toy_data//Orange_R1.fastq.gz
R2: 01_toy_data//Orange_R2.fastq.gz
out_R1: 02_toy_data_cleaning_fastp3//Orange_R1.fastq.gz
out_R2: 02_toy_data_cleaning_fastp3//Orange_R2.fastq.gz
report: 02_toy_data_cleaning_fastp3//Orange_fastp_report
[...]
- Bonus: In the case of a long list of samples, try to get the list of the prefix automatically from the list of input files.
As the training dataset is Paired-End each sample has 2 input files. You can use only R1 files to get only once each sample name.)
You may use the
sed
command.sed
-> Utilized for replacing all instances of a pattern with another in a file or in a terminal output. (sed
can do a lot of other things but we will use only this one.)
# in the content of a file
echo sed "s/PATTERN/SUBSTITUTE/g" file.txt
# pipe after a bash command
echo $string | sed "s/PATTERN/SUBSTITUTE/g"
# To replace only the first occurrence per line of a pattern, omit the **g** option
echo $string | sed "s/PATTERN/SUBSTITUTE/"
#If the pattern contains "/", you can replace "/" by "|"
echo $string | sed "s|PATTERN|SUBSTITUTE|g"
Show the code
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/
samples_list=$(ls 01_toy_data/*R1*gz | sed "s/_R1.fastq.gz//g" | sed "s|01_toy_data/||g")
echo $samples_list
Orange White
Now it’s time to get to work on your project!
Remember, if you’re looking for a command, have an incomprehensible bug or don’t know where to start :
- ask other members of your group
- the internet is also your friend (it’s not cheating!!)
- ask one of the trainers
These materials have been developed by members of UE NGE teaching team of the ENS de Lyon. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
6 Readme example Show
RNA sequencing HTLV project
6.1 Scripts storage
All the scripts used for this project are stored in the file “scripts”, available with this path : Path: /home/rstudio/ifbdata/rna_seq_htlv_xxx/scripts
6.2 Introduction
The aim of the study is to understand if there is a modification inside the transcriptome of T-Lymphocytes infected by the HTLV-1 when they are in contact with dendritic cells. For that, biologists put C91 cells, which are already infected with the HTLV-1, in contact with dendritic cells or with control cells. HTLV-1 is an oncogenic human retrovirus that can be involved in the development of leukemia and neuromyelopathy. The aim of the manipulation is to understand whether one or more genes allow infected cells to escape the immune system. And for that, we analysed the Lymphocyte RNA sequencing.
6.3 Tools used for the manipulation
FastQC : 0.11.8
MultiQC : 1.14
fastp : 0.23.2
salmon : 1.10.1
tidyverse : 2.0.0
tximport : 1.26.1
biomaRt : 2.54.1
DESeq : 1.38.3
EnhanceVolcano : 1.16.0
ggpubr : 0.6.0
DOSE : 3.24.2
clusterProfiler : 4.6.2
dbplyr : 2.4.0
6.4 Download data
The data from Hélène Dutarte’s lab at CIRI were downloaded via the script : 01.RNA_seq_data_download.sh .
The different files are called : C91CT1, C91CT2, C91CT3, C91DC1, C91DC2, C91DC3. There are three replicates for each conditions.
6.5 Quality checking of the data
6.5.1 Creation of all the FastQC data for the different experiments
The fastQC command is used to analyze sample quality from a high throughput sequence data. The quality control results are stored in a file results/02_quality_analyse_fastq. The following script is used for data quality control : 02.checking_quality_fastq.sh
For a more precise explanation on the results you can go on Migale bioinformatics faciliy
6.5.2 Reunite all Fastq data in a single file
MultiQC allows information from multiple FastQC reports to be grouped together on a single page. It is used to compare different samples more easily. The results are stored in the file results/03_quality_checking_aggregate_multiqc. The following script is used to reunite the fastq data : 03.reunite_data_multiqc.sh
6.6 Data cleaning
FastP command is used to clean all the data obtained by the fastQC command and keep only the reads that can be analysed. A wide selection of cleaning parameters is available:
- Phred nucleotide quality of at least 28.
- cutting of sequences with a Phred average of less than 28
- minimum read size
- removal of adapters
The results are stored in htlm and json file. The following script is used to cleaned the fastQC data : 04.fastp_cleanning.sh
6.7 Quantification of all our reads using Salmon tool
6.7.1 Download the transcriptome of interest to prepare the analysis by salmon
To analyze all our transcripts we need to index them to find out which gene they relate to. To do this, we need to download a database on ENSEMBL for the human genome and NCBI for the viral genome. We download :
- all the cdna of the human genome : complementary DNA, representing the human transcriptome.
- all the ncrna of the human genome : non coding RNA.
- all the cds of the viral genome : coding DNA sequence.
The following script is used to download all these data : 05.dowload_transcriptome.sh
6.7.2 Creation of the Index for salmon
All transcripts are indexed using the salmon index command. It uses a library of all 5-nucleotide compositions, allowing it to move through the previously downloaded data and annotate all the transcripts. This technique is called quasi-mapping. The following script is used to index all our transcripts of interest : 06.salmon_index.sh
6.7.3 Quantifictaion of our reads
All the RNA seq data are quantify by the salmon quant command using the indexed transcriptome as reference. The following script is used to quantify all our transcripts of interest : 06.salmon_quantification.sh
To have more information you can go on the web site Salmon explanations
6.8 Differential expression analysis
6.8.1 Import all our quantification transcript files
The Rmarkdown R1.analyses_DESeq2.Rmd allows to download all the files from salmon quantification to be used by DESeq.
In this script, we are creating a mapping between the transcript and the gene. We also create a directory of gene identifiers with their name and function.
We end by creating a dss object to store the results of the DESeq2 analysis. We filter the data: only genes that have more than 10 reads and in more than 3 replicates.
6.8.2 PCA analysis
We use the plotPCA command that can create a plot for the replicates, the batch effect or the conditions. In our study we can see that the experiment 2 is very far from the other 2 conditions due to a batch effect. We can see a replicate effect with a right shift and a condition effect with an up shift. For our analyse, and to counteract the batch effect, we will only focus on the condition effect.
For this analysis we use the script R2.PCA.Rmd, the figures can be seen on the knit.
6.8.3 Running the DESeq2 analysis
The DESeq, standing for Differential expression analysis, is an analyse to compare the quantification of transcript in two conditions and to express the differential expression of the genes. To do that we use the DESeq function. All the results are filtered with the logFoldChage with the command lfcShrink which allows to reduce the LFC. Then we create a table with all the results filtered for the LFC, with the gene name and the gene function. To do that we use the script R3.DESeq2.Rmd.
6.8.4 Showing plot for the analyse of the DESeq results
We create a volcano plot with EnhancedVolcano command, which places genes in a diagram using p-value as a function of logFoldChange.
We create a MA plot with ggmaplot command, which places genes in a diagram using LogFoldChange as a function of the mean expression.
We use the function pheatmap to create a heat map showing the differentially expressed genes in the control or in the condition tested. The blue is for the downregulated genes and the red for the upregulated genes.
The code that displays the heat map is available from Mohamed Chour.
To create these different plots showing the results of the DESeq we run the script R4.DESeq_plots.Rmd, the figures can be seen on the knit.
To have more informations you can go on the web site DESeq explanations
6.9 Analysis of the results
6.9.1 Gene set over-representation analysis
This technique allows to test between background genes if the genes that are differentially expressed are significant. This technique uses a chi2 test. For the background we use all the genes tested in our experiments. We use a gene ontholgy that classify all the genes inside biological process. This technique is first used for all the gene then for the up-regulated genes and finally the down regulated genes. We can show the results with a data frame to see if some genes can be interesting.
Then, with the command enrichGO, we do the GO enrichissement to know which different type of genes are up regulated. We can do a dot plot to show the results where the GeneRatio is the proportion of genes in a specific gene set among all up-regulated genes.
To have more information you can go on the web site Over-representation analysis
6.9.2 Gene Set Enrichment Analysis (GSEA)
The Gene Set Enrichment Analysis (GSEA) is a method that determines whether a defined set of genes shows statistically significant, concordant differences between two biological states. We use the KEGG pathway to determine which genes of the different pathways are up or down regulated.
This part of the code has been done by Mohamed Chour.
6.9.3 Creating a pathview plot
The pathview command allows you to color the speakers of a signaling pathway according to its expression regulation. The signaling pathway need to be determined beforehand on the KEGG pathway site.
This part of the code has been done by Yevhenii Kostenko.
To run the analyses of the results the scripts R5.GO_analyses.Rmd is used.
To have more information you can go on the web site Gene set enrichment analysis
6.10 Analysis of the proteins potentially secreted
We are using the ENSEMBL data to annotate each gene with signalP. SignalP refers to a peptide inside the protein that allows it to be secreted. We sort all the genes that are upregulated, which are significantly differentially expressed,with a signalP and which are not expressed in the control.
The results are put in table, sort in a descreasing way on the logFoldChange (the more differentially expressed genes)
To do that the script R6.secretome.Rmd is used.
6.11 Conclusion
With our data, we have seen that CAV1 is down regulated in infected lymphocytes in contact with DC whereas it is up-regulated in DC cells in contact with infected lymphocytes (mirror experiment already done)
We have also highlighted that the InterferonG is upregulated in Lymphocytes in contact with DC but not expressed in the control and it can be secreted.The implied pathway in the expression of INFG is also affected with up regulated genes.