USEFUL INFORMATION: Short introduction to bash

UE NGS - ENS LYON

Author

NGS-team - 2024

1 “bash” ?

Many/most bioinformatic tools do not have graphical interfaces. To use them, you need some basics on the command line and bash is the name of one language to “talk” to a computer through a terminal.

2 Learn the basics of using the command line

Note

The goal is not to turn you into bioinformatics specialists but to give you basic skills in running commands. From there you will be able to find, follow and understand tutorials and help available on the ‘jungle of the Internet’

2.1 Connecting to your Virtual Machine (VM)

To start, you need a linux environment where you can train yourself. For that, go back to the IFB website (https://biosphere.france-bioinformatique.fr/cloud/) and go to the deployment frame in the “myVM” and connect you to the VM you started at the beginning of the session.

To connect to an instance you need 3 pieces of information: a login, a password and the IP address of your VM:

    1. Click on deployment “Params” link (right), then copy your Rstudio_PASWORD
    1. Click on deployment “https” link (right), which will send you to the correct IP.
    1. fill the login (“rstudio”), and your password in the popup window, et type “enter”.
Warning
  • You must accept the security warning to access the machine.
  • If you are using firefox, it is possible that firefox does not want to override the security warning, you must then use another browser e.g. chrome.

2.2 Discovering the Rstudio interface and access to its terminal

By default, the “R” console is open on the left-hand side. This is because R is the expected language in Rstudio, but we are going to use another Rstudio function to talk directly to the computer. To do this, we are going to use the terminal.

You can also see a file browser in the bottom right-hand corner.

2.3 Access to the terminal

Open the “Terminal” tab of your Rstudio interface to access the terminal on your VM.

2.4 What is a terminal ?

A terminal is an interface used to issue commands to a computer by typing only text. You already use a lot of commands indirectly through the graphical interface of your computer (like creating or opening folders, for example).

2.5 Why use a terminal ?

There are plenty or reasons. Here are 4 :

  1. Automation/scalability. With one command, it is possible to create 100 folders instantly, which is much more tedious through a graphical interface.
  2. Flexibility. Commands can have a wide array of inputs and options.
  3. Availability. A graphical interface is a luxury! Some programs (even some operating sytems) just. . . don’t have any. Thus, if you want to use a lot of programs out there, you need the terminal.
  4. Reproducibility. A set of commands can be stored in a script, which makes them easy to reproduce.

2.6 What does a terminal look like ?

The terminal’s input line has the following layout

login@hostname:~/path$

We can split this in two parts:

• who is executing the commands: name of the user ( login ), @ , and name of the machine ( hostname ).

• where on the system is the command being executed is indicated the spot “from where” the command is executed ( ~/path ).

A $ sign ends the input prompt.

Whereas the R console will look like :

>

2.7 Launch your first bash commands

You have to imagine that you are a cursor in a tree (or hierarchy) of directories and files on your computer. You can move from one directory to another, and launch commands from your position in the file/folder hierarchy you are in, but also on the computer as a whole.

2.7.1 What is a command ?

A command is a text you pass to your terminal. It starts a system process on the computer, resulting in an action. Commands consist of the command name, followed by arguments: a series of space-delimited strings that modify how the process will behave.

An example will be easier to understand. cowsay is a (fun and useless) program that takes a message as argument.

Type this command in your terminal and type enter to execute it.

cowsay Hello

Commands can also take a special kind of argument called an option. An option starts with - and can be followed, or not, by another argument. Below the -f option is used:

cowsay -f dragon Hi

What’s the point of options? Non-optional arguments (above, “Hi”) are related to the command’s goal, or default behaviour (here, outputting a message). Options fine-tune the command’s behaviour (above, who is speaking).

Warning, each space is interpreted as a new argument, so you have to use quotes to protect an argument containing spaces.

cowsay -e "@@" "What! A talking dragon!"

The following command will list all available animals that can be made to speak.

cowsay -l

Many commands provide help on their usage through the option “-h” or “–help”. You can also find more detailed help on the Internet.

cowsay -h

There are plenty of (more useful) programs. We will cover the basics.

2.7.2 Commands to navigate around your file tree.

  • pwd -> print working directory

To know where you are in the file tree

pwd
  • ls -> list files and directories of your current directory
ls
  • cd -> change directory
cd mydatalocal
  • mkdir -> make new directory(ies)
mkdir tuto_bash

With the -p option, any intermediary directories that don’t exist will also be created

mkdir tuto_bash/folderA/folderA_1 # raise an error
mkdir -p tuto_bash/folderA/folderA_1

The -p option also allows avoiding an error if directory exists

mkdir tuto_bash # raises an error if tuto_bash exists
mkdir -p tuto_bash

2.7.3 Absolute/Relative paths

To know where your files are, the machine uses paths. There are 2 kinds of paths, relative and absolute.

  • A relative path is “how to get somewhere from my current position”. Relative paths use the “.” and “..” notations to look under or above your current position. “../../” means the parent directory of your parent directory (remember the file tree analogy!)
cd tuto_bash/folderA/folderA_1
pwd
cd ../
pwd

cd ../../
pwd

cd  # In the absence of an argument, returns to your starting directory (your "home").
  • Absolute paths all start with “/” , which is the root of the file system (remember the file tree analogy! :)). Absolute paths are useful because (unlike relative paths) they are correct no matter where you are on the machine.
cd /home/rstudio/mydatalocal/tuto_bash
  • * the wildcard

a star is a “wildcard” matching any number of any characters. e.g. folder* matches all files/folders starting by folder

ls folder*
Exercices

Using the commands we’ve just seen:

    1. Make sure you are inside tuto_bash . What is its absolute path ?
Show the code
# Answer: check the prompt, and
pwd

#or
cd /home/rstudio/mydatalocal/tuto_bash
    1. Make a new directory inside tuto_bash named folderB/ and create a subdirectory folderB/folderB_1
Show the code
cd /home/rstudio/mydatalocal/tuto_bash
mkdir -p folderB/folderB_1

#or

mkdir folderB
cd folderB
mkdir folderB_1
    1. Go in the last sub directory you created (/home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1)
Show the code
#with an absolute path
cd  /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
pwd

#or a relative path from  tuto_bash
cd  /home/rstudio/mydatalocal/tuto_bash
cd folderB/folderB_1
pwd

#or a relative path from folderB
cd  /home/rstudio/mydatalocal/tuto_bash/folderB
cd folderB_1
pwd
    1. From here, create a subdirectory folderA_2 in folderA
Show the code
#with an absolute path
mkdir -p /home/rstudio/mydatalocal/tuto_bash/folderA/folderA_2

#or a relative path from  folderB_1
cd  /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
mkdir -p ../../folderA/folderA_2
    1. List all the directories of folderA
Show the code
#with an absolute path
ls /home/rstudio/mydatalocal/tuto_bash/folderA

#or a relative path from  folderB_1
cd  /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
ls ../../folderA

2.7.4 Writing and Reading Text Files

Go back to the /home/rstudio/mydatalocal/tuto_bash/ directory

cd  /home/rstudio/mydatalocal/tuto_bash/
  • touch -> to create empty file
touch empty_file.txt

But you can easily create new file using the right bottom panel or using file > New File > Text File

Exercices
  • Create a text file and name it toto.txt,
  • Write something inside.
  • Save it.
  • echo -> display text in the terminal
echo "hello-world"
  • redirection > and >> -> > redirect the output of a command in a text file, >> add output of a command at the end of a file
echo "hello-world" > tata.txt
echo "hello-world 2" >> tata.txt

You can also redirect output of other commands.

cowsay -f dragon "hello-world" > dragon.txt
cowsay  "hi" >> dragon.txt
  • cat -> concatenate files and display output
cat dragon.txt  tata.txt

It is often use to display the content of one file

cat dragon.txt

The cat command is useful for viewing the contents of smaller files, but if the file contains hundreds of lines of text, it is overwhelming to have everything printed to the Terminal at once. To see only a part of the file, we can use the commands head and tail to see the first few or the last few lines of the file, respectively. Using myFile.txt as an example, typing

  • head and tail -> to display first or last lines of a file
head  dragon.txt
tail  dragon.txt

You can specify the number of lines you want to display

head -n 5  dragon.txt
Exercices
    1. Create 2 new files named turtle.txt and sheep.txt, containing animal saying something. You will use cowsay -f turtle message and redirection
Show the code
cowsay -f turtle hello > turtle.txt
cowsay -f sheep hello > sheep.txt
    1. Concatenate them and redirect the result in a file turtle_and_sheep.txt
Show the code

cat  turtle.txt sheep.txt  > turtle_and_sheep.txt
    1. Display the entire file, only the first 8 lines, the last 8 lines.
Show the code
cat turtle_and_sheep.txt

head -n 8  turtle_and_sheep.txt


tail -n 8 turtle_and_sheep.txt
    1. list all files of your current directory and save it in list_files.txt (Try to write in the first line: “All the files of this directory:”)
Show the code
echo "All the files of this directory:" > list_files.txt
pwd >> list_files.txt

echo >> list_files.txt

ls >>  list_files.txt

cat list_files.txt

2.7.5 Copying and Removing Files

  • cp -> copy a file in another path. cp <file> <file|folder>

If you want to copy a file in the same directory: (you have to change the name)

cp dragon.txt dragons_cp.txt

If you want to copy a file in another directory: (you can keep the same name or change the name)

cp dragon.txt folderA/
ls folderA/
cp dragon.txt folderA/dragon_cp.txt
ls folderA/
  • cp -r -> copy a directory
cp -r folderA folderAprime
  • mv -> move a file/folder in another path. mv <file> <file|folder>

Move mv can be used to rename a file/folder or move it in another paths

ls
mv dragon.txt dragons_new.txt
ls
mv folderAprime folderC

If you want to move a file in another directory: (you can keep the same name or change the name)

mv  dragons_new.txt folderA/
ls folderA/
mv dragon_cp.txt folderA/dragon_cp_NEW.txt
ls folderA/
  • rm -> remove file(s). Deleting is permanent!
rm empty_file.txt
  • rm -r -> remove recursively file(s) and directories. Warning: deleting is permanent!
mkdir -p folderC/folderC_1
cp dragon.txt folderC/folderC_1
ls folderC/folderC_1

rm -r folderC 
ls folderC/folderC_1  #raises an error
Exercices
    1. Copy turtle.txt in folderB/ and copy sheep.txt in folderB/folderB_1
Show the code
cp turtle.txt folderB/
cp sheep.txt folderB/folderB_1/
    1. Move folderB/turtle.txt in folderB/folderB_1/
Show the code

mv folderB/turtle.txt folderB/folderB_1/
    1. Move all files of folderB/folderB_1/ in folderA/folderA_1/. You will use the * (wildcard).
Show the code

ls folderB/folderB_1/

mv folderB/folderB_1/* folderA/folderA_1/

ls folderB/folderB_1/
ls folderA/folderA_1/
    1. List files of your current directory and remove files starting by sheep
Show the code
ls 

rm sheep*

ls
    1. Copy the subdirectory folderB_1 in folderD
Show the code

cp -r folderB/folderB_1/ folderD
  • 6 - use the command tree to get the file tree
Show the code

cd /home/rstudio/mydatalocal/tuto_bash
tree
Show the code


└── tuto_bash
    ├── dragons_cp.txt
    ├── empty_file.txt
    ├── folderA
       ├── dragon_cp.txt
       ├── dragons_new.txt
       ├── dragon.txt
       ├── folderA_1
       │   ├── sheep.txt
       │   └── turtle.txt
       └── folderA_2
    ├── folderB
       └── folderB_1
    ├── folderD
    ├── list_files.txt
    ├── tata.txt
    ├── turtle_and_sheep.txt
    └── turtle.txt

Most commands have multiple options specified using a single or double dash (e.g. -v , --verbose ). All options should be detailed on the help page of a command, which you can usually print in the terminal with the --help option.

cp --help

To get more information, you can also use the man command:

man cp

2.8 Combining Commands in scripts

So far you have learned how to use commands. You’ll soon find, however, that large and complex blocks of code are tedious to write out by hand every time you want to run them. It is also difficult to debug a long string of code that you wrote in the Terminal.

Instead, we can put everything into a script: a file containing code. This allows you to make your code compact and easy to move between directories if you need to. It also makes debugging much easier.

A bash (the language you use to communicate) script is a simple text file, usually starting with #!/bin/bash and named with a ‘.sh’ extension, in which you list commands to run.

Here is an example of a simple script:

tuto_bash/my_script.sh
#!/bin/bash

cd /home/rstudio/mydatalocal/tuto_bash

echo "Run in:"
pwd

mkdir -p folder_1
echo "Made a folder named folder_1"
touch folder_1/text_file.txt
echo "I want to eat strawberries." > folder_1/text_file.txt
echo "Created a file in folder_1 with some text"

Open a text file in the RStudio text editor using File -> New File -> Shell Script, copy the script text in the new text file and save it as “my_script.sh” (saved in the tuto_bash folder).

Run it with the bash command:

cd /home/rstudio/mydatalocal/tuto_bash
bash my_script.sh

You can check that the script worked using the commands previously shown.

Comments

Always add comments to your scripts to keep track of what they do. In Bash, lines starting with a # symbol are ignored by the computer.

folderA/my_script_comments.sh
#!/bin/bash

# This is a comment

#Display the working directory
echo "Run in:"
pwd

#Made a folder named folder_1
mkdir -p folder_1

# Created a file in folder_1 with some text
echo "I want to eat strawberries." > folder_1/text_file.txt

cowsay "me too"  >> folder_1/text_file.txt

Add a couple of comments to the script we created above to specify what the commands do.

Warning

In longer and more complex scripts, it’s important (necessary, even) to document each section properly.

You’ll be thankful when reading properly commented code from others (or from past you), rather than spending time figuring out what each part does.

A script can be run from another folder, you just have to adjust the path:

cd /home/rstudio/mydatalocal/tuto_bash/
bash folderA/my_script_comments.sh
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1/
bash /home/rstudio/mydatalocal/tuto_bash/folderA/my_script_comments.sh

You can use tree to search for the “folder1” you created with the previous script

cd /home/rstudio/mydatalocal/tuto_bash/
tree

(Usually, we prefer to use relative path.)

If you run a script from another directory, you’ll have to pay attention to the path used in your script. Indeed, the working directory will be the directory where you launch the script and not the directory containing the script. A good habit is to fix at the beginning of your script the working directory, for example :

sandbox/my_template_script.sh
#!/bin/bash

cd /home/rstudio/mydatalocal/tuto_bash

[...]

You can redirect the output of your script to a file to keep log of its execution (if you have print message)

cd /home/rstudio/mydatalocal/tuto_bash/
bash folderA/my_script_comments.sh > script.log

2.9 Helpful resources

Note

3 Get a toy dataset for the training

As we want to keep this data between all the sessions, you will use the permanent storage shared between all the students.

For the following of the training, we will work in a new personal directory: /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project.

Warning: Change “YourName” by your actual name.

Create this directory and go inside.

mkdir -p /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project

You can also remove the tuto_bash directory

rm -r /home/rstudio/mydatalocal/tuto_bash
Warning

The ephemeral storage area of your VM is connected to the directory /home/rstudio/mydatalocal/ so in the following of the training you must only work in the folder /home/rstudio/ifbdata/ueens_lyon_ngs24/ which will be permanent.

The dataset we are using for this training is part of a larger study described in Ryu et al., G3 2022. The authors assembled and annotated the genome of the clown fish (Amphiprion ocellaris). Here, we will just use a very small subset of two paired-end RNA-seq samples, one from White skin and the other from Orange Skin.

3.1 Get the raw fastq files

Toy input data are stored in this directory: /home/rstudio/ifbdata/ueens_lyon_ngs24/projects_raw_data/data_bash_training/

As they are very small, you can copy the input data in your personal repository using the script below which can be saved in training_project/01_copy_toydata.sh.

training_project/01_copy_toydata.sh
#!/bin/bash

mkdir -p /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project

mkdir -p 01_toy_data/
cd 01_toy_data/

cp -r /home/rstudio/ifbdata/ueens_lyon_ngs24/projects_raw_data/data_bash_training/toy_data/* .

echo "Files:"
ls
Terminal
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
bash 01_copy_toydata.sh
Note

For your real project, you will directly use the input data from one of these folders to avoid to replicate big files :

projects_raw_data/
 ├── data_bash_training
 ├── data_chipseq_petunia
 ├── data_envgenomics
 ├── data_meso_sexchromevol
 ├── data_scrnaseq
 └── data_spatial_transcriptomcs

3.2 Run the quality cleaning on the training dataset

Now we get some data, you can run your first bioinformatic analysis (as an example): perform a quality cleaning on your raw sequencing data.

fastp is a commonly used software that provides a simple way to do this. (You may need to use a different one depending on the specific features of the tool).

The template command to run fastp is :

#Not run
fastp --in1 sample_R1.fastq.gz --in2 sample_R2.fastq.gz --out1 sample_trim_R1.fastq.gz --out2 sample_trim_R2.fastq.gz
Important

CONDA

Bioinformatics tools are not installed on computers by default; you have to install them. If you are working on several projects, you may need tools whose dependencies are incompatible. For greater convenience, we will be using conda, a package and environment management system that allows you to install several versions of software packages and their dependencies and to switch easily from one to another. We have prepared several environments for you with the tools you will need for the various projects.

To activate an environment, you have to type:

conda activate uengs

You can see the name of the active environment at the beginning of your prompt.

(uengs) rstudio@351e742a1b2f:~$ 

To deactivate it:

conda deactivate

And, you come back to the “base” environment.

(base) rstudio@351e742a1b2f:~$ 

You can list available environment by typing:

conda env list
# conda environments:
#
base                     /home/rstudio/conda
meso                     /home/rstudio/conda/envs/meso
scrnaseq                 /home/rstudio/conda/envs/scrnaseq
spatial                  /home/rstudio/conda/envs/spatial
uengs                 *  /home/rstudio/conda/envs/uengs

If you need to install a “common” tool during the practical, you have to activate an environment and type something like that according to the tool documentation:

conda activate ENV_NAMES
conda install -c bioconda TOOLS

You can also list all tools and their dependencies of an environment:

conda activate uengs

conda list

It can be useful to get version of a tool:

conda list | grep fastp

Environments are automatically created when your VM starts up, which may take some time, so don’t worry if you don’t see them immediately.

You will have to activate your conda environment at the beginning of your script to be able to “see” and execute tools.

SCRIPT.sh
#!/bin/bash

CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh

conda activate ENV_NAMES

cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/project_dir/...

...

Come back to your quality control analysis :

First, get the path of a fastq file of the training dataset

cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
ls 01_toy_data/*
01_toy_data/White_R1.fastq.gz
01_toy_data/White_R2.fastq.gz
01_toy_data/Orange_R1.fastq.gz
01_toy_data/Orange_R2.fastq.gz

Note the difference between the command with the wildcard “*”

ls 01_toy_data/*
01_toy_data/White_R1.fastq.gz
01_toy_data/White_R2.fastq.gz
01_toy_data/Orange_R1.fastq.gz
01_toy_data/Orange_R2.fastq.gz

and without the “*”

ls 01_toy_data/
White_R1.fastq.gz
White_R2.fastq.gz
Orange_R1.fastq.gz
Orange_R2.fastq.gz

Now we “just” have to adapt the command to run fastp on a file of the training dataset.

By default, fastp writes 2 output files fastp.html and fastp.json to the current directory along with the execution statistics. We will use the --json and --html arguments to give them a unique name and thus avoid the files being overwritten by execution of the command on other samples.

You can save the commands in a script : “02_toy_data_cleaning.sh”

training_project/02_toy_data_cleaning.sh
#!/bin/bash

CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh

conda activate uengs 

cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
mkdir -p 02_toy_data_cleaning_fastp/


fastp --in1 01_toy_data/White_R1.fastq.gz \
      --in2 01_toy_data/White_R2.fastq.gz \
      --out1  02_toy_data_cleaning_fastp/White_trim_R1.fastq.gz \
      --out2  02_toy_data_cleaning_fastp/White_trim_R2.fastq.gz \
      --json  02_toy_data_cleaning_fastp/White_fastp_report.json \
      --html  02_toy_data_cleaning_fastp/White_fastp_report.html
Important

When a command is too looooong and becomes hard to read you can use \ for splitting a line (! without space after !)

fastp --in1 01_toy_data/White_R1.fastq.gz \
      --in2 01_toy_data/White_R2.fastq.gz \
      --out1  02_toy_data_cleaning_fastp/White_trim_R1.fastq.gz \
      --out2  02_toy_data_cleaning_fastp/White_trim_R2.fastq.gz \
      --json  02_toy_data_cleaning_fastp/White_fastp_report.json \
      --html  02_toy_data_cleaning_fastp/White_fastp_report.html

is equivalent to and easier to read than

fastp --in1 01_toy_data/White_R1.fastq.gz --in2 01_toy_data/White_R2.fastq.gz   --out1  02_toy_data_cleaning_fastp/White_trim_R1.fastq.gz       --out2  02_toy_data_cleaning_fastp/White_trim_R2.fastq.gz     --json  02_toy_data_cleaning_fastp/White_fastp_report.json --html  02_toy_data_cleaning_fastp/White_fastp_report.html

Now, you can run your cleaning script.

Terminal
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
bash 02_toy_data_cleaning.sh

We can now look at the results, using tree or ls 02_toy_data_cleaning_fastp or using the files browser of rstudio (right bottom panel of rstudio)

The result will be in the 02_toy_data_cleaning_fastp folder

02_toy_data_cleaning_fastp/
├── White_fastp_report.html <------- trimming report
├── White_fastp_report.json
├── White_trim_R1.fastq.gz
└── White_trim_R2.fastq.gz

If you want to look at the trimming report you can search it in the right bottom panel of rstudio and open it with “View in a web Browser”

Exercices
  • Modify the script to clean raw sequencing files for both samples.
training_project/02_toy_data_cleaning.sh
Show the code
#!/bin/bash

CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh

conda activate uengs 

cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project
mkdir -p 02_toy_data_cleaning_fastp/


fastp --in1 01_toy_data/White_R1.fastq.gz \
      --in2 01_toy_data/White_R2.fastq.gz \
      --out1  02_toy_data_cleaning_fastp/White_trim_R1.fastq.gz \
      --out2  02_toy_data_cleaning_fastp/White_trim_R2.fastq.gz \
      --json  02_toy_data_cleaning_fastp/White_fastp_report.json \
      --html  02_toy_data_cleaning_fastp/White_fastp_report.html



fastp --in1 01_toy_data/Orange_R1.fastq.gz \
      --in2 01_toy_data/Orange_R2.fastq.gz \
      --out1  02_toy_data_cleaning_fastp/Orange_trim_R1.fastq.gz \
      --out2  02_toy_data_cleaning_fastp/Orange_trim_R2.fastq.gz \
      --json  02_toy_data_cleaning_fastp/Orange_fastp_report.json \
      --html  02_toy_data_cleaning_fastp/Orange_fastp_report.html
      
      

In Rstudio you can open several terminals at once to avoid waiting for the completion of one command.

4 How to organize your project directory ?

For the moment we placed all scripts at the root of our project directory. In fact this is a very bad idea.

It is advisable to put all files related to the same project in the same folder, whether they are:

  • the raw data,
  • the scripts,
  • the results,
  • the project documentation,

But to separate the different types of data into sub directories.

This way, you will more easily find your way around and avoiding mixing up files or accidentally deleting them.

For example, your project directory might look like this:

project_name/
├── README.md             # overview of the project
├── data/                 # data files used in the project
├── results/              # results of the analysis (data, tables, figures)
├── scripts/              # contains all code in the project
│   ├── 01_...
│   ├── 02_...
│   └── ...
└── doc/                  # documentation for your project
    └── ...

Moreover, to be able to find your way around and to ensure reproducibility, you need to add a file, often named README.md, at the root of your folder that should contain all the useful information to take the project in hand. This is also the file that will be visible on the home page of your project on gitlab. So when someone wants to (re)work on the project, they will be able to open the folder, and they will know where to go to see and understand what has been done. This person can be a collaborator, your manager or simply yourself 6 months later.

Exercices
    1. In the training project folder, create a “scripts” folder and move scripts inside.
Show the code
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project

mkdir -p scripts

mv 0*sh scripts
    1. Do the same thing for results files
Show the code
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project

mkdir -p results

mv *cleanning* results/

4.1 Scripts and data naming

Another part of staying organized is making sure that all the directories and filenames for an analysis are as consistent as possible. You want to avoid names like alignment1.bam, and rather have names like 20170823_kd_rep1_gmap-1.4.bam that provide a basic level of information about the file.

Good name must be :

  • machine readable
    • avoid spaces, punctuation, accented
    • regular expression and globbing friendly (ex: ls CondA* )
    • easy to compute on (ex: deliberate use of delimiters, CondA_rep1.tsv)
  • human readable
    • name contains info on content (ex 01_download_training_dataset.sh)
  • plays well with default ordering (ex: put something numeric first for script for example, or use date for results)

This link and this slideshow have some good guidelines for file naming dos and don’ts.

5 More useful bash commands

In the last part, you have seen some basic bash commands. In this session we will see some new to complete your bash knowledge. As previously, the idea is only to give you useful command for this training but not a complete view of bash.

5.1 New bash commands

Create a new “tuto_bash” directory for the exercises in /home/rstudio/mydatalocal

cd /home/rstudio/mydatalocal
mkdir -p tuto_bash
cd tuto_bash
  • grep -> to get only lines of a file with a pattern
cowsay -f dragon "Hello Hello!" > dragon.txt
cat dragon.txt
grep "-" dragon.txt
  • wc -l -> to count the number of lines of a file
wc -l  dragon.txt

wc can do other things but we will use only this one.

  • basename -> to get only the name of a file from a path (the file can not exist)
basename toy_data/sample1.fastq.gz
  • dirname -> to get only the name of the directory containing the file from a path (the file can not exist)
dirname toy_data/sample1.fastq.gz
  • pipe command with |

The pipe allows forwarding the output of one command as the input of the next command. This way you can execute several consecutive commands. For example, if you want to count the number of line containing a given pattern in a file:

grep "-" dragon.txt > occurence.txt
wc -l occurence.txt

You can instead “pipe” the command using a pipe | (AltGr+6 for PC and Alt + Maj + L for Mac ) between two commands to link the two commands

grep "-" dragon.txt | wc -l  

(By the way, grep already has an option to do this directly (grep -c), but it was for the example.)

  • gunzip and zcat to decompress files

As previously seen, fastq files are large files and you should store them in a compressed form whenever useful. In unix we used preferentially .gz compression (gunzip) than .zip, but it is the same idea.

To check the contents of a compressed file, you can unzip it with gunzip but this takes time and disk space. To do it quickly, there is a command called zcat that allows you to decompress a gz file on the fly and display its contents. So, zcat piped (|) with head will allow you to display the first few lines of a gz file without taking up time or disc space.

zcat /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/toy_data/White_R1.fastq.gz | head -n 8
Exercices

Using the commands we’ve just seen:

    1. Display the first 100 sequences of a training fastq file
Show the code
zcat /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/toy_data/White_R1.fastq.gz | head -n 400
    1. Count the number of sequences of a training fastq file and of the corresponding cleanned files
Show the code
zcat /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/01_toy_data/White_R1.fastq.gz | grep "@SRR1" | wc -l

zcat /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/02_toy_data_cleaning_fastp/White_trim_R1.fastq.gz | grep "@SRR1" | wc -l

5.2 Store strings in variables

In scripts, we frequently re-use bits of text: for example, a path to a file. It is more convenient to keep these in a named “variable” and then re-use that variable in the script. This way, if you need to modify the file path, you only need to modify the variable once, and not all the occurrences of the file path.

To define a variable you will declare it with an = without space.

variable1="toto"

Quote are not mandatory unless the string contains spaces.

variable1=toto

Then, you can refer to the variable’s content by using a $ before its name. To display the command of a variable you can use echo.

echo $variable1
echo variable1
echo "variable1: $variable1"

You can also combine variables to define a new one

variable1="TOTO"
variable2="TATA"

echo variable1: $variable1
echo variable2: $variable2

variable3="$variable1 $variable2"
echo variable3: $variable3

It is advisable to use descriptive variable names rather than generic ones like “variable1”, “variable2”.

Sometimes, you may need to enclose the variable name in curly brackets ({}) to distinguish it from the following string.

For example, if you want to create a filename using the content of a variable called $sample as a prefix, such as $sample_rep1.fastq.gz, you need to add curly brackets around sample (${sample}_rep1.fastq.gz). Otherwise, the terminal will attempt to interpret $sample_rep1 as a single variable.

sample="sampleA"

filenameX=$sample_rep1.txt #variable sample_rep1 don't exist so filenameX=.txt
filenameY=${sample}_rep1.txt

echo filenameX: $filenameX
echo filenameY: $filenameY

Variables can also capture the output of a command, using $(command). (Note the use of parenthesis () and not brackets {})

list_files=$(ls)
working_directory=$(pwd)

echo "$working_directory contains the following files:
$list_files"

You can use also pipe inside

nb_dash=$(grep "-" dragon.txt | wc -l)
echo $nb_dash

Using variables will simplify both your script writing and reading processes.

filename=dragon.txt
new_filename=dragon_bis.txt

echo "$filename will be cp to $new_filename"
cp  $filename  $new_filename

You can also do pattern substitutions in a variable, using ${variable/pattern/replace} to replace the first occurrence of “pattern”.

filenameR1=sample1.R1.fastq.gz
echo ${filenameR1/R1/R2}

If you want to remove a pattern, you can use the same syntax but without a replacement string. This is a useful method to remove the extension of a file name.

filenameR1=sample1.R1.fastq.gz
echo ${filenameR1/.R1.fastq.gz/}

You can save the output of this pattern substitution in a new variable.

filenameR1=sample1.R1.fastq.gz
filenameR2=${filenameR1/R1/R2}

echo R1: $filenameR1 R2: $filenameR2

For the following exercise, we need Paired-End (PE) data. PE fastq file names differ only by “R1” or “R2”. The file containing reads 1, contains “R1” in its file name and the other file containing reads 2, contains “R2”. Therefore, from one, you can deduce the other.

Exercices

You will work in the /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/ directory to do the following exercise.

The objectives are to rewrite the training_project/02_toy_data_cleaning.sh script using variables to simplify the script and also to re-use it more easily on other samples.

Using the commands we’ve just seen:

  1. Complete the commands to build the complete path of the R1 (White_R1.fastq.gz) and R2 (White_R2.fastq.gz) files using a variable prefix containing the prefix of the sample name (“White”)
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/
fastq_data_dir="01_toy_data/"

prefix="White"

filenameR1=${prefix}_R1.fastq.gz
path_filenameR1=...

filenameR2=...
path_filenameR2=...


echo $path_filenameR1 # display 01_toy_data/White_R1.fastq.gz
echo $path_filenameR2 # display 01_toy_data/White_R2.fastq.gz

ls $path_filenameR1 # to check the path exist
ls $path_filenameR2 # to check the path exist
Show the code
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/
fastq_data_dir="01_toy_data/"


prefix="White"

filenameR1=${prefix}_R1.fastq.gz
path_filenameR1=$fastq_data_dir/$filenameR1

filenameR2=${prefix}_R2.fastq.gz
path_filenameR2=$fastq_data_dir/$filenameR2


echo $path_filenameR1 # display 01_toy_data/White_R1.fastq.gz
echo $path_filenameR2 # display 01_toy_data/White_R2.fastq.gz

ls $path_filenameR1 # to check the path exist
ls $path_filenameR2 # to check the path exist
  1. Using these variables, rewrite the training_project/02_toy_data_cleaning.sh. You may start by one sample and also create a variable for the output directory: trim_data_dir="02_toy_data_cleaning_fastp2/"
training_project/02_toy_data_cleaning2.sh
Show the code
#!/bin/bash

CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh


conda activate uengs 

cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project

fastq_data_dir="01_toy_data/"
trim_data_dir="02_toy_data_cleaning_fastp2/"

mkdir -p $trim_data_dir/


prefix="White"

filenameR1=${prefix}_R1.fastq.gz
path_filenameR1=$fastq_data_dir/$filenameR1
out_filenameR1=$trim_data_dir/$filenameR1

filenameR2=${prefix}_R2.fastq.gz
path_filenameR2=$fastq_data_dir/$filenameR2
out_filenameR2=$trim_data_dir/$filenameR2

path_report=$trim_data_dir/${prefix}_fastp_report


fastp --in1 $path_filenameR1 \
      --in2 $path_filenameR2 \
      --out1  $out_filenameR1 \
      --out2  $out_filenameR2 \
      --json  $path_report.json \
      --html  $path_report.html



prefix="Orange"

filenameR1=${prefix}_R1.fastq.gz
path_filenameR1=$fastq_data_dir/$filenameR1
out_filenameR1=$trim_data_dir/$filenameR1

filenameR2=${prefix}_R2.fastq.gz
path_filenameR2=$fastq_data_dir/$filenameR2
out_filenameR2=$trim_data_dir/$filenameR2

path_report=$trim_data_dir/${prefix}_fastp_report


fastp --in1 $path_filenameR1 \
      --in2 $path_filenameR2 \
      --out1  $out_filenameR1 \
      --out2  $out_filenameR2 \
      --json  $path_report.json \
      --html  $path_report.html
      

5.3 “For” loops to iterate on a list

Bioinformatic analysis often involves executing numerous commands, but with only slight variations each time you run a new command.

For instance, you might have to analyze sample1, then sample2, sample3, and so forth. To save time, we employ something known as a loop, specifically a for-loop.

Let’s clarify this concept with a straightforward example. Suppose you wish to print the letters a, b and c. You could achieve this by typing echo a, hitting return; then echo b, and echo c. While this accomplishes the task, it is evident that this method would become cumbersome if you had to to print dozens or hundreds of numbers.

How can we simplify this process? You probably noticed that each time we ran the command, we changed the letter after the echo command. A for-loop automates this process for you.

Here is an example of a for-loop to print the letters a, b, c, d:

for X in a b c d
do
  echo $X
done

The for-loop comprises three sections:

  • The first section is the Declaration: it begins by assigning the first item after in to the variable **X**; in this case, it would assign the value "a" to "X". The letters afterin` are called the List.

  • The next section is the Body, which runs the commands written after do, replacing the variable with whichever value is currently assigned to the variable - for the first loop, this will be the letter a. Since items remain in the list, the loop goes back to the declaration and assigns the next number in the list to the variable X; in this case, the number b. Then the body is run, and the process repeats until the end of the list is reached.

  • The final section, called the End, contains only the word done, meaning to exit the loop after all of the items in the list have been run through the Body of the loop.

You can include additional commands to the Body section. For instance, we could modify the loop to:

for X in sheep elephant fox duck
do
  # Display the value in X
  echo $X
  #Do something independent of the variable X
  cowsay -f turtle "Hello"
  #Do something dependent of the content of X ($X)
  cowsay -f $X "Hello, I'm a $X"
  #Do another thing, if you want
  cowsay -f turtle "Welcome, $X"
done

You can use a variable, as seen previously, to store the list of items to iterate over. Additionally, it is preferable to use a more descriptive variable name than “X” for iteration (for example “letter”).

list="a b c d"
for letter in $list
do
  echo Do something
  echo Do something with $letter
  echo Do another thing
done

Furthermore, the list to iterate over can be automatically generated. For instance, if you wish to iterate over a list of files you can do:

cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/

list_files=$(ls 01_toy_data/*gz)
echo $list_files

echo "Start:"

for file in $list_files
do
  echo $file
  echo Do something
  echo Do something with $file
  echo Do another thing
done

echo "End"

#OR

for file in toy_data/*gz
do
  echo $file
  echo Do something
  echo Do something with $file
  echo Do another thing
done

Note: Indentations are not important for the machine, but they helps in code readability for humans.

Exercices

Using the commands we’ve just seen:

  1. Iterate on each sample and print the path of the R1 and R2 fastq files contained in 01_toy_data directory.
Show the code

fastq_data_dir="01_toy_data/"

prefix_list="Orange White"
#or
prefix_list="Orange
White"


for prefix in $prefix_list
do
  echo prefix: $prefix
  
  filenameR1=${prefix}_R1.fastq.gz
  path_filenameR1=$fastq_data_dir/$filenameR1
  
  filenameR2=${prefix}_R2.fastq.gz
  path_filenameR2=$fastq_data_dir/$filenameR2
  
  echo R1: $path_filenameR1
  echo R2: $path_filenameR2
done
prefix: Orange
R1: 01_toy_data//Orange_R1.fastq.gz
R2: 01_toy_data//Orange_R2.fastq.gz
prefix: White
R1: 01_toy_data//White_R1.fastq.gz
R2: 01_toy_data//White_R2.fastq.gz
  1. Using the same basis, rewrite the training_project/02_toy_data_cleaning2.sh using a for-loop. Use a different output directory (02_toy_data_cleaning_fastp3/) to check that your script is ok.
training_project/02_toy_data_cleaning3.sh
Show the code
#!/bin/bash

CONDA_BASE=$(conda info --base)
source $CONDA_BASE/etc/profile.d/conda.sh

conda activate uengs 

cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project

fastq_data_dir="01_toy_data/"
trim_data_dir="02_toy_data_cleaning_fastp3/"

mkdir -p $trim_data_dir/

prefix_list="Orange
White"

for prefix in $prefix_list
do



filenameR1=${prefix}_R1.fastq.gz
path_filenameR1=$fastq_data_dir/$filenameR1
out_filenameR1=$trim_data_dir/$filenameR1

filenameR2=${prefix}_R2.fastq.gz
path_filenameR2=$fastq_data_dir/$filenameR2
out_filenameR2=$trim_data_dir/$filenameR2

path_report=$trim_data_dir/${prefix}_fastp_report

echo R1: $path_filenameR1
echo R2: $path_filenameR2

echo out_R1: $out_filenameR1
echo out_R2: $out_filenameR2

echo report: $path_report
  
fastp --in1 $path_filenameR1 \
      --in2 $path_filenameR2 \
      --out1  $out_filenameR1 \
      --out2  $out_filenameR2 \
      --json  $path_report.json \
      --html  $path_report.html


done
R1: 01_toy_data//Orange_R1.fastq.gz
R2: 01_toy_data//Orange_R2.fastq.gz
out_R1: 02_toy_data_cleaning_fastp3//Orange_R1.fastq.gz
out_R2: 02_toy_data_cleaning_fastp3//Orange_R2.fastq.gz
report: 02_toy_data_cleaning_fastp3//Orange_fastp_report
[...]
  1. Bonus: In the case of a long list of samples, try to get the list of the prefix automatically from the list of input files.
  • As the training dataset is Paired-End each sample has 2 input files. You can use only R1 files to get only once each sample name.)

  • You may use the sed command.

  • sed -> Utilized for replacing all instances of a pattern with another in a file or in a terminal output. (sed can do a lot of other things but we will use only this one.)

# in the content of a file
echo sed "s/PATTERN/SUBSTITUTE/g" file.txt

# pipe after a bash command
echo $string | sed  "s/PATTERN/SUBSTITUTE/g" 

# To replace only the first occurrence per line of a pattern, omit the **g** option
echo $string | sed  "s/PATTERN/SUBSTITUTE/" 

#If the pattern contains "/", you can replace "/" by "|"
echo $string | sed  "s|PATTERN|SUBSTITUTE|g" 
Show the code
cd /home/rstudio/ifbdata/ueens_lyon_ngs24/students/YourName/training_project/

samples_list=$(ls 01_toy_data/*R1*gz | sed "s/_R1.fastq.gz//g" | sed "s|01_toy_data/||g")

echo $samples_list 
Orange White
To finish

Now it’s time to get to work on your project!

Remember, if you’re looking for a command, have an incomprehensible bug or don’t know where to start :

  • ask other members of your group
  • the internet is also your friend (it’s not cheating!!)
  • ask one of the trainers

These materials have been developed by members of UE NGE teaching team of the ENS de Lyon. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.


6 Readme example

RNA sequencing HTLV project

6.1 Scripts storage

All the scripts used for this project are stored in the file “scripts”, available with this path : Path: /home/rstudio/ifbdata/rna_seq_htlv_xxx/scripts

6.2 Introduction

The aim of the study is to understand if there is a modification inside the transcriptome of T-Lymphocytes infected by the HTLV-1 when they are in contact with dendritic cells. For that, biologists put C91 cells, which are already infected with the HTLV-1, in contact with dendritic cells or with control cells. HTLV-1 is an oncogenic human retrovirus that can be involved in the development of leukemia and neuromyelopathy. The aim of the manipulation is to understand whether one or more genes allow infected cells to escape the immune system. And for that, we analysed the Lymphocyte RNA sequencing.

6.3 Tools used for the manipulation

FastQC : 0.11.8
MultiQC : 1.14
fastp : 0.23.2
salmon : 1.10.1

tidyverse : 2.0.0
tximport : 1.26.1
biomaRt : 2.54.1
DESeq : 1.38.3
EnhanceVolcano : 1.16.0
ggpubr : 0.6.0
DOSE : 3.24.2
clusterProfiler : 4.6.2
dbplyr : 2.4.0

6.4 Download data

The data from Hélène Dutarte’s lab at CIRI were downloaded via the script : 01.RNA_seq_data_download.sh .
The different files are called : C91CT1, C91CT2, C91CT3, C91DC1, C91DC2, C91DC3. There are three replicates for each conditions.

6.5 Quality checking of the data

6.5.1 Creation of all the FastQC data for the different experiments

The fastQC command is used to analyze sample quality from a high throughput sequence data. The quality control results are stored in a file results/02_quality_analyse_fastq. The following script is used for data quality control : 02.checking_quality_fastq.sh

For a more precise explanation on the results you can go on Migale bioinformatics faciliy

6.5.2 Reunite all Fastq data in a single file

MultiQC allows information from multiple FastQC reports to be grouped together on a single page. It is used to compare different samples more easily. The results are stored in the file results/03_quality_checking_aggregate_multiqc. The following script is used to reunite the fastq data : 03.reunite_data_multiqc.sh

6.6 Data cleaning

FastP command is used to clean all the data obtained by the fastQC command and keep only the reads that can be analysed. A wide selection of cleaning parameters is available:
- Phred nucleotide quality of at least 28.
- cutting of sequences with a Phred average of less than 28
- minimum read size
- removal of adapters
The results are stored in htlm and json file. The following script is used to cleaned the fastQC data : 04.fastp_cleanning.sh

6.7 Quantification of all our reads using Salmon tool

6.7.1 Download the transcriptome of interest to prepare the analysis by salmon

To analyze all our transcripts we need to index them to find out which gene they relate to. To do this, we need to download a database on ENSEMBL for the human genome and NCBI for the viral genome. We download :
- all the cdna of the human genome : complementary DNA, representing the human transcriptome.
- all the ncrna of the human genome : non coding RNA.
- all the cds of the viral genome : coding DNA sequence.
The following script is used to download all these data : 05.dowload_transcriptome.sh

6.7.2 Creation of the Index for salmon

All transcripts are indexed using the salmon index command. It uses a library of all 5-nucleotide compositions, allowing it to move through the previously downloaded data and annotate all the transcripts. This technique is called quasi-mapping. The following script is used to index all our transcripts of interest : 06.salmon_index.sh

6.7.3 Quantifictaion of our reads

All the RNA seq data are quantify by the salmon quant command using the indexed transcriptome as reference. The following script is used to quantify all our transcripts of interest : 06.salmon_quantification.sh

To have more information you can go on the web site Salmon explanations

6.8 Differential expression analysis

6.8.1 Import all our quantification transcript files

The Rmarkdown R1.analyses_DESeq2.Rmd allows to download all the files from salmon quantification to be used by DESeq.
In this script, we are creating a mapping between the transcript and the gene. We also create a directory of gene identifiers with their name and function.
We end by creating a dss object to store the results of the DESeq2 analysis. We filter the data: only genes that have more than 10 reads and in more than 3 replicates.

6.8.2 PCA analysis

We use the plotPCA command that can create a plot for the replicates, the batch effect or the conditions. In our study we can see that the experiment 2 is very far from the other 2 conditions due to a batch effect. We can see a replicate effect with a right shift and a condition effect with an up shift. For our analyse, and to counteract the batch effect, we will only focus on the condition effect.

For this analysis we use the script R2.PCA.Rmd, the figures can be seen on the knit.

6.8.3 Running the DESeq2 analysis

The DESeq, standing for Differential expression analysis, is an analyse to compare the quantification of transcript in two conditions and to express the differential expression of the genes. To do that we use the DESeq function. All the results are filtered with the logFoldChage with the command lfcShrink which allows to reduce the LFC. Then we create a table with all the results filtered for the LFC, with the gene name and the gene function. To do that we use the script R3.DESeq2.Rmd.

6.8.4 Showing plot for the analyse of the DESeq results

We create a volcano plot with EnhancedVolcano command, which places genes in a diagram using p-value as a function of logFoldChange.

We create a MA plot with ggmaplot command, which places genes in a diagram using LogFoldChange as a function of the mean expression.

We use the function pheatmap to create a heat map showing the differentially expressed genes in the control or in the condition tested. The blue is for the downregulated genes and the red for the upregulated genes.
The code that displays the heat map is available from Mohamed Chour.

To create these different plots showing the results of the DESeq we run the script R4.DESeq_plots.Rmd, the figures can be seen on the knit.

To have more informations you can go on the web site DESeq explanations

6.9 Analysis of the results

6.9.1 Gene set over-representation analysis

This technique allows to test between background genes if the genes that are differentially expressed are significant. This technique uses a chi2 test. For the background we use all the genes tested in our experiments. We use a gene ontholgy that classify all the genes inside biological process. This technique is first used for all the gene then for the up-regulated genes and finally the down regulated genes. We can show the results with a data frame to see if some genes can be interesting.

Then, with the command enrichGO, we do the GO enrichissement to know which different type of genes are up regulated. We can do a dot plot to show the results where the GeneRatio is the proportion of genes in a specific gene set among all up-regulated genes.

To have more information you can go on the web site Over-representation analysis

6.9.2 Gene Set Enrichment Analysis (GSEA)

The Gene Set Enrichment Analysis (GSEA) is a method that determines whether a defined set of genes shows statistically significant, concordant differences between two biological states. We use the KEGG pathway to determine which genes of the different pathways are up or down regulated.
This part of the code has been done by Mohamed Chour.

6.9.3 Creating a pathview plot

The pathview command allows you to color the speakers of a signaling pathway according to its expression regulation. The signaling pathway need to be determined beforehand on the KEGG pathway site.

This part of the code has been done by Yevhenii Kostenko.

To run the analyses of the results the scripts R5.GO_analyses.Rmd is used.

To have more information you can go on the web site Gene set enrichment analysis

6.10 Analysis of the proteins potentially secreted

We are using the ENSEMBL data to annotate each gene with signalP. SignalP refers to a peptide inside the protein that allows it to be secreted. We sort all the genes that are upregulated, which are significantly differentially expressed,with a signalP and which are not expressed in the control.

The results are put in table, sort in a descreasing way on the logFoldChange (the more differentially expressed genes)

To do that the script R6.secretome.Rmd is used.

6.11 Conclusion

With our data, we have seen that CAV1 is down regulated in infected lymphocytes in contact with DC whereas it is up-regulated in DC cells in contact with infected lymphocytes (mirror experiment already done)

We have also highlighted that the InterferonG is upregulated in Lymphocytes in contact with DC but not expressed in the control and it can be secreted.The implied pathway in the expression of INFG is also affected with up regulated genes.