USEFUL INFORMATION: Short introduction to bash

UE NGS - ENS LYON

Author

NGS-team - 2023

1 “bash” ?

Many/most bioinformatic tools do not have graphical interfaces. To use them, you need some basics on the command line and bash is the name of one language to “talk” to a computer through a terminal.

2 Learn the basics of using the command line

Note

The goal is not to turn you into bioinformatics specialists but to give you basic skills in running commands. From there you will be able to find, follow and understand tutorials and help available on the ‘jungle of the Internet’

2.1 Connecting to your Virtual Machine (VM)

To start, you need a linux environment where you can train yourself. For that, go back to the IFB website (There is a link on the top of this page) and go to the deployment frame in the “myVM” and connect you to the VM you started at the beginning of the session.

To connect to an instance you need 3 pieces of information: a login, a password and the IP address of your VM:

    1. Click on deployment “Params” link (right), then copy your Rstudio_PASWORD
    1. Click on deployment “https” link (right), which will send you to the correct IP.
    1. fill the login (“rstudio”), and your password in the popup window, et type “enter”. ”
Warning
  • You must accept the security warning to access the machine.
  • If you are using firefox, it is possible that firefox does not want to override the security warning, you must then use another browser e.g. chrome.

2.2 Discovering the Rstudio interface and access to its terminal

By default, the “R” console is open on the left-hand side. This is because R is the expected language in Rstudio, but we are going to use another Rstudio function to talk directly to the computer. To do this, we are going to use the terminal.

You can also see a file browser in the bottom right-hand corner.

2.3 Access to the terminal

Open the “Terminal” tab of your Rstudio interface to access the terminal on your VM.

2.4 What is a terminal ?

A terminal is an interface used to issue commands to a computer by typing only text. You already use a lot of commands indirectly through the graphical interface of your computer (like creating or opening folders, for example).

2.5 Why use a terminal ?

There are plenty or reasons. Here are 4 :

  1. Automation/scalability. With one command, it is possible to create 100 folders instantly, which is much more tedious through a graphical interface.
  2. Flexibility. Commands can have a wide array of inputs and options.
  3. Availability. A graphical interface is a luxury! Some programs (even some operating sytems) just. . . don’t have any. Thus, if you want to use a lot of programs out there, you need the terminal.
  4. Reproducibility. A set of commands can be stored in a script, which makes them easy to reproduce.

2.6 What does a terminal look like ?

The terminal’s input line has the following layout

login@hostname:~/path$

We can split this in two parts:

• who is executing the commands: name of the user ( login ), @ , and name of the machine ( hostname ).

• where on the system is the command being executed is indicated the spot “from where” the command is executed ( ~/path ).

A $ sign ends the input prompt.

Whereas the R console will look like :

>

2.7 Launch your first bash commands

You have to imagine that you are a cursor in a tree (or hierarchy) of directories and files on your computer. You can move from one directory to another, and launch commands from your position in the file/folder hierarchy you are in, but also on the computer as a whole.

2.7.1 What is a command ?

A command is a text you pass to your terminal. It starts a system process on the computer, resulting in an action. Commands consist of the command name, followed by arguments: a series of space-delimited strings that modify how the process will behave.

An example will be easier to understand. cowsay is a (fun and useless) program that takes a message as argument.

Type this command in your terminal and type enter to execute it.

cowsay Hello

Commands can also take a special kind of argument called an option. An option starts with - and can be followed, or not, by another argument. Below the -f option is used:

cowsay -f dragon Hi

What’s the point of options? Non-optional arguments (above, “Hi”) are related to the command’s goal, or default behaviour (here, outputting a message). Options fine-tune the command’s behaviour (above, who is speaking).

Warning, each space is interpreted as a new argument, so you have to use quotes to protect an argument containing spaces.

cowsay -e "@@" "What! A talking dragon!"

The following command will list all available animals that can be made to speak.

cowsay -l

Many commands provide help on their usage through the option “-h” or “–help”. You can also find more detailed help on the Internet.

cowsay -h

There are plenty of (more useful) programs. We will cover the basics.

2.7.2 Commands to navigate around your file tree.

  • pwd -> print working directory

To know where you are in the file tree

pwd
  • ls -> list files and directories of your current directory
ls
  • cd -> change directory
cd mydatalocal
  • mkdir -> make new directory(ies)
mkdir tuto_bash

With the -p option, any intermediary directories that don’t exist will also be created

mkdir tuto_bash/folderA/folderA_1 # raise an error
mkdir -p tuto_bash/folderA/folderA_1

The -p option also allows avoiding an error if directory exists

mkdir tuto_bash # raises an error if tuto_bash exists
mkdir -p tuto_bash

2.7.3 Absolute/Relative paths

To know where your files are, the machine uses paths. There are 2 kinds of paths, relative and absolute.

  • A relative path is “how to get somewhere from my current position”. Relative paths use the “.” and “..” notations to look under or above your current position. “../../” means the parent directory of your parent directory (remember the file tree analogy!)
cd tuto_bash/folderA/folderA_1
pwd
cd ../
pwd

cd ../../
pwd

cd 
  • Absolute paths all start with “/” , which is the root of the file system (remember the file tree analogy! :)). Absolute paths are useful because (unlike relative paths) they are correct no matter where you are on the machine.
cd /home/rstudio/mydatalocal/tuto_bash
  • * the wildcard

a star is a “wildcard” matching any number of any characters. e.g. folder* matches all files/folders starting by folder

ls folder*
Exercices

Using the commands we’ve just seen:

    1. Make sure you are inside tuto_bash . What is its absolute path ?
Show the code
# Answer: check the prompt, and
pwd

#or
cd /home/rstudio/mydatalocal/tuto_bash
    1. Make a new directory inside tuto_bash named folderB/ and create a subdirectory folderB/folderB_1
Show the code
cd /home/rstudio/mydatalocal/tuto_bash
mkdir -p folderB/folderB_1

#or

mkdir folderB
cd folderB
mkdir folderB_1
    1. Go in the last sub directory you created (/home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1)
Show the code
#with an absolute path
cd  /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
pwd

#or a relative path from  tuto_bash
cd  /home/rstudio/mydatalocal/tuto_bash
cd folderB/folderB_1
pwd

#or a relative path from folderB
cd  /home/rstudio/mydatalocal/tuto_bash/folderB
cd folderB_1
pwd
    1. From here, create a subdirectory folderA_2 in folderA
Show the code
#with an absolute path
mkdir -p /home/rstudio/mydatalocal/tuto_bash/folderA/folderA_2

#or a relative path from  folderB_1
cd  /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
mkdir -p ../../folderA/folderA_2
    1. List all the directories of folderA
Show the code
#with an absolute path
ls /home/rstudio/mydatalocal/tuto_bash/folderA

#or a relative path from  folderB_1
cd  /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1
ls ../../folderA

2.7.4 Writing and Reading Text Files

Go back to the /home/rstudio/mydatalocal/tuto_bash/ directory

cd  /home/rstudio/mydatalocal/tuto_bash/
  • touch -> to create empty file
touch empty_file.txt

You can also easily create new file using the right bottom panel or using file > New File > Text File

  • Create a text file and name it toto.txt,

  • Write something inside.

  • Save it.

  • echo -> display text in the terminal

echo "hello-world"
  • redirection > and >> -> > redirect the output of a command in a text file, >> add output of a command at the end of a file
echo "hello-world" > tata.txt
echo "hello-world 2" >> tata.txt

You can also redirect output of other commands.

cowsay -f dragon "hello-world" > dragon.txt
cowsay  "hi" >> dragon.txt
  • cat -> concatenate files and display output
cat dragon.txt  tata.txt

It is often use to display the content of one file

cat dragon.txt

The cat command is useful for viewing the contents of smaller files, but if the file contains hundreds of lines of text, it is overwhelming to have everything printed to the Terminal at once. To see only a part of the file, we can use the commands head and tail to see the first few or the last few lines of the file, respectively. Using myFile.txt as an example, typing

  • head and tail -> to display first or last lines of a file
head  dragon.txt
tail  dragon.txt

you can specify the number of lines you want to display

head -n 5  dragon.txt
Exercices
    1. Create 2 new files named turtle.txt and sheep.txt, containing animal saying something. You will use cowsay -f turtle message and redirection
Show the code
cowsay -f turtle hello > turtle.txt
cowsay -f sheep hello > sheep.txt
    1. Concatenate them and redirect the result in a file turtle_and_sheep.txt
Show the code

cat  turtle.txt sheep.txt  > turtle_and_sheep.txt
    1. Display the entire file, only the first 8 lines, the last 8 lines.
Show the code
cat turtle_and_sheep.txt

head -n 8  turtle_and_sheep.txt


tail -n 8 turtle_and_sheep.txt
    1. list all files of your current directory and save it in list_files.txt (Try to write in the first line: “All the files of this directory:”)
Show the code
echo "All the files of this directory:" > list_files.txt
pwd >> list_files.txt

echo >> list_files.txt

ls >>  list_files.txt

cat list_files.txt

2.7.5 Copying and Removing Files

  • cp -> copy a file in another path. cp <file|folder>

If you want to copy a file in the same directory: (you have to change the name)

cp dragon.txt dragons_cp.txt

If you want to copy a file in another directory: (you can keep the same name or change the name)

cp dragon.txt folderA/
ls folderA/
cp dragon.txt folderA/dragon_cp.txt
ls folderA/
  • cp -r -> copy a directory
cp -r folderA folderAprime
  • mv -> move a file/folder in another path. mv <file|folder>

Move mv can be used to rename a file/folder or move it in another paths

ls
mv dragon.txt dragons_new.txt
ls
mv folderAprime folderC

If you want to move a file in another directory: (you can keep the same name or change the name)

mv  dragons_new.txt folderA/
ls folderA/
mv dragon_cp.txt folderA/dragon_cp_NEW.txt
ls folderA/
  • rm -> remove file(s). Deleting is permanent!
rm empty_file.txt
  • rm -r -> remove recursively file(s) and directories. Warning: deleting is permanent!
mkdir -p folderC/folderC_1
cp dragon.txt folderC/folderC_1
ls folderC/folderC_1

rm -r folderC 
ls folderC/folderC_1  #raises an error
Exercices
    1. Copy turtle.txt in folderB/ and copy sheep.txt in folderB/folderB_1
Show the code
cp turtle.txt folderB/
cp sheep.txt folderB/folderB_1/
    1. Move folderB/turtle.txt in folderB/folderB_1/
Show the code

mv folderB/turtle.txt folderB/folderB_1/
    1. Move all files of folderB/folderB_1/ in folderA/folderA_1/. You will use the “*” (wildcard).
Show the code

ls folderB/folderB_1/

mv folderB/folderB_1/* folderA/folderA_1/

ls folderB/folderB_1/
ls folderA/folderA_1/
    1. List files of your current directory and remove files starting by sheep
Show the code
ls 

rm sheep*

ls
    1. Copy the subdirectory folderB_1 in folderD
Show the code

cp -r folderB/folderB_1/ folderD
  • 6 - use the command tree to get the file tree
Show the code

cd /home/rstudio/mydatalocal/tuto_bash
tree
Show the code


└── tuto_bash
    ├── dragons_cp.txt
    ├── empty_file.txt
    ├── folderA
       ├── dragon_cp.txt
       ├── dragons_new.txt
       ├── dragon.txt
       ├── folderA_1
       │   ├── sheep.txt
       │   └── turtle.txt
       └── folderA_2
    ├── folderB
       └── folderB_1
    ├── folderD
    ├── list_files.txt
    ├── tata.txt
    ├── turtle_and_sheep.txt
    └── turtle.txt

Most commands have multiple options specified using a single or double dash (e.g. -v , –verbose ). All options should be detailed on the help page of a command, which you can usually print in the terminal with the –help option.

cp --help

To get more information, you can also use the man command:

man cp

2.8 Combining Commands in scripts

So far you have learned how to use commands. You’ll soon find, however, that large and complex blocks of code are tedious to write out by hand every time you want to run them. It is also difficult to debug a long string of code that you wrote in the Terminal.

Instead, we can put everything into a script: a file containing code. This allows you to make your code compact and easy to move between directories if you need to. It also makes debugging much easier.

A bash (the language you use to communicate) script is a simple text file, usually starting with #!/bin/bash and named with a ‘.sh’ extension, in which you list commands to run.

Here is an example of a simple script:

tuto_bash/my_script.sh
#!/bin/bash

cd /home/rstudio/mydatalocal/tuto_bash

echo "Run in:"
pwd

mkdir -p folder_1
echo "Made a folder named folder_1"
touch folder_1/text_file.txt
echo "I want to eat strawberries." > folder_1/text_file.txt
echo "Created a file in folder_1 with some text"

Copy it in a text file named my_script.sh (saved in the tuto_bash folder) and run it with the bash command:

cd /home/rstudio/mydatalocal/tuto_bash
bash my_script.sh

You can check that the script worked using the commands previously shown.

Comments

Always add comments to your scripts to keep track of what they do. In Bash, lines starting with a # symbol are ignored by the computer.

folderA/my_script_comments.sh
#!/bin/bash

# This is a comment

#Display the working directory
echo "Run in:"
pwd

#Made a folder named folder_1
mkdir -p folder_1

# Created a file in folder_1 with some text
echo "I want to eat strawberries." > folder_1/text_file.txt

cowsay "me too"  >> folder_1/text_file.txt

Add a couple of comments to the script we created above to specify what the commands do.

Warning

In longer and more complex scripts, it’s important (necessary, even) to document each section properly.

You’ll be thankful when reading properly commented code from others (or from past you), rather than spending time figuring out what each part does.

A script can be run from another folder, you just have to adjust the path:

cd /home/rstudio/mydatalocal/tuto_bash/
bash folderA/my_script_comments.sh
cd /home/rstudio/mydatalocal/tuto_bash/folderB/folderB_1/
bash /home/rstudio/mydatalocal/tuto_bash/folderA/my_script_comments.sh

You can use tree to search for the “folder1” you created with the previous script

cd /home/rstudio/mydatalocal/tuto_bash/
tree

(Usually, we prefer to use relative path.)

If you run a script from another directory, you’ll have to pay attention to the path used in your script. Indeed, the working directory will be the directory where you launch the script and not the directory containing the script. A good habit is to fix at the beginning of your script the working directory, for example :

sandbox/my_template_script.sh
#!/bin/bash

cd /home/rstudio/mydatalocal/tuto_bash

[...]

You can redirect the output of your script to a file to keep log of its execution (if you have print message)

cd /home/rstudio/mydatalocal/tuto_bash/
bash folderA/my_script_comments.sh > script.log

2.9 Helpful resources

Note

3 Get a toy dataset for the training

For the following of the training, we will work in a new directory, /home/rstudio/mydatalocal/training_project/.

Create this directory and go inside.

mkdir -p /home/rstudio/mydatalocal/training_project
cd /home/rstudio/mydatalocal/training_project

You can also remove the tuto_bash directory if you want

rm -r /home/rstudio/mydatalocal/tuto_bash
Warning

The storage area of your VM (the 200Go or 400Go) is connected to the directory /home/rstudio/mydatalocal/ so in the following of the training you must only work in this folder if you do not want to have space issues.

The dataset we are using for this training is part of a larger study described in Ryu et al., G3 2022. The authors assembled and annotated the genome of the clown fish (Amphiprion ocellaris). Here, we will just use a very small subset of two paired-end RNA-seq samples, one from White skin and the other from Orange Skin.

3.1 Get the raw fastq files

Copy the script below in a file training_project/01_download_toydata.sh to automatically download the training dataset.

We will use wget, a command (like ls or cat) that downloads file(s) from an URL.

training_project/01_download_toydata.sh
#!/bin/bash

mkdir -p /home/rstudio/mydatalocal/training_project
cd /home/rstudio/mydatalocal/training_project


wget https://ue.gitbiopages.ens-lyon.fr/ue-ngs/presentations/intro_bash/toy_data.tar.gz
tar xzvf toy_data.tar.gz

echo "Files:"
ls
Terminal
cd /home/rstudio/mydatalocal/training_project
bash 01_download_toydata.sh

3.2 Run the quality checking on the training dataset

Now we get some data, you can run your first bioinformatic analysis: perform quality control on your raw sequencing data.

FastQC is a commonly used software that provides a simple way to do this.

The template command to run fastqc is :

fastqc -o output_dir sample.fastq.gz

Fisrt get the path of a fastq file of the training dataset

cd /home/rstudio/mydatalocal/training_project
ls toy_data/*
toy_data/White_R1.fastq.gz
toy_data/White_R2.fastq.gz
toy_data/Orange_R1.fastq.gz
toy_data/Orange_R2.fastq.gz

Note the difference between the command with the wildcard “*”

ls toy_data/*
toy_data/White_R1.fastq.gz
toy_data/White_R2.fastq.gz
toy_data/Orange_R1.fastq.gz
toy_data/Orange_R2.fastq.gz

and without the “*”

ls toy_data/
White_R1.fastq.gz
White_R2.fastq.gz
Orange_R1.fastq.gz
Orange_R2.fastq.gz

Now we “just” have to adapt the command to run fastqc on a file of the training dataset.

You can save the commands in a script : “02_toy_check_quality.sh”

training_project/02_toy_check_quality.sh
#!/bin/bash
cd /home/rstudio/mydatalocal/training_project
mkdir -p 01_quality_checking_fastqc/

fastqc -o 01_quality_checking_fastqc/ toy_data/White_R1.fastq.gz

and run it

Terminal
cd /home/rstudio/mydatalocal/training_project
bash 02_check_quality_fastqc.sh

We can now look at the results; we will study them tomorrow.

The result will be in the 01_quality_checking_fastqc folder in the results folder

└── results
    └── 01_quality_checking_fastqc
        ├── White_R1_fastqc.html   <-----
        └── White_R1_fastqc.zip

You can search it in the right bottom panel of rstudio and open it with “View in a web Browser

Fastqc can also take multiple files as arguments, to analyse them together. We will use the “*” wildcard to avoid writing the names of all the files

Modify the script 02_toy_check_quality.sh accordingly:

training_project/02_toy_check_quality.sh
#!/bin/bash
cd /home/rstudio/mydatalocal/training_project
mkdir -p 01_quality_checking_fastqc/

fastqc -o 01_quality_checking_fastqc/ toy_data/*.fastq.gz

and run the script again

Terminal
cd /home/rstudio/mydatalocal/training_project
bash 02_toy_check_quality.sh

In Rstudio you can open several terminals at once to avoid waiting for the completion of one command.

By making a list of files corresponding to the pattern passed to fastqc, we can check we didn’t make any mistakes:

cd /home/rstudio/mydatalocal/training_project
ls toy_data/*.fastq.gz

3.3 Aggregate reports with multiqc

MultiQC aggregates results from bioinformatics analyses across many samples into a single report It searches a given directory for analysis logs and compiles a HTML report. It’s a general use tool, perfect for summarizing the output from numerous bioinformatics tools.

Write this script and execute it.

training_project/03_toy_aggregate_quality_report_multiqc.sh
#!/bin/bash
cd /home/rstudio/mydatalocal/training_project
mkdir -p 01_quality_checking_aggregate_multiqc/

multiqc 01_quality_checking_fastqc/ -o 01_quality_checking_aggregate_multiqc/
Terminal
cd /home/rstudio/mydatalocal/training_project
bash 03_toy_aggregate_quality_report_multiqc.sh

4 How to organize your project directory ?

For the moment we placed all scripts at the root of our project directory. In fact this is a very bad idea.

It is advisable to put all files related to the same project in the same folder, whether they are:

  • the raw data,
  • the scripts,
  • the results,
  • the project documentation,

But to separate the different types of data into sub directories.

This way, you will more easily find your way around and avoiding mixing up files or accidentally deleting them.

For example, your project directory might look like this:

project_name/
├── README.md             # overview of the project
├── data/                 # data files used in the project
├── results/              # results of the analysis (data, tables, figures)
├── scripts/              # contains all code in the project
│   ├── 01_...
│   ├── 02_...
│   └── ...
└── doc/                  # documentation for your project
    └── ...

Moreover, to be able to find your way around and to ensure reproducibility, you need to add a file, often named README.md, at the root of your folder that should contain all the useful information to take the project in hand. This is also the file that will be visible on the home page of your project on gitlab. So when someone wants to (re)work on the project, they will be able to open the folder, and they will know where to go to see and understand what has been done. This person can be a collaborator, your manager or simply yourself 6 months later.

Exercices
    1. In the training project folder, create a “scripts” folder and move scripts inside.
Show the code
cd /home/rstudio/mydatalocal/training_project

mkdir -p scripts

mv 0*sh scripts
    1. Do the same thing for results files
Show the code
cd /home/rstudio/mydatalocal/training_project

mkdir -p results

mv *quality* results/

4.1 Scripts and data naming

Another part of staying organized is making sure that all the directories and filenames for an analysis are as consistent as possible. You want to avoid names like alignment1.bam, and rather have names like 20170823_kd_rep1_gmap-1.4.bam that provide a basic level of information about the file.

Good name must be :

  • machine readable
    • avoid spaces, punctuation, accented
    • regular expression and globbing friendly (ex: ls CondA* )
    • easy to compute on (ex: deliberate use of delimiters, CondA_rep1.tsv)
  • human readable
    • name contains info on content (ex 01_download_training_dataset.sh)
  • plays well with default ordering (ex: put something numeric first for script for example, or use date for results)

This link and this slideshow have some good guidelines for file naming dos and don’ts.

5 More useful bash commands

In the last part, you have seen some basic bash commands. In this session we will see some new to complete your bash knowledge. As previously, the idea is only to give you useful command for this training but not a complete view of bash.

5.1 New bash commands

Create a new “tuto_bash” directory for the exercises in mydatalocal

cd /home/rstudio/mydatalocal
mkdir -p tuto_bash
cd tuto_bash
  • grep -> to get only lines of a file with a pattern
cowsay -f dragon "Hello Hello!" > dragon.txt
cat dragon.txt
grep "-" dragon.txt
  • sed -> to replace all the occurrences of a pattern by another
sed "s/Hello/TOTO/g"  dragon.txt

sed can do a lot of other things but we will use only this one.

If you want to replace only the first occurrence of a line, you have to remove the “g” character

sed "s/Hello/TOTO/"  dragon.txt
  • wc -l -> to count the number of lines of a file
wc -l  dragon.txt

wc can do other things but we will use only this one.

  • basename -> to get only the name of a file from a path (the file can not exist)
basename toy_data/sample1.fastq.gz
  • dirname -> to get only the name of the directory containing the file from a path (the file can not exist)
dirname toy_data/sample1.fastq.gz
  • command ‘piping’, using |

The pipe allows forwarding the output of one command as the input of the next command. This way you can execute several consecutive commands. For example, if you want to count the number of line containing a given pattern in a file:

grep "-" dragon.txt > occurence.txt
wc -l occurence.txt

You can instead “pipe” the command using a pipe | (AltGr+6 for PC and Alt + Maj + L for Mac ) between two commands to link the two commands

grep "-" dragon.txt | wc -l  

(By the way, grep already has an option to do this directly (grep -c), but it was for the example.)

  • gunzip and zcat to decompress files

As previously seen, fastq files are large files and you should store them in a compressed form whenever useful. In unix we used preferentially .gz compression (gunzip) than .zip, but it is the same idea.

To check the contents of a compressed file, you can unzip it with gunzip but this takes time and disk space. To do it quickly, there is a command called zcat that allows you to decompress a gz file on the fly and display its contents. So, zcat piped (|) with head will allow you to display the first few lines of a gz file without taking up time or disc space.

zcat /home/rstudio/mydatalocal/training_project/toy_data/White_R1.fastq.gz | head -n 8
Exercices

Using the commands we’ve just seen:

    1. Display the first 100 sequences of a training fastq file
Show the code
zcat /home/rstudio/mydatalocal/training_project/toy_data/White_R1.fastq.gz | head -n 400
    1. Count the number of sequences of a training fastq file
Show the code
zcat /home/rstudio/mydatalocal/training_project/toy_data/White_R1.fastq.gz | grep "@SRR1" | wc -l

5.2 Store strings in variables

In scripts, we frequently re-use bits of text: for example, a path to a file. It is more convenient to keep these in a named “variable” and then re-use that variable in the script. This way, if you need to modify the file path, you only need to modify the variable once, and not all the occurrences of the file path.

To define a variable you will declare it with an = without space.

variable1="toto"

Quotes are not strictly mandatory, for example if there are no spaces in your string:

variable1=toto

and then you can refer to the variable’s content by using a $ before its name. To display the command of a variable you can use echo.

echo $variable1
echo variable1
echo "variable1: $variable1"

You can also combine variables to define a new one

variable1="TOTO"
variable2="TATA"

echo variable1: $variable1
echo variable2: $variable2

variable3="$variable1 $variable2"
echo variable3: $variable3

It is better to use explicit variable names, rather than “variable1”, “variable2”

To use a variable, you sometimes need to add curly brackets ({}) around the variable name, to distinguish the variable name from the string that follows.

For example, from a variable, $sample you want to create a filename using the contant of sample as prefix of your file using $sample_rep1.fastq.gz. You need to add curly brackets around sample (${sample}_rep1.fastq.gz), else the terminal will search for the variable $sample_rep1.

sample="sampleA"

filenameX=$sample_rep1.txt #variable sample_rep1 don't exist so filenameX=.txt
filenameY=${sample}_rep1.txt

echo filenameX: $filenameX
echo filenameY: $filenameY

Variables can also capture the output of a command, using $(command)

list_files=$(ls)
working_directory=$(pwd)

echo "In  $working_directory, there are :
$list_files"

You can use also pipe inside this construct

nb_dash=$(grep "-" dragon.txt | wc -l)
echo $nb_dash

Using variables will simplify writing scripts but also reading them.

filename=dragon.txt
new_filename=dragon_bis.txt

echo "$filename will be cp to $new_filename"
cp  $filename  $new_filename

You can also do pattern substitutions in a variable, using ${variable/pattern/replace} to replace the first occurrence.

filenameR1=sample1.R1.fastq.gz
echo ${filenameR1/R1/R2}

If you want to remove a pattern, you can use the same syntax but without replacement string. It is a good way to remove the extension of a file name.

filenameR1=sample1.R1.fastq.gz
echo ${filenameR1/.R1.fastq.gz/}

You can save the output of this pattern substitution in a new variable

filenameR1=sample1.R1.fastq.gz
filenameR2=${filenameR1/R1/R2}

echo R1: $filenameR1 R2: $filenameR2
Exercices

You will work in the /home/rstudio/mydatalocal/tuto_bash directory to do the following exercice. But data will be in /home/rstudio/mydatalocal/training_project/toy_data/

cd /home/rstudio/mydatalocal/tuto_bash
fastq_data_dir="/home/rstudio/mydatalocal/training_project/toy_data/"

Using the commands we’ve just seen:

  1. Complete the commands below to define a variable prefix containing the prefix of the sample name (“White”) from one file of a pair of PE fastq files (White_R1.fastq.gz)
full_filenameR1=$fastq_data_dir/White_R1.fastq.gz

...

echo $prefix # display White
Show the code
full_filenameR1=$fastq_data_dir/White_R1.fastq.gz
filenameR1=$(basename $full_filenameR1)

prefix=${filenameR1/_R1.fastq.gz/}

echo $prefix # display White
  1. From this prefix variable, write codes to save in a file (results/seq_numbers/$prefix.nb_sequences.txt) the number of sequences of the R1 fastq file. Before create a directory to save the results (“results/seq_numbers”).
Show the code
output_dir="results/seq_numbers"
mkdir -p $output_dir


full_filenameR1=$fastq_data_dir/White_R1.fastq.gz

filenameR1=$(basename $full_filenameR1)

prefix=${filenameR1/_R1.fastq.gz/}

echo $prefix # display White

zcat $full_filenameR1 | grep "@SRR1" | wc -l > $output_dir/$prefix.nb_sequences.txt
  1. Starting from the full filename: full_filenameR1=$fastq_data_dir/White_R1.fastq.gz, write codes to save automatically:

    • in a varibale named filenameR2 the name of the R2 file (=White_R2.fastq.gz)
    • and in a variable full_filenameR2 the path to the file R2 (=/home/rstudio/mydatalocal/training_project/toy_data//White_R2.fastq.gz)

full_filenameR1=$fastq_data_dir/White_R1.fastq.gz

...

echo filenameR2: $filenameR2  # display White_R2.fastq.gz
echo full_filenameR2: $full_filenameR2 # display /home/rstudio/mydatalocal/training_project/toy_data//White_R2.fastq.gz
Show the code
full_filenameR1=$fastq_data_dir/White_R1.fastq.gz
fastq_dirname=$(dirname $full_filenameR1)

filenameR1=$(basename $full_filenameR1)

echo filenameR1: $filenameR1
echo full_filenameR1: $full_filenameR1

filenameR2=${filenameR1/R1/R2}
full_filenameR2="$fastq_dirname/$filenameR2"

echo filenameR2: $filenameR2
echo full_filenameR2: $full_filenameR2
  1. Complete the previous codes to get the number of sequences of each files of the sample (R1 and R2)

the output file must contain:

White_R1.fastq.gz: 1000
White_R2.fastq.gz: 1000
Show the code
cd /home/rstudio/mydatalocal/tuto_bash/

output_dir="results/seq_numbers"
mkdir -p $output_dir


full_filenameR1=$fastq_data_dir/White_R1.fastq.gz
fastq_dirname=$(dirname $full_filenameR1)

filenameR1=$(basename $full_filenameR1)

filenameR2=${filenameR1/R1/R2}
full_filenameR2="$fastq_dirname/$filenameR2"


prefix=${filenameR1/_R1.fastq.gz/}
output_file="$output_dir/$prefix.nb_sequences.txt"


echo $prefix # display sampleA

nb_lines_R1=$(zcat $full_filenameR1 | grep "@SRR1" | wc -l)

echo "$filenameR1: $nb_lines_R1"> $output_file

nb_lines_R2=$(zcat $full_filenameR2  | grep "@SRR1" | wc -l)

echo "$filenameR2: $nb_lines_R2">> $output_file


cat $output_dir/$prefix.nb_sequences.txt

5.3 “For” loops to iterate on a list

Bioinformatic analysis often involves running many commands, but with only a small alteration each time you run a new command.

For example, you might have to analyze sample1, then sample2, sample3, and so on. To save time, we will use something called a Loop - also known as a for-loop.

Let’s illustrate what this is with a simple example. Let’s say that you need to print the letters a, b, c. You could do this by typing echo a and hitting return; then echo b, and echo c. This gets the job done, but you can see that this would quickly become tedious if your goal was to print dozens or hundreds of numbers.

How can we make this easier? You probably noticed that each time we ran the command we changed the number after the echo command. A for-loop will automatically do this for you.

Here is an example of a for-loop to print the letters a, b, c, d:

for X in a b c d
do
  echo $X
done

The for-loop has three sections:

  • The first section is the Declaration: it begins by assigning the first item after in to the variable *X; in this case, it would assign the value “a” to “X”. The letters after in are called the List**.
  • The next section is the Body, which runs the commands written after do, replacing the variable with whichever value is currently assigned to the variable - for the first loop, this will be the letter a. Since items remain in the list, the loop goes back to the declaration and assigns the next number in the list to the variable X; in this case, the number b. Then the body is run, and the process repeats until the end of the list is reached.
  • The last section, called the End, contains only the word done, meaning to exit the loop after all of the items in the list have been run through the Body of the loop.

You can add more commands to the Body section. For example, we could change the loop to:

for X in a b c d
do
  echo Do something
  echo Do something with X=$X
  echo Do another thing
done

You can use a variable as seen previously to store the list of things to iterate. And use a more explicit than X to iterate on (for example letter)

list="a b c d"
for letter in $list
do
  echo Do something
  echo Do something with $letter
  echo Do another thing
done

Finally, the list you iterate through can be created automatically. For example, if you want to iterate through a list of files you can do:

cd /home/rstudio/mydatalocal/training_project/

list_files=$(ls toy_data/*gz)
echo $list_files

echo "Start:"

for file in $list_files
do
  echo Do something
  echo Do something with $file
  echo Do another thing
done

echo "End"

#OR

for file in toy_data/*gz
do
  echo Do something
  echo Do something with $file
  echo Do another thing
done

Side note: Indentations are not important for the machine but help readability for us humans.

Exercices

Using the commands we’ve just seen:

  1. Iterate on each fastq files of the toy_data directory and print its path
Show the code
for full_filename in toy_data/*gz
do
  echo Path : $full_filename
done
  1. Using the same basis, get from its name the prefix of the sample
Path : White_R1.fastq.gz
Prefix : White_R1
Path : White_R2.fastq.gz
Prefix : White_R2
Path : Orange_R1.fastq.gz
Prefix : Orange_R1
Path : Orange_R2.fastq.gz
Prefix : Orange_R2
Show the code
cd /home/rstudio/mydatalocal/training_project/

for full_filename in toy_data/*gz
do
  echo Path : $full_filename
  
  filename=$(basename $full_filename)

  prefix=${filename/.fastq.gz/}

  echo Prefix : $prefix
done
  1. As the training dataset is Paired-end each sample is printed 2 times. Search for a way to display one time the prefix of each sample only once.
Show the code

cd /home/rstudio/mydatalocal/training_project/

for full_filenameR1 in toy_data/*R1*gz
do
  echo Path R1: $full_filenameR1
  
  filename=$(basename $full_filenameR1)

  prefix=${filename/_R1.fastq.gz/}

  echo Prefix : $prefix
done
  1. Take back the code you write to get the number of sequences of one fastq file and iterate over each sample of the PE training dataset to get the number of sequences of each fastq files.
cd /home/rstudio/mydatalocal/training_project

output_dir="/home/rstudio/mydatalocal/tuto_bash/results/seq_numbers"
mkdir -p $output_dir

for full_filenameR1 in toy_data/*R1*gz
do


...


done
Show the code
cd /home/rstudio/mydatalocal/training_project

output_dir="/home/rstudio/mydatalocal/tuto_bash/results/seq_numbers"
mkdir -p $output_dir

for full_filenameR1 in toy_data/*R1*gz
do
  echo Filename : $full_filenameR1

  fastq_dirname=$(dirname $full_filenameR1)
  filenameR1=$(basename $full_filenameR1)
  
  filenameR2=${filenameR1/R1/R2}
  full_filenameR2="$fastq_dirname/$filenameR2"
  
  prefix=${filenameR1/.R1.fastq.gz/}
  
  output_file="$output_dir/$prefix.nb_sequences.txt"
  
  echo Prefix: $prefix # display White
  
  nb_lines_R1=$(zcat $full_filenameR1 | grep "@SRR1" | wc -l)
  echo "$filenameR1: $nb_lines_R1"> $output_file
  
  nb_lines_R2=$(zcat $full_filenameR2 | grep "@SRR1" | wc -l)
  echo "$filenameR2: $nb_lines_R2">> $output_file
  
  #to check
  cat $output_file

done

#to check the output dir
ls $output_dir
To finish

Now it’s time to get to work on your project!

Remember, if you’re looking for a command, have an incomprehensible bug or don’t know where to start :

  • ask other members of your group
  • the internet is also your friend (it’s not cheating!!)
  • ask one of the trainers

These materials have been developed by members of UE NGE teaching team of the ENS de Lyon. These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.