USEFUL INFORMATION: Short introduction to script versionning (git) through Rstudio

UE NGS - ENS LYON

Author

NGS-team - 2023

This intro is a compilation of various tutorials available on the web:

The aim is to teach you the basic concepts and commands so that you can work independently in :

1 Why should you use Git?

1.1 Without git

  • Management of chaotic files between
    • different versions of a file (V1,V2,final, final_ok ….)
    • different system backups (PC, backup disks, …)
    • different users …
  • no trace of why changes were made
  • impossible to return to an earlier version

This comic strip should bring back memories for everyone:

1.2 With git

  • Preservation and archiving of your project

  • Clear history of changes

  • Efficient collaborative working:

    • Work in parallel and easily merge files
    • Keep track of who did what

2 Git can be seen as a time machine

Git lets you write the history of your project, alone or with others, via using snapshot which can be seen as pictures of the folder and files contained in it that you wish to track. This folder is the repository. Each time you want to freeze the state of the repository, to take a snapshot, you do what’s called a commit.

Each commit records a certain amount of information:

  • the modifications made
  • who made the modifications
  • when the modifications was made
  • why the modifications were made, description of the modifications (via a commit message).

Over the course of commits, you’ll build up a history that can be consulted. The main history, which contains the “clean” version of your repository, is located on the master branch.

You can also create parallel branches from a commit. For example, you could create an additional branch to do something “just to see” and abandon your idea, or keep your modifications and merge them with the master branch via a merge. But either way, you’ll have kept track of them.

In this course, we won’t be using any additional branches, but you should know that they do exist, and that they make this tool, git, so powerful and indispensable for collaborative projects.

2.1 Collaborate or archive your code via a remote repository (e.g. gitlab/github)

Git lets you make a backup of your versioned project. On a remote server, elsewhere, this is called the remote. Your remote can be on Github (the most famous) or on a self-hosted Gitlab (as here at the ENS).

To retrieve a project from a remote, the first time, you clone it; as the name suggests, you clone the project, making a copy of it that you retrieve locally, on your machine. When you make commits to your local project, you can send them to the remote by making a push. Other people connected to the remote will perform a pull to retrieve your commit.

In this way, the local version (on your computer) and the remote version (on the remote) of your project are always synchronized.

2.2 How do you write your story?

The three most common manipulations are shown in the diagram below:

pull: I retrieve the latest version of the files from the remote repository commit: validate my changes with a message explaining them push: transmit validated changes to the remote repository

To be more precise, there is an additional step to be taken before validating your modifications (i.e. making a commit): indexing your modifications. In fact, git allows you to manage modifications in subtle ways and not take into account all the modifications in your workspace (working directory). Only indexed modifications, those you have added to your staging area via the stage command, will be saved in your commit.

To summarize :

1 - First you make changes to your files, but these changes will not be saved in the repository.

2 - Use the stage command to select the modifications you’re going to include in the next commit and place them in the staging area.

3 - Then use the commit command to save the selected changes in the staging area.

These steps can be carried out via the command line, but there are also graphical tools to do this, or most editors (IDEs) such as Rstudio or Visual Code have plug-ins to make life easier.

2.3 Summary of key commands :

  • clone: retrieve the repository from the remote for the first time
  • stage: save changes that will be added to the next commit.
  • commit: a frozen moment in the life of your project
  • push: send new commits to the remote.
  • pull: retrieve the new commit locally from the remote.
  • checkout: jump back in time to a commit.

You can get a more global view of your environment with this diagram:

Now that we’ve covered the basics, it’s time to give it a try!

3 It’s time to give it a try: Initialize your Git project on Gitlab and use Rstudio to manage it locally

3.1 Linking Gitlab and your machine using Rsudio

3.1.1 1. CCreate an account on [ENS’s Gitlab].(https://gitbio.ens-lyon.fr/): https://gitbio.ens-lyon.fr/

If you don’t have an account yet:

  • Go to the site and try to connect via SSO Ens de Lyon, this will redirect you to the CAS in order to connect with your ENS identifiers.
  • You will then be blocked, which is normal. Carine will receive an account request and will be able to validate it.
  • Send an e-mail to Carine (carine.rey@ens-lyon.fr) specifying your group name.

3.1.2 Create a new repository on the ENS’s Gitlab

  • Click on the UE group (Menu -> groups -> your groups) (https://gitbio.ens-lyon.fr/ue/ue-ngs/students_2023)

  • Then create a project by clicking on (Create new project)

    • Select Create from blank project (Create blank project)
    • Give your project a name containing your group name and your name (Ideally, your project name should be in lower case, without periods, spaces or underscores, and should not begin with a number, e.g. scRNAseq_arabido_carine).
    • Leave selected** Initialize repository with a README.
    • Click on Create project

3.1.4 Configuring git in Rstudio

Finally, you will need to declare your identity: in the RStudio terminal (not the R console), type in your name so that each of your commits is linked to you:

git config --global user.name "your_pseudo"
git config --global user.email "your_mail@mail.com"

3.1.5 Clone your empty repository and create an R project in Rstudio

To associate this Git repository with an R project via RStudio, you need to make a clone:

  • On Gitlab: click on Code and copy the URL (SSH protocol)

  • In RStudio now click on: File > New Project… > Version Control > Git,
    • enter the URL/SSH address of the repository you’ve copied, the name of the R project (ideally the same as Git)
    • enter the folder in which to place it (~/mydatalocal),
    • click on Create Project and finally enter your passphrase.

In this newly created RStudio project, you’ll see the git tab in the top right-hand corner.

3.2 Using git commands in Rstudio

RStudio’s Git panel shows you the status of your project in real time: the status of the various files and folders is displayed:

  • A new file will be associated with an orange icon containing a ?
  • This new file will be associated with a green icon containing an A once you’ve checked it (in the ‘staged’ column).
  • A modified file will be associated with a blue icon containing an M
  • A deleted file will be associated with a red icon containing a D

3.3 Configuring files to be synchronized or not using a .gitignore file

You don’t need to synchronize all the files in your project. Only those you check will be associated with commits. It is therefore possible to explicitly ask Git not to monitor a particular file: this is the role of the .gitignore file at the root of your project. This is a text file that accepts regular expressions and allows you to define rules that correspond to several :

By default, when creating the Rstudio project, a .gitignore file is added containing the following lines:

.Rproj.user
.Rhistory
.RData
.Ruserdata

This means that the Rstudio project configuration files are not tracked.

For TP and in general, we don’t want to track changes to raw data or results.

  1. Add the following lines to the .gitignore file:
data/
results/
*.Rproj
  1. Index changes by clicking in the staged column the box opposite the .gitignore file.

  2. Then commit with an explicit message.

  3. Then push the modifications to synchronize your local modifications with the remote.

  4. View changes on gitlab

3.4 Organizing your working directory

It’s a good idea to put all your project-related files in the same folder:

  • raw data
  • scripts
  • results
  • project documentation,

To help you find your way around and avoid mixing up files or accidentally deleting them, we recommend that you separate the different types of data into sub-folders.

For example, your working directory might look like this:

project_name/
├── README.md             # overview of the project
├── data/                 # data files used in the project
├── results/              # results of the analysis (data, tables, figures)
├── src/                  # contains all code in the project
│   └── ...
└── doc/                  # documentation for your project
    └── ...

In addition, for ease of use and reproducibility, you need to add a file, often called README.md, to the root of your folder, which will contain all the information you need to get started with the project.

This is also the file that will be visible on your project’s home page on Gitlab. This way, when someone wants to (re)work on the project, they can open the file, and they’ll know where to go to see and understand what’s been done. This person could be a collaborator, your manager or simply yourself 6 months later.

3.4.1 The README.md

In concrete terms, the README.md file is a text file written in markdown (hence the .md extension). Markdown is a language that allows you to encode the formatting of plain text simply and easily.

For example, a # means that the following sentence is a title, ## , a subtitle, ###, a sub-subtitle. You can browse the various tags here: https://www.markdownguide.org/basic-syntax/

This makes it possible to write text without wasting time on formatting, keeping the file “light” and, above all, readable for everyone. On your project’s Gitlab page, you’ll find your formatted README.md file.

Don’t forget to add to your README.md as you go along, so you don’t forget anything. It can also be represented as your laboratory notebook or your laboratory report.

At the end of the course, the quality of your README.md will be particularly important in the evaluation.

3.4.2 Creating your project architecture

  1. Create your README.md
  2. Start completing it
  3. Index, commit, push…
  4. Create data, results and scripts folders
  5. Index, commit, push …
  6. Update .gitignore file
  7. Index, commit, pusher …