diff --git a/Script.qmd b/Script.qmd new file mode 100644 index 0000000000000000000000000000000000000000..fd2de077067afafa94a163758590d3373019bca7 --- /dev/null +++ b/Script.qmd @@ -0,0 +1,1002 @@ +--- +title: "Course NGS analysis appetizer 2024" +author: "Franziska Turck" +format: html +editor: visual +--- + +# Day 1 + +The aim of the practical course to give "absolute beginners" in Next Generation Sequencing (NGS) data analysis a small appetizer on how to do a bioinformatics analysis using command line tools. + +Most Next Generation Sequencing (NGS) data sets are too large to be analyzed on a normal personal computer. Analysis requires a high performance compute cluster, which usually runs some version of Linux as operating system. The usual interface to Linux is the terminal which accepts commands provided in plain text. + +A registered user can connect to the compute cluster through a console/shell/terminal program running on the local computer. After login, the user enters the cluster in their personal account folder, which is placed within the /home folder. In our case, we are part of a group "kurse" and our home folders are placed within this folder. + +### The typical structure of a Linux server + +Unix/Linux computers have a few "fixed" folders which have different purpose. + + + ++----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| Folder | Description | ++==========+=====================================================================================================================================================================================================================================+ +| / | The directory called "root." It is the starting point for the file system hierarchy. | +| | | +| | *Note that this is not related to the root, or superuser, account.* | ++----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| /bin | Binaries files, executable programs | ++----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| /etc | System configuration files | ++----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| /home | Home directories | ++----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| /opt | Optional or third party software | ++----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| /tmp | Temporary space, typically cleared on reboot. | ++----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| /usr | User related programs. | +| | | +| | *Note that you will not be able to install anything there unless you have a root/superuser account.* | ++----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| /var | Variable data, most notably log files | ++----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ +| /scratch | Data depository that is not backed up and usually temporary. Ideally, the working directory, temporary data are in `/scratch` while scripts, raw data that cannot be reproduced and final results are stroed somewhere in `/home` . | ++----------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ + +### Where are the programs installed? + +A normal user (not a super user) usually does not have the right to install programs in the system folders `/bin` or `/opt` or even `/usr`. Programs that are installed in these folders can usually be executed by every user because they are in the `system.path`, but not every user can modify the content of the corresponding folders. + +"Binaries" (compiled executable programs) are part of a default `system.path` under the variable `$PATH` in which the system looks after receiving the command to run a program. + +Sometimes a user needs a program that is not installed or a version of a program that is different from the default version installed in the `system.path`. This could be the case if the statistical program R is installed in the system, but not the version of R we would like to use. + +On a Linux server, any user can install executable program in the personal `home` folder, a folder underneath this folder or a shared folder that the user has read/write/execute rights in. In order to use these programs, the user has add the location of the excecuteble binaries to the `$PATH` variable. + +### Defining custom paths to "personal"programs + +The simplest was to execute the command `dosomething` installed in `/home/user/bin/dosomething` is to type the full adress to the binaries. + +Since it is tedious to always type the full path to the executable program, all possible paths to program binaries can be specified in particular "hidden" text files that are placed in each user's home directory. These files are called `.bashrc` and `.bash_profile`. + +For this course, we have test accounts on the CHEOPS compute cluster of the University of Cologne. During the first activation of the account, these hidden files (and several others) are automatically created and the files already contain some important instructions. The first exercise will be to connect to CHEOPS, try out a very basic Linux commands, install a small programm and add the path to this program to the \$PATH variable. + +## Connect to server + +To connect to a server, a local terminal/console program needs to be run on the local computer. Some servers can be accessed from any computers, but other servers require more security. In the case of CHEOPS, access is only possible from the Cologne university network, either at the Uni or after entering the network via your student virtual private network (vpn). + +Here a [link](https://rrzk.uni-koeln.de/fileadmin/sites/rrzk/HPC_Projekte/CHEOPS_Brief_Instructions.pdf) to some basic information on connecting and using CHEOPS. + +The program to connect is called `ssh`. There are many terminal programs, these are my current recommendations: + +- MobaXterm + +- vs-code + +- RStudio + +## Step1 Connect with RStudio (option 1) + +Start the program. The graphic interface should look like below, the important tab it "Terminal" in the lower half of the screen. If you do not see "Terminal", you can open one with the menu "Tools" \>"Terminal" \> "Open New Terminal". + + + +Open the file `Readme.qmd` that contains this script. + +You can send to content of a code block line by line to terminal. To do this, enter the code block, set the cursor to the beginning of the line and type at the same time, Strg, Alt,Enter. Otherwise, use copy-paste. + +```{bash} + +ssh user@cheops.rrz.uni-koeln.de + +``` + + +Once the local computer connects to the server, it will ask for the password. Type the password that was sent to you via email. While you are typing, the cursor will not move. It will nevertheless communicate your typed text. When finished typing, press Enter. + +We are now in the "shell" that allows us to send commands to the Linux server. There are different shell types, we will use the shell `bash`, which is an acronym of Bourne Again Shell. + +> *I could not test this for RStudio (I do not have a vpn for the University) but you should be able to use "Session"\<"Set working directory" to navigate to the server folder if you connect via vpn. In this case the right bottom panel with show the content of the folder on the server and you may be able to open the files on your local computer using the graphic interface of RStudio.* + +## Step1 Connect with vs-code (option 2) + +Start the program. Navigate to the extension folder and search for "remote explorer". Install the remote explorer and all other options that vs-code suggests around it. Afterwards, the tab for the remote explorer should be visible on the left bar. + + + +Go to menu "view"\>"CommandPalette" and select "Remote-SSH Connect to host. A drop down menu will indicate add new host. Select and then type: + +ssh user\@cheops.rrz.uni-koeln.de + +in the box. After establishing the connection, the window will ask for the password - type and enter. + +Below, a Terminal will open on the Linux server and the right panel will open the content of your home folder on the server. + +## Step 2: Try the most important bash shell commands + +Assignments using the commands below: + +- go to the root + +- list all folders in the root directory + +- list the folders with more detailed information and note the write, read and excecute rights that you have + +- go back to your home folder + +- list files and hidden files + +- create empty file named "copy_of_bashrc + +- create and emtopy directory called bin in your home directory + +- copy the content of .bash_profile to copy_of_bash_profile + +- view copy_of_bash_profile + +- edit copy_of_bash_profile in a way that if a program is installed in the local folder, it would be chosen over a version that is installed in the system.file binaries of the computer. + +#### Moving around directories + +``` bash +cd/ # (Change Directory): moves you right up to the root +cd $HOME #back to your home directory +cd ~ #back to your home directory +cd .. #one directory up + +pwd #(Print Working Directory): shows the full path of the current directory +``` + +#### Information on directories + +``` bash +ls #(List): lists all files in the current directory +ls –l #same but with more details +ls –a #also shows hidden files +ls --help #all options of ls +``` + +#### Make files and delete files + +``` bash +touch filename #creates a new file + +mkdir foldername #(Make Directory): creates a new folder + +mv file1 file2 # (Move): moves one file to another, if file2 doesn’t yet exist it is automatically created, but file1 is gone after the command + +cp file1 file2 # (Copy): almost as above but duplicates the file (or directory) + +rm file #(Remove): removes a file + +rmdir foldername #removes a directory but only if empty + +rm –r foldername #removes everything behind the name, be careful with the recursive option!!!! +``` + +#### Viewing and editing text file + +``` bash +less filename #displays text file content, interactive movement with cursor +more filename #displays text file content, interactive scrolling +head #show first lines of text file +tail #shows last lines of text file +#all have many options that can be seen with the -h option + +nano filename #opens a terminal editor that allows to edit the content of text files +pico filename #opens a terminal editor that allows to edit the content of text files +``` + +#### Selecting specific lines from a file + +``` bash +grep word filename #shows only lines containing the word in the file +``` + +#### Combining files + +``` bash +cat file1 file2 #appends file 2 after file 1 + +paste file 1 file2 #line by line adds contents for file 2 after content of file 1 +``` + +Here a [link](https://fossbytes.com/a-z-list-linux-command-line-reference/) to a more complete glossary of Linux commands. + +### Two very important hidden files in your home directory + +The files `.bash_profile`  and `.bashrc` were created for you when you first logged on and contain information that is loaded into the system each time you reconnect to the server. Note that this automatic creation is set up by the Cheops system administrator, it may not happen on another compute cluster. In this case, you would need to use `touch .bashrc` to create an empfy file. This shell program is called the "[B]{.underline}ourne [a]{.underline}gain [sh]{.underline}ell (bash)". + +```{bash} + +less .bash_profile + +``` + +The command above should display on the terminal something resembling this: + +``` +# .bash_profile +# Get the aliases and functions +if [ -f ~/.bashrc ]; then + . ~/.bashrc +fi +# User specific environment and startup programs +PATH=$PATH:$HOME/bin:. # Programm-Suchpfad +export PATH +``` + +The `.bash_profile` is read automatically when the shell opens. Between `if` and `fi`, the `.bash_profile` tells the shell that is should read the local `.bashrc` if there is one. + +Text after a \# sign is normally ignored by the system. The system administrator already automatically wrote the text indicating that in the `.bashrc`, some personalized functions and aliases would be found. + +`PATH` is a fixed name for a variable and indicates the path that the computer should follow in order to find executable binaries. The system administrator, who wrote the script already specified some paths. The user can add new paths to binaries, to libraries and manual pages and then export them to the system. Here, `PATH` tells the computer to look first for binaries in the already defined `PATH`, then in your personal home folder `$HOME` within a folder called `bin` (which you just created) and then in the current folder "`.`" (a dot means current folder). The different options are separated by '**`:`**', which basically means AND. + +If you call a command and it is not specified in any of these paths, the shell will output "command not found", otherwise it will run the version of the command it finds first based on the hierarchy specified in the exported `PATH` of the `.bash_profile`. + +#### Excercise: + +What would be the difference between: + +`PATH=$PATH:$HOME/bin:.` + +and + +`PATH=./$HOME/bin:$PATH` + +## Step 3: Install your first program and add its location to the path + +We will install a program called sratoolkit that allows to automatically download and unpack NGS read data from the NCBI short read archive. We install this more for exercise (it is very simple to install), in fact it is always more simple to download short re3ad data from the European Nucleotide Archive. + +### A bit of information on repositories of short read data: + +Published articles must contain a permanent link to the NGS data. Usually these data are stored in public short read archives. + +Several archives accept short read sequence data, example are: + +1. NCBI SRA National center for Biotechnology Information Short Read Archive (<https://www.ncbi.nlm.nih.gov/sra> ) + +2. European Nucleotide Archive (ENA) at European Bioinformatics Institute (EBI) (<https://www.ebi.ac.uk/ena> ) + +3. Sequence Read Archive at Gene Bank Japan (<https://www.ddbj.nig.ac.jp/dra/index-e.html> ) + +4. Chinese genebank (<https://db.cngb.org/> ) + +Manuscripts that contain NGS data should indicate an accession number under which short reads and analysis are deposited. + +NGS data files are very large and therefore highly compressed in the archives. NCBI provides tools to download, decompress and reformat data in one step. We will download a set of tools, called sratoolkit from NCBI, unpack the downloaded folder, which contains executable binary files that can be directly run on Linux servers (unfortunately many programs are not so easily installed and need to be compiled before running). + +We will also add a line into our personal.bash_profile and to direct the command typed in your shell to the place where you personally installed binaries. A small test download will indicate that all is o.k. with the installed toolkit. + +### Step 1 Download a compressed folder with the software from ncbi + +You would have found the following info by an internet search for "*install sratools NCBI*" + +```{bash} + +wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.2/sratoolkit.3.0.2-centos_linux64.tar.gz + +``` + +`wget` (web get) is a Linux command that allows downloading files from an ftp server. + +***Note:** Sometime a simple wget does not work because one of the servers requires more security settings. If you would like to know all options of the program, type wget ---help.* + +### Unpack the "tar ball" + +A tarball is Linux slang for a compressed dataset containing folders and files. The compressed files end with the extension `.tar.gz`. The following command decompresses tar balls: + +```{bash} + +tar -xzf sratoolkit.3.0.2-centos_linux64.tar.gz + +``` + +In addition to a compressed archive, now there is also a folder with the same name. In this folder, there is another folder called bin, which contains the actual programs. + +How you organize the following, is your decision. Here are several options: + +- add a path to `sratoolkit.3.0.2-centos_linux64/bin` to your `PATH` + +- copy the binary files from `sratoolkit.3.0.2-centos_linux64/bin` to `$HOME/bin` + +- copy the folder `sratoolkit.3.0.2-centos_linux64/bin` to `$HOME/bin/sratoolkit` and add the path to these binaries to the `PATH` in `.bash_profile` . + +Add the commands that you used here for documentation: + +```{bash} +#where are your binaries now, how does your .bash_profile look + +``` + +### Test if the program works + +If you did everything right, the following command should output 2 so-called `fastq` files to the screen, no matter from where you start. + +```{bash} + +fastq-dump --stdout -X 2 SRR5298630 + +``` + +#### Exercise: + +It didn't work? + +- Does the program work when you navigate directly to the folder with the executable binary files? + +- Is the program in the list of binary files? + +- ... group discussion + +# Day 1 afternoon + +## Download fastq files for further analysis + +Our assignment for this week is to re-analysze the raw data from this paper: + +[Xi, Y., Park, S.-R., Kim, D.-H., Kim, E.-D. and Sung, S. (2020), Transcriptome and epigenome analyses of vernalization in Arabidopsis thaliana. Plant J, 103: 1490-1502. https://doi.org/10.1111/tpj.14817](https://onlinelibrary.wiley.com/doi/full/10.1111/tpj.14817#open-research-section) + +The manuscript compares epigenetic histone modification of plants before, during and after vernalization. For one important epigenetic modification, the authors find a general increase of signal during and after vernalization. We want to have a closer look at the data to make sure that this is a trustworthy result. + + + +The manuscript contains the accession number for the genomic data under the header Open research: + +The data used in this study found at GSE 130291 (<https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE130291>). + +Although the data was originally submitted to NCBI download from ENA (can be found under the same accession number is more convenient. + +Open the ENA search page: + +<https://www.ebi.ac.uk/ena/browser/text-search> + +Search for the accession GSE130291 + +This will lead us to + +<https://www.ebi.ac.uk/ena/browser/view/SRR8955899> + +The repository contains files for RNAseq and ChIPseq - at the moment we stick to ChIPseq files. + +There is the option to select the files of interest and generate a download script. We can launch the commands in the terminal but to structure this a bit, let's split the tasks in groups. + +| user | files to download | +|----------|--------------------------------------------------------| +| ngsly001 | input for all conditions | +| ngsly002 | H3K27me3 before vernalization (both replicates) | +| ngsly003 | H3K27me3 at the end of vernalization (both replicates) | +| ngsly004 | H3K27me3 10 days after vernalization (both replicates) | +| ngsly005 | H3K36me3 before vernalization (both replicates) | +| ngsly006 | H3K36me3 at the end of vernalization (both replicates) | +| ngsly007 | H3K36me3 10 days after vernalization (both replicates) | +| ngsly008 | H3K4me3 before vernalization (both replicates) | + +Everyone should make a subfolder in the \$HOME folder to store their seq data and then start the personal download script generated by the ENA website. + +```{bash} +mkdir fastq +cd fastq + +#this is an example +wget -nc ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR895/008/SRR8955908/SRR8955908_2.fastq.gz + + + +``` + +## Introduction Fastq format + +The reads we downloaded are in fastq format and contain sequence and sequence quality information. Here is a wiki-page that explains the format very nicely (<https://en.wikipedia.org/wiki/FASTQ_format> ). + +A FASTQ file normally uses four lines per sequence. + +- Line 1 begins with a '\@' character and is followed by a sequence identifier and an *optional* description (like a [FASTA](https://en.wikipedia.org/wiki/FASTA_format "FASTA format") title line). + +- Line 2 is the raw sequence letters. + +- Line 3 begins with a '+' character and is *optionally* followed by the same sequence identifier (and any description) again. + +- Line 4 encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence. + +Example from our previous download of two reads: + +``` +@SRR5298630.1 1 length=75 +GCTGGNTAACCTCTCTCTTGCTCGTTGGTGGACTCATCTTGATTGTCCGCATGGGCAAGGACGTCGTGGTTCACG ++SRR5298630.1 1 length=75 +AAAAA#EAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEAEEEEEEEEEEAEEEEEE/EAEAE//EE +@SRR5298630.2 2 length=75 +CAGTTNCTGAGAATGCTTCTGTCTAGATTTTATATGAAGATATTCCCGTTTCCAACGAAATCTTCACAGCTATCC ++SRR5298630.2 2 length=75 +AAAAA#EEEEEEEAEEEEEEAEEEEEEEEEAEEEAEEEEAEEEEEE<EEAE/A<EEAAEEAEEA</EEEAEEAAA +``` + +The byte representing quality runs from 0x21 (lowest quality; '!' in ASCII) to 0x7e (highest quality; '\~' in ASCII). + +Here are the quality value characters in increasing order of quality ([ASCII](https://en.wikipedia.org/wiki/ASCII)): + +``` +!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ +``` + +These codes are a little hard to read for humans. The program `FastQC` (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ ) from the Babraham Institute of Bioinformatics will analyze the overall quality of `fastq` datasets and output a list of figures that can be opened with any web browser. `FastQC` is a Java-based program, which means it is platform independent and could also run on the local Windows computer. It is also easy to install locally on Linux, we will do this installation and then each of you will run the program on the downloaded files. + +### Install the FastQC program on the server + +Download command + +```{bash} + +wget --no-check-certificate https://www.bioinformatics.babraham.ac.uk/projects/fastqc/fastqc_v0.11.8.zip + + +``` + +Since this isn't a tarball but a zipped directory, we will unpack by typing: + +```{bash} + +unzip fastqc_v0.11.8.zip + +``` + +This creates a folder FastQC, the binaries are called by the command `fastqc`. For some idiotic reason, the file is not directly executable, but since you installed the file locally into a folder that you control you can change this: + +```{bash} + +#first cd into the FastQC folder +cd FastQC +#now check the rights of the files +ls -l +#the first line indicates -rw-r--r-- for fastqc +chmod 755 fastqc +ls -l +# now -rwxr-xr-x x means executable + +``` + +Now, add the path to the `fastqc` executable to the `.bash_profile`. Do not forget to source the file after the change. + +### Launch FastQC on the downloaded files + +The command to launch the fastqc analysis is very simple: fastqc path_to_file/filename. Since each of you has downloaded more than one file, we will write a little script to automatize the analysis. + +Change the text below so that everything corresponds to your set-up: + +```{bash} +path_to_reads=$HOME/fastq +path_to_QC=$HOME/QC + +mkdir -p $path_to_QC + +for i in SRR8955916_1 SRR8955916_2 +do +fastqc $path_to_reads/$i.fastq.gz -o $path_to_QC +done + +``` + +or alternatively: + +```{bash} +path_to_reads=$HOME/fastq +path_to_QC=$HOME/QC + +mkdir -p $path_to_QC + +cd $path_to_reads +for i in *fastq.gz +do +fastqc $path_to_reads/$i -o $path_to_QC +done + +``` + +The second option is faster to write when there are many files but it will keep the extension fastq.gz in the folder names. + +#### Assignment: + +Open the fastQC files in the browser on your local computer and look at the results. Are all experiments of high quality? + +# Day 2 + +## Introduction: a typical ChIP-seq analysis pipeline + +- A good ChIP-seq experiment setup should have reads of at least two biological replicates. + +- Ideally there should be a control sample, this could be a library prepared from a "mock" purification or an aliquot of the input material before precipitation (or both of these controls). A mock purification could be a ChIP with preimmune/no antibodies or a purification with the antibodies but from a control plant that does not produce the protein (mutant or absence of transgenic epitope). I + +- In the Turck group, we prefer to use a control plant that doesn't express the transgene, at least for transcription factors, as the antibodies are likely to purify some background signal. In the case of histone modification ChIP, some researchers like to use a control prepared from antibodies that precipitate all nucleosomes irrespective of the modification. We usually use the input as control for histone modification ChIP, the input also shows nucleosome density dynamics to a certain degree. + +- A perfect ChIPseq experiment would contain a spike-in sample consisting of an equal amount of cells from a different organism that is added to all samples (of equal amount) before ChIP. Hard to do for plant extracts as it is difficult to determine the exact number of nuclei in a purification. + + + +Outline of a ChIP-seq pipeline + +### Summary of the steps in a ChIP-seq pipeline + +1. A first step would be to analyze the overall quality of the fastq raw data "QC step" based on the quality scores included in the fastq file. If the quality of the reads is low (e.g. too many shorter reads than expected, too many ambiguous nucleotides, a lot of repeats and adaptor sequence), the raw data should be filtered for minimal quality criteria before the alignment. + +2. The second step is to align the read data to a **reference genome** using a short read alignment program such as "bwa-mem" or "bowtie2" or "star". These programs will generate alignment files in various formats, the most common ones a called "bam", "aln" or "sam" files and carry these names as file extension. There are some alignment options that can be adjusted to the sample. For example, if the reference does not really fit to the sequenced reads or may contain a lot of incorrectly mapped transposons it could be good to dig deeper into the aligner settings. + +3. The third step is to run another QC step on the aligned files. For example, although we expect reads to enrich within the regions that were purified by the antibodies during ChIP their start positions should not all be the same. Producing a large number of reads with exactly the same start position is usually an indication that too many PCR cycles were run during the library preparation or that there simply was too little start material to do a decent amplification. In this case, there is the risk that the peaks represent PCR artefacts rather than actual chromatin enrichment. One option to possibly improve the analysis of over-amplified data could be to filter out all identical reads but one. + +4. Step 4 is part of the quality control, but useful beyond that. Visualizing of the aligned data. Bam formatted files can be directly visualized with the integrated genome viewer (IGV). The program suit deepTools includes many features for quality testing, for example. + +5. Step 5 is to identify enriched regions and filter them against the control samples. Possibly the most commonly used program to identify peaks is "MACS", but "SICER/EPIC" is an interesting option in particular for chromatin data. These programs use different statistical models to identify peaks and to filter them against controls. + +6. Step 6 should be another quality control step, this time to compare the results from two samples. The simple way is to overlap the peaks identified in two replicates and estimate how well they overlap. A majority should be shared, but one should be careful as artefacts are also often shared. A better way is to use a statistical approach to not only estimate overlap but the most significant overlap. An approach is to use the Irreproducibility discovery rate framework "IDR" [(2012) ENCODE: TF ChIP-seq peak calling using the Irreproducibility Discovery Rate (IDR) framework - Anshul Kundaje (google.com)](https://sites.google.com/site/anshulkundaje/projects/idr). A very good test is also to use pseudo-replication. For this, mapped reads from two biological replicates are pooled and then randomly split again. The entire pipeline is repeated and compared to the results from true replicates. The numbers should be fairly similar. Last, data from each replicate can be spit in half and the analysis is performed as before, pretending that there are 4 replicates. In this case the number of reads identified within and between the actual repetitions should also be fairly similar. + +7. Step 7 step could be to compare two biological samples for differential binding. MACS and SICER could be used for this, but there are also more specialized programs such as "diffBind", which have a more sophisticated way of dealing with replicates and normalizing different libraries. + +8. Step 8 is to annotate the peaks for genomic features, e.g. are the located within genes, close to genes, in intergenic regions. The Homer suite of programs is probably the best tool to perform this part, but there also tools that do this as part of R pipelines. + +Once peaks, or differential peaks have been identified and annotated to genes, there are almost **endless possibilities** of performing meta-analyses, such as overlapping peaks between samples, looking for enriched Gene Ontology (GO) terms, probing for enriched motifs that could correspond to transcription factor binding sites, comparing peaks between different transcription factor or chromatin modification data sets, etc. etc. .g a Chromatin Immunoprecipitation and sequencing (ChIP-seq) pipeline part 1 -read alignments + +## Organize the read data and perform a QC analysis + +We did that yesterday + +## Download the TAIR10 reference sequence of *Arabidopsis thaliana* + +We want to align sequencing data from *Arabidopsis thaliana*; thus we need the genome files. + +For *A. thaliana*, TAIR is the first source of information (<https://www.arabidopsis.org/> ) although most of the info can also be found in NCBI or EBI. + +Within TAIR there is an ftp archive link to sequence and annotation, let's navigate there and check out the provided information. + +It is recommended to create a folder "reference" and navigate to that folder before downloading. + +This command downloads all fasta files that start with the letters TAIR10_chr. What follows after the "\*" wildcard does not matter. + +```{bash} + +path_to_refernce=$HOME/reference + +mdir -p $path_to_reference + +wget -P $path_to_reference -b ftp://ftp.arabidopsis.org/home/tair/Sequences/whole_chromosomes/TAIR10_chr* + +``` + +We used `wget` before, but without the `–b` addition. With this addition, the download moves to the background and the shell accepts the next command right away. `-P` specifies the directory to which the files are downloaded + +### Create a file containing fasta files for all nuclear chromosomes + +```{bash} + +cat TAIR10_chr1.fas TAIR10_chr2.fas TAIR10_chr3.fas TAIR10_chr4.fas TAIR10_chr5.fas >TAIR10_allnu.fas + +``` + +### Check the file and look at the fasta headers + +```{bash} + +wc -l TAIR10_allnu.fas + +grep \> TAIR10_allnu.fas + +``` + +The command `wc` counts through files, it can indicate the number of bytes, words, lines, characters etc. With the option `–l` we should have counted 10 lines in theory. What went wrong? + +The command `grep` grabs all lines of a text file that contain a "`>`" symbol. Since the symbol has a special meaning (see above, it directs an output into a file) a "`\`"symbol must be placed right before to escape from the special meaning. + +Thus, we have the right number of sequences, but the sequences have been formatted with line breaks. Luckily, the formatting of the fasta file does not matter for the next steps since the aligner `bowtie2` that we will use, knows to ignore all line breaks except after a `>` symbol. + +## Aligning fastq data to a reference + +We need a short read aligner to align the downloaded sequence files to the reference genome. In addition, we need to prepare the reference genome by creating binary "index" files that the program can read and sort much faster than the fasta text file. + +In CHEOPS, the short read aligner bowtie2 is installed as loadable module, which means that we do not have to install as local version in our home folder but can use this module. + +***Note**: For more information on bowtie, follow the link: (<https://en.wikipedia.org/wiki/Bowtie_%28sequence_analysis%29> ).* + +***Note:** There are CHEOPS commands that tell you which modules are available. For example* `module avail` *. If you want to know more, download CHEOPS brief instructions (<https://rrzk.uni-koeln.de/sites/rrzk/HPC_Projekte/CHEOPS_Brief_Instructions.pdf> ).* + +### Load the module Bowtie2 + +```{bash} + +bowtie2 + +module add /opt/rrzk/modules/software/bowtie2/2.4.1 + +bowtie2 + +``` + +What happened? + +There should be a long list with command options for bowtie2. In principle, it is much nicer to read the manual online, for example here <https://wikis.utexas.edu/display/bioiteam/Mapping+with+bowtie2+Tutorial> or here <http://bowtie-bio.sourceforge.net/bowtie2/index.shtml> . If you have a look on the second page, in particular, you realize that the current version of Bowtie2 Version 2.5.1 from January 17, 2023, while the installed version is 2.4.1, which probably contains less features and more bugs. For the moment that is o.k. but it illustrates how installing the program with full control in your home directory could also be a good idea. + +### Index the Arabidopsis genome files + +Below the command to tell bowtie to create an index from the fasta reference file. Wither run this command from the folder with the reference fasta file or add the path to the command below before running. + +```{bash} + +bowtie2-build TAIR10_allnu.fas TAIR10_allnu + +``` + +### Align the reads to the indexed reference genome + +The command to launch `bowtie2` is relatively simple, but the default alignment is not sorted by position within the genome and it is a plain text tile in `.sam` format that takes a terrible amount of space. A tool calles samtools can sort the alignments by position and produces binary files. We will combine the commands in a submission script using pipes to automatize everything. + +There is also the complication that some datasets consist of paired end reads while other are single end reads. The command to submit paired and and single end reads is not quite the same. + +The following describes the steps. Each output is saved as file. This is very clunky and many of these files are not needed in the end. + +``` +#this aligns the reads + +bowtie2 -x TAIR10_allnu -U name.fastq > output.sam + +#this makes binary files +samtools view -b -o output.bam output.sam + +#this sorts the binary files +samtools sort output.bam >output.sorted.bam + +#this creates an index for the binary files +samtools index output.sorted.bam +``` + +The `pipe` sign `|` can directly pipe the output from one file into the next command without the need to save the file in between. The commands above would end up like this: + +``` +bowtie2 -x TAIR10_allnu -U name.fastq | \ +samtools view -b - | \ +samtools sort - -o name.bam + +samtools index name.bam +``` + +Of course this script would never run, since non of the paths is properly indicated. + +Now let's write a correct submission script that corresponds to the actual situation. + +``` +#!/bin/bash -l +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --mem=1gb +#SBATCH --account=kurse + +module load bowtie2/2.4.1 +module load samtools/1.12 + +#modules to be loaded again since the batch script does not run in your shell but you delegate the job to another shell called “SLURM†+ + +workdir=/scratch/ngscourse/${USER}/${SLURM_JOB_ID} +mkdir -p $workdir + +cd $workdir #change into the work directory + +for i in BASENAME1 BASENAME2 #loop +do +bowtie2 -x /home/group.kurse/$USER/reference/TAIR10_allnu -U /home/group.kurse/$USER/fastq/${i}.fastq.gz | \ +samtools view -b - | \ +samtools sort - -o /scratch/ngscourse/${i}.bam +samtools index +done + +cd - #changes back to folder from which the script was started +``` + +Usually, text behind a `#` is ignored but the lines above are an exception. The first line tells the computer that this is a Bourne again shell (bash) script. The next lines set the computer power we request. We have to calculate in the account kurse, which is pretty limited. If at a later stage during your studies you will need to seriously work with the cluster ask for access to other accounts. + +It is a good idea to have all the temporary files go to a work directory on scratch since this unclogs the home folder. The \$ sign indicates that this is a variable. The text in brackets indicates what to put into the variable. USER has a fixed meaning, it is your log identity. SLURM_JOB_ID also has a fixed meaning, it is the identity of the job. + +for paired reads the script is slighlty different: + +``` +#!/bin/bash -l +#SBATCH --nodes=1 +#SBATCH --ntasks=1 +#SBATCH --mem=1gb +#SBATCH --account=kurse + +module load bowtie2/2.4.1 +module load samtools/1.12 + +#modules to be loaded again since the batch script does not run in your shell but you delegate the job to another shell called “SLURM†+ + +workdir=/scratch/ngscourse/${USER}/${SLURM_JOB_ID} +mkdir -p $workdir + +cd $workdir #change into the work directory + +for i in BASENAME1 BASENAME2 #loop +do +bowtie2 -x /home/group.kurse/$USER/reference/TAIR10_allnu -1 /home/group.kurse/$USER/fastq/${i}_1.fastq.gz -2 /home/group.kurse/$USER/fastq/${i}_2.fastq.gz | \ +samtools view -b - | \ +samtools sort - -o /scratch/ngscourse/${i}.bam +samtools index +done + +cd - #changes back to folder from which the script was started +``` + +### Save the script as text file + +Large jobs, such as alignments, should not directly be run on a cluster but are submitted through a queuing system that allocates time and space to each job. In the top lines of the script above, we reserve the number of nodes, tasks the calculation memory and indicate the account. We are calculating under the course account, and thus have very limited rights to request compute power. + +To be able to submit the script to SLURM, it needs to be saved as script file. There are several options to do this. + +1. Use `touch` to make a new file named `alignment.bash`, then open the file with `nano` or `pico`, then copy and paste the script text above and save the file +2. With vs-code, you can open a new file copy-paste the script and save the file with the extension `.sh` or `.bash` +3. Use the heredoc direction + +```{bash} + +cat << EOF > Alignment.sh +copy text here +can be many lines +EOF + +``` + +### Submit alignment script to SLURM + +```{bash} + +sbatch alignment.sh & + +``` + +### Understanding the SAM/BAM format + +NGS alignment is most commonly saved as SAM formated files, or the binary form BAM. + +The file starts with a header: + + + +After the header, each aligment takes up two rows: + + + +> 1. Read name +> +> 2. FLAG: a bitwise set of information describing the alignment, FLAG. Provides the following information: +> +> ``` +> 1. are there multiple fragments? +> +> 2. are all fragments properly aligned? +> +> 3. is this fragment unmapped? +> +> 4. is the next fragment unmapped? +> +> 5. is this query the reverse strand? +> +> 6. is the next fragment the reverse strand? +> +> 7. is this the 1st fragment? +> +> 8. is this the last fragment? +> +> 9. is this a secondary alignment? +> +> 10. did this read fail quality controls? +> +> 11. is this read a PCR or optical duplicate? +> ``` +> +> 3. RNAME : Reference sequence NAME of the alignment. +> +> 4. POS : 1-based leftmost mapping POSition of the first counted mapping operation for a read. The reference sequence starts with "1". If a "0" is indicated, normally read has not been mapped +> +> 5. MAPQ : MAPQ Quality as numeric value that depends on the alignment program that was used, Value 225 means that mapping quality is not available. +> +> 6. CIGAR : CIGAR string. Again, very complicated to decode for a human brain, but indicates how much of the query sequence actually matches and whether there are mismatches +> +> 7. RNEXT : Reference sequence name of the primary alignment of the NEXT read in the template. +> +> 8. PNEXT : Position of the primary alignment of the NEXT read in the template. +> +> 9. TLEN : signed observed Template LENgth. +> +> 10. SEQ : Sequence +> +> 11. QUAL: Quality string of the original fastq sequence +> +> 12. Optional information on the alignment + +The bitwise information results in a Flag number that is rather hard to decode if you are not a computer, but there is a nice tool that allows to easily decode it for a human brain: + +<https://broadinstitute.github.io/picard/explain-flags.html>  + +Here a quick overview of the most common Flags: + + + +### Filtering and counting alignments based on SAM flags + +The program Samtools allows to "filter" and "count" the alignment based no Flags, quality scores + +Usage: `samtools view [options] <in.bam>|<in.sam>|<in.cram> [region ...]` + +``` +Output options: + -b, --bam Output BAM + -C, --cram Output CRAM (requires -T) + -h, --with-header Include header in SAM output + -H, --header-only Print SAM header only (no alignments) + --no-header Print SAM alignment records only [default] + -c, --count Print only the count of matching records + -o, --output FILE Write output to FILE [standard output] + -U, --unoutput FILE, --output-unselected FILE + Output reads not selected by filters to FILE +Filtering options (Only include in output reads that...): + -L, --target[s]-file FILE ...overlap (BED) regions in FILE + -q, --min-MQ INT ...have mapping quality >= INT + -l, --library STR ...are in library STR + -m, --min-qlen INT ...cover >= INT query bases (as measured via CIGAR) + -e, --expr STR ...match the filter expression STR + -f, --require-flags FLAG ...have all of the FLAGs present + -F, --excl[ude]-flags FLAG ...have none of the FLAGs present + -G FLAG EXCLUDE reads with all of the FLAGs present + --subsample FLOAT Keep only FLOAT fraction of templates/read pairs + --subsample-seed INT Influence WHICH reads are kept in subsampling [0] + -s INT.FRAC Same as --subsample 0.FRAC --subsample-seed INT +``` + +### Assignment + +Using your alignment data or an alignment file that will be provided for download answer the following questions: + +How many reads were mapped, how many were unmapped? + +How many alignments were of a quality higher than 30 or 50? + +How many reads were mapped to forward, how many to the reverse strand of the genome? + +How many paired reads were mapped as pairs? + +How many reads were likely PCR duplicates? + +Document your commands in the script box below and save the results in a file. + +```{bash} + + + + +``` + +# Day 3 + +Since working directly with the terminal may have too steep a learning curve if only few analyses have to be carried out, a group of bioinformaticians has developed a web server called GALAXY (<https://usegalaxy.eu/>), which can serve as frontend to performing NGS analysis. GALAXY is directly connected to the DataPlant GitLab, which contained the script files [https://git.nfdi4plants.org/plantchromatin/ngsappetizercourse/](#0){.uri}. The alignments of all files have been added to the repository under data/BAM-files. + +## Upload the alignment files to a personal space in Galaxy + +In GALAXY, use the *Upload Data* function on the left panel to navigate to *Choose Remote Files* -\> *DataPlant DataHub* -\> *PlantChromatin: NGS appetizer*. Select a subset of files, for example, all H3K27me3 data and all inputs or all H3K36me3 data and all inputs or all H3K4me3 data and all inputs. + +| Sample code | SAMPLE alias | +|--------------|--------------------| +| SRR8955896 | NV_input | +| SRR8955897 | 40V_input | +| SRR8955898 | T10_input | +| SRR8955899 | NV_H3K27me3_rep1 | +| SRR8955908 | NV_H3K27me3_rep2 | +| SRR8955900 | 40V_H3K27me3_rep1 | +| SRR8955909 | 40V_H3K27me3_rep2 | +| SRR8955901 | T10_H3K27me3_rep1 | +| SRR8955910 | T10_H3K27me3_rep2 | +| SRR8955902 | NV_H3K4me3_rep1 | +| SRR8955911 | NV_H3K4me3_rep2 | +| SRR8955903 | 40V_H3K4me3_rep1 | +| SRR8955912 | 40V_H3K4me3_rep2 | +| SRR8955904 | T10_H3K4me3_rep1 | +| SRR8955913 | T10_H3K4me3_rep2 | +| SRR8955905 | NV_H3K36me3_rep1 | +| SRR8955914 | NV_H3K36me3_rep2 | +| SRR8955906 | 40V_H3K36me3_rep1 | +| SRR8955915 | 40V_H3K36me3_rep2 | +| SRR8955907 | T10_H3K36me3_rep1 | +| SRR8955916 | T10_H3K36me3_rep2 | + +## Visualizing the alignments in IGV + +Aligned and sorted BAM files can be visualized in a genome browser, e.g. gbrowse, jbrowse or the IGV browser. The IGV web-browser is o.k. for a quick visualization. + +The IGV (<https://software.broadinstitute.org/software/igv/> ) can either be downloaded and installed on a computer (it runs on a Java platform, thus any computer would do) or webservice can be called: + +<https://igv.org/> + +Download to your computer and start the application. We will need to activate the Arabidopsis genome TAIR10, the default loaded genome is TAIR10. + +Once the aplication is launced on the right genome, navigate to random a smaller regions (around 5kb). IGV cannot visualize MAM-files of larger regions, too much data. + +To display the files uploaded to GALAXY, navigate to a single file in the right panel and click on the analysis button. Follow the menu *Display in IGV (local)*. + + + +### Assignment + +- How do the input samples look? + +- Do the reflect nucleosome profiles? + +- Are there bis spikes in the data? + +- Are there large differences between the input controls? + +Cross-linked chromatin that shows a slight pattern resembling nucleosome distribution is well suited as control. There should be no major differences in the input samples. If that is the case, it is possible to use one input sample as controil for all data or to ppol the reads for all controls to make a common control. If input samples shows spikes, it is better todo the next analysis without using input as controls. + +## Convert an alignment to a normalized coverage track using DeepTools BamCoverage + +To visualize a larger part of the alignments, it is better to calculate a coverage track, in which the number of reads overlapping with a specific position in the genome are summed up. It is often better to bin the positions and to calculate coverage for every 5, 10 or 50 positions. The coverage can be normalized in different ways, the most straight forward solution is to calculate coverage per million mapped reads (CPM). + +*Deeptools* coverage <https://deeptools.readthedocs.io/en/develop/content/tools/bamCoverage.html> can calculate coverage with many options. Deeptools contains a suite of many functions and is integrated in Galaxy. + +### Assignment + +- Navigate to the left panel, until the header *GENOMICS TOOLKITS* -\> *DeepTools* -\> *bamCoverage*. + +- After the files a generated, add them as tracks to IGV. + +- Download a few figures and add them to your Markdown document. + +## Identify enriched regions + +Under the Menu *Peak calling*, GALAXY offers two tools that are useful for identifying enriched target regions: MACS (Model-based Analysis of ChIP-Seq) and SICER. MACS is designed to consider the peak shape, it expects narrow pakes that show a shape ressembling a normal distribution. Chromatin modification peaks do not comply with this assumption. Thus, it is necessary to input the output of MACS int another tools *MACS2 bdgbroadcall* to merge the peaks to broader peaks. SICER is less used but works well with plant chromatin data. It directly calls larger regions. + +### Assignment + +- Select one tool and launch the analysis to identify peaks (we should distribute who uses which tool) +- For SICER, the alignments in .bam format have to be converted to .bed format first. This can be done using *Operate on and transform BAM* under the menu *SAM/BAM* +- Generate a table with peaks per condition and replicate (we will try to generate a common table) +- Visualize enriched regions in IGV + +## Intersect enriched regions between replicates + +It makes sense to combine the data from biological replicates for a high confidence list of targets. For operations with genomic regions, the [*bedtools* suite](https://bedtools.readthedocs.io/en/latest/index.html) most commonly used. To combine two different sets of genomic regions, *bedtools intersect* is the required. + + + +The command line tool would like like that: + +``` +bedtools intersect [OPTIONS] -a <FILE> \ + -b <FILE1, FILE2, ..., FILEN> +``` + +In Galaxy, bedtools Intersect intervalsis found under menu BED. + +Bedtools operates on files in .bed format (**Browser Extensible Data**). There are several .bed formats. the minimal bed file has three columns, but bed6 and bed12 formatted files are also defined: + + + +### Assignment + +- Overlap replicates and keep only regions present in both replicates + +- Concatenate intersected peaks from three different conditions + +- use *bedtools merge* to generate an "all_peaks" file + +- convert the output to a .bed formated file + +## Correlate coverage across enriched regions + +The last analysis in this course will be to compare the coverage at the enriched regions between the samples. This requires a merged, sorted bedfile of all regions that are enriched for a certain histone modification and the .bam files corresponding to the samples in question. + +The analysis can be performed with several tools, here we will use Deeptools. + +First, we have to generate a matrix file that contains the coverages for all .bam files at the regions of interest. Second, we can generate a correlation heatmap between the samples. + +- Use ComputeMatrix to generate a matrix files for replicates and experimental samples + +- Use PlotCorrelation to plot a heatmap and pairwise correlation plots