Skip to content
Snippets Groups Projects
Commit 651833ba authored by Dominik Brilhaus's avatar Dominik Brilhaus
Browse files

Merge branch 'ceplas-dm-viktoria' into 'main'

Ceplas dm viktoria

See merge request !2
parents 855897bb 8c4f654a
No related branches found
No related tags found
1 merge request!2Ceplas dm viktoria
Pipeline #8159 passed
## Sources
Detailed data preprocessing documentation: https://github.com/weberlab-hhu/Predmoter/blob/main/docs/data_preprocessing.md
Detailed documentation of the h5 file creation and architecture: https://github.com/weberlab-hhu/Predmoter/blob/main/docs/h5_files.md
\ No newline at end of file
# Data preprocessing documentataion
This documentation is about preprocessing ATAC- and ChIP-seq data. The pipeline
is used to create bam files from public new generation sequencing (NGS) data
(fastq files) downloaded from NCBI's (National Center for Biotechnology Information)
SRA (Sequence Read Archive).
## 1. Required tools
The tool version numbers are the ones used during this project.
- SRA-Toolkit 3.0.0 (SRA Toolkit - GitHub)
- Java 1.8.0 (Arnold et al., 2005)
- Trimmomatic 0.36 (Bolger et al., 2014)
- FastQC 0.11.9 (Babraham Bioinformatics - FastQC A Quality Control Tool
for High Throughput Sequence Data)
- MultiQC 1.14 (Ewels et al., 2016)
- BWA 2.1 (Md et al., 2019)
- SamTools 1.6 (Danecek et al., 2021)
- Picard (Picard Tools - By Broad Institute, n.d.)
- deepTools 3.5.1 (Ramírez et al., 2016)
- Helixer 0.3.1 (Stiehler et al., 2021; Weberlab: Institute of Plant Biochemistry,
HHU: GitHub, n.d.)
- Helixer Post (https://github.com/TonyBolger/HelixerPost)
- gffread (Pertea & Pertea, 2020)
- GeenuFF 0.3.0 (Weberlab: Institute of Plant Biochemistry, HHU: GitHub, n.d.)
## 2. Pipeline
| Category | Download data | Trimming | Quality control | Mapping | Quality control | Deduplication/Data cleaning | Quality control |
|:--------:|:-------------:|:-------------------:|:---------------:|:---------------:|:---------------:|:---------------------------:|:-------------------------------:|
| Tools | SRA Toolkit | Trimmomatic<br>Java | FastQC | BWA<br>SamTools | SamTools | Picard<br>Java<br>SamTools | deepTools<br>Helixer<br>gffread |
### 2.1. Download data
**Tools:** SRA toolkit
**Setup:** The path to the toolkit
``export PATH=$PATH:<path_to>/sratoolkit.3.0.0-ubuntu64/bin`` needs to be in
your ``.bashrc``. When first configuring the SRA toolkit, be sure to set the local
file caching to a folder of sufficient size, so **not** your home directory
(see https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration).
There are two options for downloading the runs from SRA: fastq-dump and
fasterq-dump. The program calls are similar, with fastq-dump downloading the
files in qzip format while fasterq-dump downloads faster, but doesn't zip the
fastq files.
#### 2.1.1. Fastq-dump
```bash
fastq-dump <SRR_accsession> --gzip --split-3
```
#### 2.1.2. Fasterq-dump
```bash
fasterq-dump <SRR_accsession> --split-3
gzip <SRR_accsession>_1.fastq <SRR_accsession>_2.fastq # for paired end data
```
#### 2.1.3. Special case: single cell data
```bash
prefetch <SRR_accsession>
fasterq-dump <SRR_accsession> -S --include-technical
gzip <SRR_accsession>_3.fastq <SRR_accsession>_4.fastq
```
Unzipped fastq files also work, as trimmomatic accepts them as well.
It is however necessary to rename the forward read from <SRR_accsession>_3.fastq to
<SRR_accsession>_1.fastq and the reverse reads from <SRR_accsession>_4.fastq to
<SRR_accsession>_2.fastq.
### 2.2. Trimming
**Tools:** Java 1.8.0, Trimmomatic 0.36
**Hint:** ATAC-seq data usually uses Nextera adapters, so use NexteraPE-PE.fa for
paired ATAC-seq and TruSeq3-PE-2.fa for paired ChIP-seq data.
```bash
# paired end data
java -jar -Xmx2048m <path_to_trimmomatic>/trimmomatic-0.36.jar PE -threads 2 \
-basein <sample_ID>_1.fastq.gz -baseout <out_path>/<sample_ID>.fastq.gz \
ILLUMINACLIP:<path_to_trimmomatic>/adapters/TruSeq3-PE-2.fa:3:30:10:1:true \
MAXINFO:36:0.7
# single end data
java -jar -Xmx2048m <path_to_trimmomatic>/trimmomatic-0.36.jar SE -threads 2 \
<sample_ID>.fastq.gz <out_path>/<sample_ID>.fastq.gz \
ILLUMINACLIP:<path_to_trimmomatic>/adapters/TruSeq3-SE.fa:2:30:10 MAXINFO:36:0.7
```
### 2.3. Quality control
**Tools:** FastQC
```bash
# paired end data
fastqc <sample_ID>_1P.fastq.gz
fastqc <sample_ID>_2P.fastq.gz
# single end data
fastqc <sample_ID>.fastq.gz
```
**Hint:** [MultiQC](https://multiqc.info) was used to generate summary html reports.
### 2.4. Mapping
**Tools:** BWA, SamTools
#### 2.4.1. Genome indexing
```bash
bwa-mem2 index -p <path_to_genome>/<species> <path_to_genome>/<species>.fa
```
#### 2.4.2. Mapping
```bash
# paired end data
bwa-mem2 mem <path_to_genome>/<species> <(zcat <sample_ID>_1P.fastq.gz) \
<(zcat <sample_ID>_2P.fastq.gz) > <sample_ID>.sam
# single end data
bwa-mem2 mem <path_to_genome>/<species> <(zcat <sample_ID>.fastq.gz) > \
<sample_ID>.sam
# sort and index
samtools view <sample_ID>.sam -b |samtools sort -T tmp<sample_ID> -@1 \
-o <sample_ID>.bam
samtools index <sample_ID>.bam
```
### 2.5. Quality control
**Tools:** SamTools
Check for percentage of reads mapped, mates paired, etc.
```bash
samtools flagstat <sample_ID>.bam > <sample_ID>.txt
```
### 2.6. Deduplication/Data cleaning
**Tools:** Picard, Java, SamTools
Common errors are:
- the picard output just ends without signaling that picard finished and
there is no output bam file
- java.lang.OutOfMemoryError: Java heap space
- java.lang.OutOfMemoryError: GC overhead limit exceeded
The **solution** is extending the java heap space, so replacing
-Xmx1G with -Xmx2G or even -Xmx3G.
```bash
# picard
java -Xmx1G -jar <path_to_picard>/picard.jar MarkDuplicates I=<sample_ID>.bam \
O=<sample_ID>_marked.bam M=<sample_ID>.stats
# filter with samtools
samtools view -F 1796 <sample_ID>_marked.bam -b > <sample_ID>.bam
# -F 1796 filters out duplicates, unmapped reads, non-primary alignments
# and reads not passing platform quality checks
samtools index <sample_ID>.bam
```
### 2.7. Quality control
**Tools:** deepTools, Helixer, gffread
Only the predictions/annotation from Helixer were used for consistency, but if the
genome already has an annotation you can also use that and skip the fourth step.
You can also blacklist regions (bed file) in the fifth step, if the previous plot
was too noisy. An easy way is to blacklist the chloroplast, mitochondrion and unplaced
scaffolds, as they have been observed to frequently cause noise in ATAC-seq data.
#### 2.7.1. Plot coverage
```bash
files=`ls <folder_of_deduplicated_files>/*.bam`
<path_to_deeptools>/deepTools/bin/plotCoverage -b $files \
--plotFile <species>Coverage.png --smartLabels --outRawCounts deduplicated/raw.txt
```
#### 2.7.2. Plot fingerprint
```bash
files=`ls <folder_of_deduplicated_files>/*.bam`
<path_to_deeptools>/deepTools/bin/plotFingerprint -b $files \
--plotFile <species>Fingerprint.png --smartLabels --binSize 500 \
--numberOfSamples 500_000
# if the genome is equal or smaller than 250 Mbp, please adjust the defaults of
# binsize (500) and numberOfSamples (500_000) oof plotFingerprint, so that
# binsize x numberOfSamples < genome size
```
#### 2.7.3. Bam to bigwig
```bash
files=`ls deduplicated/*2.bam`
for file in $files; do
<path_to_deeptools>/deepTools/bin/bamCoverage -b $file -o $(basename $file .bam).bw -of bigwig
done
```
#### 2.7.4. Helixer annotation
```bash
<path_to_helixer>/Helixer/Helixer.py --lineage land_plant --fasta-path <species>.fa \
--species <full_species_name> --gff-output-path <species>_land_plant_v0.3_a_0080_helixer.gff3 \
--model-filepath <path_to_best_model>/land_plant_v0.3_a_0080.h5 --batch-size 150 \
--subsequence-length 21384 --temporary-dir <temp_dir>
# change batch size/subsequence length as your GPU can handle/as you need
# full species name example: Arabidopsis_thaliana
```
>**Warning:** To create the annotation, you will need an Nvidia GPU, their CUDA
> toolkit and cuDNN. (see https://github.com/weberlab-hhu/Helixer)
#### 2.7.5. Compute matrix
```bash
# convert predictions or reference annotation to bed
gffread helixer_pred/<species>_land_plant_v0.3_a_0080_helixer.gff3 \
--bed -o <species>_helixer.bed
# compute matrix
files=`ls <path_to_bigwig_files>*.bw`
<path_to_deeptools>/deepTools/bin/computeMatrix reference-point \
-S $files -R <species>_helixer.bed --smartLabels \
-o <species>_matrix.mat.gz --upstream 1500 --downstream 1500
```
#### 2.7.6. Plot heatmap
```bash
<path_to_deeptools>/deepTools/bin/plotHeatmap -m deepqc/<species>_matrix.mat.gz \
-o <species>_tss_heatmap.png
```
The heatmap should show a peak in front of/around the transcription start site (TSS)
for ATAC-seq and behind the TSS for H3K4me3 ChIP-seq.
**ATAC-seq example of _Brachypodium distachyon_:**
![](../images/Brachypodium_distachyon_ATACseq.png)
**ChIP-seq example of _Brachypodium distachyon_:**
![](../images/Brachypodium_distachyon_ChIPseq.png)
### References
- Arnold, K., Gosling, J., & Holmes, D. (2005). The Java programming language.
Addison Wesley Professional.
- Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput
Sequence Data. (n.d.). Retrieved May 23, 2022, from
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
- Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer
for Illumina sequence data. Bioinformatics, 30(15), 2114.
https://doi.org/10.1093/BIOINFORMATICS/BTU170
- Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O.,
Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve
years of SAMtools and BCFtools. GigaScience, 10(2).
https://doi.org/10.1093/GIGASCIENCE/GIAB008
- Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize
analysis results for multiple tools and samples in a single report. Bioinformatics,
32(19), 3047–3048. https://doi.org/10.1093/BIOINFORMATICS/BTW354
- Md, V., Misra, S., Li, H., & Aluru, S. (2019). Efficient architecture-aware
acceleration of BWA-MEM for multicore systems. Proceedings - 2019 IEEE 33rd
International Parallel and Distributed Processing Symposium, IPDPS 2019, 314–324.
https://doi.org/10.1109/IPDPS.2019.00041
- Pertea, G., & Pertea, M. (2020). GFF Utilities: GffRead and GffCompare.
F1000Research, 9, 304. https://doi.org/10.12688/F1000RESEARCH.23297.2
- Picard Tools - By Broad Institute. (n.d.). Retrieved May 23, 2022,
from https://broadinstitute.github.io/picard/
- Ramírez, F., Ryan, D. P., Grüning, B., Bhardwaj, V., Kilpert, F., Richter,
A. S., Heyne, S., Dündar, F., & Manke, T. (2016). deepTools2: a next generation
web server for deep-sequencing data analysis. Nucleic Acids Research, 44(W1),
W160–W165. https://doi.org/10.1093/NAR/GKW257
- Stiehler, F., Steinborn, M., Scholz, S., Dey, D., Weber, A. P. M., & Denton,
A. K. (2021). Helixer: cross-species gene annotation of large eukaryotic genomes
using deep learning. Bioinformatics, 36(22–23), 5291–5298.
https://doi.org/10.1093/BIOINFORMATICS/BTAA1044
- SRA Toolkit - GitHub. (n.d.). Retrieved September 19, 2022,
from https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit
- Weberlab: Institute of Plant Biochemistry, HHU: GitHub. (n.d.). Retrieved
May 23, 2022, from https://github.com/weberlab-hhu
# H5 files
Predmoter uses h5 files as input and main output data. H5 files,
Hierarchical Data Format 5 (HDF5), are designed to store and organize large
amounts of data. They consist of two major components:
- Datasets (multidimensional arrays)
- Groups (container structures; can hold datasets and other groups)
## 1. H5 file architecture
### 1. Input files
The architecture of the h5 files used as input data for Predmoter looks like this:
```raw
<file>.h5
|
|
├── data
| ├── X (encoded DNA sequence)
| ├── seqids (chromosome/scaffold names)
| ├── species
| └── start_ends (start and end base per chunk)
|
└── evaluation
├── <dataset>_meta
| └── bam_files
├── <dataset>_coverage (in reads per base pair)
└── <dataset>_spliced_coverage
```
#### 1.1. The data group
The base h5 file and all arrays of the data group is created using ``fasta2h5.py``.
##### 1.1.1. X
The encoded DNA sequence. The original fasta file is cut into chunks of the same
size (default: 21384 base pairs). The shape of this array is (N, C, 4). The C
represents the chunk/sequence length and N the number of chunks. The four represents
the nucleotides/bases (example shape: (11202, 21384, 4)). The nucleotide encoding
is as follows:
- [1, 0, 0, 0] = C
- [0, 1, 0, 0] = A
- [0, 0, 1, 0] = T
- [0, 0, 0, 1] = G
- [0.25, 0.25, 0.25, 0.25] = N
- [0, 0, 0, 0] = padding
Padding is used to pad out too short sequences or chromosome ends, since chromosomes
are rarely exactly divisible by the sequence length. Arrays always need to have
compatible shapes to be concatenated, so the sequence length needs to be identical
between all chunks. Shorter sequences will therefore be padded. Predmoter considers
Ns padding as well (so [0, 0, 0, 0]), as most Ns are gaps in the genome assembly
and as such not informative.
##### 1.1.2 seqids
The ``seqids`` represent the fasta file headers/identifiers of the sequence
(chromosome, scaffold, contig, etc.) that a chunk comes from. The shape is (N,), i.e.
the number of chunks.
##### 1.1.3. species
The species ID given by the user. The shape is (N,), i.e. the number of chunks.
Technically useful if multiple species are in one file. Predmoter assumes that
there is just one species per file, which is especially important for testing and
predicting. It should be possible to use multiple species in one file during
training, but it wasn't tested.
##### 1.1.4. start_ends
The start and end coordinates of a given chunk (in that order). When the end
coordinate is smaller than the start coordinate (e.g. [21384, 0]), then the
chunk is from the negative strand. So [0, 21384] is the first 21384 bp of a given
sequence of the positive strand. On the positive strand start is inclusive and
end exclusive (it's the other way around on the negative strand). If
``|start - end|`` is less than the chunk length then there is padding. The shape is
(N, 2), i.e. two values (start and end) per chunk.
#### Examples
##### 1. Chunk order
The DNA sequence order inside the h5 files is: chromosome 1-positive strand,
chromosome 1-negative strand, chromosome 2-positive strand,
chromosome 2-negative strand, etc.
Example of one chromosome:
- sequence length: 21384 bp
- length of chromosome 1: 51321 bp
| | | | | | | |
|:----------|:-----------|:---------------|:---------------|:---------------|:---------------|:-----------|
| chunk | 0 | 1 | 2 | 3 | 4 | 5 |
| strand | + | + | + | - | - | - |
| start/end | [0, 21384] | [21384, 42768] | [42768, 51321] | [51321, 42768] | [42768, 21384] | [21384, 0] |
All strands are represented from 5' to 3', so the negative strand is in descending
order. Chunks 2 and 3 contain padding and chunk 3 is the reverse complement of chunk 2.
##### 2. Padding
bases = A, T, C, G, N
padding = P
(both encoded inside the file)
positive strand:
| | | | |
|:-----------|:----------------|:----------------|:----------------|
| chunk | 0 | 1 | 2 |
| mini_array | [A, A, T, G, A] | [C, A, T, N, C] | [G, T, T, P, P] |
| start/end | [0, 5] | [5, 10] | [10, 13] |
negative strand:
| | | | |
|:-----------|:----------------|:----------------|:----------------|
| chunk | 3 | 4 | 5 |
| mini_array | [A, A, C, P, P] | [G, N, A, T, G] | [T, C, A, T, T] |
| start/end | [13, 10] | [10, 5] | [5, 0] |
The padding is on the positive and negative strand always at the end of the array,
so chunk 2 and 3 are **not** exact mirrors/reverse complements of each other, while
for example chunk 0 and 5 are.
#### 1.2 The evaluation group
All arrays of the evaluation group are added to the base h5 file using
``add_ngs_coverage.py``. Multiple datasets can be added to the h5 file.
##### 1.2.1 dataset_meta
The meta of a dataset contains the array ``bam_files``. This is a list of the bam file
names (including the path) that were used to add the dataset. The shape is (B,), i.e.
the number of bam files. Multiple bam files can be added per dataset, but it is not
possible to add additional bam files to an existing dataset. In that case,
``add_ngs_coverage`` needs to be used on the desired h5 file not containing the dataset
and all bam files (old and new) can be added together.
##### 1.2.2. dataset_coverage
Coverage is the number of reads mapping at a given position (cigar M or X). The
shape is (N, C, B). The C represents the chunk/sequence length, N the number of
chunks, and B the number of bam files (example shape: (11202, 21384, 5)). Predmoter
uses the average of all bam files per dataset to train.
##### 1.2.3 dataset_spliced_coverage
Spliced coverage is the number of reads mapping with a gap, that is a read deletion
or spliced out section at a given position (cigar N or D). The shape is the same as
the shape of dataset_coverage.
### 2. Prediction file
The prediction h5 file contains:
```raw
<input_file_basename>_prediction.h5
|
├── data
| ├── seqids
| ├── start_ends
| └── species
|
└── prediction
├── predictions
├── model_name
└── datasets
```
The ``seqids``, ``start_ends`` and ``species`` arrays are taken from the input file.
The predictions array contains the models' predictions. The shape is (N, C, D);
the C represents the chunk/sequence length, N the number of chunks and D the number
of datasets predicted. The metadata contains the models' filename (including the path),
and the datasets the model predicted in the correct order (e.g., "atacseq", "h3k4me3").
The order of datasets is very important, as it is used by the Converter to convert each
dataset into its own bigwig or bedgraph file. The Converter also needs ``seqids`` and
``start_ends`` to calculate chromosome order and chromosome lengths.
## 2. H5 file creation
**Tools:** Helixer, GeenuFF (optional), Predmoter
(Stiehler et al., 2021; Weberlab: Institute of Plant Biochemistry, HHU: GitHub, n.d.)
ATAC- and ChIP-seq data is PCR amplified. Therefore, you can not determine from which
strand which read originated unless you used barcodes to get strand specific data.
Hence, the coverage information of the read is always added to both strands. The
advantages are:
- Open chromatin and closed chromatin regions always apply to both strands anyway.
- The addition to both strands allows for built-in data augmentation.
### 2.1. Create h5 file from fasta file
```bash
fasta2h5.py --species <full_species_name> --h5-output-path <species/filename>.h5 \
--fasta-path <path_to_genome>/<species>.fa
# full species name example: Arabidopsis_thaliana
```
> **H5 base file:** To add annotations or datasets, the fasta h5 file needs to be
> created **first**. The additional data will then be added to the fasta h5 file.
### 2.2. Add ATAC- and/or ChIP-seq data
Multiple datasets should be added separately. The shift argument is a special case
for ATAC-seq data to shift the reads +4 (+ strand) or -5 (- strand) base pairs as it is
typically done.
```bash
python3 <path_to_helixer>/helixer/evaluation/add_ngs_coverage.py \
-s <full_species_name> -d <species>.h5 -b <path_to_bam_files>/*.bam \
--dataset-prefix <ngs_prefix> --unstranded --threads <number_of_bam_files_to_add> \
(--shift)
# multiple threads are faster, but --threads 0 also works
# ngs prefix example: atacseq
# --unstranded only for unstranded data
```
### 2.3. Add annotation(s) (optional)
To add the annotation to the h5 file, they need to be converted from gff/ggf3 to a
sqlite file using GeenuFF. If helixer and reference are desired one of them needs to
be added using ``--add-additional <annotation_name>`` (helixer is the better option;
suggested annotation name: helixer_post).
[More details about the annotation data](https://github.com/weberlab-hhu/Helixer/blob/main/docs/h5_data.md)
```bash
# to sqlite
import2geenuff.py --gff3 <helixer_annotation>.gff3 --fasta <species>.fa \
--species <full_species_name> --db-path <helixer_geenuff>.sqlite3 \
--log-file output.log
# --gff3 <reference>.gff to convert the reference annotation
# add annotation
geenuff2h5.py --h5-output-path <species>.h5 --input-db-path <helixer_geenuff>.sqlite3 \
--no-multiprocess --modes y,anno_meta,transitions
# modes: don't add X again when using add_additional
```
Resulting new file structure (when adding the reference annotation not in alternative
and adding the helixer annotation to alternative):
```raw
<file>.h5
|
|
├── data
| ├── X (encoded DNA sequence)
| ├── seqids (chromosome/scaffold names)
| ├── species
| ├── start_ends (start and end base per chunk)
| ├── err_samples
| ├── fully_intergenic_samples
| ├── gene_lengths
| ├── is_annotated
| ├── phases
| ├── sample_weights
| ├── transitions
| └── y
|
├── evaluation
| ├── <dataset>_meta
| | └── bam_files
| ├── <dataset>_coverage (in reads per base pair)
| └── <dataset>_spliced_coverage
|
└── alternative
└── helixer_post
├── err_samples
├── fully_intergenic_samples
├── gene_lengths
├── is_annotated
├── phases
├── sample_weights
├── transitions
└── y
```
**Common errors:**
- GeenuFF is very strict, so frequently converting the annotation to sqlite fails and
the reference gff needs to be edited slightly for it to be converted.
- The full species names don't match up between script calls.
## References
- Stiehler, F., Steinborn, M., Scholz, S., Dey, D., Weber, A. P. M., & Denton,
A. K. (2021). Helixer: cross-species gene annotation of large eukaryotic genomes
using deep learning. Bioinformatics, 36(22–23), 5291–5298.
https://doi.org/10.1093/BIOINFORMATICS/BTAA1044
- SRA Toolkit - GitHub. (n.d.). Retrieved September 19, 2022,
from https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit
- Weberlab: Institute of Plant Biochemistry, HHU: GitHub. (n.d.). Retrieved
May 23, 2022, from https://github.com/weberlab-hhu
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment