Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
Kindel-2024
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Container Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
CEPLAS
Kindel-2024
Merge requests
!2
Ceplas dm viktoria
Code
Review changes
Check out branch
Download
Patches
Plain diff
Merged
Ceplas dm viktoria
ceplas-dm-viktoria
into
main
Overview
0
Commits
16
Pipelines
2
Changes
2
Merged
Viktoria Petrova
requested to merge
ceplas-dm-viktoria
into
main
3 months ago
Overview
0
Commits
16
Pipelines
2
Changes
2
Expand
check issues for a summary
0
0
Merge request reports
Viewing commit
9c7bd1b5
Prev
Next
Show latest version
2 files
+
535
−
0
Side-by-side
Compare changes
Side-by-side
Inline
Show whitespace changes
Show one file at a time
Files
2
Search (e.g. *.vue) (Ctrl+P)
9c7bd1b5
add detailed data preprocessing documentation and h5 generation documentation to workflows
· 9c7bd1b5
Viktoria Petrova
authored
3 months ago
workflows/data_preprocessing.md
0 → 100644
+
265
−
0
Options
# Data preprocessing documentataion
This documentation is about preprocessing ATAC- and ChIP-seq data. The pipeline
is used to create bam files from public new generation sequencing (NGS) data
(fastq files) downloaded from NCBI's (National Center for Biotechnology Information)
SRA (Sequence Read Archive).
## 1. Required tools
The tool version numbers are the ones used during this project.
-
SRA-Toolkit 3.0.0 (SRA Toolkit - GitHub)
-
Java 1.8.0 (Arnold et al., 2005)
-
Trimmomatic 0.36 (Bolger et al., 2014)
-
FastQC 0.11.9 (Babraham Bioinformatics - FastQC A Quality Control Tool
for High Throughput Sequence Data)
-
MultiQC 1.14 (Ewels et al., 2016)
-
BWA 2.1 (Md et al., 2019)
-
SamTools 1.6 (Danecek et al., 2021)
-
Picard (Picard Tools - By Broad Institute, n.d.)
-
deepTools 3.5.1 (Ramírez et al., 2016)
-
Helixer 0.3.1 (Stiehler et al., 2021; Weberlab: Institute of Plant Biochemistry,
HHU: GitHub, n.d.)
-
Helixer Post (https://github.com/TonyBolger/HelixerPost)
-
gffread (Pertea & Pertea, 2020)
-
GeenuFF 0.3.0 (Weberlab: Institute of Plant Biochemistry, HHU: GitHub, n.d.)
## 2. Pipeline
| Category | Download data | Trimming | Quality control | Mapping | Quality control | Deduplication/Data cleaning | Quality control |
|:--------:|:-------------:|:-------------------:|:---------------:|:---------------:|:---------------:|:---------------------------:|:-------------------------------:|
| Tools | SRA Toolkit | Trimmomatic
<br>
Java | FastQC | BWA
<br>
SamTools | SamTools | Picard
<br>
Java
<br>
SamTools | deepTools
<br>
Helixer
<br>
gffread |
### 2.1. Download data
**Tools:**
SRA toolkit
**Setup:**
The path to the toolkit
``export PATH=$PATH:<path_to>/sratoolkit.3.0.0-ubuntu64/bin``
needs to be in
your
``.bashrc``
. When first configuring the SRA toolkit, be sure to set the local
file caching to a folder of sufficient size, so
**not**
your home directory
(see https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration).
There are two options for downloading the runs from SRA: fastq-dump and
fasterq-dump. The program calls are similar, with fastq-dump downloading the
files in qzip format while fasterq-dump downloads faster, but doesn't zip the
fastq files.
#### 2.1.1. Fastq-dump
```
bash
fastq-dump <SRR_accsession>
--gzip
--split-3
```
#### 2.1.2. Fasterq-dump
```
bash
fasterq-dump <SRR_accsession>
--split-3
gzip
<SRR_accsession>_1.fastq <SRR_accsession>_2.fastq
# for paired end data
```
#### 2.1.3. Special case: single cell data
```
bash
prefetch <SRR_accsession>
fasterq-dump <SRR_accsession>
-S
--include-technical
gzip
<SRR_accsession>_3.fastq <SRR_accsession>_4.fastq
```
Unzipped fastq files also work, as trimmomatic accepts them as well.
It is however necessary to rename the forward read from
<SRR_accsession>
_3.fastq to
<SRR_accsession>
_1.fastq and the reverse reads from <SRR_
accsession>_4.fastq to
<SRR_accsession>
_2.fastq.
### 2.2. Trimming
**Tools:**
Java 1.8.0, Trimmomatic 0.36
**Hint:**
ATAC-seq data usually uses Nextera adapters, so use NexteraPE-PE.fa for
paired ATAC-seq and TruSeq3-PE-2.fa for paired ChIP-seq data.
```
bash
# paired end data
java
-jar
-Xmx2048m
<path_to_trimmomatic>/trimmomatic-0.36.jar PE
-threads
2
\
-basein
<sample_ID>_1.fastq.gz
-baseout
<out_path>/<sample_ID>.fastq.gz
\
ILLUMINACLIP:<path_to_trimmomatic>/adapters/TruSeq3-PE-2.fa:3:30:10:1:true
\
MAXINFO:36:0.7
# single end data
java
-jar
-Xmx2048m
<path_to_trimmomatic>/trimmomatic-0.36.jar SE
-threads
2
\
<sample_ID>.fastq.gz <out_path>/<sample_ID>.fastq.gz
\
ILLUMINACLIP:<path_to_trimmomatic>/adapters/TruSeq3-SE.fa:2:30:10 MAXINFO:36:0.7
```
### 2.3. Quality control
**Tools:**
FastQC
```
bash
# paired end data
fastqc <sample_ID>_1P.fastq.gz
fastqc <sample_ID>_2P.fastq.gz
# single end data
fastqc <sample_ID>.fastq.gz
```
**Hint:**
[
MultiQC
](
https://multiqc.info
)
was used to generate summary html reports.
### 2.4. Mapping
**Tools:**
BWA, SamTools
#### 2.4.1. Genome indexing
```
bash
bwa-mem2 index
-p
<path_to_genome>/<species> <path_to_genome>/<species>.fa
```
#### 2.4.2. Mapping
```
bash
# paired end data
bwa-mem2 mem <path_to_genome>/<species> <
(
zcat <sample_ID>_1P.fastq.gz
)
\
<
(
zcat <sample_ID>_2P.fastq.gz
)
>
<sample_ID>.sam
# single end data
bwa-mem2 mem <path_to_genome>/<species> <
(
zcat <sample_ID>.fastq.gz
)
>
\
<sample_ID>.sam
# sort and index
samtools view <sample_ID>.sam
-b
|samtools
sort
-T
tmp<sample_ID> -@1
\
-o
<sample_ID>.bam
samtools index <sample_ID>.bam
```
### 2.5. Quality control
**Tools:**
SamTools
Check for percentage of reads mapped, mates paired, etc.
```
bash
samtools flagstat <sample_ID>.bam
>
<sample_ID>.txt
```
### 2.6. Deduplication/Data cleaning
**Tools:**
Picard, Java, SamTools
Common errors are:
-
the picard output just ends without signaling that picard finished and
there is no output bam file
-
java.lang.OutOfMemoryError: Java heap space
-
java.lang.OutOfMemoryError: GC overhead limit exceeded
The
**solution**
is extending the java heap space, so replacing
-Xmx1G with -Xmx2G or even -Xmx3G.
```
bash
# picard
java
-Xmx1G
-jar
<path_to_picard>/picard.jar MarkDuplicates
I
=
<sample_ID>.bam
\
O
=
<sample_ID>_marked.bam
M
=
<sample_ID>.stats
# filter with samtools
samtools view
-F
1796 <sample_ID>_marked.bam
-b
>
<sample_ID>.bam
# -F 1796 filters out duplicates, unmapped reads, non-primary alignments
# and reads not passing platform quality checks
samtools index <sample_ID>.bam
```
### 2.7. Quality control
**Tools:**
deepTools, Helixer, gffread
Only the predictions/annotation from Helixer were used for consistency, but if the
genome already has an annotation you can also use that and skip the fourth step.
You can also blacklist regions (bed file) in the fifth step, if the previous plot
was too noisy. An easy way is to blacklist the chloroplast, mitochondrion and unplaced
scaffolds, as they have been observed to frequently cause noise in ATAC-seq data.
#### 2.7.1. Plot coverage
```
bash
files
=
`
ls
<folder_of_deduplicated_files>/
*
.bam
`
<path_to_deeptools>/deepTools/bin/plotCoverage
-b
$files
\
--plotFile
<species>Coverage.png
--smartLabels
--outRawCounts
deduplicated/raw.txt
```
#### 2.7.2. Plot fingerprint
```
bash
files
=
`
ls
<folder_of_deduplicated_files>/
*
.bam
`
<path_to_deeptools>/deepTools/bin/plotFingerprint
-b
$files
\
--plotFile
<species>Fingerprint.png
--smartLabels
--binSize
500
\
--numberOfSamples
500_000
# if the genome is equal or smaller than 250 Mbp, please adjust the defaults of
# binsize (500) and numberOfSamples (500_000) oof plotFingerprint, so that
# binsize x numberOfSamples < genome size
```
#### 2.7.3. Bam to bigwig
```
bash
files
=
`
ls
deduplicated/
*
2.bam
`
for
file
in
$files
;
do
<path_to_deeptools>/deepTools/bin/bamCoverage
-b
$file
-o
$(
basename
$file
.bam
)
.bw
-of
bigwig
done
```
#### 2.7.4. Helixer annotation
```
bash
<path_to_helixer>/Helixer/Helixer.py
--lineage
land_plant
--fasta-path
<species>.fa
\
--species
<full_species_name>
--gff-output-path
<species>_land_plant_v0.3_a_0080_helixer.gff3
\
--model-filepath
<path_to_best_model>/land_plant_v0.3_a_0080.h5
--batch-size
150
\
--subsequence-length
21384
--temporary-dir
<temp_dir>
# change batch size/subsequence length as your GPU can handle/as you need
# full species name example: Arabidopsis_thaliana
```
>**Warning:** To create the annotation, you will need an Nvidia GPU, their CUDA
> toolkit and cuDNN. (see https://github.com/weberlab-hhu/Helixer)
#### 2.7.5. Compute matrix
```
bash
# convert predictions or reference annotation to bed
gffread helixer_pred/<species>_land_plant_v0.3_a_0080_helixer.gff3
\
--bed
-o
<species>_helixer.bed
# compute matrix
files
=
`
ls
<path_to_bigwig_files>
*
.bw
`
<path_to_deeptools>/deepTools/bin/computeMatrix reference-point
\
-S
$files
-R
<species>_helixer.bed
--smartLabels
\
-o
<species>_matrix.mat.gz
--upstream
1500
--downstream
1500
```
#### 2.7.6. Plot heatmap
```
bash
<path_to_deeptools>/deepTools/bin/plotHeatmap
-m
deepqc/<species>_matrix.mat.gz
\
-o
<species>_tss_heatmap.png
```
The heatmap should show a peak in front of/around the transcription start site (TSS)
for ATAC-seq and behind the TSS for H3K4me3 ChIP-seq.
**ATAC-seq example of _Brachypodium distachyon_:**

**ChIP-seq example of _Brachypodium distachyon_:**

### References
-
Arnold, K., Gosling, J., & Holmes, D. (2005). The Java programming language.
Addison Wesley Professional.
-
Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput
Sequence Data. (n.d.). Retrieved May 23, 2022, from
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
-
Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer
for Illumina sequence data. Bioinformatics, 30(15), 2114.
https://doi.org/10.1093/BIOINFORMATICS/BTU170
-
Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O.,
Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve
years of SAMtools and BCFtools. GigaScience, 10(2).
https://doi.org/10.1093/GIGASCIENCE/GIAB008
-
Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize
analysis results for multiple tools and samples in a single report. Bioinformatics,
32(19), 3047–3048. https://doi.org/10.1093/BIOINFORMATICS/BTW354
-
Md, V., Misra, S., Li, H., & Aluru, S. (2019). Efficient architecture-aware
acceleration of BWA-MEM for multicore systems. Proceedings - 2019 IEEE 33rd
International Parallel and Distributed Processing Symposium, IPDPS 2019, 314–324.
https://doi.org/10.1109/IPDPS.2019.00041
-
Pertea, G., & Pertea, M. (2020). GFF Utilities: GffRead and GffCompare.
F1000Research, 9, 304. https://doi.org/10.12688/F1000RESEARCH.23297.2
-
Picard Tools - By Broad Institute. (n.d.). Retrieved May 23, 2022,
from https://broadinstitute.github.io/picard/
-
Ramírez, F., Ryan, D. P., Grüning, B., Bhardwaj, V., Kilpert, F., Richter,
A. S., Heyne, S., Dündar, F., & Manke, T. (2016). deepTools2: a next generation
web server for deep-sequencing data analysis. Nucleic Acids Research, 44(W1),
W160–W165. https://doi.org/10.1093/NAR/GKW257
-
Stiehler, F., Steinborn, M., Scholz, S., Dey, D., Weber, A. P. M., & Denton,
A. K. (2021). Helixer: cross-species gene annotation of large eukaryotic genomes
using deep learning. Bioinformatics, 36(22–23), 5291–5298.
https://doi.org/10.1093/BIOINFORMATICS/BTAA1044
-
SRA Toolkit - GitHub. (n.d.). Retrieved September 19, 2022,
from https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit
-
Weberlab: Institute of Plant Biochemistry, HHU: GitHub. (n.d.). Retrieved
May 23, 2022, from https://github.com/weberlab-hhu
Loading