From 9c7bd1b576c6a8121a404f7ff9f9b76eebb76c7c Mon Sep 17 00:00:00 2001 From: Viktoria Petrova <vipet103@hhu.de> Date: Wed, 22 Jan 2025 11:42:07 +0100 Subject: [PATCH] add detailed data preprocessing documentation and h5 generation documentation to workflows --- workflows/data_preprocessing.md | 265 +++++++++++++++++++++++++++++++ workflows/h5_files.md | 270 ++++++++++++++++++++++++++++++++ 2 files changed, 535 insertions(+) create mode 100644 workflows/data_preprocessing.md create mode 100644 workflows/h5_files.md diff --git a/workflows/data_preprocessing.md b/workflows/data_preprocessing.md new file mode 100644 index 0000000..a5fb843 --- /dev/null +++ b/workflows/data_preprocessing.md @@ -0,0 +1,265 @@ +# Data preprocessing documentataion +This documentation is about preprocessing ATAC- and ChIP-seq data. The pipeline +is used to create bam files from public new generation sequencing (NGS) data +(fastq files) downloaded from NCBI's (National Center for Biotechnology Information) +SRA (Sequence Read Archive). + +## 1. Required tools +The tool version numbers are the ones used during this project. + +- SRA-Toolkit 3.0.0 (SRA Toolkit - GitHub) +- Java 1.8.0 (Arnold et al., 2005) +- Trimmomatic 0.36 (Bolger et al., 2014) +- FastQC 0.11.9 (Babraham Bioinformatics - FastQC A Quality Control Tool +for High Throughput Sequence Data) +- MultiQC 1.14 (Ewels et al., 2016) +- BWA 2.1 (Md et al., 2019) +- SamTools 1.6 (Danecek et al., 2021) +- Picard (Picard Tools - By Broad Institute, n.d.) +- deepTools 3.5.1 (RamÃrez et al., 2016) +- Helixer 0.3.1 (Stiehler et al., 2021; Weberlab: Institute of Plant Biochemistry, +HHU: GitHub, n.d.) +- Helixer Post (https://github.com/TonyBolger/HelixerPost) +- gffread (Pertea & Pertea, 2020) +- GeenuFF 0.3.0 (Weberlab: Institute of Plant Biochemistry, HHU: GitHub, n.d.) + +## 2. Pipeline +| Category | Download data | Trimming | Quality control | Mapping | Quality control | Deduplication/Data cleaning | Quality control | +|:--------:|:-------------:|:-------------------:|:---------------:|:---------------:|:---------------:|:---------------------------:|:-------------------------------:| +| Tools | SRA Toolkit | Trimmomatic<br>Java | FastQC | BWA<br>SamTools | SamTools | Picard<br>Java<br>SamTools | deepTools<br>Helixer<br>gffread | + + +### 2.1. Download data +**Tools:** SRA toolkit +**Setup:** The path to the toolkit +``export PATH=$PATH:<path_to>/sratoolkit.3.0.0-ubuntu64/bin`` needs to be in +your ``.bashrc``. When first configuring the SRA toolkit, be sure to set the local +file caching to a folder of sufficient size, so **not** your home directory +(see https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration). + +There are two options for downloading the runs from SRA: fastq-dump and +fasterq-dump. The program calls are similar, with fastq-dump downloading the +files in qzip format while fasterq-dump downloads faster, but doesn't zip the +fastq files. + +#### 2.1.1. Fastq-dump +```bash +fastq-dump <SRR_accsession> --gzip --split-3 + +``` +#### 2.1.2. Fasterq-dump +```bash +fasterq-dump <SRR_accsession> --split-3 + +gzip <SRR_accsession>_1.fastq <SRR_accsession>_2.fastq # for paired end data + +``` + +#### 2.1.3. Special case: single cell data +```bash +prefetch <SRR_accsession> +fasterq-dump <SRR_accsession> -S --include-technical + +gzip <SRR_accsession>_3.fastq <SRR_accsession>_4.fastq +``` +Unzipped fastq files also work, as trimmomatic accepts them as well. +It is however necessary to rename the forward read from <SRR_accsession>_3.fastq to +<SRR_accsession>_1.fastq and the reverse reads from <SRR_accsession>_4.fastq to +<SRR_accsession>_2.fastq. + +### 2.2. Trimming +**Tools:** Java 1.8.0, Trimmomatic 0.36 +**Hint:** ATAC-seq data usually uses Nextera adapters, so use NexteraPE-PE.fa for +paired ATAC-seq and TruSeq3-PE-2.fa for paired ChIP-seq data. + +```bash +# paired end data +java -jar -Xmx2048m <path_to_trimmomatic>/trimmomatic-0.36.jar PE -threads 2 \ +-basein <sample_ID>_1.fastq.gz -baseout <out_path>/<sample_ID>.fastq.gz \ +ILLUMINACLIP:<path_to_trimmomatic>/adapters/TruSeq3-PE-2.fa:3:30:10:1:true \ +MAXINFO:36:0.7 + +# single end data +java -jar -Xmx2048m <path_to_trimmomatic>/trimmomatic-0.36.jar SE -threads 2 \ +<sample_ID>.fastq.gz <out_path>/<sample_ID>.fastq.gz \ +ILLUMINACLIP:<path_to_trimmomatic>/adapters/TruSeq3-SE.fa:2:30:10 MAXINFO:36:0.7 +``` + +### 2.3. Quality control +**Tools:** FastQC +```bash +# paired end data +fastqc <sample_ID>_1P.fastq.gz +fastqc <sample_ID>_2P.fastq.gz + +# single end data +fastqc <sample_ID>.fastq.gz +``` +**Hint:** [MultiQC](https://multiqc.info) was used to generate summary html reports. + +### 2.4. Mapping +**Tools:** BWA, SamTools + +#### 2.4.1. Genome indexing +```bash +bwa-mem2 index -p <path_to_genome>/<species> <path_to_genome>/<species>.fa +``` + +#### 2.4.2. Mapping +```bash +# paired end data +bwa-mem2 mem <path_to_genome>/<species> <(zcat <sample_ID>_1P.fastq.gz) \ +<(zcat <sample_ID>_2P.fastq.gz) > <sample_ID>.sam + +# single end data +bwa-mem2 mem <path_to_genome>/<species> <(zcat <sample_ID>.fastq.gz) > \ +<sample_ID>.sam + +# sort and index +samtools view <sample_ID>.sam -b |samtools sort -T tmp<sample_ID> -@1 \ +-o <sample_ID>.bam +samtools index <sample_ID>.bam +``` + +### 2.5. Quality control +**Tools:** SamTools + +Check for percentage of reads mapped, mates paired, etc. + +```bash +samtools flagstat <sample_ID>.bam > <sample_ID>.txt +``` + +### 2.6. Deduplication/Data cleaning +**Tools:** Picard, Java, SamTools + +Common errors are: +- the picard output just ends without signaling that picard finished and +there is no output bam file +- java.lang.OutOfMemoryError: Java heap space +- java.lang.OutOfMemoryError: GC overhead limit exceeded + +The **solution** is extending the java heap space, so replacing +-Xmx1G with -Xmx2G or even -Xmx3G. + +```bash +# picard +java -Xmx1G -jar <path_to_picard>/picard.jar MarkDuplicates I=<sample_ID>.bam \ +O=<sample_ID>_marked.bam M=<sample_ID>.stats + +# filter with samtools +samtools view -F 1796 <sample_ID>_marked.bam -b > <sample_ID>.bam +# -F 1796 filters out duplicates, unmapped reads, non-primary alignments +# and reads not passing platform quality checks +samtools index <sample_ID>.bam +``` + +### 2.7. Quality control +**Tools:** deepTools, Helixer, gffread + +Only the predictions/annotation from Helixer were used for consistency, but if the +genome already has an annotation you can also use that and skip the fourth step. +You can also blacklist regions (bed file) in the fifth step, if the previous plot +was too noisy. An easy way is to blacklist the chloroplast, mitochondrion and unplaced +scaffolds, as they have been observed to frequently cause noise in ATAC-seq data. + +#### 2.7.1. Plot coverage +```bash +files=`ls <folder_of_deduplicated_files>/*.bam` +<path_to_deeptools>/deepTools/bin/plotCoverage -b $files \ +--plotFile <species>Coverage.png --smartLabels --outRawCounts deduplicated/raw.txt +``` +#### 2.7.2. Plot fingerprint +```bash +files=`ls <folder_of_deduplicated_files>/*.bam` +<path_to_deeptools>/deepTools/bin/plotFingerprint -b $files \ +--plotFile <species>Fingerprint.png --smartLabels --binSize 500 \ +--numberOfSamples 500_000 +# if the genome is equal or smaller than 250 Mbp, please adjust the defaults of +# binsize (500) and numberOfSamples (500_000) oof plotFingerprint, so that +# binsize x numberOfSamples < genome size +``` +#### 2.7.3. Bam to bigwig +```bash +files=`ls deduplicated/*2.bam` + +for file in $files; do + <path_to_deeptools>/deepTools/bin/bamCoverage -b $file -o $(basename $file .bam).bw -of bigwig +done +``` +#### 2.7.4. Helixer annotation +```bash +<path_to_helixer>/Helixer/Helixer.py --lineage land_plant --fasta-path <species>.fa \ +--species <full_species_name> --gff-output-path <species>_land_plant_v0.3_a_0080_helixer.gff3 \ +--model-filepath <path_to_best_model>/land_plant_v0.3_a_0080.h5 --batch-size 150 \ +--subsequence-length 21384 --temporary-dir <temp_dir> +# change batch size/subsequence length as your GPU can handle/as you need +# full species name example: Arabidopsis_thaliana +``` +>**Warning:** To create the annotation, you will need an Nvidia GPU, their CUDA +> toolkit and cuDNN. (see https://github.com/weberlab-hhu/Helixer) + +#### 2.7.5. Compute matrix +```bash +# convert predictions or reference annotation to bed +gffread helixer_pred/<species>_land_plant_v0.3_a_0080_helixer.gff3 \ +--bed -o <species>_helixer.bed + +# compute matrix +files=`ls <path_to_bigwig_files>*.bw` +<path_to_deeptools>/deepTools/bin/computeMatrix reference-point \ +-S $files -R <species>_helixer.bed --smartLabels \ +-o <species>_matrix.mat.gz --upstream 1500 --downstream 1500 +``` + +#### 2.7.6. Plot heatmap +```bash +<path_to_deeptools>/deepTools/bin/plotHeatmap -m deepqc/<species>_matrix.mat.gz \ +-o <species>_tss_heatmap.png +``` + +The heatmap should show a peak in front of/around the transcription start site (TSS) +for ATAC-seq and behind the TSS for H3K4me3 ChIP-seq. + +**ATAC-seq example of _Brachypodium distachyon_:** + + +**ChIP-seq example of _Brachypodium distachyon_:** + + +### References +- Arnold, K., Gosling, J., & Holmes, D. (2005). The Java programming language. +Addison Wesley Professional. +- Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput +Sequence Data. (n.d.). Retrieved May 23, 2022, from +https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ +- Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer +for Illumina sequence data. Bioinformatics, 30(15), 2114. +https://doi.org/10.1093/BIOINFORMATICS/BTU170 +- Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., +Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve +years of SAMtools and BCFtools. GigaScience, 10(2). +https://doi.org/10.1093/GIGASCIENCE/GIAB008 +- Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize +analysis results for multiple tools and samples in a single report. Bioinformatics, +32(19), 3047–3048. https://doi.org/10.1093/BIOINFORMATICS/BTW354 +- Md, V., Misra, S., Li, H., & Aluru, S. (2019). Efficient architecture-aware +acceleration of BWA-MEM for multicore systems. Proceedings - 2019 IEEE 33rd +International Parallel and Distributed Processing Symposium, IPDPS 2019, 314–324. +https://doi.org/10.1109/IPDPS.2019.00041 +- Pertea, G., & Pertea, M. (2020). GFF Utilities: GffRead and GffCompare. +F1000Research, 9, 304. https://doi.org/10.12688/F1000RESEARCH.23297.2 +- Picard Tools - By Broad Institute. (n.d.). Retrieved May 23, 2022, +from https://broadinstitute.github.io/picard/ +- RamÃrez, F., Ryan, D. P., Grüning, B., Bhardwaj, V., Kilpert, F., Richter, +A. S., Heyne, S., Dündar, F., & Manke, T. (2016). deepTools2: a next generation +web server for deep-sequencing data analysis. Nucleic Acids Research, 44(W1), +W160–W165. https://doi.org/10.1093/NAR/GKW257 +- Stiehler, F., Steinborn, M., Scholz, S., Dey, D., Weber, A. P. M., & Denton, +A. K. (2021). Helixer: cross-species gene annotation of large eukaryotic genomes +using deep learning. Bioinformatics, 36(22–23), 5291–5298. +https://doi.org/10.1093/BIOINFORMATICS/BTAA1044 +- SRA Toolkit - GitHub. (n.d.). Retrieved September 19, 2022, +from https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit +- Weberlab: Institute of Plant Biochemistry, HHU: GitHub. (n.d.). Retrieved +May 23, 2022, from https://github.com/weberlab-hhu diff --git a/workflows/h5_files.md b/workflows/h5_files.md new file mode 100644 index 0000000..076ca60 --- /dev/null +++ b/workflows/h5_files.md @@ -0,0 +1,270 @@ +# H5 files +Predmoter uses h5 files as input and main output data. H5 files, +Hierarchical Data Format 5 (HDF5), are designed to store and organize large +amounts of data. They consist of two major components: + +- Datasets (multidimensional arrays) +- Groups (container structures; can hold datasets and other groups) + +## 1. H5 file architecture +### 1. Input files +The architecture of the h5 files used as input data for Predmoter looks like this: +```raw +<file>.h5 +| +| +├── data +| ├── X (encoded DNA sequence) +| ├── seqids (chromosome/scaffold names) +| ├── species +| └── start_ends (start and end base per chunk) +| +└── evaluation + ├── <dataset>_meta + | └── bam_files + ├── <dataset>_coverage (in reads per base pair) + └── <dataset>_spliced_coverage +``` + +#### 1.1. The data group +The base h5 file and all arrays of the data group is created using ``fasta2h5.py``. + +##### 1.1.1. X +The encoded DNA sequence. The original fasta file is cut into chunks of the same +size (default: 21384 base pairs). The shape of this array is (N, C, 4). The C +represents the chunk/sequence length and N the number of chunks. The four represents +the nucleotides/bases (example shape: (11202, 21384, 4)). The nucleotide encoding +is as follows: +- [1, 0, 0, 0] = C +- [0, 1, 0, 0] = A +- [0, 0, 1, 0] = T +- [0, 0, 0, 1] = G +- [0.25, 0.25, 0.25, 0.25] = N +- [0, 0, 0, 0] = padding + +Padding is used to pad out too short sequences or chromosome ends, since chromosomes +are rarely exactly divisible by the sequence length. Arrays always need to have +compatible shapes to be concatenated, so the sequence length needs to be identical +between all chunks. Shorter sequences will therefore be padded. Predmoter considers +Ns padding as well (so [0, 0, 0, 0]), as most Ns are gaps in the genome assembly +and as such not informative. + +##### 1.1.2 seqids +The ``seqids`` represent the fasta file headers/identifiers of the sequence +(chromosome, scaffold, contig, etc.) that a chunk comes from. The shape is (N,), i.e. +the number of chunks. + +##### 1.1.3. species +The species ID given by the user. The shape is (N,), i.e. the number of chunks. +Technically useful if multiple species are in one file. Predmoter assumes that +there is just one species per file, which is especially important for testing and +predicting. It should be possible to use multiple species in one file during +training, but it wasn't tested. + +##### 1.1.4. start_ends +The start and end coordinates of a given chunk (in that order). When the end +coordinate is smaller than the start coordinate (e.g. [21384, 0]), then the +chunk is from the negative strand. So [0, 21384] is the first 21384 bp of a given +sequence of the positive strand. On the positive strand start is inclusive and +end exclusive (it's the other way around on the negative strand). If +``|start - end|`` is less than the chunk length then there is padding. The shape is +(N, 2), i.e. two values (start and end) per chunk. + +#### Examples +##### 1. Chunk order +The DNA sequence order inside the h5 files is: chromosome 1-positive strand, +chromosome 1-negative strand, chromosome 2-positive strand, +chromosome 2-negative strand, etc. + +Example of one chromosome: +- sequence length: 21384 bp +- length of chromosome 1: 51321 bp + +| | | | | | | | +|:----------|:-----------|:---------------|:---------------|:---------------|:---------------|:-----------| +| chunk | 0 | 1 | 2 | 3 | 4 | 5 | +| strand | + | + | + | - | - | - | +| start/end | [0, 21384] | [21384, 42768] | [42768, 51321] | [51321, 42768] | [42768, 21384] | [21384, 0] | + +All strands are represented from 5' to 3', so the negative strand is in descending +order. Chunks 2 and 3 contain padding and chunk 3 is the reverse complement of chunk 2. + +##### 2. Padding +bases = A, T, C, G, N +padding = P +(both encoded inside the file) + +positive strand: + +| | | | | +|:-----------|:----------------|:----------------|:----------------| +| chunk | 0 | 1 | 2 | +| mini_array | [A, A, T, G, A] | [C, A, T, N, C] | [G, T, T, P, P] | +| start/end | [0, 5] | [5, 10] | [10, 13] | + +negative strand: + +| | | | | +|:-----------|:----------------|:----------------|:----------------| +| chunk | 3 | 4 | 5 | +| mini_array | [A, A, C, P, P] | [G, N, A, T, G] | [T, C, A, T, T] | +| start/end | [13, 10] | [10, 5] | [5, 0] | + +The padding is on the positive and negative strand always at the end of the array, +so chunk 2 and 3 are **not** exact mirrors/reverse complements of each other, while +for example chunk 0 and 5 are. + +#### 1.2 The evaluation group +All arrays of the evaluation group are added to the base h5 file using +``add_ngs_coverage.py``. Multiple datasets can be added to the h5 file. + +##### 1.2.1 dataset_meta +The meta of a dataset contains the array ``bam_files``. This is a list of the bam file +names (including the path) that were used to add the dataset. The shape is (B,), i.e. +the number of bam files. Multiple bam files can be added per dataset, but it is not +possible to add additional bam files to an existing dataset. In that case, +``add_ngs_coverage`` needs to be used on the desired h5 file not containing the dataset +and all bam files (old and new) can be added together. + +##### 1.2.2. dataset_coverage +Coverage is the number of reads mapping at a given position (cigar M or X). The +shape is (N, C, B). The C represents the chunk/sequence length, N the number of +chunks, and B the number of bam files (example shape: (11202, 21384, 5)). Predmoter +uses the average of all bam files per dataset to train. + +##### 1.2.3 dataset_spliced_coverage +Spliced coverage is the number of reads mapping with a gap, that is a read deletion +or spliced out section at a given position (cigar N or D). The shape is the same as +the shape of dataset_coverage. + +### 2. Prediction file +The prediction h5 file contains: +```raw +<input_file_basename>_prediction.h5 +| +├── data +| ├── seqids +| ├── start_ends +| └── species +| +└── prediction + ├── predictions + ├── model_name + └── datasets +``` + +The ``seqids``, ``start_ends`` and ``species`` arrays are taken from the input file. +The predictions array contains the models' predictions. The shape is (N, C, D); +the C represents the chunk/sequence length, N the number of chunks and D the number +of datasets predicted. The metadata contains the models' filename (including the path), +and the datasets the model predicted in the correct order (e.g., "atacseq", "h3k4me3"). +The order of datasets is very important, as it is used by the Converter to convert each +dataset into its own bigwig or bedgraph file. The Converter also needs ``seqids`` and +``start_ends`` to calculate chromosome order and chromosome lengths. + +## 2. H5 file creation +**Tools:** Helixer, GeenuFF (optional), Predmoter +(Stiehler et al., 2021; Weberlab: Institute of Plant Biochemistry, HHU: GitHub, n.d.) + +ATAC- and ChIP-seq data is PCR amplified. Therefore, you can not determine from which +strand which read originated unless you used barcodes to get strand specific data. +Hence, the coverage information of the read is always added to both strands. The +advantages are: +- Open chromatin and closed chromatin regions always apply to both strands anyway. +- The addition to both strands allows for built-in data augmentation. + +### 2.1. Create h5 file from fasta file +```bash +fasta2h5.py --species <full_species_name> --h5-output-path <species/filename>.h5 \ +--fasta-path <path_to_genome>/<species>.fa +# full species name example: Arabidopsis_thaliana +``` + +> **H5 base file:** To add annotations or datasets, the fasta h5 file needs to be +> created **first**. The additional data will then be added to the fasta h5 file. + +### 2.2. Add ATAC- and/or ChIP-seq data +Multiple datasets should be added separately. The shift argument is a special case +for ATAC-seq data to shift the reads +4 (+ strand) or -5 (- strand) base pairs as it is +typically done. +```bash +python3 <path_to_helixer>/helixer/evaluation/add_ngs_coverage.py \ +-s <full_species_name> -d <species>.h5 -b <path_to_bam_files>/*.bam \ +--dataset-prefix <ngs_prefix> --unstranded --threads <number_of_bam_files_to_add> \ +(--shift) +# multiple threads are faster, but --threads 0 also works +# ngs prefix example: atacseq +# --unstranded only for unstranded data +``` + +### 2.3. Add annotation(s) (optional) +To add the annotation to the h5 file, they need to be converted from gff/ggf3 to a +sqlite file using GeenuFF. If helixer and reference are desired one of them needs to +be added using ``--add-additional <annotation_name>`` (helixer is the better option; +suggested annotation name: helixer_post). +[More details about the annotation data](https://github.com/weberlab-hhu/Helixer/blob/main/docs/h5_data.md) +```bash +# to sqlite +import2geenuff.py --gff3 <helixer_annotation>.gff3 --fasta <species>.fa \ +--species <full_species_name> --db-path <helixer_geenuff>.sqlite3 \ +--log-file output.log +# --gff3 <reference>.gff to convert the reference annotation + +# add annotation +geenuff2h5.py --h5-output-path <species>.h5 --input-db-path <helixer_geenuff>.sqlite3 \ +--no-multiprocess --modes y,anno_meta,transitions +# modes: don't add X again when using add_additional +``` + +Resulting new file structure (when adding the reference annotation not in alternative +and adding the helixer annotation to alternative): +```raw +<file>.h5 +| +| +├── data +| ├── X (encoded DNA sequence) +| ├── seqids (chromosome/scaffold names) +| ├── species +| ├── start_ends (start and end base per chunk) +| ├── err_samples +| ├── fully_intergenic_samples +| ├── gene_lengths +| ├── is_annotated +| ├── phases +| ├── sample_weights +| ├── transitions +| └── y +| +├── evaluation +| ├── <dataset>_meta +| | └── bam_files +| ├── <dataset>_coverage (in reads per base pair) +| └── <dataset>_spliced_coverage +| +└── alternative + └── helixer_post + ├── err_samples + ├── fully_intergenic_samples + ├── gene_lengths + ├── is_annotated + ├── phases + ├── sample_weights + ├── transitions + └── y +``` + +**Common errors:** +- GeenuFF is very strict, so frequently converting the annotation to sqlite fails and +the reference gff needs to be edited slightly for it to be converted. +- The full species names don't match up between script calls. + +## References +- Stiehler, F., Steinborn, M., Scholz, S., Dey, D., Weber, A. P. M., & Denton, +A. K. (2021). Helixer: cross-species gene annotation of large eukaryotic genomes +using deep learning. Bioinformatics, 36(22–23), 5291–5298. +https://doi.org/10.1093/BIOINFORMATICS/BTAA1044 +- SRA Toolkit - GitHub. (n.d.). Retrieved September 19, 2022, +from https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit +- Weberlab: Institute of Plant Biochemistry, HHU: GitHub. (n.d.). Retrieved +May 23, 2022, from https://github.com/weberlab-hhu \ No newline at end of file -- GitLab