From 9c7bd1b576c6a8121a404f7ff9f9b76eebb76c7c Mon Sep 17 00:00:00 2001
From: Viktoria Petrova <vipet103@hhu.de>
Date: Wed, 22 Jan 2025 11:42:07 +0100
Subject: [PATCH] add detailed data preprocessing documentation and h5
 generation documentation to workflows

---
 workflows/data_preprocessing.md | 265 +++++++++++++++++++++++++++++++
 workflows/h5_files.md           | 270 ++++++++++++++++++++++++++++++++
 2 files changed, 535 insertions(+)
 create mode 100644 workflows/data_preprocessing.md
 create mode 100644 workflows/h5_files.md

diff --git a/workflows/data_preprocessing.md b/workflows/data_preprocessing.md
new file mode 100644
index 0000000..a5fb843
--- /dev/null
+++ b/workflows/data_preprocessing.md
@@ -0,0 +1,265 @@
+# Data preprocessing documentataion
+This documentation is about preprocessing ATAC- and ChIP-seq data. The pipeline
+is used to create bam files from public new generation sequencing (NGS) data
+(fastq files) downloaded from NCBI's (National Center for Biotechnology Information)
+SRA (Sequence Read Archive).   
+     
+## 1. Required tools
+The tool version numbers are the ones used during this project.   
+      
+- SRA-Toolkit 3.0.0 (SRA Toolkit - GitHub)
+- Java 1.8.0 (Arnold et al., 2005)
+- Trimmomatic 0.36 (Bolger et al., 2014)
+- FastQC 0.11.9 (Babraham Bioinformatics - FastQC A Quality Control Tool
+for High Throughput Sequence Data)
+- MultiQC 1.14 (Ewels et al., 2016)
+- BWA 2.1 (Md et al., 2019)
+- SamTools 1.6 (Danecek et al., 2021)
+- Picard (Picard Tools - By Broad Institute, n.d.)
+- deepTools 3.5.1 (RamÃrez et al., 2016)
+- Helixer 0.3.1 (Stiehler et al., 2021; Weberlab: Institute of Plant Biochemistry,
+HHU: GitHub, n.d.)
+- Helixer Post (https://github.com/TonyBolger/HelixerPost)
+- gffread (Pertea & Pertea, 2020)
+- GeenuFF 0.3.0 (Weberlab: Institute of Plant Biochemistry, HHU: GitHub, n.d.)
+     
+## 2. Pipeline
+| Category | Download data |      Trimming       | Quality control |     Mapping     | Quality control | Deduplication/Data cleaning |         Quality control         |
+|:--------:|:-------------:|:-------------------:|:---------------:|:---------------:|:---------------:|:---------------------------:|:-------------------------------:|
+|  Tools   |  SRA Toolkit  | Trimmomatic<br>Java |     FastQC      | BWA<br>SamTools |    SamTools     | Picard<br>Java<br>SamTools  | deepTools<br>Helixer<br>gffread |
+    
+
+### 2.1. Download data
+**Tools:** SRA toolkit    
+**Setup:** The path to the toolkit 
+``export PATH=$PATH:<path_to>/sratoolkit.3.0.0-ubuntu64/bin`` needs to be in
+your ``.bashrc``. When first configuring the SRA toolkit, be sure to set the local
+file caching to a folder of sufficient size, so **not** your home directory
+(see https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration).    
+     
+There are two options for downloading the runs from SRA: fastq-dump and
+fasterq-dump. The program calls are similar, with fastq-dump downloading the
+files in qzip format while fasterq-dump downloads faster, but doesn't zip the
+fastq files.   
+
+#### 2.1.1. Fastq-dump
+```bash
+fastq-dump <SRR_accsession> --gzip --split-3
+
+```
+#### 2.1.2. Fasterq-dump
+```bash
+fasterq-dump <SRR_accsession> --split-3
+
+gzip <SRR_accsession>_1.fastq <SRR_accsession>_2.fastq  # for paired end data
+
+```
+
+#### 2.1.3. Special case: single cell data
+```bash
+prefetch <SRR_accsession>
+fasterq-dump <SRR_accsession> -S --include-technical
+
+gzip <SRR_accsession>_3.fastq <SRR_accsession>_4.fastq 
+```
+Unzipped fastq files also work, as trimmomatic accepts them as well.
+It is however necessary to rename the forward read from <SRR_accsession>_3.fastq to
+<SRR_accsession>_1.fastq and the reverse reads from <SRR_accsession>_4.fastq to
+<SRR_accsession>_2.fastq.
+     
+### 2.2. Trimming
+**Tools:** Java 1.8.0, Trimmomatic 0.36    
+**Hint:** ATAC-seq data usually uses Nextera adapters, so use NexteraPE-PE.fa for
+paired ATAC-seq and TruSeq3-PE-2.fa for paired ChIP-seq data.
+    
+```bash
+# paired end data
+java -jar -Xmx2048m <path_to_trimmomatic>/trimmomatic-0.36.jar PE -threads 2 \
+-basein <sample_ID>_1.fastq.gz -baseout <out_path>/<sample_ID>.fastq.gz \
+ILLUMINACLIP:<path_to_trimmomatic>/adapters/TruSeq3-PE-2.fa:3:30:10:1:true \
+MAXINFO:36:0.7
+
+# single end data
+java -jar -Xmx2048m <path_to_trimmomatic>/trimmomatic-0.36.jar SE -threads 2 \
+<sample_ID>.fastq.gz <out_path>/<sample_ID>.fastq.gz \
+ILLUMINACLIP:<path_to_trimmomatic>/adapters/TruSeq3-SE.fa:2:30:10 MAXINFO:36:0.7
+```
+    
+### 2.3. Quality control
+**Tools:** FastQC
+```bash
+# paired end data
+fastqc <sample_ID>_1P.fastq.gz
+fastqc <sample_ID>_2P.fastq.gz
+
+# single end data
+fastqc <sample_ID>.fastq.gz
+```
+**Hint:** [MultiQC](https://multiqc.info) was used to generate summary html reports.
+     
+### 2.4. Mapping
+**Tools:** BWA, SamTools
+     
+#### 2.4.1. Genome indexing
+```bash
+bwa-mem2 index -p <path_to_genome>/<species> <path_to_genome>/<species>.fa
+```
+     
+#### 2.4.2. Mapping
+```bash
+# paired end data
+bwa-mem2 mem <path_to_genome>/<species> <(zcat <sample_ID>_1P.fastq.gz) \
+<(zcat <sample_ID>_2P.fastq.gz) > <sample_ID>.sam
+
+# single end data
+bwa-mem2 mem <path_to_genome>/<species> <(zcat <sample_ID>.fastq.gz) > \
+<sample_ID>.sam
+
+# sort and index
+samtools view <sample_ID>.sam -b |samtools sort -T tmp<sample_ID> -@1 \
+-o <sample_ID>.bam
+samtools index <sample_ID>.bam
+```
+     
+### 2.5. Quality control
+**Tools:** SamTools    
+     
+Check for percentage of reads mapped, mates paired, etc.    
+     
+```bash
+samtools flagstat <sample_ID>.bam > <sample_ID>.txt
+```
+    
+### 2.6. Deduplication/Data cleaning
+**Tools:** Picard, Java, SamTools    
+    
+Common errors are:
+- the picard output just ends without signaling that picard finished and
+there is no output bam file
+- java.lang.OutOfMemoryError: Java heap space
+- java.lang.OutOfMemoryError: GC overhead limit exceeded
+    
+The **solution** is extending the java heap space, so replacing
+-Xmx1G with -Xmx2G or even -Xmx3G.    
+    
+```bash
+# picard
+java -Xmx1G -jar <path_to_picard>/picard.jar MarkDuplicates I=<sample_ID>.bam \
+O=<sample_ID>_marked.bam M=<sample_ID>.stats
+
+# filter with samtools
+samtools view -F 1796 <sample_ID>_marked.bam -b > <sample_ID>.bam
+# -F 1796 filters out duplicates, unmapped reads, non-primary alignments 
+# and reads not passing platform quality checks
+samtools index <sample_ID>.bam
+```
+     
+### 2.7. Quality control
+**Tools:** deepTools, Helixer, gffread   
+    
+Only the predictions/annotation from Helixer were used for consistency, but if the
+genome already has an annotation you can also use that and skip the fourth step.
+You can also blacklist regions (bed file) in the fifth step, if the previous plot
+was too noisy. An easy way is to blacklist the chloroplast, mitochondrion and unplaced
+scaffolds, as they have been observed to frequently cause noise in ATAC-seq data.
+    
+#### 2.7.1. Plot coverage
+```bash
+files=`ls <folder_of_deduplicated_files>/*.bam`
+<path_to_deeptools>/deepTools/bin/plotCoverage -b $files \
+--plotFile <species>Coverage.png --smartLabels --outRawCounts deduplicated/raw.txt
+```
+#### 2.7.2. Plot fingerprint
+```bash
+files=`ls <folder_of_deduplicated_files>/*.bam`
+<path_to_deeptools>/deepTools/bin/plotFingerprint -b $files \
+--plotFile <species>Fingerprint.png --smartLabels --binSize 500 \
+--numberOfSamples 500_000
+# if the genome is equal or smaller than 250 Mbp, please adjust the defaults of 
+# binsize (500) and numberOfSamples (500_000) oof plotFingerprint, so that
+# binsize x numberOfSamples < genome size
+```
+#### 2.7.3. Bam to bigwig
+```bash
+files=`ls deduplicated/*2.bam`
+
+for file in $files; do
+    <path_to_deeptools>/deepTools/bin/bamCoverage -b $file -o $(basename $file .bam).bw -of bigwig
+done
+```
+#### 2.7.4. Helixer annotation
+```bash
+<path_to_helixer>/Helixer/Helixer.py --lineage land_plant --fasta-path <species>.fa \
+--species <full_species_name> --gff-output-path <species>_land_plant_v0.3_a_0080_helixer.gff3 \
+--model-filepath <path_to_best_model>/land_plant_v0.3_a_0080.h5 --batch-size 150 \
+--subsequence-length 21384 --temporary-dir <temp_dir>
+# change batch size/subsequence length as your GPU can handle/as you need
+# full species name example: Arabidopsis_thaliana
+```
+>**Warning:** To create the annotation, you will need an Nvidia GPU, their CUDA
+> toolkit and cuDNN. (see https://github.com/weberlab-hhu/Helixer)
+     
+#### 2.7.5. Compute matrix
+```bash
+# convert predictions or reference annotation to bed
+gffread helixer_pred/<species>_land_plant_v0.3_a_0080_helixer.gff3 \
+--bed -o <species>_helixer.bed
+
+# compute matrix
+files=`ls <path_to_bigwig_files>*.bw`
+<path_to_deeptools>/deepTools/bin/computeMatrix reference-point \
+-S $files -R <species>_helixer.bed --smartLabels \
+-o <species>_matrix.mat.gz --upstream 1500 --downstream 1500
+```
+     
+#### 2.7.6. Plot heatmap
+```bash
+<path_to_deeptools>/deepTools/bin/plotHeatmap -m deepqc/<species>_matrix.mat.gz \
+-o <species>_tss_heatmap.png
+```
+    
+The heatmap should show a peak in front of/around the transcription start site (TSS)
+for ATAC-seq and behind the TSS for H3K4me3 ChIP-seq.   
+     
+**ATAC-seq example of _Brachypodium distachyon_:**
+![](../images/Brachypodium_distachyon_ATACseq.png)
+     
+**ChIP-seq example of _Brachypodium distachyon_:**
+![](../images/Brachypodium_distachyon_ChIPseq.png)
+     
+### References
+- Arnold, K., Gosling, J., & Holmes, D. (2005). The Java programming language.
+Addison Wesley Professional.
+- Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput
+Sequence Data. (n.d.). Retrieved May 23, 2022, from
+https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
+- Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer
+for Illumina sequence data. Bioinformatics, 30(15), 2114.
+https://doi.org/10.1093/BIOINFORMATICS/BTU170
+- Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O.,
+Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve
+years of SAMtools and BCFtools. GigaScience, 10(2).
+https://doi.org/10.1093/GIGASCIENCE/GIAB008
+- Ewels, P., Magnusson, M., Lundin, S., & KÃ¤ller, M. (2016). MultiQC: summarize
+analysis results for multiple tools and samples in a single report. Bioinformatics,
+32(19), 3047â€“3048. https://doi.org/10.1093/BIOINFORMATICS/BTW354
+- Md, V., Misra, S., Li, H., & Aluru, S. (2019). Efficient architecture-aware
+acceleration of BWA-MEM for multicore systems. Proceedings - 2019 IEEE 33rd
+International Parallel and Distributed Processing Symposium, IPDPS 2019, 314â€“324.
+https://doi.org/10.1109/IPDPS.2019.00041
+- Pertea, G., & Pertea, M. (2020). GFF Utilities: GffRead and GffCompare.
+F1000Research, 9, 304. https://doi.org/10.12688/F1000RESEARCH.23297.2
+- Picard Tools - By Broad Institute. (n.d.). Retrieved May 23, 2022,
+from https://broadinstitute.github.io/picard/
+- RamÃrez, F., Ryan, D. P., GrÃ¼ning, B., Bhardwaj, V., Kilpert, F., Richter,
+A. S., Heyne, S., DÃ¼ndar, F., & Manke, T. (2016). deepTools2: a next generation
+web server for deep-sequencing data analysis. Nucleic Acids Research, 44(W1),
+W160â€“W165. https://doi.org/10.1093/NAR/GKW257
+- Stiehler, F., Steinborn, M., Scholz, S., Dey, D., Weber, A. P. M., & Denton,
+A. K. (2021). Helixer: cross-species gene annotation of large eukaryotic genomes
+using deep learning. Bioinformatics, 36(22â€“23), 5291â€“5298.
+https://doi.org/10.1093/BIOINFORMATICS/BTAA1044
+- SRA Toolkit - GitHub. (n.d.). Retrieved September 19, 2022,
+from https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit
+- Weberlab: Institute of Plant Biochemistry, HHU: GitHub. (n.d.). Retrieved
+May 23, 2022, from https://github.com/weberlab-hhu
diff --git a/workflows/h5_files.md b/workflows/h5_files.md
new file mode 100644
index 0000000..076ca60
--- /dev/null
+++ b/workflows/h5_files.md
@@ -0,0 +1,270 @@
+# H5 files
+Predmoter uses h5 files as input and main output data. H5 files, 
+Hierarchical Data Format 5 (HDF5), are designed to store and organize large
+amounts of data. They consist of two major components:   
+    
+- Datasets (multidimensional arrays)
+- Groups (container structures; can hold datasets and other groups)
+     
+## 1. H5 file architecture
+### 1. Input files
+The architecture of the h5 files used as input data for Predmoter looks like this:
+```raw
+<file>.h5  
+| 
+|
+â”œâ”€â”€ data  
+|     â”œâ”€â”€ X (encoded DNA sequence)
+|     â”œâ”€â”€ seqids (chromosome/scaffold names)
+|     â”œâ”€â”€ species
+|     â””â”€â”€ start_ends (start and end base per chunk)
+|
+â””â”€â”€ evaluation
+      â”œâ”€â”€ <dataset>_meta
+      |     â””â”€â”€ bam_files
+      â”œâ”€â”€ <dataset>_coverage (in reads per base pair)
+      â””â”€â”€ <dataset>_spliced_coverage
+```
+    
+#### 1.1. The data group
+The base h5 file and all arrays of the data group is created using ``fasta2h5.py``.
+     
+##### 1.1.1. X
+The encoded DNA sequence. The original fasta file is cut into chunks of the same
+size (default: 21384 base pairs). The shape of this array is (N, C, 4). The C
+represents the chunk/sequence length and N the number of chunks. The four represents
+the nucleotides/bases (example shape: (11202, 21384, 4)). The nucleotide encoding
+is as follows:
+- [1, 0, 0, 0] = C
+- [0, 1, 0, 0] = A
+- [0, 0, 1, 0] = T
+- [0, 0, 0, 1] = G
+- [0.25, 0.25, 0.25, 0.25] = N
+- [0, 0, 0, 0] = padding
+
+Padding is used to pad out too short sequences or chromosome ends, since chromosomes
+are rarely exactly divisible by the sequence length. Arrays always need to have
+compatible shapes to be concatenated, so the sequence length needs to be identical
+between all chunks. Shorter sequences will therefore be padded. Predmoter considers
+Ns padding as well (so [0, 0, 0, 0]), as most Ns are gaps in the genome assembly
+and as such not informative.
+
+##### 1.1.2 seqids
+The ``seqids`` represent the fasta file headers/identifiers of the sequence
+(chromosome, scaffold, contig, etc.) that a chunk comes from. The shape is (N,), i.e.
+the number of chunks.
+
+##### 1.1.3. species
+The species ID given by the user. The shape is (N,), i.e. the number of chunks.
+Technically useful if multiple species are in one file. Predmoter assumes that
+there is just one species per file, which is especially important for testing and
+predicting. It should be possible to use multiple species in one file during
+training, but it wasn't tested.
+
+##### 1.1.4. start_ends
+The start and end coordinates of a given chunk (in that order). When the end
+coordinate is smaller than the start coordinate (e.g. [21384, 0]), then the
+chunk is from the negative strand. So [0, 21384] is the first 21384 bp of a given
+sequence of the positive strand. On the positive strand start is inclusive and
+end exclusive (it's the other way around on the negative strand). If
+``|start - end|`` is less than the chunk length then there is padding. The shape is
+(N, 2), i.e. two values (start and end) per chunk.
+      
+#### Examples
+##### 1. Chunk order
+The DNA sequence order inside the h5 files is: chromosome 1-positive strand,
+chromosome 1-negative strand, chromosome 2-positive strand,
+chromosome 2-negative strand, etc.
+
+Example of one chromosome:
+- sequence length: 21384 bp
+- length of chromosome 1: 51321 bp
+     
+|           |            |                |                |                |                |            |
+|:----------|:-----------|:---------------|:---------------|:---------------|:---------------|:-----------|
+| chunk     | 0          | 1              | 2              | 3              | 4              | 5          |
+| strand    | +          | +              | +              | -              | -              | -          |
+| start/end | [0, 21384] | [21384, 42768] | [42768, 51321] | [51321, 42768] | [42768, 21384] | [21384, 0] |
+
+All strands are represented from 5' to 3', so the negative strand is in descending
+order. Chunks 2 and 3 contain padding and chunk 3 is the reverse complement of chunk 2.     
+      
+##### 2. Padding
+bases =  A, T, C, G, N    
+padding = P  
+(both encoded inside the file)     
+      
+positive strand:
+
+|            |                 |                 |                 |
+|:-----------|:----------------|:----------------|:----------------|
+| chunk      | 0               | 1               | 2               |
+| mini_array | [A, A, T, G, A] | [C, A, T, N, C] | [G, T, T, P, P] |
+| start/end  | [0, 5]          | [5, 10]         | [10, 13]        |
+     
+negative strand:
+
+|            |                 |                 |                 |
+|:-----------|:----------------|:----------------|:----------------|
+| chunk      | 3               | 4               | 5               |
+| mini_array | [A, A, C, P, P] | [G, N, A, T, G] | [T, C, A, T, T] |
+| start/end  | [13, 10]        | [10, 5]         | [5, 0]          |
+
+The padding is on the positive and negative strand always at the end of the array,
+so chunk 2 and 3 are **not** exact mirrors/reverse complements of each other, while
+for example chunk 0 and 5 are.
+     
+#### 1.2 The evaluation group
+All arrays of the evaluation group are added to the base h5 file using
+``add_ngs_coverage.py``. Multiple datasets can be added to the h5 file.
+      
+##### 1.2.1 dataset_meta
+The meta of a dataset contains the array ``bam_files``. This is a list of the bam file
+names (including the path) that were used to add the dataset. The shape is (B,), i.e.
+the number of bam files. Multiple bam files can be added per dataset, but it is not
+possible to add additional bam files to an existing dataset. In that case,
+``add_ngs_coverage`` needs to be used on the desired h5 file not containing the dataset
+and all bam files (old and new) can be added together.
+     
+##### 1.2.2. dataset_coverage
+Coverage is the number of reads mapping at a given position (cigar M or X). The
+shape is (N, C, B). The C represents the chunk/sequence length, N the number of
+chunks, and B the number of bam files (example shape: (11202, 21384, 5)). Predmoter
+uses the average of all bam files per dataset to train.
+      
+##### 1.2.3 dataset_spliced_coverage
+Spliced coverage is the number of reads mapping with a gap, that is a read deletion
+or spliced out section at a given position (cigar N or D). The shape is the same as
+the shape of dataset_coverage.
+     
+### 2. Prediction file
+The prediction h5 file contains:   
+```raw
+<input_file_basename>_prediction.h5
+|
+â”œâ”€â”€ data  
+|     â”œâ”€â”€ seqids
+|     â”œâ”€â”€ start_ends
+|     â””â”€â”€ species
+|
+â””â”€â”€ prediction
+      â”œâ”€â”€ predictions
+      â”œâ”€â”€ model_name
+      â””â”€â”€ datasets
+```
+   
+The ``seqids``, ``start_ends`` and ``species`` arrays are taken from the input file.
+The predictions array contains the models' predictions. The shape is (N, C, D);
+the C represents the chunk/sequence length, N the number of chunks and D the number
+of datasets predicted. The metadata contains the models' filename (including the path),
+and the datasets the model predicted in the correct order (e.g., "atacseq", "h3k4me3").
+The order of datasets is very important, as it is used by the Converter to convert each
+dataset into its own bigwig or bedgraph file. The Converter also needs ``seqids`` and
+``start_ends`` to calculate chromosome order and chromosome lengths.
+    
+## 2. H5 file creation
+**Tools:** Helixer, GeenuFF (optional), Predmoter
+(Stiehler et al., 2021; Weberlab: Institute of Plant Biochemistry, HHU: GitHub, n.d.)     
+     
+ATAC- and ChIP-seq data is PCR amplified. Therefore, you can not determine from which
+strand which read originated unless you used barcodes to get strand specific data.
+Hence, the coverage information of the read is always added to both strands. The
+advantages are:
+- Open chromatin and closed chromatin regions always apply to both strands anyway.
+- The addition to both strands allows for built-in data augmentation.
+     
+### 2.1. Create h5 file from fasta file
+```bash
+fasta2h5.py --species <full_species_name> --h5-output-path <species/filename>.h5 \
+--fasta-path <path_to_genome>/<species>.fa
+# full species name example: Arabidopsis_thaliana
+```
+
+> **H5 base file:** To add annotations or datasets, the fasta h5 file needs to be
+> created **first**. The additional data will then be added to the fasta h5 file.
+    
+### 2.2. Add ATAC- and/or ChIP-seq data
+Multiple datasets should be added separately. The shift argument is a special case
+for ATAC-seq data to shift the reads +4 (+ strand) or -5 (- strand) base pairs as it is
+typically done.
+```bash
+python3 <path_to_helixer>/helixer/evaluation/add_ngs_coverage.py \
+-s <full_species_name> -d <species>.h5 -b <path_to_bam_files>/*.bam \
+--dataset-prefix <ngs_prefix> --unstranded --threads <number_of_bam_files_to_add> \
+(--shift)
+# multiple threads are faster, but --threads 0 also works
+# ngs prefix example: atacseq
+# --unstranded only for unstranded data
+```
+    
+### 2.3. Add annotation(s) (optional)
+To add the annotation to the h5 file, they need to be converted from gff/ggf3 to a
+sqlite file using GeenuFF. If helixer and reference are desired one of them needs to
+be added using ``--add-additional <annotation_name>`` (helixer is the better option;
+suggested annotation name: helixer_post).
+[More details about the annotation data](https://github.com/weberlab-hhu/Helixer/blob/main/docs/h5_data.md)
+```bash
+# to sqlite
+import2geenuff.py --gff3 <helixer_annotation>.gff3 --fasta <species>.fa \
+--species <full_species_name> --db-path <helixer_geenuff>.sqlite3 \
+--log-file output.log
+# --gff3 <reference>.gff to convert the reference annotation
+
+# add annotation
+geenuff2h5.py --h5-output-path <species>.h5 --input-db-path <helixer_geenuff>.sqlite3 \
+--no-multiprocess --modes y,anno_meta,transitions
+# modes: don't add X again when using add_additional
+```
+    
+Resulting new file structure (when adding the reference annotation not in alternative
+and adding the helixer annotation to alternative):
+```raw
+<file>.h5  
+| 
+|
+â”œâ”€â”€ data  
+|     â”œâ”€â”€ X (encoded DNA sequence)
+|     â”œâ”€â”€ seqids (chromosome/scaffold names)
+|     â”œâ”€â”€ species
+|     â”œâ”€â”€ start_ends (start and end base per chunk)
+|     â”œâ”€â”€ err_samples
+|     â”œâ”€â”€ fully_intergenic_samples
+|     â”œâ”€â”€ gene_lengths
+|     â”œâ”€â”€ is_annotated
+|     â”œâ”€â”€ phases
+|     â”œâ”€â”€ sample_weights
+|     â”œâ”€â”€ transitions
+|     â””â”€â”€ y
+|
+â”œâ”€â”€ evaluation
+|     â”œâ”€â”€ <dataset>_meta
+|     |     â””â”€â”€ bam_files
+|     â”œâ”€â”€ <dataset>_coverage (in reads per base pair)
+|     â””â”€â”€ <dataset>_spliced_coverage
+|     
+â””â”€â”€ alternative
+      â””â”€â”€ helixer_post
+           â”œâ”€â”€ err_samples
+           â”œâ”€â”€ fully_intergenic_samples
+           â”œâ”€â”€ gene_lengths
+           â”œâ”€â”€ is_annotated
+           â”œâ”€â”€ phases
+           â”œâ”€â”€ sample_weights
+           â”œâ”€â”€ transitions
+           â””â”€â”€ y
+```
+     
+**Common errors:**  
+- GeenuFF is very strict, so frequently converting the annotation to sqlite fails and
+the reference gff needs to be edited slightly for it to be converted.
+- The full species names don't match up between script calls.
+     
+## References
+- Stiehler, F., Steinborn, M., Scholz, S., Dey, D., Weber, A. P. M., & Denton,
+A. K. (2021). Helixer: cross-species gene annotation of large eukaryotic genomes
+using deep learning. Bioinformatics, 36(22â€“23), 5291â€“5298.
+https://doi.org/10.1093/BIOINFORMATICS/BTAA1044
+- SRA Toolkit - GitHub. (n.d.). Retrieved September 19, 2022,
+from https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit
+- Weberlab: Institute of Plant Biochemistry, HHU: GitHub. (n.d.). Retrieved
+May 23, 2022, from https://github.com/weberlab-hhu
\ No newline at end of file
-- 
GitLab