diff --git a/assays/Metrics/README.md b/assays/Metrics/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assays/Metrics/dataset/.gitkeep b/assays/Metrics/dataset/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assays/Metrics/dataset/README.md b/assays/Metrics/dataset/README.md new file mode 100644 index 0000000000000000000000000000000000000000..afd302cf62f1923764466929a83999a0679c1b32 --- /dev/null +++ b/assays/Metrics/dataset/README.md @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:a04385c9094cb5337c706fb90f474163b01f50f189ad396780abfdf952ba06bb +size 3110 diff --git a/assays/Metrics/dataset/vbae074f4.jpeg b/assays/Metrics/dataset/vbae074f4.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..52aa059e09c29f7bfe60637e6d32f684ea154687 --- /dev/null +++ b/assays/Metrics/dataset/vbae074f4.jpeg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:e579680293d3936defa04375a20c3d136a368b4184b57e656d39a8fc02690326 +size 366838 diff --git a/assays/Metrics/dataset/vbae074f5.jpeg b/assays/Metrics/dataset/vbae074f5.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..3b393cdb118b446cc360f3f4ed40f0e4e64452fa --- /dev/null +++ b/assays/Metrics/dataset/vbae074f5.jpeg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b450c6447bc0a414336ed2fa240202d121f06eeae1dcc905f5e5d90ef4afe92a +size 376869 diff --git a/assays/Metrics/dataset/vbae074f6.jpeg b/assays/Metrics/dataset/vbae074f6.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..b1051e61f1d2c8b44357e1085f7d5bc1dc2e5379 --- /dev/null +++ b/assays/Metrics/dataset/vbae074f6.jpeg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:2dce5c51da05f36478dfd98396dc9911256ec5da2e0696873a377396fe020753 +size 433285 diff --git a/assays/Metrics/isa.assay.xlsx b/assays/Metrics/isa.assay.xlsx new file mode 100644 index 0000000000000000000000000000000000000000..a59c41f7d5d36e1bed1a54df427b7f84b8fc0267 Binary files /dev/null and b/assays/Metrics/isa.assay.xlsx differ diff --git a/assays/Metrics/protocols/.gitkeep b/assays/Metrics/protocols/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assays/Metrics/protocols/MetricsProtocol.md b/assays/Metrics/protocols/MetricsProtocol.md new file mode 100644 index 0000000000000000000000000000000000000000..c09714a2f19afcff40d0c8fb83023a4d04c2e47a --- /dev/null +++ b/assays/Metrics/protocols/MetricsProtocol.md @@ -0,0 +1,24 @@ +## Metrics + +Five metrics were used to evaluate model performance, the Poisson loss, the Pearson correlation coefficient (Pearson’s r), precision, recall, and F1. + +The most prominent peak caller for ChIP-seq data, MACS (Zhang et al. 2008), which was also frequently used for ATAC-seq data (Hiranuma et al. 2017, Thibodeau et al. 2018, Hentges et al. 2022), assumes that the ChIP-seq coverage data is Poisson distributed. Therefore, PyTorch’s Poisson negative log likelihood loss function (Poisson loss) was used as the loss function for all models (Equation 1). + +(1) $$loss=\frac{1}{n} \sum_{i=1}^n e^{x_i} - y_i \ast x_i$$ + +The individual samples of the predictions $(â xâ )$ and the targets $(yâ â )$ are indexed with $(i)$â . The sample size is denoted with $(n)$ + (https://pytorch.org/docs/stable/generated/torch.nn.PoissonNLLLoss.html). This version of the Poisson loss caused the network to output logarithmic predictions. The desired, actual predictions were thus the exponential of the network’s output. The exponential distribution only consists of positive real numbers like the ATAC- and ChIP-seq read coverage. + +To measure the “accuracy†of the model’s predictions, i.e. translating the Poisson loss into a human-readable number, the Pearson’s r was chosen (Equation 2), measuring the linear correlation between two variables. + +(2) $$r=\frac{\sum_{i=1}^n (x_i - \overline{x}) (y_i - \overline{y})}{\sqrt{\sum_{i=1}^n (x_i - \overline{x})^2 \sum_{i=1}^n (y_i - \overline{y})^2} + \epsilon}$$ + +The sample size is denoted with $n$â , the individual samples of the predictions $(xâ )$ and the targets $(â yâ )$ are indexed with $i$â . The additional epsilon $(\epsilon)$ equals 1e-8 and is used to avoid a division by zero. A value of 1 represents a perfect positive linear relationship, so Predmoter’s predictions and the experimental NGS coverage data would be identical. A value of 0 means no linear relationship between the predictions and targets. Finally, a value of −1 represents a perfect negative linear relationship. + +Precision, recall, and F1 were used to compare predicted peaks to experimental peaks for both test species (Equations 3–5). A F1 score of 1 indicates that the predicted peaks are at the same position as the experimental peaks. The lowest score possible is 0. Precision, recall, and F1 were calculated base-wise. Called peaks were denoted with 1, all other base pairs with 0. A confusion matrix containing the sum of True Positives (TP), False Positives (FP), and False Negatives (FN) for the two classes, peak and no peak, was computed for the average predicted coverage of both strands. Precision and recall were also utilized to plot precision-recall curves (PRC). The area under the precision-recall curve (AUPRC) was calculated using scikit-learn (Pedregosa et al. 2011). Flagged sequences were excluded from the calculations (see Section 2.1.2). The baseline AUPRC is equal to the fraction of positives, i.e. the percentage of peaks in the training set (Saito and Rehmsmeier 2015). The peak percentages were calculated using the Predmoter’s compute_peak_f1.py script in “side_scripts.†The percentages are listed in Supplementary Table S8. + +(3) $$precision = \frac{TP}{TP+FP}$$ + +(4) $$recall = \frac{TP}{TP+FN}$$ + +(5) $$F_1 = 2 \ast \frac{precision \ast recall}{precision+recall}$$ \ No newline at end of file diff --git a/assays/PeakCalling/README.md b/assays/PeakCalling/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assays/PeakCalling/dataset/.gitkeep b/assays/PeakCalling/dataset/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assays/PeakCalling/dataset/README.md b/assays/PeakCalling/dataset/README.md new file mode 100644 index 0000000000000000000000000000000000000000..94493ec7ccfdef5af08bc3681bf156dd5ddb2db9 --- /dev/null +++ b/assays/PeakCalling/dataset/README.md @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c8327a7510d3266d1eb0e491449e35d9935e07c71adcff790e9fba169ab3fdb5 +size 1148 diff --git a/assays/PeakCalling/dataset/vbae074f7.jpeg b/assays/PeakCalling/dataset/vbae074f7.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..ced0a31700d7c66e24756aa931a55aec20751aea --- /dev/null +++ b/assays/PeakCalling/dataset/vbae074f7.jpeg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:c7f8fb60b659d2c1e9c94300d8ad889d1ae4f15ffb395a06dd437a634da76f57 +size 481791 diff --git a/assays/PeakCalling/isa.assay.xlsx b/assays/PeakCalling/isa.assay.xlsx new file mode 100644 index 0000000000000000000000000000000000000000..439d7c0d741efc7ba9226de45235400fde2a7f0b Binary files /dev/null and b/assays/PeakCalling/isa.assay.xlsx differ diff --git a/assays/PeakCalling/protocols/.gitkeep b/assays/PeakCalling/protocols/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assays/PeakCalling/protocols/PeakCallingProtocol.md b/assays/PeakCalling/protocols/PeakCallingProtocol.md new file mode 100644 index 0000000000000000000000000000000000000000..bc6e831bff1aa7bd9270c43c823a987b384ee6ff --- /dev/null +++ b/assays/PeakCalling/protocols/PeakCallingProtocol.md @@ -0,0 +1,7 @@ +## Peak calling + +Peak calling on predictions and the experimental data was performed with MACS3 (Zhang et al. 2008). The sample bam files of the experimental data per species and dataset were merged. Then peaks were called on the merged bam files with MACS3’s “callpeak†command. The parameters for calling ATAC-seq peaks were the BAMPE format, a q-value of 0.01, keeping all duplicates, using the background lambda as local lambda (“no-lambdaâ€) and the ungapped genome size of the species’ genome assembly (see Supplementary Table S2) as mappable genome size. For ChIP-seq peak calling two parameters, broad and a broad cutoff of 0.1, were added. The chosen q-value was the default 0.05. The ChIP-seq peaks of the species *S.polyrhiza* and *Chlamydomonas reinhardtii* were called using the format BAM instead of BAMPE. MACS3’s “bdgpeakcall†was used to call peaks on the test species predictions in bedGraph file format. The parameters for peak calling were the same MACS3’s “callpeak†determined for the experimental data, i.e. for paired end reads the minimum length and maximum gap are set to the predicted fragment size (Table 5). The cutoff value, threshold of the minimum read coverage to call a peak, was estimated by plotting the average read coverage of predictions around the TSS (see Fig. 5b). + +Different cutoff values were also examined. For the ATAC-seq predictions of *A. thaliana*, cutoffs in the range of 1 to 25 with a step of 1 and for *O. sativa* cutoffs in the range of 5 to 200 with a step of 5 and including a cutoff of 1 at the start were chosen. For the ChIP-seq predictions of both species, cutoffs in the range of 5 to 100 with a step of 5 and including a cutoff of 1 at the start were chosen. + +The selected parameters of MACS3’s “bdgpeakcall†for each test species and dataset are listed. \ No newline at end of file diff --git a/assays/SpeciesSelection/README.md b/assays/SpeciesSelection/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assays/SpeciesSelection/dataset/.gitkeep b/assays/SpeciesSelection/dataset/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assays/SpeciesSelection/dataset/README.md b/assays/SpeciesSelection/dataset/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a47b74250dd6b03e4e99b19daed9b6dea03dccdd --- /dev/null +++ b/assays/SpeciesSelection/dataset/README.md @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:732c169207e9f3d10a75f3708837f60a38ae529bce130530cf45f0e9735617d0 +size 5441 diff --git a/assays/SpeciesSelection/dataset/vbae074f3.jpeg b/assays/SpeciesSelection/dataset/vbae074f3.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..915295ed05cf1e05dd72dd0dc720d727c0b7468a --- /dev/null +++ b/assays/SpeciesSelection/dataset/vbae074f3.jpeg @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:b1862e49d4f63a10b82589f2d6799d138b48daeaf14eede9534087d719e9ee04 +size 429409 diff --git a/assays/SpeciesSelection/isa.assay.xlsx b/assays/SpeciesSelection/isa.assay.xlsx new file mode 100644 index 0000000000000000000000000000000000000000..a0d25bd0be3c9674a3ebd966f861be01032dc323 Binary files /dev/null and b/assays/SpeciesSelection/isa.assay.xlsx differ diff --git a/assays/SpeciesSelection/protocols/.gitkeep b/assays/SpeciesSelection/protocols/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/assays/SpeciesSelection/protocols/Cross-speciesPredictionModels.md b/assays/SpeciesSelection/protocols/Cross-speciesPredictionModels.md new file mode 100644 index 0000000000000000000000000000000000000000..0a2100805fb2a648b6d8716ac24485f5a823c65f --- /dev/null +++ b/assays/SpeciesSelection/protocols/Cross-speciesPredictionModels.md @@ -0,0 +1,9 @@ +## Cross-species prediction models + +Ensuring a diverse range of species in the training set, while simultaneously reserving enough data for validation and testing to effectively evaluate the models’ generalization ability, proved difficult. At the start of development, the amount of high-quality, publicly available ATAC-seq data was low. Around 60% of the plant ATAC-seq data on SRA available up until July 2023 needed to be discarded after the final quality control. This left the ATAC-seq data of the 14 plant species used in this study. In later development stages 3 more ATAC-seq datasets, from *Actinidia chinensis, Panicum miliaceum* and *Sorghum bicolor*, and 2 more ChIP-seq datasets corresponding to acquired ATAC-seq datasets, from *A.chinensis* and *M.polymorpha*, became available. The low availability of high-quality data, especially in early development stages, turned out to be a major hindrance in providing the network with an appropriate amount of data to train on. Data of two species, *A.thaliana* and *O.sativa*, was set aside as a hold-out test set. In doing so, both a dicot and a monocot species with available ATAC- and ChIP-seq datasets could be used for final evaluation. The same applied to the two validation species, the dicot *Medicago truncatula* and the monocot *S.polyrhiza* (Table 3). + +The resulting training, validation, and test split for the ATAC-seq models, ChIP-seq models and Combined models was around 90% training set, 5% validation set and 5% test set (Fig. 3a). + +The model training pairs were visualized using the Uniform Manifold Approximation and Projection (UMAP) learning technique for dimension reduction (McInnes et al. 2018). Random training pairs, 5% of each species in the training set, were used to calculate the UMAPS. Gap subsequences and flagged sequences were not included. The chosen parameters were 10 neighbors, 0.1 minimum distance and the Euclidean distance metric. The additional species datasets, added in later development stages, were included. None of the available settings and metrics for UMAP computation showed distinct clusters based on the number of peaks within the input (Fig. 3b). + +For the first seven models only the species for which experimental ATAC-seq data of high quality was available up until July of 2023 were trained on. The same applied to the BiHybrid_05 model using ChIP-seq data. The Combined model used both datasets. The Combined_02 model used additional data of four species. Gap subsequences were masked for all models; unplaced scaffolds and non-nuclear sequences were masked starting with model BiHybrid_04. \ No newline at end of file diff --git a/assays/SpeciesSelection/protocols/Intra-speciesModelsAndLeave-one-outCross-validation.md b/assays/SpeciesSelection/protocols/Intra-speciesModelsAndLeave-one-outCross-validation.md new file mode 100644 index 0000000000000000000000000000000000000000..615b0fc185816349ce08e808211ccf817664ca40 --- /dev/null +++ b/assays/SpeciesSelection/protocols/Intra-speciesModelsAndLeave-one-outCross-validation.md @@ -0,0 +1,5 @@ +## Intra-species models and leave-one-out cross-validation + +Cross-species validation instead of an in-species split for the validation and training data was deemed closer to the real-world use case of predicting ATAC- and ChIP-seq data for an entire species. However, two models were trained using an intra-species training and validation split. These models, IS_10 and IS_20, used 10% and 20% of each species dataset as the validation set respectively. The input files were split using Predmoter's intra_species_train_val_split.py script in “side_scripts.†This method ensured that each sequence ID from the original fasta file was fully assigned to either training or validation set. Since the focus of this study is on cross-species prediction, all 25 plant species were used in leave-one-out cross-validation (LOOCV) to evaluate the best model setup on different species. All these setups were trained on ATAC- and ChIP-seq datasets simultaneously (Table 4). When performing LOOCV the model performance was evaluated on all datasets available in the left-out species. + +All models excluded gap subsequences, subsequences of 21 384 bp only containing Ns, and flagged subsequences. For more details on exact model parameters see Supplementary Section S1.3 and Supplementary Table S4. \ No newline at end of file diff --git a/isa.investigation.xlsx b/isa.investigation.xlsx index f0b490a88535575f2a3a8668373309b9575e80a0..cad7afb6cf4d6a73b0a960fe2627d7f4925c598e 100644 Binary files a/isa.investigation.xlsx and b/isa.investigation.xlsx differ diff --git a/studies/ArchitectureAndProposedModels/README.md b/studies/ArchitectureAndProposedModels/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/studies/ArchitectureAndProposedModels/isa.study.xlsx b/studies/ArchitectureAndProposedModels/isa.study.xlsx new file mode 100644 index 0000000000000000000000000000000000000000..d30bf64aa6f3955d817388f52622ac1e5ca77fb5 Binary files /dev/null and b/studies/ArchitectureAndProposedModels/isa.study.xlsx differ diff --git a/studies/ArchitectureAndProposedModels/protocols/.gitkeep b/studies/ArchitectureAndProposedModels/protocols/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/studies/ArchitectureAndProposedModels/protocols/ArchitectureAndProposedModelsProtocol.md b/studies/ArchitectureAndProposedModels/protocols/ArchitectureAndProposedModelsProtocol.md new file mode 100644 index 0000000000000000000000000000000000000000..f3908387084c85c96751d8b56965576bd588349d --- /dev/null +++ b/studies/ArchitectureAndProposedModels/protocols/ArchitectureAndProposedModelsProtocol.md @@ -0,0 +1,11 @@ +## Architecture and proposed models + +The model architectures were implemented using Pytorch Lightning (Falcon 2019) on top of PyTorch (Paszke et al. 2019). The model used supervised learning, a method that connects an input to an output based on example input–output pairs (Russell and Norvig 2016). + +The input for the model was a genomic DNA sequence. The nucleotides were encoded into four-dimensional vectors (see Supplementary Table S1). The DNA sequence of a given plant species was cut into subsequences of 21 384 bp. This number was large enough to contain typical gene lengths of plants while being divisible by ten of the numbers from one to twenty. An easily divisible subsequence length is a requirement for Predmoter (see Supplementary Section S1.2). As few chromosomes, scaffolds or contigs were divisible by 21 384 bp, sequence ends as well as short sequences were padded with the vector [0., 0., 0., 0.]. Padded base pairs were masked during training. If a subsequence only contained N bases, here referred to as “gap subsequence,†it was filtered out. Both strands, plus and minus, were used. Since the ATAC- and ChIP-seq data was PCR amplified and as such it was not possible to determine from which strand a read originated, the coverage information was always added to both strands. The model’s predictions for either ATAC-seq, ChIP-seq or both were compared to the experimental read coverage. The target data were represented per sample of experimental data. These were averaged beforehand, resulting in one coverage track per NGS dataset and plant species. + +Three main model architectures were examined on their performance. The first architecture consisted of convolutional layers followed by transposed convolutional layers for deconvolution (LeCun et al. 1989, LeCun and Bengio 1995). The deconvolution was added to output base-wise predictions. We refer here to this architecture as U-Net. To ensure that the new sequence lengths resulting from a convolution or deconvolution was correct, custom padding formulas were used (Supplementary Section S1.2). Our second approach was a hybrid network. A block of long short-term memory layers (LSTM) (Hochreiter and Schmidhuber 1997) was placed in between a convolutional layer block and a transposed convolutional layer block. The final approach was called bi-hybrid. Its architecture matched the hybrid architecture, except that the LSTM layers were replaced with bidirectional LSTM layers (BiLSTM) (Hochreiter and Schmidhuber 1997, Schuster and Paliwal 1997). Each convolutional and transposed convolutional layer was followed in all three approaches by the ReLU activation function (Glorot et al. 2011). Additional augmentations to the bi-hybrid network included adding batch normalization after each convolutional and transposed convolutional layer and adding a dropout layer after each BiLSTM layer except the last (Fig. 2). The Adam algorithm was used as an optimization method (Kingma and Ba 2014). The network’s base-wise predictions can be smoothed via a postprocessing step utilizing a rolling mean of a given window size. + +We examined 10 different model setups (Table 2). The best model of each architecture and dataset combination was used to develop the next combination test. The model reaching the highest Pearson’s correlation for the validation set was deemed the best model. Pre-tests showed that including gap subsequences, subsequences of 21 384 bp only containing Ns, led to a considerably lower Pearson’s correlation. The proportion of gap subsequences in the total data was 0.6%. Normalizing the NGS coverage data through a general approach of subtracting the average coverage from the dataset and using a ReLU transformation (Glorot et al. 2011) showed notably worse results during previous attempts. The approach of normalizing via an input sample was not feasible due to the considerable lack of available ATAC-seq input samples accompanying the experiments. Therefore, the target data was not adjusted towards its sequencing depth. For more information about the training process see Supplementary Section S1.3. + +All models excluded gap subsequences, subsequences of 21 384 bp only containing Ns. For more details on species selection and exact model parameters see Supplementary Table S4. Models excluding subsequences of unplaced scaffolds and non-nuclear sequences during training and testing are denoted with *. \ No newline at end of file diff --git a/studies/ArchitectureAndProposedModels/resources/.gitkeep b/studies/ArchitectureAndProposedModels/resources/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/studies/ArchitectureAndProposedModels/resources/README.md b/studies/ArchitectureAndProposedModels/resources/README.md new file mode 100644 index 0000000000000000000000000000000000000000..4f23ba3fe9ed28c23d57fee5ebe590d1f50bff7e --- /dev/null +++ b/studies/ArchitectureAndProposedModels/resources/README.md @@ -0,0 +1,22 @@ +<img src=./dataset/vbae074f2.jpeg width=60%> + +**Figure 2.** Predmoter architecture and prediction process. The bi-hybrid architecture with batch normalization and dropout is schematically depicted. Not to scale. Hyperparameters are examples and can vary. The base-wise predictions and smoothed predictions are from an example subsequence from *A. thaliana*. + +## Figure source + +https://academic.oup.com/view-large/figure/464175134/vbae074f2.tif + +**Table 2.** Model architecture and dataset explanation (short). + +| **Model name** | **Dataset** | **Architecture** | | +|----------------------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|---| +| U-Net | ATAC-seq | 3 convolutional layers + 3 transposed convolutional layers | | +| Hybrid | ATAC-seq | U-Net + 2 LSTM layers | | +| BiHybrid | ATAC-seq | U-Net + 2 BiLSTM layers | | +| BiHybrid_02 | ATAC-seq | U-Net + 2 BiLSTM layers + 6 batch normalization layers | | +| BiHybrid_03.1 (see Fig. 2) | ATAC-seq | U-Net + 2 BiLSTM layers + 6 batch normalization layers + 1 dropout layer (dropout probability of 0.3) | | +| BiHybrid_03.2 | ATAC-seq | U-Net + 2 BiLSTM layers + 6 batch normalization layers + 1 dropout layer (dropout probability of 0.5) | | +| BiHybrid_04 | ATAC-seq, filtered flagged sequences* | U-Net + 2 BiLSTM layers + 6 batch normalization layers + 1 dropout layer (dropout probability of 0.3) | | +| BiHybrid_05 | ChIP-seq (H3K4me3), filtered flagged sequences* | U-Net + 2 BiLSTM layers + 6 batch normalization layers + 1 dropout layer (dropout probability of 0.3) | | +| Combined | ATAC-seq, ChIP-seq (H3K4me3), filtered flagged sequences* | U-Net + 2 BiLSTM layers + 6 batch normalization layers + 1 dropout layer (dropout probability of 0.3) | | +| Combined_02 | ATAC-seq, ChIP-seq (H3K4me3), filtered flagged sequences* (+ additional data) | U-Net + 2 BiLSTM layers + 6 batch normalization layers + 1 dropout layer (dropout probability of 0.3) | | \ No newline at end of file diff --git a/studies/ArchitectureAndProposedModels/resources/vbae074f2.jpeg b/studies/ArchitectureAndProposedModels/resources/vbae074f2.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..63defdf4bf6d63d6c40decb1dc681ba63fbfc68b Binary files /dev/null and b/studies/ArchitectureAndProposedModels/resources/vbae074f2.jpeg differ diff --git a/studies/Data/README.md b/studies/Data/README.md new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/studies/Data/isa.study.xlsx b/studies/Data/isa.study.xlsx new file mode 100644 index 0000000000000000000000000000000000000000..e5f84c799430250a8c2a167e46cf41bcbb3725e5 Binary files /dev/null and b/studies/Data/isa.study.xlsx differ diff --git a/studies/Data/protocols/.gitkeep b/studies/Data/protocols/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/studies/Data/protocols/DataOverviewAndPreprocessing.md b/studies/Data/protocols/DataOverviewAndPreprocessing.md new file mode 100644 index 0000000000000000000000000000000000000000..0f48f0e59c565e533570ba924837979400012fb2 --- /dev/null +++ b/studies/Data/protocols/DataOverviewAndPreprocessing.md @@ -0,0 +1,5 @@ +## Data overview and preprocessing + +The entire dataset consisted of 25 plant genomes, for 17 of which genome-wide ATAC-seq data was publicly available and for 21 of which genome-wide ChIP-seq (H3K4me3) data was publicly available (see Table 1 and Supplementary Table S2). A wide variety of tissues and treatments were used in these ATAC- and ChIP-seq experiments which are listed in Supplementary Table S3. The NGS data was downloaded from the sequence read archive (SRA) using the SRA-Toolkit 3.0.0 (https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). The reads were trimmed with Trimmomatic 0.36 (Bolger et al. 2014) and quality controlled using FastQC 0.11.9 (Andrews 2010) and MultiQC (Ewels et al. 2016). If the reads passed quality control, they were mapped to the reference genome using BWA 2.1 (Md et al. 2019). Conversion to bam files was performed using SamTools 1.6 (Danecek et al. 2021). The Picard Toolkit (Broad Institute ed 2019) was used to mark duplicates. The duplicates, unmapped reads, non-primary alignments and reads not passing platform quality checks were removed with SamTools. Plots for quality control were generated using deepTools 3.5.3 (RamÃrez et al. 2016) and the necessary genome annotations were generated using Helixer v.0.3.1 (Stiehler et al. 2021, Holst et al. 2023). ATAC-seq data was deemed of high enough quality if the average coverage enrichment ±3 kbp around the TSS showed the expected peak and the average peak read coverage was at least 2.5 times the background coverage. The quality control for ChIP-seq data was performed using the same criteria. A detailed data preprocessing documentation is available at: https://github.com/weberlab-hhu/Predmoter/blob/main/docs/data_preprocessing.md. The plant genome fasta files and final NGS data bam files were converted to h5 files using Helixer (Stiehler et al. 2021, Holst et al. 2023). The ATAC-seq reads were shifted +4 bp on the positive strand and −5 bp on the negative strand to adjust the read start sites to represent the center of the transposon binding site (Buenrostro et al. 2013). A detailed documentation of the h5 file creation and architecture is available at: https://github.com/weberlab-hhu/Predmoter/blob/main/docs/h5_files.md. + +The species used in the development of Predmoter are separated into the four domains algae, mosses, monocots, and dicots. The availability and usage of the species dataset for ATAC- or ChIP-seq is indicated by a check mark. \ No newline at end of file diff --git a/studies/Data/protocols/FilteringFlaggedSequences.md b/studies/Data/protocols/FilteringFlaggedSequences.md new file mode 100644 index 0000000000000000000000000000000000000000..0e4cdeb2f1ae3442f31af525d3d00817163e4ce2 --- /dev/null +++ b/studies/Data/protocols/FilteringFlaggedSequences.md @@ -0,0 +1,3 @@ +## Filtering flagged sequences + +A naïve filtering approach was used to reduce the noise in the dataset. The ATAC-seq data showed high coverage for non-nuclear sequences. The transposase cuts primarily open chromatin (Buenrostro et al. 2013) and as such also the chloroplast and mitochondrial genomes. When the organelles were not completely removed before the experiment, the data contained noise in the form of notably higher coverage in these regions. Unplaced scaffolds were also observed to contribute to this noise during the data quality control steps (Fig. 1a). Therefore, unplaced scaffolds and non-nuclear sequences were flagged during later development stages (see Section 2.2 and Tables 2 and 3). Assemblies on scaffold or contig level, *Bigelowiella natans, Eragrostis nindensis, Marchantia polymorpha, Oropetium thomaeum, Pyrus x bretschneiderii*, and *Spirodela polyrhiza*, were not flagged. The flagged sequences were filtered out (Fig. 1b). The information about the assembly accessions of the unplaced scaffolds and non-nuclear sequences was extracted from the sequence report jsonl files available at the NCBI’s RefSeq or GenBank and added to the h5 file (under “data/blacklistâ€) via add_blacklist.py in “side_scripts.†The flagged sequences reached around 7% of all genome assemblies used not counting assemblies on scaffold or contig level. \ No newline at end of file diff --git a/studies/Data/resources/.gitkeep b/studies/Data/resources/.gitkeep new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/studies/Data/resources/README.md b/studies/Data/resources/README.md new file mode 100644 index 0000000000000000000000000000000000000000..171a17279ed36fd26356a6b8d78ee09cdb6281df --- /dev/null +++ b/studies/Data/resources/README.md @@ -0,0 +1,37 @@ +<img src=./dataset/vbae074f1.jpeg width=60%> + +**Figure 1.** Average ATAC- and ChIP-seq coverage ±3 kpb around the TSS for each species in the dataset. (a) Average ATAC-seq coverage including unplaced scaffolds and non-nuclear sequences. (b) Average ATAC- and ChIP-seq coverage excluding unplaced scaffolds and non-nuclear sequences. The species are sorted into the three categories dicots, monocots, and algae/mosses. + +## Figure source + +https://academic.oup.com/view-large/figure/464175126/vbae074f1.tif + +**Table 1.** Plant genomes and available datasets. + +| **Domain** | **Species** | **ATAC-seq** | **ChIP-seq (H3K4me3)** | +|------------|-----------------------------|--------------|------------------------| +| Algae | _Bigelowiella natans_ | ✔ | | +| | _Chlamydomonas reinhardtii_ | | ✔ | +| Mosses | _Marchantia polymorpha_ | ✔ | ✔ | +| Monocots | _Brachypodium distachyon_ | ✔ | ✔ | +| | _Eragrostis nindensis_ | ✔ | ✔ | +| | _Oropetium thomaeum_ | ✔ | | +| | _Oryza brachyantha_ | | ✔ | +| | _Oryza sativa_ | ✔ | ✔ | +| | _Panicum miliaceum_ | ✔ | | +| | _Setaria italica_ | | ✔ | +| | _Sorghum bicolor_ | ✔ | | +| | _Spirodela polyrhiza_ | ✔ | ✔ | +| | _Zea mays_ | ✔ | ✔ | +| Dicots | _Actinidia chinensis_ | ✔ | ✔ | +| | _Arabidopsis thaliana_ | ✔ | ✔ | +| | _Brassica napus_ | ✔ | ✔ | +| | _Brassica oleracea_ | | ✔ | +| | _Brassica rapa_ | | ✔ | +| | _Glycine max_ | ✔ | ✔ | +| | _Malus domestica_ | ✔ | ✔ | +| | _Medicago truncatula_ | ✔ | ✔ | +| | _Prunus persica_ | | ✔ | +| | _Pyrus x bretschneideri_ | | ✔ | +| | _Sesamum indicum_ | | ✔ | +| | _Solanum lycopersicum_ | ✔ | ✔ | \ No newline at end of file diff --git a/studies/Data/resources/vbae074f1.jpeg b/studies/Data/resources/vbae074f1.jpeg new file mode 100644 index 0000000000000000000000000000000000000000..6b1493d73a04b7ca98c2e3b6e48b49d8c1cea62f Binary files /dev/null and b/studies/Data/resources/vbae074f1.jpeg differ diff --git a/workflows/README.md b/workflows/README.md new file mode 100644 index 0000000000000000000000000000000000000000..d352f2fc35844147d9fdfc07c1a1da41fb5db3e3 --- /dev/null +++ b/workflows/README.md @@ -0,0 +1,5 @@ +## Sources + +Detailed data preprocessing documentation: https://github.com/weberlab-hhu/Predmoter/blob/main/docs/data_preprocessing.md + +Detailed documentation of the h5 file creation and architecture: https://github.com/weberlab-hhu/Predmoter/blob/main/docs/h5_files.md \ No newline at end of file diff --git a/workflows/data_preprocessing.md b/workflows/data_preprocessing.md new file mode 100644 index 0000000000000000000000000000000000000000..a5fb843ee805dccfec588eab571b7d31975fcc18 --- /dev/null +++ b/workflows/data_preprocessing.md @@ -0,0 +1,265 @@ +# Data preprocessing documentataion +This documentation is about preprocessing ATAC- and ChIP-seq data. The pipeline +is used to create bam files from public new generation sequencing (NGS) data +(fastq files) downloaded from NCBI's (National Center for Biotechnology Information) +SRA (Sequence Read Archive). + +## 1. Required tools +The tool version numbers are the ones used during this project. + +- SRA-Toolkit 3.0.0 (SRA Toolkit - GitHub) +- Java 1.8.0 (Arnold et al., 2005) +- Trimmomatic 0.36 (Bolger et al., 2014) +- FastQC 0.11.9 (Babraham Bioinformatics - FastQC A Quality Control Tool +for High Throughput Sequence Data) +- MultiQC 1.14 (Ewels et al., 2016) +- BWA 2.1 (Md et al., 2019) +- SamTools 1.6 (Danecek et al., 2021) +- Picard (Picard Tools - By Broad Institute, n.d.) +- deepTools 3.5.1 (RamÃrez et al., 2016) +- Helixer 0.3.1 (Stiehler et al., 2021; Weberlab: Institute of Plant Biochemistry, +HHU: GitHub, n.d.) +- Helixer Post (https://github.com/TonyBolger/HelixerPost) +- gffread (Pertea & Pertea, 2020) +- GeenuFF 0.3.0 (Weberlab: Institute of Plant Biochemistry, HHU: GitHub, n.d.) + +## 2. Pipeline +| Category | Download data | Trimming | Quality control | Mapping | Quality control | Deduplication/Data cleaning | Quality control | +|:--------:|:-------------:|:-------------------:|:---------------:|:---------------:|:---------------:|:---------------------------:|:-------------------------------:| +| Tools | SRA Toolkit | Trimmomatic<br>Java | FastQC | BWA<br>SamTools | SamTools | Picard<br>Java<br>SamTools | deepTools<br>Helixer<br>gffread | + + +### 2.1. Download data +**Tools:** SRA toolkit +**Setup:** The path to the toolkit +``export PATH=$PATH:<path_to>/sratoolkit.3.0.0-ubuntu64/bin`` needs to be in +your ``.bashrc``. When first configuring the SRA toolkit, be sure to set the local +file caching to a folder of sufficient size, so **not** your home directory +(see https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration). + +There are two options for downloading the runs from SRA: fastq-dump and +fasterq-dump. The program calls are similar, with fastq-dump downloading the +files in qzip format while fasterq-dump downloads faster, but doesn't zip the +fastq files. + +#### 2.1.1. Fastq-dump +```bash +fastq-dump <SRR_accsession> --gzip --split-3 + +``` +#### 2.1.2. Fasterq-dump +```bash +fasterq-dump <SRR_accsession> --split-3 + +gzip <SRR_accsession>_1.fastq <SRR_accsession>_2.fastq # for paired end data + +``` + +#### 2.1.3. Special case: single cell data +```bash +prefetch <SRR_accsession> +fasterq-dump <SRR_accsession> -S --include-technical + +gzip <SRR_accsession>_3.fastq <SRR_accsession>_4.fastq +``` +Unzipped fastq files also work, as trimmomatic accepts them as well. +It is however necessary to rename the forward read from <SRR_accsession>_3.fastq to +<SRR_accsession>_1.fastq and the reverse reads from <SRR_accsession>_4.fastq to +<SRR_accsession>_2.fastq. + +### 2.2. Trimming +**Tools:** Java 1.8.0, Trimmomatic 0.36 +**Hint:** ATAC-seq data usually uses Nextera adapters, so use NexteraPE-PE.fa for +paired ATAC-seq and TruSeq3-PE-2.fa for paired ChIP-seq data. + +```bash +# paired end data +java -jar -Xmx2048m <path_to_trimmomatic>/trimmomatic-0.36.jar PE -threads 2 \ +-basein <sample_ID>_1.fastq.gz -baseout <out_path>/<sample_ID>.fastq.gz \ +ILLUMINACLIP:<path_to_trimmomatic>/adapters/TruSeq3-PE-2.fa:3:30:10:1:true \ +MAXINFO:36:0.7 + +# single end data +java -jar -Xmx2048m <path_to_trimmomatic>/trimmomatic-0.36.jar SE -threads 2 \ +<sample_ID>.fastq.gz <out_path>/<sample_ID>.fastq.gz \ +ILLUMINACLIP:<path_to_trimmomatic>/adapters/TruSeq3-SE.fa:2:30:10 MAXINFO:36:0.7 +``` + +### 2.3. Quality control +**Tools:** FastQC +```bash +# paired end data +fastqc <sample_ID>_1P.fastq.gz +fastqc <sample_ID>_2P.fastq.gz + +# single end data +fastqc <sample_ID>.fastq.gz +``` +**Hint:** [MultiQC](https://multiqc.info) was used to generate summary html reports. + +### 2.4. Mapping +**Tools:** BWA, SamTools + +#### 2.4.1. Genome indexing +```bash +bwa-mem2 index -p <path_to_genome>/<species> <path_to_genome>/<species>.fa +``` + +#### 2.4.2. Mapping +```bash +# paired end data +bwa-mem2 mem <path_to_genome>/<species> <(zcat <sample_ID>_1P.fastq.gz) \ +<(zcat <sample_ID>_2P.fastq.gz) > <sample_ID>.sam + +# single end data +bwa-mem2 mem <path_to_genome>/<species> <(zcat <sample_ID>.fastq.gz) > \ +<sample_ID>.sam + +# sort and index +samtools view <sample_ID>.sam -b |samtools sort -T tmp<sample_ID> -@1 \ +-o <sample_ID>.bam +samtools index <sample_ID>.bam +``` + +### 2.5. Quality control +**Tools:** SamTools + +Check for percentage of reads mapped, mates paired, etc. + +```bash +samtools flagstat <sample_ID>.bam > <sample_ID>.txt +``` + +### 2.6. Deduplication/Data cleaning +**Tools:** Picard, Java, SamTools + +Common errors are: +- the picard output just ends without signaling that picard finished and +there is no output bam file +- java.lang.OutOfMemoryError: Java heap space +- java.lang.OutOfMemoryError: GC overhead limit exceeded + +The **solution** is extending the java heap space, so replacing +-Xmx1G with -Xmx2G or even -Xmx3G. + +```bash +# picard +java -Xmx1G -jar <path_to_picard>/picard.jar MarkDuplicates I=<sample_ID>.bam \ +O=<sample_ID>_marked.bam M=<sample_ID>.stats + +# filter with samtools +samtools view -F 1796 <sample_ID>_marked.bam -b > <sample_ID>.bam +# -F 1796 filters out duplicates, unmapped reads, non-primary alignments +# and reads not passing platform quality checks +samtools index <sample_ID>.bam +``` + +### 2.7. Quality control +**Tools:** deepTools, Helixer, gffread + +Only the predictions/annotation from Helixer were used for consistency, but if the +genome already has an annotation you can also use that and skip the fourth step. +You can also blacklist regions (bed file) in the fifth step, if the previous plot +was too noisy. An easy way is to blacklist the chloroplast, mitochondrion and unplaced +scaffolds, as they have been observed to frequently cause noise in ATAC-seq data. + +#### 2.7.1. Plot coverage +```bash +files=`ls <folder_of_deduplicated_files>/*.bam` +<path_to_deeptools>/deepTools/bin/plotCoverage -b $files \ +--plotFile <species>Coverage.png --smartLabels --outRawCounts deduplicated/raw.txt +``` +#### 2.7.2. Plot fingerprint +```bash +files=`ls <folder_of_deduplicated_files>/*.bam` +<path_to_deeptools>/deepTools/bin/plotFingerprint -b $files \ +--plotFile <species>Fingerprint.png --smartLabels --binSize 500 \ +--numberOfSamples 500_000 +# if the genome is equal or smaller than 250 Mbp, please adjust the defaults of +# binsize (500) and numberOfSamples (500_000) oof plotFingerprint, so that +# binsize x numberOfSamples < genome size +``` +#### 2.7.3. Bam to bigwig +```bash +files=`ls deduplicated/*2.bam` + +for file in $files; do + <path_to_deeptools>/deepTools/bin/bamCoverage -b $file -o $(basename $file .bam).bw -of bigwig +done +``` +#### 2.7.4. Helixer annotation +```bash +<path_to_helixer>/Helixer/Helixer.py --lineage land_plant --fasta-path <species>.fa \ +--species <full_species_name> --gff-output-path <species>_land_plant_v0.3_a_0080_helixer.gff3 \ +--model-filepath <path_to_best_model>/land_plant_v0.3_a_0080.h5 --batch-size 150 \ +--subsequence-length 21384 --temporary-dir <temp_dir> +# change batch size/subsequence length as your GPU can handle/as you need +# full species name example: Arabidopsis_thaliana +``` +>**Warning:** To create the annotation, you will need an Nvidia GPU, their CUDA +> toolkit and cuDNN. (see https://github.com/weberlab-hhu/Helixer) + +#### 2.7.5. Compute matrix +```bash +# convert predictions or reference annotation to bed +gffread helixer_pred/<species>_land_plant_v0.3_a_0080_helixer.gff3 \ +--bed -o <species>_helixer.bed + +# compute matrix +files=`ls <path_to_bigwig_files>*.bw` +<path_to_deeptools>/deepTools/bin/computeMatrix reference-point \ +-S $files -R <species>_helixer.bed --smartLabels \ +-o <species>_matrix.mat.gz --upstream 1500 --downstream 1500 +``` + +#### 2.7.6. Plot heatmap +```bash +<path_to_deeptools>/deepTools/bin/plotHeatmap -m deepqc/<species>_matrix.mat.gz \ +-o <species>_tss_heatmap.png +``` + +The heatmap should show a peak in front of/around the transcription start site (TSS) +for ATAC-seq and behind the TSS for H3K4me3 ChIP-seq. + +**ATAC-seq example of _Brachypodium distachyon_:** + + +**ChIP-seq example of _Brachypodium distachyon_:** + + +### References +- Arnold, K., Gosling, J., & Holmes, D. (2005). The Java programming language. +Addison Wesley Professional. +- Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput +Sequence Data. (n.d.). Retrieved May 23, 2022, from +https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ +- Bolger, A. M., Lohse, M., & Usadel, B. (2014). Trimmomatic: a flexible trimmer +for Illumina sequence data. Bioinformatics, 30(15), 2114. +https://doi.org/10.1093/BIOINFORMATICS/BTU170 +- Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., +Whitwham, A., Keane, T., McCarthy, S. A., Davies, R. M., & Li, H. (2021). Twelve +years of SAMtools and BCFtools. GigaScience, 10(2). +https://doi.org/10.1093/GIGASCIENCE/GIAB008 +- Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC: summarize +analysis results for multiple tools and samples in a single report. Bioinformatics, +32(19), 3047–3048. https://doi.org/10.1093/BIOINFORMATICS/BTW354 +- Md, V., Misra, S., Li, H., & Aluru, S. (2019). Efficient architecture-aware +acceleration of BWA-MEM for multicore systems. Proceedings - 2019 IEEE 33rd +International Parallel and Distributed Processing Symposium, IPDPS 2019, 314–324. +https://doi.org/10.1109/IPDPS.2019.00041 +- Pertea, G., & Pertea, M. (2020). GFF Utilities: GffRead and GffCompare. +F1000Research, 9, 304. https://doi.org/10.12688/F1000RESEARCH.23297.2 +- Picard Tools - By Broad Institute. (n.d.). Retrieved May 23, 2022, +from https://broadinstitute.github.io/picard/ +- RamÃrez, F., Ryan, D. P., Grüning, B., Bhardwaj, V., Kilpert, F., Richter, +A. S., Heyne, S., Dündar, F., & Manke, T. (2016). deepTools2: a next generation +web server for deep-sequencing data analysis. Nucleic Acids Research, 44(W1), +W160–W165. https://doi.org/10.1093/NAR/GKW257 +- Stiehler, F., Steinborn, M., Scholz, S., Dey, D., Weber, A. P. M., & Denton, +A. K. (2021). Helixer: cross-species gene annotation of large eukaryotic genomes +using deep learning. Bioinformatics, 36(22–23), 5291–5298. +https://doi.org/10.1093/BIOINFORMATICS/BTAA1044 +- SRA Toolkit - GitHub. (n.d.). Retrieved September 19, 2022, +from https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit +- Weberlab: Institute of Plant Biochemistry, HHU: GitHub. (n.d.). Retrieved +May 23, 2022, from https://github.com/weberlab-hhu diff --git a/workflows/h5_files.md b/workflows/h5_files.md new file mode 100644 index 0000000000000000000000000000000000000000..076ca607296448818a56217465f1c9abffa31401 --- /dev/null +++ b/workflows/h5_files.md @@ -0,0 +1,270 @@ +# H5 files +Predmoter uses h5 files as input and main output data. H5 files, +Hierarchical Data Format 5 (HDF5), are designed to store and organize large +amounts of data. They consist of two major components: + +- Datasets (multidimensional arrays) +- Groups (container structures; can hold datasets and other groups) + +## 1. H5 file architecture +### 1. Input files +The architecture of the h5 files used as input data for Predmoter looks like this: +```raw +<file>.h5 +| +| +├── data +| ├── X (encoded DNA sequence) +| ├── seqids (chromosome/scaffold names) +| ├── species +| └── start_ends (start and end base per chunk) +| +└── evaluation + ├── <dataset>_meta + | └── bam_files + ├── <dataset>_coverage (in reads per base pair) + └── <dataset>_spliced_coverage +``` + +#### 1.1. The data group +The base h5 file and all arrays of the data group is created using ``fasta2h5.py``. + +##### 1.1.1. X +The encoded DNA sequence. The original fasta file is cut into chunks of the same +size (default: 21384 base pairs). The shape of this array is (N, C, 4). The C +represents the chunk/sequence length and N the number of chunks. The four represents +the nucleotides/bases (example shape: (11202, 21384, 4)). The nucleotide encoding +is as follows: +- [1, 0, 0, 0] = C +- [0, 1, 0, 0] = A +- [0, 0, 1, 0] = T +- [0, 0, 0, 1] = G +- [0.25, 0.25, 0.25, 0.25] = N +- [0, 0, 0, 0] = padding + +Padding is used to pad out too short sequences or chromosome ends, since chromosomes +are rarely exactly divisible by the sequence length. Arrays always need to have +compatible shapes to be concatenated, so the sequence length needs to be identical +between all chunks. Shorter sequences will therefore be padded. Predmoter considers +Ns padding as well (so [0, 0, 0, 0]), as most Ns are gaps in the genome assembly +and as such not informative. + +##### 1.1.2 seqids +The ``seqids`` represent the fasta file headers/identifiers of the sequence +(chromosome, scaffold, contig, etc.) that a chunk comes from. The shape is (N,), i.e. +the number of chunks. + +##### 1.1.3. species +The species ID given by the user. The shape is (N,), i.e. the number of chunks. +Technically useful if multiple species are in one file. Predmoter assumes that +there is just one species per file, which is especially important for testing and +predicting. It should be possible to use multiple species in one file during +training, but it wasn't tested. + +##### 1.1.4. start_ends +The start and end coordinates of a given chunk (in that order). When the end +coordinate is smaller than the start coordinate (e.g. [21384, 0]), then the +chunk is from the negative strand. So [0, 21384] is the first 21384 bp of a given +sequence of the positive strand. On the positive strand start is inclusive and +end exclusive (it's the other way around on the negative strand). If +``|start - end|`` is less than the chunk length then there is padding. The shape is +(N, 2), i.e. two values (start and end) per chunk. + +#### Examples +##### 1. Chunk order +The DNA sequence order inside the h5 files is: chromosome 1-positive strand, +chromosome 1-negative strand, chromosome 2-positive strand, +chromosome 2-negative strand, etc. + +Example of one chromosome: +- sequence length: 21384 bp +- length of chromosome 1: 51321 bp + +| | | | | | | | +|:----------|:-----------|:---------------|:---------------|:---------------|:---------------|:-----------| +| chunk | 0 | 1 | 2 | 3 | 4 | 5 | +| strand | + | + | + | - | - | - | +| start/end | [0, 21384] | [21384, 42768] | [42768, 51321] | [51321, 42768] | [42768, 21384] | [21384, 0] | + +All strands are represented from 5' to 3', so the negative strand is in descending +order. Chunks 2 and 3 contain padding and chunk 3 is the reverse complement of chunk 2. + +##### 2. Padding +bases = A, T, C, G, N +padding = P +(both encoded inside the file) + +positive strand: + +| | | | | +|:-----------|:----------------|:----------------|:----------------| +| chunk | 0 | 1 | 2 | +| mini_array | [A, A, T, G, A] | [C, A, T, N, C] | [G, T, T, P, P] | +| start/end | [0, 5] | [5, 10] | [10, 13] | + +negative strand: + +| | | | | +|:-----------|:----------------|:----------------|:----------------| +| chunk | 3 | 4 | 5 | +| mini_array | [A, A, C, P, P] | [G, N, A, T, G] | [T, C, A, T, T] | +| start/end | [13, 10] | [10, 5] | [5, 0] | + +The padding is on the positive and negative strand always at the end of the array, +so chunk 2 and 3 are **not** exact mirrors/reverse complements of each other, while +for example chunk 0 and 5 are. + +#### 1.2 The evaluation group +All arrays of the evaluation group are added to the base h5 file using +``add_ngs_coverage.py``. Multiple datasets can be added to the h5 file. + +##### 1.2.1 dataset_meta +The meta of a dataset contains the array ``bam_files``. This is a list of the bam file +names (including the path) that were used to add the dataset. The shape is (B,), i.e. +the number of bam files. Multiple bam files can be added per dataset, but it is not +possible to add additional bam files to an existing dataset. In that case, +``add_ngs_coverage`` needs to be used on the desired h5 file not containing the dataset +and all bam files (old and new) can be added together. + +##### 1.2.2. dataset_coverage +Coverage is the number of reads mapping at a given position (cigar M or X). The +shape is (N, C, B). The C represents the chunk/sequence length, N the number of +chunks, and B the number of bam files (example shape: (11202, 21384, 5)). Predmoter +uses the average of all bam files per dataset to train. + +##### 1.2.3 dataset_spliced_coverage +Spliced coverage is the number of reads mapping with a gap, that is a read deletion +or spliced out section at a given position (cigar N or D). The shape is the same as +the shape of dataset_coverage. + +### 2. Prediction file +The prediction h5 file contains: +```raw +<input_file_basename>_prediction.h5 +| +├── data +| ├── seqids +| ├── start_ends +| └── species +| +└── prediction + ├── predictions + ├── model_name + └── datasets +``` + +The ``seqids``, ``start_ends`` and ``species`` arrays are taken from the input file. +The predictions array contains the models' predictions. The shape is (N, C, D); +the C represents the chunk/sequence length, N the number of chunks and D the number +of datasets predicted. The metadata contains the models' filename (including the path), +and the datasets the model predicted in the correct order (e.g., "atacseq", "h3k4me3"). +The order of datasets is very important, as it is used by the Converter to convert each +dataset into its own bigwig or bedgraph file. The Converter also needs ``seqids`` and +``start_ends`` to calculate chromosome order and chromosome lengths. + +## 2. H5 file creation +**Tools:** Helixer, GeenuFF (optional), Predmoter +(Stiehler et al., 2021; Weberlab: Institute of Plant Biochemistry, HHU: GitHub, n.d.) + +ATAC- and ChIP-seq data is PCR amplified. Therefore, you can not determine from which +strand which read originated unless you used barcodes to get strand specific data. +Hence, the coverage information of the read is always added to both strands. The +advantages are: +- Open chromatin and closed chromatin regions always apply to both strands anyway. +- The addition to both strands allows for built-in data augmentation. + +### 2.1. Create h5 file from fasta file +```bash +fasta2h5.py --species <full_species_name> --h5-output-path <species/filename>.h5 \ +--fasta-path <path_to_genome>/<species>.fa +# full species name example: Arabidopsis_thaliana +``` + +> **H5 base file:** To add annotations or datasets, the fasta h5 file needs to be +> created **first**. The additional data will then be added to the fasta h5 file. + +### 2.2. Add ATAC- and/or ChIP-seq data +Multiple datasets should be added separately. The shift argument is a special case +for ATAC-seq data to shift the reads +4 (+ strand) or -5 (- strand) base pairs as it is +typically done. +```bash +python3 <path_to_helixer>/helixer/evaluation/add_ngs_coverage.py \ +-s <full_species_name> -d <species>.h5 -b <path_to_bam_files>/*.bam \ +--dataset-prefix <ngs_prefix> --unstranded --threads <number_of_bam_files_to_add> \ +(--shift) +# multiple threads are faster, but --threads 0 also works +# ngs prefix example: atacseq +# --unstranded only for unstranded data +``` + +### 2.3. Add annotation(s) (optional) +To add the annotation to the h5 file, they need to be converted from gff/ggf3 to a +sqlite file using GeenuFF. If helixer and reference are desired one of them needs to +be added using ``--add-additional <annotation_name>`` (helixer is the better option; +suggested annotation name: helixer_post). +[More details about the annotation data](https://github.com/weberlab-hhu/Helixer/blob/main/docs/h5_data.md) +```bash +# to sqlite +import2geenuff.py --gff3 <helixer_annotation>.gff3 --fasta <species>.fa \ +--species <full_species_name> --db-path <helixer_geenuff>.sqlite3 \ +--log-file output.log +# --gff3 <reference>.gff to convert the reference annotation + +# add annotation +geenuff2h5.py --h5-output-path <species>.h5 --input-db-path <helixer_geenuff>.sqlite3 \ +--no-multiprocess --modes y,anno_meta,transitions +# modes: don't add X again when using add_additional +``` + +Resulting new file structure (when adding the reference annotation not in alternative +and adding the helixer annotation to alternative): +```raw +<file>.h5 +| +| +├── data +| ├── X (encoded DNA sequence) +| ├── seqids (chromosome/scaffold names) +| ├── species +| ├── start_ends (start and end base per chunk) +| ├── err_samples +| ├── fully_intergenic_samples +| ├── gene_lengths +| ├── is_annotated +| ├── phases +| ├── sample_weights +| ├── transitions +| └── y +| +├── evaluation +| ├── <dataset>_meta +| | └── bam_files +| ├── <dataset>_coverage (in reads per base pair) +| └── <dataset>_spliced_coverage +| +└── alternative + └── helixer_post + ├── err_samples + ├── fully_intergenic_samples + ├── gene_lengths + ├── is_annotated + ├── phases + ├── sample_weights + ├── transitions + └── y +``` + +**Common errors:** +- GeenuFF is very strict, so frequently converting the annotation to sqlite fails and +the reference gff needs to be edited slightly for it to be converted. +- The full species names don't match up between script calls. + +## References +- Stiehler, F., Steinborn, M., Scholz, S., Dey, D., Weber, A. P. M., & Denton, +A. K. (2021). Helixer: cross-species gene annotation of large eukaryotic genomes +using deep learning. Bioinformatics, 36(22–23), 5291–5298. +https://doi.org/10.1093/BIOINFORMATICS/BTAA1044 +- SRA Toolkit - GitHub. (n.d.). Retrieved September 19, 2022, +from https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit +- Weberlab: Institute of Plant Biochemistry, HHU: GitHub. (n.d.). Retrieved +May 23, 2022, from https://github.com/weberlab-hhu \ No newline at end of file