Merge branch 'ceplas-dm-viktoria' into 'main'

Ceplas dm viktoria See merge request !2

Merge branch 'ceplas-dm-viktoria' into 'main'
Ceplas dm viktoria See merge request !2
651833ba · Dominik Brilhaus · 855897bb · 8c4f654a · 651833ba · 651833ba
Commit 651833ba authored 4 months ago by Dominik Brilhaus
--- a/assays/SpeciesSelection/isa.assay.xlsx
+++ b/assays/SpeciesSelection/isa.assay.xlsx
--- a/assays/SpeciesSelection/protocols/.gitkeep
+++ b/assays/SpeciesSelection/protocols/.gitkeep
--- a/assays/SpeciesSelection/protocols/Cross-speciesPredictionModels.md
+++ b/assays/SpeciesSelection/protocols/Cross-speciesPredictionModels.md
+## Cross-species prediction models
+
+Ensuring a diverse range of species in the training set, while simultaneously reserving enough data for validation and testing to effectively evaluate the models’ generalization ability, proved difficult. At the start of development, the amount of high-quality, publicly available ATAC-seq data was low. Around 60% of the plant ATAC-seq data on SRA available up until July 2023 needed to be discarded after the final quality control. This left the ATAC-seq data of the 14 plant species used in this study. In later development stages 3 more ATAC-seq datasets, from *Actinidia chinensis, Panicum miliaceum* and *Sorghum bicolor*, and 2 more ChIP-seq datasets corresponding to acquired ATAC-seq datasets, from *A.chinensis* and *M.polymorpha*, became available. The low availability of high-quality data, especially in early development stages, turned out to be a major hindrance in providing the network with an appropriate amount of data to train on. Data of two species, *A.thaliana* and *O.sativa*, was set aside as a hold-out test set. In doing so, both a dicot and a monocot species with available ATAC- and ChIP-seq datasets could be used for final evaluation. The same applied to the two validation species, the dicot *Medicago truncatula* and the monocot *S.polyrhiza* (Table 3).
+
+The resulting training, validation, and test split for the ATAC-seq models, ChIP-seq models and Combined models was around 90% training set, 5% validation set and 5% test set (Fig. 3a).
+
+The model training pairs were visualized using the Uniform Manifold Approximation and Projection (UMAP) learning technique for dimension reduction (McInnes et al. 2018). Random training pairs, 5% of each species in the training set, were used to calculate the UMAPS. Gap subsequences and flagged sequences were not included. The chosen parameters were 10 neighbors, 0.1 minimum distance and the Euclidean distance metric. The additional species datasets, added in later development stages, were included. None of the available settings and metrics for UMAP computation showed distinct clusters based on the number of peaks within the input (Fig. 3b).
+
+For the first seven models only the species for which experimental ATAC-seq data of high quality was available up until July of 2023 were trained on. The same applied to the BiHybrid_05 model using ChIP-seq data. The Combined model used both datasets. The Combined_02 model used additional data of four species. Gap subsequences were masked for all models; unplaced scaffolds and non-nuclear sequences were masked starting with model BiHybrid_04.
\ No newline at end of file
--- a/assays/SpeciesSelection/protocols/Intra-speciesModelsAndLeave-one-outCross-validation.md
+++ b/assays/SpeciesSelection/protocols/Intra-speciesModelsAndLeave-one-outCross-validation.md
+## Intra-species models and leave-one-out cross-validation
+
+Cross-species validation instead of an in-species split for the validation and training data was deemed closer to the real-world use case of predicting ATAC- and ChIP-seq data for an entire species. However, two models were trained using an intra-species training and validation split. These models, IS_10 and IS_20, used 10% and 20% of each species dataset as the validation set respectively. The input files were split using Predmoter's intra_species_train_val_split.py script in “side_scripts.” This method ensured that each sequence ID from the original fasta file was fully assigned to either training or validation set. Since the focus of this study is on cross-species prediction, all 25 plant species were used in leave-one-out cross-validation (LOOCV) to evaluate the best model setup on different species. All these setups were trained on ATAC- and ChIP-seq datasets simultaneously (Table 4). When performing LOOCV the model performance was evaluated on all datasets available in the left-out species.
+
+All models excluded gap subsequences, subsequences of 21 384 bp only containing Ns, and flagged subsequences. For more details on exact model parameters see Supplementary Section S1.3 and Supplementary Table S4.
\ No newline at end of file
--- a/isa.investigation.xlsx
+++ b/isa.investigation.xlsx
--- a/studies/ArchitectureAndProposedModels/README.md
+++ b/studies/ArchitectureAndProposedModels/README.md
--- a/studies/ArchitectureAndProposedModels/isa.study.xlsx
+++ b/studies/ArchitectureAndProposedModels/isa.study.xlsx
--- a/studies/ArchitectureAndProposedModels/protocols/.gitkeep
+++ b/studies/ArchitectureAndProposedModels/protocols/.gitkeep
--- a/studies/ArchitectureAndProposedModels/protocols/ArchitectureAndProposedModelsProtocol.md
+++ b/studies/ArchitectureAndProposedModels/protocols/ArchitectureAndProposedModelsProtocol.md
+## Architecture and proposed models
+
+The model architectures were implemented using Pytorch Lightning (Falcon 2019) on top of PyTorch (Paszke et al. 2019). The model used supervised learning, a method that connects an input to an output based on example input–output pairs (Russell and Norvig 2016).
+
+The input for the model was a genomic DNA sequence. The nucleotides were encoded into four-dimensional vectors (see Supplementary Table S1). The DNA sequence of a given plant species was cut into subsequences of 21 384 bp. This number was large enough to contain typical gene lengths of plants while being divisible by ten of the numbers from one to twenty. An easily divisible subsequence length is a requirement for Predmoter (see Supplementary Section S1.2). As few chromosomes, scaffolds or contigs were divisible by 21 384 bp, sequence ends as well as short sequences were padded with the vector [0., 0., 0., 0.]. Padded base pairs were masked during training. If a subsequence only contained N bases, here referred to as “gap subsequence,” it was filtered out. Both strands, plus and minus, were used. Since the ATAC- and ChIP-seq data was PCR amplified and as such it was not possible to determine from which strand a read originated, the coverage information was always added to both strands. The model’s predictions for either ATAC-seq, ChIP-seq or both were compared to the experimental read coverage. The target data were represented per sample of experimental data. These were averaged beforehand, resulting in one coverage track per NGS dataset and plant species.
+
+Three main model architectures were examined on their performance. The first architecture consisted of convolutional layers followed by transposed convolutional layers for deconvolution (LeCun et al. 1989, LeCun and Bengio 1995). The deconvolution was added to output base-wise predictions. We refer here to this architecture as U-Net. To ensure that the new sequence lengths resulting from a convolution or deconvolution was correct, custom padding formulas were used (Supplementary Section S1.2). Our second approach was a hybrid network. A block of long short-term memory layers (LSTM) (Hochreiter and Schmidhuber 1997) was placed in between a convolutional layer block and a transposed convolutional layer block. The final approach was called bi-hybrid. Its architecture matched the hybrid architecture, except that the LSTM layers were replaced with bidirectional LSTM layers (BiLSTM) (Hochreiter and Schmidhuber 1997, Schuster and Paliwal 1997). Each convolutional and transposed convolutional layer was followed in all three approaches by the ReLU activation function (Glorot et al. 2011). Additional augmentations to the bi-hybrid network included adding batch normalization after each convolutional and transposed convolutional layer and adding a dropout layer after each BiLSTM layer except the last (Fig. 2). The Adam algorithm was used as an optimization method (Kingma and Ba 2014). The network’s base-wise predictions can be smoothed via a postprocessing step utilizing a rolling mean of a given window size.
+
+We examined 10 different model setups (Table 2). The best model of each architecture and dataset combination was used to develop the next combination test. The model reaching the highest Pearson’s correlation for the validation set was deemed the best model. Pre-tests showed that including gap subsequences, subsequences of 21 384 bp only containing Ns, led to a considerably lower Pearson’s correlation. The proportion of gap subsequences in the total data was 0.6%. Normalizing the NGS coverage data through a general approach of subtracting the average coverage from the dataset and using a ReLU transformation (Glorot et al. 2011) showed notably worse results during previous attempts. The approach of normalizing via an input sample was not feasible due to the considerable lack of available ATAC-seq input samples accompanying the experiments. Therefore, the target data was not adjusted towards its sequencing depth. For more information about the training process see Supplementary Section S1.3.
+
+All models excluded gap subsequences, subsequences of 21 384 bp only containing Ns. For more details on species selection and exact model parameters see Supplementary Table S4. Models excluding subsequences of unplaced scaffolds and non-nuclear sequences during training and testing are denoted with *.
\ No newline at end of file
--- a/studies/ArchitectureAndProposedModels/resources/.gitkeep
+++ b/studies/ArchitectureAndProposedModels/resources/.gitkeep
--- a/studies/ArchitectureAndProposedModels/resources/README.md
+++ b/studies/ArchitectureAndProposedModels/resources/README.md
+<img src=./dataset/vbae074f2.jpeg width=60%>
+
+**Figure 2.** Predmoter architecture and prediction process. The bi-hybrid architecture with batch normalization and dropout is schematically depicted. Not to scale. Hyperparameters are examples and can vary. The base-wise predictions and smoothed predictions are from an example subsequence from *A. thaliana*.
+
+## Figure source
+
+https://academic.oup.com/view-large/figure/464175134/vbae074f2.tif
+
+**Table 2.** Model architecture and dataset explanation (short).
+
+| **Model name**             | **Dataset**                                                                   | **Architecture**                                                                                      |   |
+|----------------------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------|---|
+| U-Net                      | ATAC-seq                                                                      | 3 convolutional layers + 3 transposed convolutional layers                                            |   |
+| Hybrid                     | ATAC-seq                                                                      | U-Net + 2 LSTM layers                                                                                 |   |
+| BiHybrid                   | ATAC-seq                                                                      | U-Net + 2 BiLSTM layers                                                                               |   |
+| BiHybrid_02                | ATAC-seq                                                                      | U-Net + 2 BiLSTM layers + 6 batch normalization layers                                                |   |
+| BiHybrid_03.1 (see Fig. 2) | ATAC-seq                                                                      | U-Net + 2 BiLSTM layers + 6 batch normalization layers + 1 dropout layer (dropout probability of 0.3) |   |
+| BiHybrid_03.2              | ATAC-seq                                                                      | U-Net + 2 BiLSTM layers + 6 batch normalization layers + 1 dropout layer (dropout probability of 0.5) |   |
+| BiHybrid_04                | ATAC-seq, filtered flagged sequences*                                         | U-Net + 2 BiLSTM layers + 6 batch normalization layers + 1 dropout layer (dropout probability of 0.3) |   |
+| BiHybrid_05                | ChIP-seq (H3K4me3), filtered flagged sequences*                               | U-Net + 2 BiLSTM layers + 6 batch normalization layers + 1 dropout layer (dropout probability of 0.3) |   |
+| Combined                   | ATAC-seq, ChIP-seq (H3K4me3), filtered flagged sequences*                     | U-Net + 2 BiLSTM layers + 6 batch normalization layers + 1 dropout layer (dropout probability of 0.3) |   |
+| Combined_02                | ATAC-seq, ChIP-seq (H3K4me3), filtered flagged sequences* (+ additional data) | U-Net + 2 BiLSTM layers + 6 batch normalization layers + 1 dropout layer (dropout probability of 0.3) |   |
\ No newline at end of file
--- a/studies/ArchitectureAndProposedModels/resources/vbae074f2.jpeg
+++ b/studies/ArchitectureAndProposedModels/resources/vbae074f2.jpeg
--- a/studies/Data/README.md
+++ b/studies/Data/README.md
--- a/studies/Data/isa.study.xlsx
+++ b/studies/Data/isa.study.xlsx
--- a/studies/Data/protocols/.gitkeep
+++ b/studies/Data/protocols/.gitkeep
--- a/studies/Data/protocols/DataOverviewAndPreprocessing.md
+++ b/studies/Data/protocols/DataOverviewAndPreprocessing.md
+## Data overview and preprocessing
+
+The entire dataset consisted of 25 plant genomes, for 17 of which genome-wide ATAC-seq data was publicly available and for 21 of which genome-wide ChIP-seq (H3K4me3) data was publicly available (see Table 1 and Supplementary Table S2). A wide variety of tissues and treatments were used in these ATAC- and ChIP-seq experiments which are listed in Supplementary Table S3. The NGS data was downloaded from the sequence read archive (SRA) using the SRA-Toolkit 3.0.0 (https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). The reads were trimmed with Trimmomatic 0.36 (Bolger et al. 2014) and quality controlled using FastQC 0.11.9 (Andrews 2010) and MultiQC (Ewels et al. 2016). If the reads passed quality control, they were mapped to the reference genome using BWA 2.1 (Md et al. 2019). Conversion to bam files was performed using SamTools 1.6 (Danecek et al. 2021). The Picard Toolkit (Broad Institute ed 2019) was used to mark duplicates. The duplicates, unmapped reads, non-primary alignments and reads not passing platform quality checks were removed with SamTools. Plots for quality control were generated using deepTools 3.5.3 (Ramírez et al. 2016) and the necessary genome annotations were generated using Helixer v.0.3.1 (Stiehler et al. 2021, Holst et al. 2023). ATAC-seq data was deemed of high enough quality if the average coverage enrichment ±3 kbp around the TSS showed the expected peak and the average peak read coverage was at least 2.5 times the background coverage. The quality control for ChIP-seq data was performed using the same criteria. A detailed data preprocessing documentation is available at: https://github.com/weberlab-hhu/Predmoter/blob/main/docs/data_preprocessing.md. The plant genome fasta files and final NGS data bam files were converted to h5 files using Helixer (Stiehler et al. 2021, Holst et al. 2023). The ATAC-seq reads were shifted +4 bp on the positive strand and −5 bp on the negative strand to adjust the read start sites to represent the center of the transposon binding site (Buenrostro et al. 2013). A detailed documentation of the h5 file creation and architecture is available at: https://github.com/weberlab-hhu/Predmoter/blob/main/docs/h5_files.md.
+
+The species used in the development of Predmoter are separated into the four domains algae, mosses, monocots, and dicots. The availability and usage of the species dataset for ATAC- or ChIP-seq is indicated by a check mark.
\ No newline at end of file
--- a/studies/Data/protocols/FilteringFlaggedSequences.md
+++ b/studies/Data/protocols/FilteringFlaggedSequences.md
+## Filtering flagged sequences
+
+A naïve filtering approach was used to reduce the noise in the dataset. The ATAC-seq data showed high coverage for non-nuclear sequences. The transposase cuts primarily open chromatin (Buenrostro et al. 2013) and as such also the chloroplast and mitochondrial genomes. When the organelles were not completely removed before the experiment, the data contained noise in the form of notably higher coverage in these regions. Unplaced scaffolds were also observed to contribute to this noise during the data quality control steps (Fig. 1a). Therefore, unplaced scaffolds and non-nuclear sequences were flagged during later development stages (see Section 2.2 and Tables 2 and 3). Assemblies on scaffold or contig level, *Bigelowiella natans, Eragrostis nindensis, Marchantia polymorpha, Oropetium thomaeum, Pyrus x bretschneiderii*, and *Spirodela polyrhiza*, were not flagged. The flagged sequences were filtered out (Fig. 1b). The information about the assembly accessions of the unplaced scaffolds and non-nuclear sequences was extracted from the sequence report jsonl files available at the NCBI’s RefSeq or GenBank and added to the h5 file (under “data/blacklist”) via add_blacklist.py in “side_scripts.” The flagged sequences reached around 7% of all genome assemblies used not counting assemblies on scaffold or contig level.
\ No newline at end of file
--- a/studies/Data/resources/.gitkeep
+++ b/studies/Data/resources/.gitkeep
--- a/studies/Data/resources/README.md
+++ b/studies/Data/resources/README.md
+<img src=./dataset/vbae074f1.jpeg width=60%>
+
+**Figure 1.** Average ATAC- and ChIP-seq coverage ±3 kpb around the TSS for each species in the dataset. (a) Average ATAC-seq coverage including unplaced scaffolds and non-nuclear sequences. (b) Average ATAC- and ChIP-seq coverage excluding unplaced scaffolds and non-nuclear sequences. The species are sorted into the three categories dicots, monocots, and algae/mosses.
+
+## Figure source
+
+https://academic.oup.com/view-large/figure/464175126/vbae074f1.tif
+
+**Table 1.** Plant genomes and available datasets.
+
+| **Domain** | **Species**                 | **ATAC-seq** | **ChIP-seq (H3K4me3)** |
+|------------|-----------------------------|--------------|------------------------|
+| Algae      | _Bigelowiella natans_       |       ✔      |                        |
+|            | _Chlamydomonas reinhardtii_ |              |            ✔           |
+| Mosses     | _Marchantia polymorpha_     |       ✔      |            ✔           |
+| Monocots   | _Brachypodium distachyon_   |       ✔      |            ✔           |
+|            | _Eragrostis nindensis_      |       ✔      |            ✔           |
+|            | _Oropetium thomaeum_        |       ✔      |                        |
+|            | _Oryza brachyantha_         |              |            ✔           |
+|            | _Oryza sativa_              |       ✔      |            ✔           |
+|            | _Panicum miliaceum_         |       ✔      |                        |
+|            | _Setaria italica_           |              |            ✔           |
+|            | _Sorghum bicolor_           |       ✔      |                        |
+|            | _Spirodela polyrhiza_       |       ✔      |            ✔           |
+|            | _Zea mays_                  |       ✔      |            ✔           |
+| Dicots     | _Actinidia chinensis_       |       ✔      |            ✔           |
+|            | _Arabidopsis thaliana_      |       ✔      |            ✔           |
+|            | _Brassica napus_            |       ✔      |            ✔           |
+|            | _Brassica oleracea_         |              |            ✔           |
+|            | _Brassica rapa_             |              |            ✔           |
+|            | _Glycine max_               |       ✔      |            ✔           |
+|            | _Malus domestica_           |       ✔      |            ✔           |
+|            | _Medicago truncatula_       |       ✔      |            ✔           |
+|            | _Prunus persica_            |              |            ✔           |
+|            | _Pyrus x bretschneideri_    |              |            ✔           |
+|            | _Sesamum indicum_           |              |            ✔           |
+|            | _Solanum lycopersicum_      |       ✔      |            ✔           |
\ No newline at end of file
--- a/studies/Data/resources/vbae074f1.jpeg
+++ b/studies/Data/resources/vbae074f1.jpeg