add species selection assay and protocols

3527fe09 · Viktoria Petrova · defff70a · 3527fe09 · 3527fe09 · 3527fe09
Commit 3527fe09 authored 5 months ago by Viktoria Petrova
--- a/assays/SpeciesSelection/README.md
+++ b/assays/SpeciesSelection/README.md
--- a/assays/SpeciesSelection/dataset/.gitkeep
+++ b/assays/SpeciesSelection/dataset/.gitkeep
--- a/assays/SpeciesSelection/isa.assay.xlsx
+++ b/assays/SpeciesSelection/isa.assay.xlsx
--- a/assays/SpeciesSelection/protocols/.gitkeep
+++ b/assays/SpeciesSelection/protocols/.gitkeep
--- a/assays/SpeciesSelection/protocols/Cross-speciesPredictionModels.md
+++ b/assays/SpeciesSelection/protocols/Cross-speciesPredictionModels.md
+## Cross-species prediction models
+
+Ensuring a diverse range of species in the training set, while simultaneously reserving enough data for validation and testing to effectively evaluate the models’ generalization ability, proved difficult. At the start of development, the amount of high-quality, publicly available ATAC-seq data was low. Around 60% of the plant ATAC-seq data on SRA available up until July 2023 needed to be discarded after the final quality control. This left the ATAC-seq data of the 14 plant species used in this study. In later development stages 3 more ATAC-seq datasets, from *Actinidia chinensis, Panicum miliaceum* and *Sorghum bicolor*, and 2 more ChIP-seq datasets corresponding to acquired ATAC-seq datasets, from *A.chinensis* and *M.polymorpha*, became available. The low availability of high-quality data, especially in early development stages, turned out to be a major hindrance in providing the network with an appropriate amount of data to train on. Data of two species, *A.thaliana* and *O.sativa*, was set aside as a hold-out test set. In doing so, both a dicot and a monocot species with available ATAC- and ChIP-seq datasets could be used for final evaluation. The same applied to the two validation species, the dicot *Medicago truncatula* and the monocot *S.polyrhiza* (Table 3).
+
+The resulting training, validation, and test split for the ATAC-seq models, ChIP-seq models and Combined models was around 90% training set, 5% validation set and 5% test set (Fig. 3a).
+
+The model training pairs were visualized using the Uniform Manifold Approximation and Projection (UMAP) learning technique for dimension reduction (McInnes et al. 2018). Random training pairs, 5% of each species in the training set, were used to calculate the UMAPS. Gap subsequences and flagged sequences were not included. The chosen parameters were 10 neighbors, 0.1 minimum distance and the Euclidean distance metric. The additional species datasets, added in later development stages, were included. None of the available settings and metrics for UMAP computation showed distinct clusters based on the number of peaks within the input (Fig. 3b).
+
+For the first seven models only the species for which experimental ATAC-seq data of high quality was available up until July of 2023 were trained on. The same applied to the BiHybrid_05 model using ChIP-seq data. The Combined model used both datasets. The Combined_02 model used additional data of four species. Gap subsequences were masked for all models; unplaced scaffolds and non-nuclear sequences were masked starting with model BiHybrid_04.
\ No newline at end of file
--- a/assays/SpeciesSelection/protocols/Intra-speciesModelsAndLeave-one-outCross-validation.txt
+++ b/assays/SpeciesSelection/protocols/Intra-speciesModelsAndLeave-one-outCross-validation.txt
+## Intra-species models and leave-one-out cross-validation
+
+Cross-species validation instead of an in-species split for the validation and training data was deemed closer to the real-world use case of predicting ATAC- and ChIP-seq data for an entire species. However, two models were trained using an intra-species training and validation split. These models, IS_10 and IS_20, used 10% and 20% of each species dataset as the validation set respectively. The input files were split using Predmoter's intra_species_train_val_split.py script in “side_scripts.” This method ensured that each sequence ID from the original fasta file was fully assigned to either training or validation set. Since the focus of this study is on cross-species prediction, all 25 plant species were used in leave-one-out cross-validation (LOOCV) to evaluate the best model setup on different species. All these setups were trained on ATAC- and ChIP-seq datasets simultaneously (Table 4). When performing LOOCV the model performance was evaluated on all datasets available in the left-out species.
+
+All models excluded gap subsequences, subsequences of 21 384 bp only containing Ns, and flagged subsequences. For more details on exact model parameters see Supplementary Section S1.3 and Supplementary Table S4.
\ No newline at end of file