add data study and protocols

cf650487 · Viktoria Petrova · 855897bb · cf650487 · cf650487 · cf650487
Commit cf650487 authored 6 months ago by Viktoria Petrova
--- a/isa.investigation.xlsx
+++ b/isa.investigation.xlsx
--- a/studies/Data/README.md
+++ b/studies/Data/README.md
--- a/studies/Data/isa.study.xlsx
+++ b/studies/Data/isa.study.xlsx
--- a/studies/Data/protocols/.gitkeep
+++ b/studies/Data/protocols/.gitkeep
--- a/studies/Data/protocols/DataOverviewAndPreprocessing.md
+++ b/studies/Data/protocols/DataOverviewAndPreprocessing.md
+## Data overview and preprocessing
+The entire dataset consisted of 25 plant genomes, for 17 of which genome-wide ATAC-seq data was publicly available and for 21 of which genome-wide ChIP-seq (H3K4me3) data was publicly available (see Table 1 and Supplementary Table S2). A wide variety of tissues and treatments were used in these ATAC- and ChIP-seq experiments which are listed in Supplementary Table S3. The NGS data was downloaded from the sequence read archive (SRA) using the SRA-Toolkit 3.0.0 (https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). The reads were trimmed with Trimmomatic 0.36 (Bolger et al. 2014) and quality controlled using FastQC 0.11.9 (Andrews 2010) and MultiQC (Ewels et al. 2016). If the reads passed quality control, they were mapped to the reference genome using BWA 2.1 (Md et al. 2019). Conversion to bam files was performed using SamTools 1.6 (Danecek et al. 2021). The Picard Toolkit (Broad Institute ed 2019) was used to mark duplicates. The duplicates, unmapped reads, non-primary alignments and reads not passing platform quality checks were removed with SamTools. Plots for quality control were generated using deepTools 3.5.3 (Ramírez et al. 2016) and the necessary genome annotations were generated using Helixer v.0.3.1 (Stiehler et al. 2021, Holst et al. 2023). ATAC-seq data was deemed of high enough quality if the average coverage enrichment ±3 kbp around the TSS showed the expected peak and the average peak read coverage was at least 2.5 times the background coverage. The quality control for ChIP-seq data was performed using the same criteria. A detailed data preprocessing documentation is available at: https://github.com/weberlab-hhu/Predmoter/blob/main/docs/data_preprocessing.md. The plant genome fasta files and final NGS data bam files were converted to h5 files using Helixer (Stiehler et al. 2021, Holst et al. 2023). The ATAC-seq reads were shifted +4 bp on the positive strand and −5 bp on the negative strand to adjust the read start sites to represent the center of the transposon binding site (Buenrostro et al. 2013). A detailed documentation of the h5 file creation and architecture is available at: https://github.com/weberlab-hhu/Predmoter/blob/main/docs/h5_files.md.
+The species used in the development of Predmoter are separated into the four domains algae, mosses, monocots, and dicots. The availability and usage of the species dataset for ATAC- or ChIP-seq is indicated by a check mark.
\ No newline at end of file
--- a/studies/Data/protocols/FilteringFlaggedSequences.md
+++ b/studies/Data/protocols/FilteringFlaggedSequences.md
+## Filtering flagged sequences
+A naïve filtering approach was used to reduce the noise in the dataset. The ATAC-seq data showed high coverage for non-nuclear sequences. The transposase cuts primarily open chromatin (Buenrostro et al. 2013) and as such also the chloroplast and mitochondrial genomes. When the organelles were not completely removed before the experiment, the data contained noise in the form of notably higher coverage in these regions. Unplaced scaffolds were also observed to contribute to this noise during the data quality control steps (Fig. 1a). Therefore, unplaced scaffolds and non-nuclear sequences were flagged during later development stages (see Section 2.2 and Tables 2 and 3). Assemblies on scaffold or contig level, *Bigelowiella natans, Eragrostis nindensis, Marchantia polymorpha, Oropetium thomaeum, Pyrus x bretschneiderii*, and *Spirodela polyrhiza*, were not flagged. The flagged sequences were filtered out (Fig. 1b). The information about the assembly accessions of the unplaced scaffolds and non-nuclear sequences was extracted from the sequence report jsonl files available at the NCBI’s RefSeq or GenBank and added to the h5 file (under “data/blacklist”) via add_blacklist.py in “side_scripts.” The flagged sequences reached around 7% of all genome assemblies used not counting assemblies on scaffold or contig level.
\ No newline at end of file
--- a/studies/Data/resources/.gitkeep
+++ b/studies/Data/resources/.gitkeep