DataOverviewAndPreprocessing.md



Data overview and preprocessing
The entire dataset consisted of 25 plant genomes, for 17 of which genome-wide ATAC-seq data was publicly available and for 21 of which genome-wide ChIP-seq (H3K4me3) data was publicly available (see Table 1 and Supplementary Table S2). A wide variety of tissues and treatments were used in these ATAC- and ChIP-seq experiments which are listed in Supplementary Table S3. The NGS data was downloaded from the sequence read archive (SRA) using the SRA-Toolkit 3.0.0 (https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit). The reads were trimmed with Trimmomatic 0.36 (Bolger et al. 2014) and quality controlled using FastQC 0.11.9 (Andrews 2010) and MultiQC (Ewels et al. 2016). If the reads passed quality control, they were mapped to the reference genome using BWA 2.1 (Md et al. 2019). Conversion to bam files was performed using SamTools 1.6 (Danecek et al. 2021). The Picard Toolkit (Broad Institute ed 2019) was used to mark duplicates. The duplicates, unmapped reads, non-primary alignments and reads not passing platform quality checks were removed with SamTools. Plots for quality control were generated using deepTools 3.5.3 (Ramírez et al. 2016) and the necessary genome annotations were generated using Helixer v.0.3.1 (Stiehler et al. 2021, Holst et al. 2023). ATAC-seq data was deemed of high enough quality if the average coverage enrichment ±3 kbp around the TSS showed the expected peak and the average peak read coverage was at least 2.5 times the background coverage. The quality control for ChIP-seq data was performed using the same criteria. A detailed data preprocessing documentation is available at: https://github.com/weberlab-hhu/Predmoter/blob/main/docs/data_preprocessing.md. The plant genome fasta files and final NGS data bam files were converted to h5 files using Helixer (Stiehler et al. 2021, Holst et al. 2023). The ATAC-seq reads were shifted +4 bp on the positive strand and −5 bp on the negative strand to adjust the read start sites to represent the center of the transposon binding site (Buenrostro et al. 2013). A detailed documentation of the h5 file creation and architecture is available at: https://github.com/weberlab-hhu/Predmoter/blob/main/docs/h5_files.md.
The species used in the development of Predmoter are separated into the four domains algae, mosses, monocots, and dicots. The availability and usage of the species dataset for ATAC- or ChIP-seq is indicated by a check mark.