Merge branch 'ceplas-dm-viktoria' into 'main'

Ceplas dm viktoria See merge request !2

Merge branch 'ceplas-dm-viktoria' into 'main'
Ceplas dm viktoria See merge request !2
651833ba · Dominik Brilhaus · 855897bb · 8c4f654a · 651833ba · 651833ba
Commit 651833ba authored 6 months ago by Dominik Brilhaus
--- a/assays/Metrics/README.md
+++ b/assays/Metrics/README.md
--- a/assays/Metrics/dataset/.gitkeep
+++ b/assays/Metrics/dataset/.gitkeep
--- a/assays/Metrics/dataset/README.md
+++ b/assays/Metrics/dataset/README.md
--- a/assays/Metrics/dataset/vbae074f4.jpeg
+++ b/assays/Metrics/dataset/vbae074f4.jpeg
--- a/assays/Metrics/dataset/vbae074f5.jpeg
+++ b/assays/Metrics/dataset/vbae074f5.jpeg
--- a/assays/Metrics/dataset/vbae074f6.jpeg
+++ b/assays/Metrics/dataset/vbae074f6.jpeg
--- a/assays/Metrics/isa.assay.xlsx
+++ b/assays/Metrics/isa.assay.xlsx
--- a/assays/Metrics/protocols/.gitkeep
+++ b/assays/Metrics/protocols/.gitkeep
--- a/assays/Metrics/protocols/MetricsProtocol.md
+++ b/assays/Metrics/protocols/MetricsProtocol.md
+## Metrics
+Five metrics were used to evaluate model performance, the Poisson loss, the Pearson correlation coefficient (Pearson’s r), precision, recall, and F1.
+The most prominent peak caller for ChIP-seq data, MACS (Zhang et al. 2008), which was also frequently used for ATAC-seq data (Hiranuma et al. 2017, Thibodeau et al. 2018, Hentges et al. 2022), assumes that the ChIP-seq coverage data is Poisson distributed. Therefore, PyTorch’s Poisson negative log likelihood loss function (Poisson loss) was used as the loss function for all models (Equation 1).
+(1) $$loss=\frac{1}{n} \sum_{i=1}^n e^{x_i} - y_i \ast x_i$$
+The individual samples of the predictions $(⁠x⁠)$ and the targets $(y⁠⁠)$ are indexed with $(i)$⁠. The sample size is denoted with $(n)$
+ (https://pytorch.org/docs/stable/generated/torch.nn.PoissonNLLLoss.html). This version of the Poisson loss caused the network to output logarithmic predictions. The desired, actual predictions were thus the exponential of the network’s output. The exponential distribution only consists of positive real numbers like the ATAC- and ChIP-seq read coverage.
+To measure the “accuracy” of the model’s predictions, i.e. translating the Poisson loss into a human-readable number, the Pearson’s r was chosen (Equation 2), measuring the linear correlation between two variables.
+(2) $$r=\frac{\sum_{i=1}^n (x_i - \overline{x}) (y_i - \overline{y})}{\sqrt{\sum_{i=1}^n (x_i - \overline{x})^2 \sum_{i=1}^n (y_i - \overline{y})^2} + \epsilon}$$
+The sample size is denoted with $n$⁠, the individual samples of the predictions $(x⁠)$ and the targets $(⁠y⁠)$ are indexed with $i$⁠. The additional epsilon $(\epsilon)$ equals 1e-8 and is used to avoid a division by zero. A value of 1 represents a perfect positive linear relationship, so Predmoter’s predictions and the experimental NGS coverage data would be identical. A value of 0 means no linear relationship between the predictions and targets. Finally, a value of −1 represents a perfect negative linear relationship.
+Precision, recall, and F1 were used to compare predicted peaks to experimental peaks for both test species (Equations 3–5). A F1 score of 1 indicates that the predicted peaks are at the same position as the experimental peaks. The lowest score possible is 0. Precision, recall, and F1 were calculated base-wise. Called peaks were denoted with 1, all other base pairs with 0. A confusion matrix containing the sum of True Positives (TP), False Positives (FP), and False Negatives (FN) for the two classes, peak and no peak, was computed for the average predicted coverage of both strands. Precision and recall were also utilized to plot precision-recall curves (PRC). The area under the precision-recall curve (AUPRC) was calculated using scikit-learn (Pedregosa et al. 2011). Flagged sequences were excluded from the calculations (see Section 2.1.2). The baseline AUPRC is equal to the fraction of positives, i.e. the percentage of peaks in the training set (Saito and Rehmsmeier 2015). The peak percentages were calculated using the Predmoter’s compute_peak_f1.py script in “side_scripts.” The percentages are listed in Supplementary Table S8.
+(3) $$precision = \frac{TP}{TP+FP}$$
+(4) $$recall = \frac{TP}{TP+FN}$$
+(5) $$F_1 = 2 \ast \frac{precision \ast recall}{precision+recall}$$
\ No newline at end of file
--- a/assays/PeakCalling/README.md
+++ b/assays/PeakCalling/README.md
--- a/assays/PeakCalling/dataset/.gitkeep
+++ b/assays/PeakCalling/dataset/.gitkeep
--- a/assays/PeakCalling/dataset/README.md
+++ b/assays/PeakCalling/dataset/README.md
--- a/assays/PeakCalling/dataset/vbae074f7.jpeg
+++ b/assays/PeakCalling/dataset/vbae074f7.jpeg
--- a/assays/PeakCalling/isa.assay.xlsx
+++ b/assays/PeakCalling/isa.assay.xlsx
--- a/assays/PeakCalling/protocols/.gitkeep
+++ b/assays/PeakCalling/protocols/.gitkeep
--- a/assays/PeakCalling/protocols/PeakCallingProtocol.md
+++ b/assays/PeakCalling/protocols/PeakCallingProtocol.md
+## Peak calling
+Peak calling on predictions and the experimental data was performed with MACS3 (Zhang et al. 2008). The sample bam files of the experimental data per species and dataset were merged. Then peaks were called on the merged bam files with MACS3’s “callpeak” command. The parameters for calling ATAC-seq peaks were the BAMPE format, a q-value of 0.01, keeping all duplicates, using the background lambda as local lambda (“no-lambda”) and the ungapped genome size of the species’ genome assembly (see Supplementary Table S2) as mappable genome size. For ChIP-seq peak calling two parameters, broad and a broad cutoff of 0.1, were added. The chosen q-value was the default 0.05. The ChIP-seq peaks of the species *S.polyrhiza* and *Chlamydomonas reinhardtii* were called using the format BAM instead of BAMPE. MACS3’s “bdgpeakcall” was used to call peaks on the test species predictions in bedGraph file format. The parameters for peak calling were the same MACS3’s “callpeak” determined for the experimental data, i.e. for paired end reads the minimum length and maximum gap are set to the predicted fragment size (Table 5). The cutoff value, threshold of the minimum read coverage to call a peak, was estimated by plotting the average read coverage of predictions around the TSS (see Fig. 5b).
+Different cutoff values were also examined. For the ATAC-seq predictions of *A. thaliana*, cutoffs in the range of 1 to 25 with a step of 1 and for *O. sativa* cutoffs in the range of 5 to 200 with a step of 5 and including a cutoff of 1 at the start were chosen. For the ChIP-seq predictions of both species, cutoffs in the range of 5 to 100 with a step of 5 and including a cutoff of 1 at the start were chosen.
+The selected parameters of MACS3’s “bdgpeakcall” for each test species and dataset are listed.
\ No newline at end of file
--- a/assays/SpeciesSelection/README.md
+++ b/assays/SpeciesSelection/README.md
--- a/assays/SpeciesSelection/dataset/.gitkeep
+++ b/assays/SpeciesSelection/dataset/.gitkeep
--- a/assays/SpeciesSelection/dataset/README.md
+++ b/assays/SpeciesSelection/dataset/README.md
--- a/assays/SpeciesSelection/dataset/vbae074f3.jpeg
+++ b/assays/SpeciesSelection/dataset/vbae074f3.jpeg