Updated README.md

c969af21 · Sebastian Triesch · 632c4be4 · c969af21
Commit c969af21 authored 1 year ago by Sebastian Triesch
--- a/README.md
+++ b/README.md
@@ -21,13 +21,14 @@ This project was mainly conducted at the [Institute of Plant Biochemistry at HHU
 	- https://orcid.org/0000-0002-9134-6511
 - Benjamin Stich 
 	- https://orcid.org/0000-0001-6791-8068
- Steven Kelly
+- Steven Kelly 
 	- https://orcid.org/0000-0003-1250-7055
 - Andreas P.M.Weber 
 	- https://orcid.org/0000-0003-0970-4672


 ## Abstract
+
 C3-C4 intermediate photosynthesis has evolved at least five times convergently in the Brassicaceae, despite
 this family lacking _bona fide_ C4 species. The establishment of this carbon concentrating mechanism is known
 to require a complex suite of ultrastructural modifications as well as changes in spatial expression patterns,
@@ -45,3 +46,159 @@ upstream region which we conclude to be responsible for causing the spatial shif
 Our findings hint at a pivotal role of TEs in the evolution of C3-C4 intermediacy, especially in mediating
 differential spatial gene expression.

+
+## Breakdown of the computation analysis
+## Creating structural gene annotation with Helixer
+
+This was performed at the commits Helixer bb840b4, GeenuFF 1f6cffb, and 
+HelixerPost 08c6215
+
+Updates to the code since mean that this is not the _recommended_ way anymore,
+but uncommented code in the scripts are provided exactly as used for accuracy.
+
+Comments have been added to indicate where scripts or commands would need to be 
+changed to run with more current (e.g. v0.3) versions of the code.
+And for clarity / explanation.
+
+#### Structure
+
+The scripts assume the input is provided in the following format
+
+```
+raw/<researcher>/
+├── b_gravinae
+│   └── b_gravinae.fasta
+├── b_juncea
+│   └── b_juncea.fasta
+├── b_napus
+│   └── b_napus.fasta
+├── b_nigra
+│   └── b_nigra.fasta
+...
+```
+
+The final gff3 output + log files will then end up here
+
+```
+helixer_post/<researcher>/
+├── b_gravinae
+│   ├── b_gravinae.gff3
+│   ├── hp.err
+│   └── hp.out
+├── b_juncea
+│   ├── b_juncea.gff3
+│   ├── hp.err
+│   └── hp.out
+├── b_napus
+│   ├── b_napus.gff3
+│   ├── hp.err
+│   └── hp.out
+├── b_nigra
+│   ├── b_nigra.gff3
+│   ├── hp.err
+│   └── hp.out
+...
+```
+
+(example excerpt only)
+
+#### Annotating
+
+##### generate numeric encoding of genome sequence
+This step takes the CATGs from the fasta file, and encodes
+them as 1-hot (except ambiguity characters) numeric
+vectors for inputting into our network.
+
+
+example to run one sequence
+```
+bash toh5.sh raw/<researcher>/b_gravinae/b_gravinae.fasta
+```
+
+#### raw base-wise predictions of genic class & phase
+This part has to run on the GPU, and it was easiest
+to do so for the number of species used with nni,
+which uses the files 'config.yml' and 'search_space.json'.
+
+###### prep
+###### Acquire the trained model.
+E.g. 
+```
+wget https://uni-duesseldorf.sciebo.de/s/C68s4YLv5ZqqXus/download
+mv download land_plant_v0.3_m_0100.h5
+```
+(for clarity note that `land_plant_v0.3_m_0100.h5` and `fullmoon_211117_17.h5`
+are two names for the same model)
+
+###### prep search\_space.json
+This file requires full paths to the model and the h5 files created
+above to be set exactly for the machine in question. The provided
+file is an example only.
+
+##### Start all predictions
+```
+export hppath=<path/to/repository>/Helixer
+nnictl create -c config.yml
+```
+
+which then generates a folder `$HOME/nni-experiments/<NNI-ID>`
+with the results. Each species in search_space.json,
+will be in a different trial folder: `$HOME/nni-experiments/<NNI-ID>/trials/<TRIAL-ID>`
+
+##### post processing into final predictions
+
+run once for each trial ID / species
+
+```
+bash helixer_post.sh $HOME/nni-experiments/<NNI-ID>/trials/<TRIAL-ID>
+```
+
+And you're done, this should create the gff3 files, e.g. 
+
+`helixer_post/<researcher>/b_gravinae/b_gravinae.gff3`
+
+
+#### Recommendation.
+This three step process made sense when running the previous version
+of the code and still does for running many genomes with unbalanced 
+GPU vs CPU availability.
+However, to run on a single genome and also just to take advantage of usability
+improvements, the above could now be accomplished for any single genome
+as shown below using b\_granvinae as an example (structure simplified).
+
+```
+Helixer.py --fasta-path b_gravinae.fasta \
+    --gff-output-path b_gravinae.gff3 --species b_gravinae
+    --overlap-core-length=53460 --overlap-offset=13365
+    --lineage land_plant
+```
+
+This method additionally provides exact instructions on how to download the 
+best available model for the lineage, if not already present. 
+
+## performing TE annotation using EDTA
+run EDTA using something like: <br />
+```bash
+perl EDTA.pl --genome <genome> --anno 1 --sensitive 1 --overwrite 0
+```
+
+EDTA will produce multiple outputs:
+- for FRAGMENTED TEs use: .fasta.mod.EDTA.TEanno.gff3 file
+- for INTACT TEs use: .fasta.mod.EDTA.intact.gff3 file
+make sure to delete headers starting with ### in the respecrtive files, they can't be read by numpy!
+
+### Fig. 2: EDTA results analysis:
+The [code for the results presented in Fig.2](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig2_genome_size.ipynb) contains a basic analysis of the `EDTA` results. The genome size were hard-coded from the amount of bases in the respective genome .fasta files. It was distinguished between the "FRAGMENTED" and "INTACT" outputs of `EDTA`.
+
+### Fig.3: TE classes
+The [code for Fig.3](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig3_TE_types.ipynb) contains a breakdown of the TE classes as analyzed in `EDTA`. The lengths of the TEs (as numbers of base pairs) were counted and compared.
+
+### Fig.4: LTR age calculation
+For the [LTR age calculation as presented in Fig.4](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig4_LTR_age.ipynb) LTR-TEs were extracted from the `EDTA` results, sorted by photosynthesis phenotype and visualized. Furthermore, statistical parameters were calculated.
+
+### Fig.5 f: TE-gene association:
+The [TE-gene association analysis](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig5_TE_gene_association.py) was conducted using the .gff3 annotation files from `Helixer` and `EDTA`. Only the INTACT TEs predicted by `EDTA` were used, the FRAGMENTED were ignored to reduce the amount of false-positive hits. For each contigs, it was check if a TE was starting/ending in a gene, residing inside a gene, spanning a gene or residing up- or downstream of a gene. Strand specificity was considered. Results were written to a `.tsv` file and visualized using [this code](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig5_stackedbar.ipynb). <br>
+Single genes were visualized using [this code snippet](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig6_single_gene_visualization.ipynb). <br>
+Statistics were performed using [this code](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Tab1_statistics.ipynb).   
+
+