diff --git a/README.md b/README.md index 03a392827a909ea8d72dc311752ab70d26697633..2890b339e9e6540b2211818f60316afa0222ae84 100644 --- a/README.md +++ b/README.md @@ -21,13 +21,14 @@ This project was mainly conducted at the [Institute of Plant Biochemistry at HHU - https://orcid.org/0000-0002-9134-6511 - Benjamin Stich - https://orcid.org/0000-0001-6791-8068 -- Steven Kelly +- Steven Kelly - https://orcid.org/0000-0003-1250-7055 - Andreas P.M.Weber - https://orcid.org/0000-0003-0970-4672 ## Abstract + C3-C4 intermediate photosynthesis has evolved at least five times convergently in the Brassicaceae, despite this family lacking _bona fide_ C4 species. The establishment of this carbon concentrating mechanism is known to require a complex suite of ultrastructural modifications as well as changes in spatial expression patterns, @@ -45,3 +46,159 @@ upstream region which we conclude to be responsible for causing the spatial shif Our findings hint at a pivotal role of TEs in the evolution of C3-C4 intermediacy, especially in mediating differential spatial gene expression. + +## Breakdown of the computation analysis +## Creating structural gene annotation with Helixer + +This was performed at the commits Helixer bb840b4, GeenuFF 1f6cffb, and +HelixerPost 08c6215 + +Updates to the code since mean that this is not the _recommended_ way anymore, +but uncommented code in the scripts are provided exactly as used for accuracy. + +Comments have been added to indicate where scripts or commands would need to be +changed to run with more current (e.g. v0.3) versions of the code. +And for clarity / explanation. + +#### Structure + +The scripts assume the input is provided in the following format + +``` +raw/<researcher>/ +├── b_gravinae +│  └── b_gravinae.fasta +├── b_juncea +│  └── b_juncea.fasta +├── b_napus +│  └── b_napus.fasta +├── b_nigra +│  └── b_nigra.fasta +... +``` + +The final gff3 output + log files will then end up here + +``` +helixer_post/<researcher>/ +├── b_gravinae +│  ├── b_gravinae.gff3 +│  ├── hp.err +│  └── hp.out +├── b_juncea +│  ├── b_juncea.gff3 +│  ├── hp.err +│  └── hp.out +├── b_napus +│  ├── b_napus.gff3 +│  ├── hp.err +│  └── hp.out +├── b_nigra +│  ├── b_nigra.gff3 +│  ├── hp.err +│  └── hp.out +... +``` + +(example excerpt only) + +#### Annotating + +##### generate numeric encoding of genome sequence +This step takes the CATGs from the fasta file, and encodes +them as 1-hot (except ambiguity characters) numeric +vectors for inputting into our network. + + +example to run one sequence +``` +bash toh5.sh raw/<researcher>/b_gravinae/b_gravinae.fasta +``` + +#### raw base-wise predictions of genic class & phase +This part has to run on the GPU, and it was easiest +to do so for the number of species used with nni, +which uses the files 'config.yml' and 'search_space.json'. + +###### prep +###### Acquire the trained model. +E.g. +``` +wget https://uni-duesseldorf.sciebo.de/s/C68s4YLv5ZqqXus/download +mv download land_plant_v0.3_m_0100.h5 +``` +(for clarity note that `land_plant_v0.3_m_0100.h5` and `fullmoon_211117_17.h5` +are two names for the same model) + +###### prep search\_space.json +This file requires full paths to the model and the h5 files created +above to be set exactly for the machine in question. The provided +file is an example only. + +##### Start all predictions +``` +export hppath=<path/to/repository>/Helixer +nnictl create -c config.yml +``` + +which then generates a folder `$HOME/nni-experiments/<NNI-ID>` +with the results. Each species in search_space.json, +will be in a different trial folder: `$HOME/nni-experiments/<NNI-ID>/trials/<TRIAL-ID>` + +##### post processing into final predictions + +run once for each trial ID / species + +``` +bash helixer_post.sh $HOME/nni-experiments/<NNI-ID>/trials/<TRIAL-ID> +``` + +And you're done, this should create the gff3 files, e.g. + +`helixer_post/<researcher>/b_gravinae/b_gravinae.gff3` + + +#### Recommendation. +This three step process made sense when running the previous version +of the code and still does for running many genomes with unbalanced +GPU vs CPU availability. +However, to run on a single genome and also just to take advantage of usability +improvements, the above could now be accomplished for any single genome +as shown below using b\_granvinae as an example (structure simplified). + +``` +Helixer.py --fasta-path b_gravinae.fasta \ + --gff-output-path b_gravinae.gff3 --species b_gravinae + --overlap-core-length=53460 --overlap-offset=13365 + --lineage land_plant +``` + +This method additionally provides exact instructions on how to download the +best available model for the lineage, if not already present. + +## performing TE annotation using EDTA +run EDTA using something like: <br /> +```bash +perl EDTA.pl --genome <genome> --anno 1 --sensitive 1 --overwrite 0 +``` + +EDTA will produce multiple outputs: +- for FRAGMENTED TEs use: .fasta.mod.EDTA.TEanno.gff3 file +- for INTACT TEs use: .fasta.mod.EDTA.intact.gff3 file +make sure to delete headers starting with ### in the respecrtive files, they can't be read by numpy! + +### Fig. 2: EDTA results analysis: +The [code for the results presented in Fig.2](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig2_genome_size.ipynb) contains a basic analysis of the `EDTA` results. The genome size were hard-coded from the amount of bases in the respective genome .fasta files. It was distinguished between the "FRAGMENTED" and "INTACT" outputs of `EDTA`. + +### Fig.3: TE classes +The [code for Fig.3](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig3_TE_types.ipynb) contains a breakdown of the TE classes as analyzed in `EDTA`. The lengths of the TEs (as numbers of base pairs) were counted and compared. + +### Fig.4: LTR age calculation +For the [LTR age calculation as presented in Fig.4](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig4_LTR_age.ipynb) LTR-TEs were extracted from the `EDTA` results, sorted by photosynthesis phenotype and visualized. Furthermore, statistical parameters were calculated. + +### Fig.5 f: TE-gene association: +The [TE-gene association analysis](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig5_TE_gene_association.py) was conducted using the .gff3 annotation files from `Helixer` and `EDTA`. Only the INTACT TEs predicted by `EDTA` were used, the FRAGMENTED were ignored to reduce the amount of false-positive hits. For each contigs, it was check if a TE was starting/ending in a gene, residing inside a gene, spanning a gene or residing up- or downstream of a gene. Strand specificity was considered. Results were written to a `.tsv` file and visualized using [this code](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig5_stackedbar.ipynb). <br> +Single genes were visualized using [this code snippet](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig6_single_gene_visualization.ipynb). <br> +Statistics were performed using [this code](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Tab1_statistics.ipynb). + +