Skip to content
Snippets Groups Projects
Commit c969af21 authored by Sebastian Triesch's avatar Sebastian Triesch
Browse files

Updated README.md

parent 632c4be4
No related branches found
No related tags found
No related merge requests found
......@@ -21,13 +21,14 @@ This project was mainly conducted at the [Institute of Plant Biochemistry at HHU
- https://orcid.org/0000-0002-9134-6511
- Benjamin Stich
- https://orcid.org/0000-0001-6791-8068
- Steven Kelly
- Steven Kelly
- https://orcid.org/0000-0003-1250-7055
- Andreas P.M.Weber
- https://orcid.org/0000-0003-0970-4672
## Abstract
C3-C4 intermediate photosynthesis has evolved at least five times convergently in the Brassicaceae, despite
this family lacking _bona fide_ C4 species. The establishment of this carbon concentrating mechanism is known
to require a complex suite of ultrastructural modifications as well as changes in spatial expression patterns,
......@@ -45,3 +46,159 @@ upstream region which we conclude to be responsible for causing the spatial shif
Our findings hint at a pivotal role of TEs in the evolution of C3-C4 intermediacy, especially in mediating
differential spatial gene expression.
## Breakdown of the computation analysis
## Creating structural gene annotation with Helixer
This was performed at the commits Helixer bb840b4, GeenuFF 1f6cffb, and
HelixerPost 08c6215
Updates to the code since mean that this is not the _recommended_ way anymore,
but uncommented code in the scripts are provided exactly as used for accuracy.
Comments have been added to indicate where scripts or commands would need to be
changed to run with more current (e.g. v0.3) versions of the code.
And for clarity / explanation.
#### Structure
The scripts assume the input is provided in the following format
```
raw/<researcher>/
├── b_gravinae
│ └── b_gravinae.fasta
├── b_juncea
│ └── b_juncea.fasta
├── b_napus
│ └── b_napus.fasta
├── b_nigra
│ └── b_nigra.fasta
...
```
The final gff3 output + log files will then end up here
```
helixer_post/<researcher>/
├── b_gravinae
│ ├── b_gravinae.gff3
│ ├── hp.err
│ └── hp.out
├── b_juncea
│ ├── b_juncea.gff3
│ ├── hp.err
│ └── hp.out
├── b_napus
│ ├── b_napus.gff3
│ ├── hp.err
│ └── hp.out
├── b_nigra
│ ├── b_nigra.gff3
│ ├── hp.err
│ └── hp.out
...
```
(example excerpt only)
#### Annotating
##### generate numeric encoding of genome sequence
This step takes the CATGs from the fasta file, and encodes
them as 1-hot (except ambiguity characters) numeric
vectors for inputting into our network.
example to run one sequence
```
bash toh5.sh raw/<researcher>/b_gravinae/b_gravinae.fasta
```
#### raw base-wise predictions of genic class & phase
This part has to run on the GPU, and it was easiest
to do so for the number of species used with nni,
which uses the files 'config.yml' and 'search_space.json'.
###### prep
###### Acquire the trained model.
E.g.
```
wget https://uni-duesseldorf.sciebo.de/s/C68s4YLv5ZqqXus/download
mv download land_plant_v0.3_m_0100.h5
```
(for clarity note that `land_plant_v0.3_m_0100.h5` and `fullmoon_211117_17.h5`
are two names for the same model)
###### prep search\_space.json
This file requires full paths to the model and the h5 files created
above to be set exactly for the machine in question. The provided
file is an example only.
##### Start all predictions
```
export hppath=<path/to/repository>/Helixer
nnictl create -c config.yml
```
which then generates a folder `$HOME/nni-experiments/<NNI-ID>`
with the results. Each species in search_space.json,
will be in a different trial folder: `$HOME/nni-experiments/<NNI-ID>/trials/<TRIAL-ID>`
##### post processing into final predictions
run once for each trial ID / species
```
bash helixer_post.sh $HOME/nni-experiments/<NNI-ID>/trials/<TRIAL-ID>
```
And you're done, this should create the gff3 files, e.g.
`helixer_post/<researcher>/b_gravinae/b_gravinae.gff3`
#### Recommendation.
This three step process made sense when running the previous version
of the code and still does for running many genomes with unbalanced
GPU vs CPU availability.
However, to run on a single genome and also just to take advantage of usability
improvements, the above could now be accomplished for any single genome
as shown below using b\_granvinae as an example (structure simplified).
```
Helixer.py --fasta-path b_gravinae.fasta \
--gff-output-path b_gravinae.gff3 --species b_gravinae
--overlap-core-length=53460 --overlap-offset=13365
--lineage land_plant
```
This method additionally provides exact instructions on how to download the
best available model for the lineage, if not already present.
## performing TE annotation using EDTA
run EDTA using something like: <br />
```bash
perl EDTA.pl --genome <genome> --anno 1 --sensitive 1 --overwrite 0
```
EDTA will produce multiple outputs:
- for FRAGMENTED TEs use: .fasta.mod.EDTA.TEanno.gff3 file
- for INTACT TEs use: .fasta.mod.EDTA.intact.gff3 file
make sure to delete headers starting with ### in the respecrtive files, they can't be read by numpy!
### Fig. 2: EDTA results analysis:
The [code for the results presented in Fig.2](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig2_genome_size.ipynb) contains a basic analysis of the `EDTA` results. The genome size were hard-coded from the amount of bases in the respective genome .fasta files. It was distinguished between the "FRAGMENTED" and "INTACT" outputs of `EDTA`.
### Fig.3: TE classes
The [code for Fig.3](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig3_TE_types.ipynb) contains a breakdown of the TE classes as analyzed in `EDTA`. The lengths of the TEs (as numbers of base pairs) were counted and compared.
### Fig.4: LTR age calculation
For the [LTR age calculation as presented in Fig.4](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig4_LTR_age.ipynb) LTR-TEs were extracted from the `EDTA` results, sorted by photosynthesis phenotype and visualized. Furthermore, statistical parameters were calculated.
### Fig.5 f: TE-gene association:
The [TE-gene association analysis](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig5_TE_gene_association.py) was conducted using the .gff3 annotation files from `Helixer` and `EDTA`. Only the INTACT TEs predicted by `EDTA` were used, the FRAGMENTED were ignored to reduce the amount of false-positive hits. For each contigs, it was check if a TE was starting/ending in a gene, residing inside a gene, spanning a gene or residing up- or downstream of a gene. Strand specificity was considered. Results were written to a `.tsv` file and visualized using [this code](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig5_stackedbar.ipynb). <br>
Single genes were visualized using [this code snippet](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Fig6_single_gene_visualization.ipynb). <br>
Statistics were performed using [this code](https://git.nfdi4plants.org/hhu-plant-biochemistry/triesch2023_brassicaceae_transposons/-/blob/main/workflows/Tab1_statistics.ipynb).
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment