From c956bfeee368e3c0c9f6df7c37457c651dadaa0d Mon Sep 17 00:00:00 2001 From: Atemia <j.atemia@fz-juelich.de> Date: Wed, 11 Dec 2024 21:59:49 +0100 Subject: [PATCH] update: centralized lab-book --- README.md | 434 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 432 insertions(+), 2 deletions(-) diff --git a/README.md b/README.md index 702764ff30..60cfd42631 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# TEOSINTE Project: Comparative Genomics of Maize's Wild Relatives (Teosinte) +# From Habitat to Genotype: The Complex Interplay of Climate, Phenotypes, and Taxonomy in Teosinte ## Description @@ -61,6 +61,436 @@ This project is funded by the German Ministry for Science and Education (BMBF 03 - preprocessing_data - snp_gene_neighborhood_pipeline - - The file [`GWAS_Research_Methodology.md`](workflows/GWAS_Research_Methodology.md) documents the steps and order of the whole analysis. +--- + +## Phenotype and Climate Data Analysis + +### Hierarchical Clustering and PCA Analysis + +The analysis data and scripts for this section can be found in the directories: +`/workflows/phenotype_pca_and_hc/data/` (data) and `/workflows/phenotype_pca_and_hc/scripts/` (scripts). + +The raw data was cleaned and merged following the steps outlined in the notebook: +`/workflows/phenotype_pca_and_hc/scripts/phenotype_and_climate_data_notebook.ipynb`. + +- **Principal Component Analysis (PCA)** was performed using the script: + `pheno_and_climate_pca_in_r.R`. + +- **Hierarchical Clustering** and the generation of the corresponding Newick file were completed using the script: + `hierarchical_clustering_analysis.R`. + +Visualization of the Newick file was conducted using the ITOL web tool. The annotation files generated are located in `/workflows/phenotype_pca_and_hc/scripts/`, and the resulting plots are available in `/workflows/phenotype_pca_and_hc/results/`. + +### Relative Range Intersection (RRI) + +The objective is to analyze the intersection of climatic factor values among different taxa and visualize the results using bar plots, providing insights into ecological similarities and differences among teosinte taxa. + +The analysis data and script for this section are located in: +`/workflows/climate_factor_intersections/data` (data) and `/workflows/climate_factor_intersections/scripts/fcfi_jitter.py` (script). + +- Resulting bar plots are saved as SVG files in the directory: + `/workflows/climate_factor_intersections/results/`. + +--- + +## Population Structure and Genetic Diversity + +### Genetic Diversity + +**Objective**: Assess genetic differentiation. + +The analysis data and scripts for this section are located in: +`/workflows/hierfstats/data/` (data) and `/workflows/hierfstats/scripts/` (scripts). + +- **Principal Component Analysis (PCA)** of genotype data was conducted using the script: + `genotype_pca.R`. + +- **Fixation Index (Fst)** calculations were performed with `hierfstat` using the scripts: + `hierfst_data_prep_bash_commands.sh` and `hierfstats_hetero.Rmd`. + +- Result tables and plots are stored in the directory: + `/workflows/hierfstats/plots/`. + +### Genetic Clusters and Geographic Distribution of Teosinte + +**Objective**: Assess population structure. + +The analysis data for this section can be found in the directory: +`/studies/data/genetic/admixture_outs/`. + +- **Genetic Clusters and Geographic Distribution**: + Admixture proportions of teosinte individuals were visualized using pie plots placed on a geographical map, based on the coordinates of the individuals' collection sites. + Scripts are available in the directory: `/workflows/admixture_plots/geo_and_indiv_admixture_scripts/`, and results are in: `/workflows/admixture_plots/output_plots/geo_admixture_plots/`. + +- **ADMIXTURE Bar Plot of All Teosinte Individuals**: + The inferred ancestry of teosinte populations was plotted as a bar plot using the script: + `/workflows/admixture_plots/admixture_sorted_based_hc_pheno_plots/scripts/admixture_per_individual_accession_all_taxa.R`. + +An overall GWAS summary plot was generated using the script: +`/workflows/admixture_plots/admixture_sorted_based_hc_pheno_plots/scripts/gwas_summary_hc_admix_alleles_plot.R`. + +--- + +<!-- ## GWAS Analysis + +### Data Preprocessing for GWAS analysis + +The aforementioned analysis section's scripts and files are located in the directory [workflows/preprocessing_data](workflows/preprocessing_data). + +1. **Conversion of Genetic Data**: + The initial 'raw' genetic data, provided in VCF format, was converted to hapmap format using TASSEL software [doi:10.1093/bioinformatics/btm308]. The files were loaded directly into TASSEL and saved as diploid hapmap files. The respective raw data files used can be found in the following directories: + + - `/initial_data/data/genetic/filtered/*.vcf` + - `/initial_data/data/meta/ADN_pasap_3604.txt` + +2. **Hapmap File Cleaning**: + After conversion, species-specific tags were removed from the overall teosinte hapmap file IDs using the script `remove_spp_tags_in_hapmap_files.sh` to ensure standardized identifiers across all files. + +3. **Genotype-Phenotype Accession Matching and taxon subset data files**: + The Jupyter notebook `geno_pheno_accession_selection_3455_accesions_matched.ipynb` was used to filter out genotypes that did not have corresponding phenotype data, and vice versa, in preparation for GWAS analysis using GAPIT. This notebook also generated individual taxa subsets for further analysis. + +4. **Genotype File Conversion**: + The hapmap genotype files were further converted to various formats (Plink, numeric, H5) using scripts located in the `hapmap_convertion_scripts` subdirectory. These conversions facilitated compatibility with other analysis tools used in downstream analyses. + +5. **Phenotype Data Extraction**: + The script `extract_indiv_spp_pheno_data.sh` was used to extract and create individual species phenotype files. This step was essential to ensure that each species had its own specific phenotype dataset for analysis. + +The cleaned and filtered genotype and phenotype files were saved in the `/mnt/data/joseph/TEOSINTE/analyses/GWAS/data/` or `studies/processed_genotype_phenotype_teosinte_data` directory, ready for use in the GWAS pipeline. --> + +## GWAS Analysis + +### Data Preprocessing for GWAS Analysis + +The scripts and files for preprocessing data for GWAS are located in the directory [workflows/preprocessing_data](workflows/preprocessing_data). + +1. **Conversion of Genetic Data** + Raw genetic data in VCF format was converted to hapmap format using TASSEL software [doi:10.1093/bioinformatics/btm308]. Files were loaded into TASSEL and saved as diploid hapmap files. The source files can be found in the following directories: + + - `/initial_data/data/genetic/filtered/*.vcf` + - `/initial_data/data/meta/ADN_pasap_3604.txt` + +2. **Hapmap File Cleaning** + Species-specific tags were removed from teosinte hapmap file IDs using `remove_spp_tags_in_hapmap_files.sh`, standardizing identifiers across files. + +3. **Genotype-Phenotype Matching and Taxon Subset Data** + The Jupyter notebook `geno_pheno_accession_selection_3455_accesions_matched.ipynb` was used to align genotype and phenotype data and generate individual taxa subsets. + +4. **Genotype File Conversion** + Hapmap files were converted to Plink, numeric, and H5 formats using scripts in the `hapmap_convertion_scripts` directory for compatibility with downstream tools. + +5. **Phenotype Data Extraction** + The script `extract_indiv_spp_pheno_data.sh` created phenotype datasets specific to each species. + + Processed genotype and phenotype files are stored in: + + - `/mnt/data/joseph/TEOSINTE/analyses/GWAS/data/` + - `studies/processed_genotype_phenotype_teosinte_data` + +### GWAS Using GAPIT + +Scripts and files for GWAS analysis are available in [workflows/gwas_gapit_pipeline](workflows/gwas_gapit_pipeline). + +The GWAS was performed using GAPIT, evaluating models such as GLM, MLM, FarmCPU, MLMM, CMLM, SUPER, and BLINK. FarmCPU and MLMM were identified as the best-performing models based on QQ plots and BIC criteria. Input files included genotype and phenotype data from `/mnt/data/joseph/TEOSINTE/analyses/GWAS/data/`. + +Principal components (PCs) for population structure correction were selected based on eigenvalues calculated via PCA, using BIC for optimization. The main script for GWAS runs was `gwas_3455.R`, and various taxa-specific scripts include: + +- `run_gapit_all_taxa.sh` +- `run_gapit_diplo_plus_per.sh` +- `run_gapit_diplo.sh` +- `run_gapit_lux_plus_nica.sh` +- `run_gapit_lux.sh` +- `run_gapit_mex_chalco.sh` +- `run_gapit_mex_mesa.sh` +- `run_gapit_mex.sh` +- `run_gapit_parv_plus_hue.sh` +- `run_gapit_parv.sh` +- `run_gapit_perenn.sh` +- `run_gapit.sh` + +--- + +### Annotation of Significant SNPs + +**Objective:** Process and annotate GWAS result files from GAPIT. +The analysis scripts and files are available in [workflows/snp_gene_neighborhood_pipeline/scripts](workflows/snp_gene_neighborhood_pipeline/). + +Significant SNPs were annotated using the B73 reference genome. This process involved downloading the B73 reference sequence and its annotations, running Mercator and ProtScriber, and generating annotations for each SNP. The annotation pipeline utilizes `snp_physical_mapping_pl_v2.py` for mapping SNPs to nearby features within a +50kb window and `generate_reference_protein_function_annotations_v2.py` for annotating the SNPs with functional information from Mercator and InterProScan. The annotated results for each trait are stored in `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results/taxon/all_features/trait/`. Additionally, heatmaps were generated to summarize the results, with scripts available in the directory `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/mercator_results_summarizer/`. These scripts provide both raw data and Z-transformed data. + +#### Prerequisites + +Before running this pipeline, ensure you have the following: + +- A GFF3 file of the B73 reference sequence located at `../misc/b73/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3`. +- Mercator and ProtScriber analysis results from the protein.fa and gene.fa sequences. The `gene.fa` file captures CDS information. +- Human-readable descriptions from the file `Zm-B73-REFERENCE-NAM.UniProt.GO.tab`. + +#### Step 1: Download Reference Sequence + +Download the B73 reference sequence and its annotations. + +#### Step 2: Run Mercator + +Execute Mercator using the downloaded protein and CDS sequences. + +#### Step 3: Generate Mercator and ProtScriber Annotations + +Generate Mercator and ProtScriber annotations from the Mercator FASTA results file with the following command: + +```bash +bash grep "^>" ../misc/b73/protein_seq_mercator_prot_swissprot_annot/b73_reference.fa | sed 's/>//g' > ../misc/b73/protein_seq_mercator_prot_swissprot_annot/b73_reference.fa.headers.txt +``` + +#### Step 4: Data Preparation and Annotation + +Data Preparation: Ensure the GWAS results and reference files are ready for annotation. + +Annotation Script: Run `bash /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/annotate.sh` to annotate significant SNPs in your Teosinte GWAS results using the following components: + +`python3 snp_physical_mapping_pl_v2.py`: Maps SNPs to nearby features from the GAPIT.Filter_GWAS_result.csv. +`python3 generate_reference_protein_function_annotations_v2.py`: Annotates SNPs with Mercator, InterProScan, and ProtScriber information. + +Fast Processing Option: Use `annotate_v2.sh` for faster annotation, which runs`process_trait.sh`. This script includes refactored versions of the above scripts. + +Results: Each trait’s annotated SNP features was stored in `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results/taxon/all_features/trait`. + +Final SNP Annotation Table: The final annotated SNP table, combining information from Mercator, InterProScan, and ProtScriber, was stored in the `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results/taxon/all_features/trait` directory with filenames like `${trait}_annotated_snp_table_${model}.csv`. + +Combined Annotations: The combined annotations for each trait and model was stored in the directory `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results/taxon/all_features/trait` directory with filenames like `${taxon}_final_annotation_${trait}_${model}.csv`. + +#### Step 5: Cleanup and Merging + +Remove empty files generated from models with no significant SNPs: + +Clean the repository by removing 0-byte files created from models with no significant SNPs: +`find /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results -size 0c -delete` + +Merge all annotation files into a mega CSV file containing all annotations for the final analysis +`/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/merge_all_annotatated_files.py` + +--- + +### Heatmaps and Summary Tables + +Scripts for heatmap generation and summary are available in [mercator_results_summarizer](workflows/snp_gene_neighborhood_pipeline/scripts/mercator_results_summarizer). + +- `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/mercator_summary.py`: Summarizes Mercator annotations into two count tables—one with raw data and another with Z-transformed data. These tables help analyze SNP functional associations with genes and biological pathways. + +- `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/plot_heatmap.R`: Generates heatmaps of Mercator annotations using the output of mercator_summary.py. Hardcoded path to phenotype data: `/mnt/data/joseph/TEOSINTE/analyses-on-pheotype_and_climatic_data/downstream-data-and-outputs/data/phenotype_and_env_data_vars_labeled.csv` + +```bash +# parviglumis +python3 mercator_summary.py -p parameter_parv.csv -d 2 + +Rscript plot_heatmap.R parviglumis_bin_depth_2_vs_traits_df.csv parviglumis_depth_2 + +# Mexicana - All races +python3 mercator_summary.py -p parameter_mex.csv -d 2 + +Rscript plot_heatmap.R mexicana_bin_depth_2_vs_traits_df.csv mexicana_depth_2 + +# Mexicana - Mesa Central +python3 mercator_summary.py -p parameter_mex_mesa.csv -d 2 + +Rscript plot_heatmap.R mexicana_mesa_bin_depth_2_vs_traits_df.csv mexicana_mesa_central_depth_2 + +# Mexicana - Chalco +python3 mercator_summary.py -p parameter_mex_chalco.csv -d 2 + +Rscript plot_heatmap.R mexicana_chalco_bin_depth_2_vs_traits_df.csv mexicana_chalco_depth_2 + +# Diploperennis +python3 mercator_summary.py -p parameter_diplop.csv -d 2 + +Rscript plot_heatmap.R diplo_perennis_bin_depth_2_vs_traits_df.csv diplo_perennis_depth_2 +``` + +- `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/unique_bins.R`: Preprocesses data to identify unique annotations for each taxon-trait combination and saves them in uniq_mercator_annotations.csv + +-`/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/unique_bins.R/plot_counts_snps_n_protein_coding_genes.py`: Summarizes the number of SNPs and protein-coding genes found in LD windows and generates horizontal bar plots. + +--- + +### Enrichment Analysis and Jaccard Similarity Index (JSI) analysis + +The scripts dirctory is located in the directory `/workflows/snp_gene_neighborhood_pipeline/scripts/mercator_enrichment` + +The result directory for enrichment analysis tables, plots and JSI plots `/workflows/snp_gene_neighborhood_pipeline/results_all_factors_snps_n_annotations` + +#### Generate the case annotation tables for enrichment analyisis + +The Python script `generate_enrichment_input_tables_v2.py` processes annotation tables and generates a combined DataFrame for specified clades. The tables will be used in the enrichment analyis. + +Execution flow: + +1. The `process_annotation_table` function processes an annotation table and extracts relevant information. It takes the path to an annotation table (`annot_table_path`), the desired bin depth (`bin_depth`), the trait name (`trait`), and the clade name (`clade`) as input. This function reads the annotation table, extracts information about genes and their associated bins, and creates a DataFrame with columns: "ID," "PFam," "Trait," and "Clade" for the specified clade. PFam column header is used inplace of mercator annotations to be compatible with Asis's `enrichment_funks.R` functions. + +2. The `main` function is the entry point of the script. It parses command-line arguments using `argparse`. The arguments include `--bin_depth` (bin depth), `--param_file` (path to a parameter file), and `--output_dir` (output directory for the resulting CSV file). + +3. Inside the `main` function: + + - The `param_df` DataFrame is created by reading the parameter file. + - The script iterates over unique clade names specified in the parameter file. + - For each clade, it processes the corresponding annotation table and generates a DataFrame using the `process_annotation_table` function. + - The results are concatenated into the `clade_combined_df` DataFrame. + +4. After processing all clades, the script removes rows with PFam values (mercator values) starting with "Enzyme" from the `clade_combined_df`. + +5. Finally, it defines the output CSV file name based on bin depth, taxon, and clade, and writes the `clade_combined_df` DataFrame to the CSV file in the specified output directory. + +This script can be used to generate combined annotation data for multiple clades from the results of the snp_annotation_pipeline, making it easier to analyze and visualize the results for various biological traits. + +Script executions. + +```sh +python3 generate_enrichment_input_tables_v2.py --bin_depth 1 --param_file ../mercator_results_summarizer/parameter_mex.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/ + +python3 generate_enrichment_input_tables_v2.py --bin_depth 2 --param_file ../mercator_results_summarizer/parameter_mex.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/ + +python3 generate_enrichment_input_tables_v2.py --bin_depth 1 --param_file ../mercator_results_summarizer/parameter_parv.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/ + +python3 generate_enrichment_input_tables_v2.py --bin_depth 2 --param_file ../mercator_results_summarizer/parameter_parv.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/ + +python3 generate_enrichment_input_tables_v2.py --bin_depth 1 --param_file ../mercator_results_summarizer/parameter_mex_chalco.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/ + +python3 generate_enrichment_input_tables_v2.py --bin_depth 2 --param_file ../mercator_results_summarizer/parameter_mex_chalco.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/ + +python3 generate_enrichment_input_tables_v2.py --bin_depth 1 --param_file ../mercator_results_summarizer/parameter_mex_mesa.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/ + +python3 generate_enrichment_input_tables_v2.py --bin_depth 2 --param_file ../mercator_results_summarizer/parameter_mex_mesa.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/ + +python3 generate_enrichment_input_tables_v2.py --bin_depth 1 --param_file ../mercator_results_summarizer/parameter_mex_diplo.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/ + +python3 generate_enrichment_input_tables_v2.py --bin_depth 2 --param_file ../mercator_results_summarizer/parameter_mex_diplo.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/ + +python3 generate_enrichment_input_tables_v2.py --bin_depth 1 --param_file ../mercator_results_summarizer/parameter_diplop.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/ + +python3 generate_enrichment_input_tables_v2.py --bin_depth 2 --param_file ../mercator_results_summarizer/parameter_diplop.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/ +``` + +Output tables stored in `../../results_mercator_heatmaps_enrichment/enrichment/input_tables/` + +#### Enrichment analysis + +##### main result output ora tables and plots + +Utilize R scripts `mercator_enrichment_v2.R` to analyze the generated CSV files and produce results for each clade. + +```sh +Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_parviglumis_clade_combined_enrich_input.csv parviglumis_d1 1 + +Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_parviglumis_clade_combined_enrich_input.csv parviglumis_d2 2 + +Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_mexicana_clade_combined_enrich_input.csv mexicana_d1 1 + +Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_mexicana_clade_combined_enrich_input.csv mexicana_d2 2 + +Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_mexicana_mesa_central_clade_combined_enrich_input.csv mexicana_mesa_central_d1 1 + +Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_mexicana_mesa_central_clade_combined_enrich_input.csv mexicana_mesa_central_d2 2 + +Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_mexicana_chalco_clade_combined_enrich_input.csv mexicana_chalco_d1 1 + +Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_mexicana_chalco_clade_combined_enrich_input.csv mexicana_chalco_d2 2 + +Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_diplo_perennis_clade_combined_enrich_input.csv diplo_perennis_d1 1 + +Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_diplo_perennis_clade_combined_enrich_input.csv diplo_perennis_d2 2 + +``` + +##### merge results ora outputs tables and plot + +`merge_tbls_plot_ora.sh` merges various result tables and generate summary plots. + +```sh +bash ./merge_tbls_plot_ora.sh + +``` + +##### Enrichment plot protein functions count plots + +The script `grouped_enrichment_bar_plots.py` creates bar plots to visualize protein function counts based on the analysis output. + +```sh + +python3 grouped_enrichment_bar_plots.py mexicana ../../results_mercator_heatmaps_enrichment/enrichment/ora_mapman_result_tables/mexicana/mexicana_d2_ora_mapman_result_tbl/mexicana_merged_ORA_output_tbl.csv --output_dir=../../results_all_factors_snps_n_annotations/ + +python3 grouped_enrichment_bar_plots.py parviglumis ../../results_mercator_heatmaps_enrichment/enrichment/ora_mapman_result_tables/parviglumis/parviglumis_d2_ora_mapman_result_tbl/parviglumis_merged_ORA_output_tbl.csv --output_dir=../../results_all_factors_snps_n_annotations/ +``` + +##### Main result output Under Representation Analysus (URA) tables and plots + +```sh +Rscript mercator_enrichment_ura.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_parviglumis_clade_combined_enrich_input.csv parviglumis_d1 1 + +Rscript mercator_enrichment_ura.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_parviglumis_clade_combined_enrich_input.csv parviglumis_d2 2 + +Rscript mercator_enrichment_ura.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_mexicana_clade_combined_enrich_input.csv mexicana_d1 1 + +Rscript mercator_enrichment_ura.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_mexicana_clade_combined_enrich_input.csv mexicana_d2 2 + +Rscript mercator_enrichment_ura.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_mexicana_mesa_central_clade_combined_enrich_input.csv mexicana_mesa_central_d1 1 + +Rscript mercator_enrichment_ura.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_mexicana_mesa_central_clade_combined_enrich_input.csv mexicana_mesa_central_d2 2 + +Rscript mercator_enrichment_ura.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_mexicana_chalco_clade_combined_enrich_input.csv mexicana_chalco_d1 1 + +Rscript mercator_enrichment_ura.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_mexicana_chalco_clade_combined_enrich_input.csv mexicana_chalco_d2 2 + +Rscript mercator_enrichment_ura.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_diplo_perennis_clade_combined_enrich_input.csv diplo_perennis_d1 1 + +Rscript mercator_enrichment_ura.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_diplo_perennis_clade_combined_enrich_input.csv diplo_perennis_d2 2 + +``` + +#### Jaccard Similarity index + +This was calculated and plotted following the steps outlined in the Jupyter notebook located at: `workflows/snp_gene_neighborhood_pipeline/scripts/snps_and_proteins_jaccard_index` --- + +## Teosinte GWAS SNPs and literature SNPs comparisons + +The analysis scripts and files mentioned above are available in the directory [workflows/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps](workflows/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps) + +### Description and retrival of the Literature data + +- Source: [Diversity, SNP, and Trait panel MaizeGDB Genome browser](https://www.maizegdb.org/diversity) +- On the left panel, under Diversity section, activate the tracks; + + - `GWAS SNPs for 21 traits (Li. 2022 - Wang & Wang Labs)` + The Nature Plants paper by Li et al. 2022 performed genome-wide association studies (GWAS) for 21 important agronomic traits across 1,604 inbred lines at five agro-ecologically distinct locations in China in 2018 and 2019. On the basis of a cut-off of P < 10-6 for GWAS, they identified 2,360 significant associations at 1,847 SNPs. Raw data is found in Supplementary Table 6 of their paper. The SNP positions were originally mapped to B73 RefGen_v4. Each glyph on this track shows the SNP position and trait name. + + MaizeGDB remapped this dataset to B73 RefGen_v5, by taking the 100bp flanking sequeces for each SNP in RefGenv4 and BLASTing them to B73 RefGen_v5. The Blast results were converted to GFF. Each of the SNPs uniquely mapped to the B73 RefGen_v5 with 100% coverage and sequence identity. + + Mapping commands: + bedtools getfasta -fi Zm-B73-REFERENCE-GRAMENE-4.0.fa -bed snps.bed -s -name > snps.fa + makeblastdb -in Zm-B73-REFERENCE-NAM-5.0.fa -dbtype nucl -parse_seqids + blastn -db Zm-B73-REFERENCE-NAM-5.0.fa -query snps.fa -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" -qcov_hsp_perc 100 -perc_identity 100 -out snps_positions.txt + + Publication: + Genomic insights into historical improvement of heterotic groups during modern hybrid maize breeding. + + [Raw data](https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA009749) + + [Original source code](https://www.nature.com/articles/s41477-022-01190-2#Sec29) + + - `GWAS SNPs from GWAS Atlas database` + (Updated February 24 2022) This track describes genome-wide association mapping (GWAS) from 133 papers covering 531 studies for 279 traits across ~42,000 loci overlapping ~8,400 genes. The data was compiled and remapped to B73_v4 at the GWAS Atlas database by the National Genomics Data Center at the Chinese Academy of Sciences (PMID: 31566222). Right-click on a feature to view details for each GWAS hit including trait name, tissue, position, allele, p-value, R^2 value, RS number, RS reference allele, RS alternate allele, publication information, and PubMed link. Track features are colored based on tissue. + + All of the original coordinates for B73 RefGen_v2, v3, and v4 have been remapped to B73 RefGen_v5. MaizeGDB used the 100bp upstream and downstream flanking sequences (retrieved using bedtools getfasta) for each SNP (in RefGen_v4) from the RefGen_v4 SNP genomic coordinates from the GWAS Atlas database (retrieved using bedtools getfasta) as query sequences aligned to B73 RefGen_v5 with at least 98% coverage and 98% sequence identity. This track shows the top hit (see command below) for each of those query sequences. If multiple top hits exist, the hit nearest the original location is chosen. RS numbers were assigned based on coordinate positions for the B73 RefGen_v4 SNPs downloaded from the European Variation Archive (March 2020). MaizeGDB also added information about overlapping features (using bedtools intersect) including open chromatin peaks (leaf,ear), 32 TF-binding sites peaks, and RefGen_v5 gene models. + + BLAST was used for the mapping (example command syntax: "blastn -db Zm-B73-REFERENCE-NAM-5.0.fa -query flanking_sequences.fasta -perc_identity 98 -qcov_hsp_perc 98"). + +- Right click on respectic SNPs track and download `Whole reference sequence` as `GFF3` format. +- Download was done per chromosome + +### Teosinte GWAS SNPs and Literature Comparisons + +The results of the GWAS conducted on teosinte were systematically compared with established SNP datasets, including the GWAS SNPs for 21 traits from the Li et al. (2022) study and those from the GWAS Atlas database. Comparison tables were generated and stored alongside the annotated SNP results. The following commands were utilized to perform the comparisons: + +```bash +python3 literature_snps_comparisons.py -t /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results_all_factors_snps_n_annotations/mexicana -a /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps/data/processed_data/GWAS_SNPs_from_GWAS_Atlas_database_merged.gff3 -l /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps/data/processed_data/GWAS_SNPs_for_21_traits_Li_2022_Wang_Labs_merged.gff3 + +python3 literature_snps_comparisons.py -t /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results_all_factors_snps_n_annotations/parviglumis -a /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps/data/processed_data/GWAS_SNPs_from_GWAS_Atlas_database_merged.gff3 -l /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps/data/processed_data/GWAS_SNPs_for_21_traits_Li_2022_Wang_Labs_merged.gff3 +``` -- GitLab