Skip to content
Snippets Groups Projects

TEOSINTE Project: Comparative Genomics of Maize's Wild Relatives (Teosinte)

Description

Project Start: 01.10.2021

Project Status: Active

Responsible Investigator: Joseph Atemia

Supervisor: Asis Hallab

Project Partners: Alicia Mastretta-Yanes (Instituto de Ecologia, UNAM, Mexico), Ana Wegier (Jardín Botánico, Instituto de Biologia, UNAM, México)

Crop wild relatives represent important sources of genetic variation that could be of great aid to breeding programs, especially regarding adaptation to extreme environmental conditions [ISBN:978-1-78064-197-3]. Maize close wild relatives are teosintes, which have a wide ecogeographic distribution in Mexico spanning extreme ranges of precipitation and temperatures [doi:10.1371/journal.pone.0192676]. The identification of the genetic variants associated with adaptation to such conditions is a necessary step in order to better monitor, conserve and use such diversity in applied projects. As part of preliminary research, the Mexican partners performed an extensive teosinte sampling, including ca. 4000 individuals of 276 populations of all the 7 teosinte species and subspecies distributed in Mexico. These samples were phenotyped in a greenhouse common garden, and genotyping by sequencing (GBS) was applied yielding ca. 60,000 SNPs. Ecogeographical analyses were also performed [doi:10.1371/journal.pone.0192676]. This dataset encompasses probably the most complete teosinte sampling done so far providing genomic data at a population level. However, teosinte genomes have proved to be incredibly diverse and complex, and during the course of domestication the differences among maize and its parental teosintes (​Zea mays ssp. ​ parviglumis and ssp. ​ mexicana) not only included re-shuffling allele-frequencies,but also structural and expression changes [doi:10.1038/ng.2309]. Therefore, fully exploring the variation associated to teosinte adaptation to diverse environments requires new approaches. Here, we aim to identify the genetic variants associated with teosinte adaptation to different environmental conditions, in the context of the genomic changes associated with domestication.


Funding

This project is funded by the German Ministry for Science and Education (BMBF 031B0921).


Data and Code Availability

  • Genotyping-by-sequencing (GBS) Data: The GBS data used in this study was obtained from publicly available sources and is detailed in the following published papers:

  • Repository:
    All code, scripts, and associated files used for the Genome-Wide Association Study (GWAS) are hosted in this repository.

  • Data and File Tracking:

    • Large datasets and other outputs generated during the study are stored and version-controlled using Git LFS (Large File Storage) for efficient tracking of changes and sharing of substantial files within the repository.
    • The specific data and outputs related to this GWAS study are included and organized in the repository for direct access.

Directory Structure

  1. Project Overview

    • Overall Description of the project.
  2. assay

  3. runs

    • GAPIT GWAS results outputs
  4. sudies

    • Initial 'raw' data
    • Processed data eventually used in the GWAS analysis using GAPIT – Genome Association and Prediction Integrated Tool.
  5. workflows

    • Contains the following subdirectories that habours scipts and respective results files:

      • admixture_plots
      • climate_factor_intersections
      • genetic_diversity
      • gwas_gapit_pipeline
      • gwas_ps_robustness
      • phenotype_distribution_plots
      • phylogeny_based_GBS
      • preprocessing_data
      • snp_gene_neighborhood_pipeline

Steps in Analysis

The sections that follow outlines the key steps taken to explore and analyze the relationships described in the manuscript From Habitat to Genotype: The Complex Interplay of Climate, Phenotypes, and Taxonomy in Teosinte. Each step is detailed to ensure reproducibility and transparency in the analysis.

Caution: Please note that some of the directory paths in the scripts may need to be modified to match your local environment in order to ensure proper reproducibility.


Phenotype and Climate Data Analysis

Hierarchical Clustering and PCA Analysis

The analysis data and scripts for this section can be found in the directories:
/workflows/phenotype_pca_and_hc/data/ (data) and /workflows/phenotype_pca_and_hc/scripts/ (scripts).

The raw data was cleaned and merged following the steps outlined in the notebook:
/workflows/phenotype_pca_and_hc/scripts/phenotype_and_climate_data_notebook.ipynb.

  • Principal Component Analysis (PCA) was performed using the script:
    pheno_and_climate_pca_in_r.R.

  • Hierarchical Clustering and the generation of the corresponding Newick file were completed using the script:
    hierarchical_clustering_analysis.R.

Visualization of the Newick file was conducted using the ITOL web tool. The annotation files generated are located in /workflows/phenotype_pca_and_hc/scripts/, and the resulting plots are available in /workflows/phenotype_pca_and_hc/results/.

Relative Range Intersection (RRI)

The objective is to analyze the intersection of climatic factor values among different taxa and visualize the results using bar plots, providing insights into ecological similarities and differences among teosinte taxa.

The analysis data and script for this section are located in:
/workflows/climate_factor_intersections/data (data) and /workflows/climate_factor_intersections/scripts/fcfi_jitter.py (script).

  • Resulting bar plots are saved as SVG files in the directory:
    /workflows/climate_factor_intersections/results/.

Population Structure and Genetic Diversity

Genetic Diversity

Objective: Assess genetic differentiation.

The analysis data and scripts for this section are located in:
/workflows/hierfstats/data/ (data) and /workflows/hierfstats/scripts/ (scripts).

  • Principal Component Analysis (PCA) of genotype data was conducted using the script:
    genotype_pca.R.

  • Fixation Index (Fst) calculations were performed with hierfstat using the scripts:
    hierfst_data_prep_bash_commands.sh and hierfstats_hetero.Rmd.

  • Result tables and plots are stored in the directory:
    /workflows/hierfstats/plots/.

Genetic Clusters and Geographic Distribution of Teosinte

Objective: Assess population structure.

The analysis data for this section can be found in the directory:
/studies/data/genetic/admixture_outs/.

  • Genetic Clusters and Geographic Distribution:
    Admixture proportions of teosinte individuals were visualized using pie plots placed on a geographical map, based on the coordinates of the individuals' collection sites.
    Scripts are available in the directory: /workflows/admixture_plots/geo_and_indiv_admixture_scripts/, and results are in: /workflows/admixture_plots/output_plots/geo_admixture_plots/.

  • ADMIXTURE Bar Plot of All Teosinte Individuals:
    The inferred ancestry of teosinte populations was plotted as a bar plot using the script:
    /workflows/admixture_plots/admixture_sorted_based_hc_pheno_plots/scripts/admixture_per_individual_accession_all_taxa.R.

An overall GWAS summary plot was generated using the script:
/workflows/admixture_plots/admixture_sorted_based_hc_pheno_plots/scripts/gwas_summary_hc_admix_alleles_plot.R.


GWAS Analysis

Data Preprocessing for GWAS Analysis

The scripts and files for preprocessing data for GWAS are located in the directory workflows/preprocessing_data.

  1. Conversion of Genetic Data
    Raw genetic data in VCF format was converted to hapmap format using TASSEL software [doi:10.1093/bioinformatics/btm308]. Files were loaded into TASSEL and saved as diploid hapmap files. The source files can be found in the following directories:

    • studies/initial_data/resources/data/genetic/filtered/*.vcf
    • studies/initial_data/resources/data/meta/ADN_pasap_3604.txt
  2. Hapmap File Cleaning
    Species-specific tags were removed from teosinte hapmap file IDs using remove_spp_tags_in_hapmap_files.sh, standardizing identifiers across files.

  3. Genotype-Phenotype Matching and Taxon Subset Data
    The Jupyter notebook geno_pheno_accession_selection_3455_accesions_matched.ipynb was used to align genotype and phenotype data and generate individual taxa subsets.

  4. Genotype File Conversion
    Hapmap files were converted to Plink, numeric, and H5 formats using scripts in the hapmap_convertion_scripts directory for compatibility with downstream tools.

  5. Phenotype Data Extraction
    The script extract_indiv_spp_pheno_data.sh created phenotype datasets specific to each species.

    Processed genotype and phenotype files are stored in:

    • studies/processed_genotype_phenotype_teosinte_data/resources/gwas_data_3455_accessions/genotype*
    • studies/processed_genotype_phenotype_teosinte_data/resources/gwas_data_3455_accessions/phenotype

GWAS Using GAPIT

Scripts and files for GWAS analysis are available in workflows/gwas_gapit_pipeline.

The GWAS was performed using GAPIT, evaluating models such as GLM, MLM, FarmCPU, MLMM, CMLM, SUPER, and BLINK. FarmCPU and MLMM were identified as the best-performing models based on QQ plots and BIC criteria. Input files included genotype and phenotype data from studies/processed_genotype_phenotype_teosinte_data/resources/gwas_data_3455_accessions/genotype* and studies/processed_genotype_phenotype_teosinte_data/resources/gwas_data_3455_accessions/phenotype.

Principal components (PCs) for population structure correction were selected based on eigenvalues calculated via PCA, using BIC for optimization. The main script for GWAS runs was gwas_3455.R, and various taxa-specific scripts include:

  • run_gapit_all_taxa.sh
  • run_gapit_diplo_plus_per.sh
  • run_gapit_diplo.sh
  • run_gapit_lux_plus_nica.sh
  • run_gapit_lux.sh
  • run_gapit_mex_chalco.sh
  • run_gapit_mex_mesa.sh
  • run_gapit_mex.sh
  • run_gapit_parv_plus_hue.sh
  • run_gapit_parv.sh
  • run_gapit_perenn.sh
  • run_gapit.sh

Annotation of Significant SNPs

Objective: Process and annotate GWAS result files from GAPIT. The analysis scripts and files are available in workflows/snp_gene_neighborhood_pipeline/scripts.

Significant SNPs were annotated using the B73 reference genome. This process involved downloading the B73 reference sequence and its annotations, running Mercator and ProtScriber, and generating annotations for each SNP. The annotation pipeline utilizes snp_physical_mapping_pl_v2.py for mapping SNPs to nearby features within a +50kb window and generate_reference_protein_function_annotations_v2.py for annotating the SNPs with functional information from Mercator and InterProScan. The annotated results for each trait are stored in workflows/snp_gene_neighborhood_pipeline/results/taxon/all_features/trait/. Additionally, heatmaps were generated to summarize the results, with scripts available in the directory workflows/snp_gene_neighborhood_pipeline/scripts/mercator_results_summarizer/. These scripts provide both raw data and Z-transformed data.

Prerequisites

Before running this pipeline, ensure you have the following:

  • A GFF3 file of the B73 reference sequence located at ../misc/b73/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3.
  • Mercator and ProtScriber analysis results from the protein.fa and gene.fa sequences. The gene.fa file captures CDS information.
  • Human-readable descriptions from the file Zm-B73-REFERENCE-NAM.UniProt.GO.tab.

Step 1: Download Reference Sequence

Download the B73 reference sequence and its annotations.

Step 2: Run Mercator

Execute Mercator using the downloaded protein and CDS sequences.

Step 3: Generate Mercator and ProtScriber Annotations

Generate Mercator and ProtScriber annotations from the Mercator FASTA results file with the following command:

bash grep "^>" ../misc/b73/protein_seq_mercator_prot_swissprot_annot/b73_reference.fa | sed 's/>//g' > ../misc/b73/protein_seq_mercator_prot_swissprot_annot/b73_reference.fa.headers.txt

Step 4: Data Preparation and Annotation

Data Preparation: Ensure the GWAS results and reference files are ready for annotation.

Annotation Script: Run bash workflows/snp_gene_neighborhood_pipeline/scripts/annotate.sh to annotate significant SNPs in your Teosinte GWAS results using the following components:

python3 snp_physical_mapping_pl_v2.py: Maps SNPs to nearby features from the GAPIT.Filter_GWAS_result.csv. python3 generate_reference_protein_function_annotations_v2.py: Annotates SNPs with Mercator, InterProScan, and ProtScriber information.

Fast Processing Option: Use annotate_v2.sh for faster annotation, which runsprocess_trait.sh. This script includes refactored versions of the above scripts.

Results: Each trait’s annotated SNP features was stored in workflows/snp_gene_neighborhood_pipeline/results/taxon/all_features/trait.

Final SNP Annotation Table: The final annotated SNP table, combining information from Mercator, InterProScan, and ProtScriber, was stored in the workflows/snp_gene_neighborhood_pipeline/results/taxon/all_features/trait directory with filenames like ${trait}_annotated_snp_table_${model}.csv.

Combined Annotations: The combined annotations for each trait and model was stored in the directory workflows/snp_gene_neighborhood_pipeline/results/taxon/all_features/trait directory with filenames like ${taxon}_final_annotation_${trait}_${model}.csv.

Step 5: Cleanup and Merging

Remove empty files generated from models with no significant SNPs:

Clean the repository by removing 0-byte files created from models with no significant SNPs: find workflows/snp_gene_neighborhood_pipeline/results -size 0c -delete

Merge all annotation files into a mega CSV file containing all annotations for the final analysis workflows/snp_gene_neighborhood_pipeline/scripts/merge_all_annotatated_files.py


Heatmaps and Summary Tables

Scripts for heatmap generation and summary are available in mercator_results_summarizer.

  • workflows/snp_gene_neighborhood_pipeline/scripts/mercator_results_summarizer/mercator_summary.py: Summarizes Mercator annotations into two count tables—one with raw data and another with Z-transformed data. These tables help analyze SNP functional associations with genes and biological pathways.

  • workflows/snp_gene_neighborhood_pipeline/scripts/plot_heatmap.R: Generates heatmaps of Mercator annotations using the output of mercator_summary.py. Hardcoded path to phenotype data: workflows/phenotype_pca_and_hc/phenotype_and_env_data_vars_labeled.csv

# parviglumis
python3 mercator_summary.py -p parameter_parv.csv -d 2

Rscript plot_heatmap.R parviglumis_bin_depth_2_vs_traits_df.csv parviglumis_depth_2

# Mexicana - All races
python3 mercator_summary.py -p parameter_mex.csv -d 2

Rscript plot_heatmap.R mexicana_bin_depth_2_vs_traits_df.csv mexicana_depth_2

# Mexicana - Mesa Central
python3 mercator_summary.py -p parameter_mex_mesa.csv -d 2

Rscript plot_heatmap.R mexicana_mesa_bin_depth_2_vs_traits_df.csv mexicana_mesa_central_depth_2

# Mexicana - Chalco
python3 mercator_summary.py -p parameter_mex_chalco.csv -d 2

Rscript plot_heatmap.R mexicana_chalco_bin_depth_2_vs_traits_df.csv mexicana_chalco_depth_2

# Diploperennis
python3 mercator_summary.py -p parameter_diplop.csv -d 2

Rscript plot_heatmap.R diplo_perennis_bin_depth_2_vs_traits_df.csv diplo_perennis_depth_2
  • workflows/snp_gene_neighborhood_pipeline/scripts/unique_bins.R: Preprocesses data to identify unique annotations for each taxon-trait combination and saves them in uniq_mercator_annotations.csv

-workflows/snp_gene_neighborhood_pipeline/scripts/unique_bins.R/plot_counts_snps_n_protein_coding_genes.py: Summarizes the number of SNPs and protein-coding genes found in LD windows and generates horizontal bar plots.


Enrichment Analysis and Jaccard Similarity Index (JSI) analysis

The scripts dirctory is located in the directory /workflows/snp_gene_neighborhood_pipeline/scripts/mercator_enrichment

The result directory for enrichment analysis tables, plots and JSI plots /workflows/snp_gene_neighborhood_pipeline/results_all_factors_snps_n_annotations

Generate the case annotation tables for enrichment analyisis

The Python script generate_enrichment_input_tables_v2.py processes annotation tables and generates a combined DataFrame for specified clades. The tables will be used in the enrichment analyis.

Execution flow:

  1. The process_annotation_table function processes an annotation table and extracts relevant information. It takes the path to an annotation table (annot_table_path), the desired bin depth (bin_depth), the trait name (trait), and the clade name (clade) as input. This function reads the annotation table, extracts information about genes and their associated bins, and creates a DataFrame with columns: "ID," "PFam," "Trait," and "Clade" for the specified clade. PFam column header is used inplace of mercator annotations to be compatible with Asis's enrichment_funks.R functions.

  2. The main function is the entry point of the script. It parses command-line arguments using argparse. The arguments include --bin_depth (bin depth), --param_file (path to a parameter file), and --output_dir (output directory for the resulting CSV file).

  3. Inside the main function:

    • The param_df DataFrame is created by reading the parameter file.
    • The script iterates over unique clade names specified in the parameter file.
    • For each clade, it processes the corresponding annotation table and generates a DataFrame using the process_annotation_table function.
    • The results are concatenated into the clade_combined_df DataFrame.
  4. After processing all clades, the script removes rows with PFam values (mercator values) starting with "Enzyme" from the clade_combined_df.

  5. Finally, it defines the output CSV file name based on bin depth, taxon, and clade, and writes the clade_combined_df DataFrame to the CSV file in the specified output directory.

This script can be used to generate combined annotation data for multiple clades from the results of the snp_annotation_pipeline, making it easier to analyze and visualize the results for various biological traits.

Script executions.

python3 generate_enrichment_input_tables_v2.py --bin_depth 1 --param_file ../mercator_results_summarizer/parameter_mex.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

python3 generate_enrichment_input_tables_v2.py --bin_depth 2 --param_file ../mercator_results_summarizer/parameter_mex.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

python3 generate_enrichment_input_tables_v2.py --bin_depth 1 --param_file ../mercator_results_summarizer/parameter_parv.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

python3 generate_enrichment_input_tables_v2.py --bin_depth 2 --param_file ../mercator_results_summarizer/parameter_parv.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

python3 generate_enrichment_input_tables_v2.py --bin_depth 1 --param_file ../mercator_results_summarizer/parameter_mex_chalco.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

python3 generate_enrichment_input_tables_v2.py --bin_depth 2 --param_file ../mercator_results_summarizer/parameter_mex_chalco.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

python3 generate_enrichment_input_tables_v2.py --bin_depth 1 --param_file ../mercator_results_summarizer/parameter_mex_mesa.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

python3 generate_enrichment_input_tables_v2.py --bin_depth 2 --param_file ../mercator_results_summarizer/parameter_mex_mesa.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

python3 generate_enrichment_input_tables_v2.py --bin_depth 1 --param_file ../mercator_results_summarizer/parameter_mex_diplo.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

python3 generate_enrichment_input_tables_v2.py --bin_depth 2 --param_file ../mercator_results_summarizer/parameter_mex_diplo.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

python3 generate_enrichment_input_tables_v2.py --bin_depth 1 --param_file ../mercator_results_summarizer/parameter_diplop.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

python3 generate_enrichment_input_tables_v2.py --bin_depth 2 --param_file ../mercator_results_summarizer/parameter_diplop.csv --output_dir ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

Output tables stored in ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/

Enrichment analysis

main result output ora tables and plots

The script results are located in:

  • workflows/snp_gene_neighborhood_pipeline/scripts/mercator_enrichment (scripts)
  • workflows/snp_gene_neighborhood_pipeline/results_mercator_heatmaps_enrichment (results)

Used R scripts mercator_enrichment_v2.R to analyze the generated CSV files and produce results for each clade.

Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_parviglumis_clade_combined_enrich_input.csv parviglumis_d1 1

Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_parviglumis_clade_combined_enrich_input.csv parviglumis_d2 2

Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_mexicana_clade_combined_enrich_input.csv mexicana_d1 1

Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_mexicana_clade_combined_enrich_input.csv mexicana_d2 2

Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_mexicana_mesa_central_clade_combined_enrich_input.csv mexicana_mesa_central_d1 1

Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_mexicana_mesa_central_clade_combined_enrich_input.csv mexicana_mesa_central_d2 2

Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_mexicana_chalco_clade_combined_enrich_input.csv mexicana_chalco_d1 1

Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_mexicana_chalco_clade_combined_enrich_input.csv mexicana_chalco_d2 2

Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_1_diplo_perennis_clade_combined_enrich_input.csv diplo_perennis_d1 1

Rscript mercator_enrichment_v2.R ../../results_mercator_heatmaps_enrichment/enrichment/input_tables/depth_2_diplo_perennis_clade_combined_enrich_input.csv diplo_perennis_d2 2
merge results ora outputs tables and plot

merge_tbls_plot_ora.sh merges various result tables and generate summary plots.

bash ./merge_tbls_plot_ora.sh
Enrichment plot protein functions count plots

The script grouped_enrichment_bar_plots.py creates bar plots to visualize protein function counts based on the analysis output.

python3 grouped_enrichment_bar_plots.py mexicana ../../results_mercator_heatmaps_enrichment/enrichment/ora_mapman_result_tables/mexicana/mexicana_d2_ora_mapman_result_tbl/mexicana_merged_ORA_output_tbl.csv --output_dir=../../results_all_factors_snps_n_annotations/

python3 grouped_enrichment_bar_plots.py parviglumis ../../results_mercator_heatmaps_enrichment/enrichment/ora_mapman_result_tables/parviglumis/parviglumis_d2_ora_mapman_result_tbl/parviglumis_merged_ORA_output_tbl.csv --output_dir=../../results_all_factors_snps_n_annotations/

Jaccard Similarity index

This was calculated and plotted following the steps outlined in the Jupyter notebook located at: workflows/snp_gene_neighborhood_pipeline/scripts/snps_and_proteins_jaccard_index.ipynb


Teosinte GWAS SNPs and literature SNPs comparisons

The analysis scripts and files mentioned above are available in the directory workflows/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps

Description and retrival of the Literature data

  • Source: Diversity, SNP, and Trait panel MaizeGDB Genome browser

  • On the left panel, under Diversity section, activate the tracks;

    • GWAS SNPs for 21 traits (Li. 2022 - Wang & Wang Labs) The Nature Plants paper by Li et al. 2022 performed genome-wide association studies (GWAS) for 21 important agronomic traits across 1,604 inbred lines at five agro-ecologically distinct locations in China in 2018 and 2019. On the basis of a cut-off of P < 10-6 for GWAS, they identified 2,360 significant associations at 1,847 SNPs. Raw data is found in Supplementary Table 6 of their paper. The SNP positions were originally mapped to B73 RefGen_v4. Each glyph on this track shows the SNP position and trait name.

      MaizeGDB remapped this dataset to B73 RefGen_v5, by taking the 100bp flanking sequeces for each SNP in RefGenv4 and BLASTing them to B73 RefGen_v5. The Blast results were converted to GFF. Each of the SNPs uniquely mapped to the B73 RefGen_v5 with 100% coverage and sequence identity.

      Mapping commands: bedtools getfasta -fi Zm-B73-REFERENCE-GRAMENE-4.0.fa -bed snps.bed -s -name > snps.fa makeblastdb -in Zm-B73-REFERENCE-NAM-5.0.fa -dbtype nucl -parse_seqids blastn -db Zm-B73-REFERENCE-NAM-5.0.fa -query snps.fa -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" -qcov_hsp_perc 100 -perc_identity 100 -out snps_positions.txt

      Publication: Genomic insights into historical improvement of heterotic groups during modern hybrid maize breeding.

      Raw data

      Original source code

    • GWAS SNPs from GWAS Atlas database (Updated February 24 2022) This track describes genome-wide association mapping (GWAS) from 133 papers covering 531 studies for 279 traits across ~42,000 loci overlapping ~8,400 genes. The data was compiled and remapped to B73_v4 at the GWAS Atlas database by the National Genomics Data Center at the Chinese Academy of Sciences (PMID: 31566222). Right-click on a feature to view details for each GWAS hit including trait name, tissue, position, allele, p-value, R^2 value, RS number, RS reference allele, RS alternate allele, publication information, and PubMed link. Track features are colored based on tissue.

      All of the original coordinates for B73 RefGen_v2, v3, and v4 have been remapped to B73 RefGen_v5. MaizeGDB used the 100bp upstream and downstream flanking sequences (retrieved using bedtools getfasta) for each SNP (in RefGen_v4) from the RefGen_v4 SNP genomic coordinates from the GWAS Atlas database (retrieved using bedtools getfasta) as query sequences aligned to B73 RefGen_v5 with at least 98% coverage and 98% sequence identity. This track shows the top hit (see command below) for each of those query sequences. If multiple top hits exist, the hit nearest the original location is chosen. RS numbers were assigned based on coordinate positions for the B73 RefGen_v4 SNPs downloaded from the European Variation Archive (March 2020). MaizeGDB also added information about overlapping features (using bedtools intersect) including open chromatin peaks (leaf,ear), 32 TF-binding sites peaks, and RefGen_v5 gene models.

      BLAST was used for the mapping (example command syntax: "blastn -db Zm-B73-REFERENCE-NAM-5.0.fa -query flanking_sequences.fasta -perc_identity 98 -qcov_hsp_perc 98").

  • Right click on respectic SNPs track and download Whole reference sequence as GFF3 format.

  • Download was done per chromosome

Teosinte GWAS SNPs and Literature Comparisons

The results of the GWAS conducted on teosinte were systematically compared with established SNP datasets, including the GWAS SNPs for 21 traits from the Li et al. (2022) study and those from the GWAS Atlas database. Comparison tables were generated and stored alongside the annotated SNP results. The following commands were utilized to perform the comparisons:

python3 literature_snps_comparisons.py -t workflows/snp_gene_neighborhood_pipeline/results_all_factors_snps_n_annotations/mexicana -a workflows/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps/data/processed_data/GWAS_SNPs_from_GWAS_Atlas_database_merged.gff3 -l workflows/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps/data/processed_data/GWAS_SNPs_for_21_traits_Li_2022_Wang_Labs_merged.gff3

python3 literature_snps_comparisons.py -t workflows/snp_gene_neighborhood_pipeline/results_all_factors_snps_n_annotations/parviglumis -a workflows/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps/data/processed_data/GWAS_SNPs_from_GWAS_Atlas_database_merged.gff3 -l workflows/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps/data/processed_data/GWAS_SNPs_for_21_traits_Li_2022_Wang_Labs_merged.gff3