diff --git a/workflows/GWAS_Research_Methodology.md b/workflows/GWAS_Research_Methodology.md deleted file mode 100644 index cbc573737385e9420f9ed7f17207988f7b6430d3..0000000000000000000000000000000000000000 --- a/workflows/GWAS_Research_Methodology.md +++ /dev/null @@ -1,181 +0,0 @@ -# GWAS Analysis Workflows Documentation - -## Preprocessing - -The aforementioned analysis section's scripts and files are located in the directory [workflows/preprocessing_data](workflows/preprocessing_data). - -1. **Conversion of Genetic Data**: - The initial 'raw' genetic data, provided in VCF format, was converted to hapmap format using TASSEL software [doi:10.1093/bioinformatics/btm308]. The files were loaded directly into TASSEL and saved as diploid hapmap files. The respective raw data files used can be found in the following directories: - - - `/initial_data/data/genetic/filtered/*.vcf` - - `/initial_data/data/meta/ADN_pasap_3604.txt` - -2. **Hapmap File Cleaning**: - After conversion, species-specific tags were removed from the overall teosinte hapmap file IDs using the script `remove_spp_tags_in_hapmap_files.sh` to ensure standardized identifiers across all files. - -3. **Genotype-Phenotype Accession Matching and taxon subset data files**: - The Jupyter notebook `geno_pheno_accession_selection_3455_accesions_matched.ipynb` was used to filter out genotypes that did not have corresponding phenotype data, and vice versa, in preparation for GWAS analysis using GAPIT. This notebook also generated individual taxa subsets for further analysis. - -4. **Genotype File Conversion**: - The hapmap genotype files were further converted to various formats (Plink, numeric, H5) using scripts located in the `hapmap_convertion_scripts` subdirectory. These conversions facilitated compatibility with other analysis tools used in downstream analyses. - -5. **Phenotype Data Extraction**: - The script `extract_indiv_spp_pheno_data.sh` was used to extract and create individual species phenotype files. This step was essential to ensure that each species had its own specific phenotype dataset for analysis. - -The cleaned and filtered genotype and phenotype files were saved in the `/mnt/data/joseph/TEOSINTE/analyses/GWAS/data/` or `studies/processed_genotype_phenotype_teosinte_data` directory, ready for use in the GWAS pipeline. - -## GWAS Analysis - -The analysis scripts and files mentioned above are available in the directory [workflows/gwas_gapit_pipeline](workflows/gwas_gapit_pipeline) - -The Genome-Wide Association Studies (GWAS) were conducted using the GAPIT tool, evaluating several models including the General Linear Model (GLM), Mixed Linear Model (MLM), FarmCPU, MLMM, CMLM, SUPER, and BLINK. QQ plots were analyzed to assess the performance of the models, with FarmCPU and MLMM identified as the most effective for the dataset. Population structure models were selected based on BIC model selection criteria. Input files for the analysis included genotype and phenotype files located at `/mnt/data/joseph/TEOSINTE/analyses/GWAS/data/` or `studies/processed_genotype_phenotype_teosinte_data`, and the resulting GWAS output is stored in `runs` or `/mnt/data/joseph/TEOSINTE/analyses/GWAS/gapit_pipeline/results/.` Directories. - -GAPIT parameters were fine-tuned to identify the most suitable principal components (PCs) for GWAS analysis by calculating eigenvalues through principal component analysis (PCA) of the genotype data. Utilizing the Bayesian Information Criterion (BIC) method, GAPIT selects the top PCs that account for the highest genetic variance, thereby optimizing population structure correction in association tests. The script `gwas_3455.R` serves as the foundation for running analyses on individual teosinte data subsets, as outlined below: - -- `run_gapit_all_taxa.sh` -- `run_gapit_diplo_plus_per.sh` -- `run_gapit_diplo.sh` -- `run_gapit_lux_plus_nica.sh` -- `run_gapit_lux.sh` -- `run_gapit_mex_chalco.sh` -- `run_gapit_mex_mesa.sh` -- `run_gapit_mex.sh` -- `run_gapit_parv_plus_hue.sh` -- `run_gapit_parv.sh` -- `run_gapit_perenn.sh` -- `run_gapit.sh` - -### GWAS Significant SNPs annotation and Summary tables - -The analysis scripts and files mentioned above are available in the directory [workflows/snp_gene_neighborhood_pipeline/scripts](workflows/snp_gene_neighborhood_pipeline/scripts) - -Significant SNPs were annotated using the B73 reference genome. This process involved downloading the B73 reference sequence and its annotations, running Mercator and ProtScriber, and generating annotations for each SNP. The annotation pipeline utilizes `snp_physical_mapping_pl_v2.py` for mapping SNPs to nearby _+50kb window_ features and `generate_reference_protein_function_annotations_v2.py` for annotating the SNPs with functional information from Mercator and InterProScan. The annotated results for each trait are stored in /`mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results/taxon/all_features/trait/`. Additionally, heatmaps were generated to summarize the results, with scripts available in the directory `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/mercator_results_summarizer/`. These scripts provide both raw data and Z-transformed data. - -- The aim was to processes and annotates GWAS results files from GAPIT. - Prerequisites - Before running this pipeline, ensure you have the following: - - - A GFF3 file of the B73 reference sequence located at ../misc/b73/Zm-B73-REFERENCE-NAM-5.0_Zm00001eb.1.gff3. - - Mercator and ProtScriber analysis results from the protein.fa and gene.fa sequences. The gene.fa file was used to capture CDS information. - - Human-readable descriptions from the file Zm-B73-REFERENCE-NAM.UniProt.GO.tab . - - Step 1: Download Reference Sequence - Download the B73 reference sequence and its annotations. - - Step 2: Run Mercator - Execute Mercator using the downloaded protein and CDS sequences. - - Step 3: Generate Mercator and ProtScriber Annotations - - The following command to generate Mercator and ProtScriber annotations from the Mercator FASTA results file: - - `bash grep "^>" ../misc/b73/protein_seq_mercator_prot_swissprot_annot/b73_reference.fa | sed 's/>//g' > ../misc/b73/protein_seq_mercator_prot_swissprot_annot/b73_reference.fa.headers.txt` - - Step 4: Data Preparation and Annotation - Data Preparation: Ensure the GWAS results and reference files are ready for annotation. - - Annotation Script: Run `bash /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/annotate.sh` to annotate significant SNPs in your Teosinte GWAS results using the following components: - - `python3 snp_physical_mapping_pl_v2.py`: Maps SNPs to nearby features from the GAPIT.Filter_GWAS_result.csv. - `python3 generate_reference_protein_function_annotations_v2.py`: Annotates SNPs with Mercator, InterProScan, and ProtScriber information. - - Fast Processing Option: Use `annotate_v2.sh` for faster annotation, which runs`process_trait.sh`. This script includes refactored versions of the above scripts. - - Results: Each trait’s annotated SNP features was stored in `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results/taxon/all_features/trait`. - - Final SNP Annotation Table: The final annotated SNP table, combining information from Mercator, InterProScan, and ProtScriber, was stored in the `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results/taxon/all_features/trait` directory with filenames like `${trait}_annotated_snp_table_${model}.csv`. - - Combined Annotations: The combined annotations for each trait and model was stored in the directory `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results/taxon/all_features/trait` directory with filenames like `${taxon}_final_annotation_${trait}_${model}.csv`. - - Step 5: Cleanup and Merging - Remove empty files generated from models with no significant SNPs: - - Clean the repository by removing 0-byte files created from models with no significant SNPs: - `find /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results -size 0c -delete` - - Merge all annotation files into a mega CSV file containing all annotations for the final analysis - `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/merge_all_annotatated_files.py` - -- Heatmaps and overall GWAS results summary - The directory [mercator_results_summarizer](/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/mercator_results_summarizer) contains scripts to summarize Mercator annotations and visualize GWAS results using heatmaps. - - - `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/mercator_summary.py`: Summarizes Mercator annotations into two count tables—one with raw data and another with Z-transformed data. These tables help analyze SNP functional associations with genes and biological pathways. - - - `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/plot_heatmap.R`: Generates heatmaps of Mercator annotations using the output of mercator_summary.py. Hardcoded path to phenotype data: `/mnt/data/joseph/TEOSINTE/analyses-on-pheotype_and_climatic_data/downstream-data-and-outputs/data/phenotype_and_env_data_vars_labeled.csv` - -```bash -# parviglumis -python3 mercator_summary.py -p parameter_parv.csv -d 2 - -Rscript plot_heatmap.R parviglumis_bin_depth_2_vs_traits_df.csv parviglumis_depth_2 - -# Mexicana - All races -python3 mercator_summary.py -p parameter_mex.csv -d 2 - -Rscript plot_heatmap.R mexicana_bin_depth_2_vs_traits_df.csv mexicana_depth_2 - -# Mexicana - Mesa Central -python3 mercator_summary.py -p parameter_mex_mesa.csv -d 2 - -Rscript plot_heatmap.R mexicana_mesa_bin_depth_2_vs_traits_df.csv mexicana_mesa_central_depth_2 - -# Mexicana - Chalco -python3 mercator_summary.py -p parameter_mex_chalco.csv -d 2 - -Rscript plot_heatmap.R mexicana_chalco_bin_depth_2_vs_traits_df.csv mexicana_chalco_depth_2 - -# Diploperennis -python3 mercator_summary.py -p parameter_diplop.csv -d 2 - -Rscript plot_heatmap.R diplo_perennis_bin_depth_2_vs_traits_df.csv diplo_perennis_depth_2 -``` - -- `/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/unique_bins.R`: Preprocesses data to identify unique annotations for each taxon-trait combination and saves them in uniq_mercator_annotations.csv - --`/mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/src/unique_bins.R/plot_counts_snps_n_protein_coding_genes.py`: Summarizes the number of SNPs and protein-coding genes found in LD windows and generates horizontal bar plots. - -## Teosinte GWAS SNPs and literature SNPs comparisons - -The analysis scripts and files mentioned above are available in the directory [workflows/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps](workflows/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps) - -### Description and retrival of the Literature data - -- Source: [Diversity, SNP, and Trait panel MaizeGDB Genome browser](https://www.maizegdb.org/diversity) -- On the left panel, under Diversity section, activate the tracks; - - - `GWAS SNPs for 21 traits (Li. 2022 - Wang & Wang Labs)` - The Nature Plants paper by Li et al. 2022 performed genome-wide association studies (GWAS) for 21 important agronomic traits across 1,604 inbred lines at five agro-ecologically distinct locations in China in 2018 and 2019. On the basis of a cut-off of P < 10-6 for GWAS, they identified 2,360 significant associations at 1,847 SNPs. Raw data is found in Supplementary Table 6 of their paper. The SNP positions were originally mapped to B73 RefGen_v4. Each glyph on this track shows the SNP position and trait name. - - MaizeGDB remapped this dataset to B73 RefGen_v5, by taking the 100bp flanking sequeces for each SNP in RefGenv4 and BLASTing them to B73 RefGen_v5. The Blast results were converted to GFF. Each of the SNPs uniquely mapped to the B73 RefGen_v5 with 100% coverage and sequence identity. - - Mapping commands: - bedtools getfasta -fi Zm-B73-REFERENCE-GRAMENE-4.0.fa -bed snps.bed -s -name > snps.fa - makeblastdb -in Zm-B73-REFERENCE-NAM-5.0.fa -dbtype nucl -parse_seqids - blastn -db Zm-B73-REFERENCE-NAM-5.0.fa -query snps.fa -outfmt "6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore" -qcov_hsp_perc 100 -perc_identity 100 -out snps_positions.txt - - Publication: - Genomic insights into historical improvement of heterotic groups during modern hybrid maize breeding. - - [Raw data](https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA009749) - - [Original source code](https://www.nature.com/articles/s41477-022-01190-2#Sec29) - - - `GWAS SNPs from GWAS Atlas database` - (Updated February 24 2022) This track describes genome-wide association mapping (GWAS) from 133 papers covering 531 studies for 279 traits across ~42,000 loci overlapping ~8,400 genes. The data was compiled and remapped to B73_v4 at the GWAS Atlas database by the National Genomics Data Center at the Chinese Academy of Sciences (PMID: 31566222). Right-click on a feature to view details for each GWAS hit including trait name, tissue, position, allele, p-value, R^2 value, RS number, RS reference allele, RS alternate allele, publication information, and PubMed link. Track features are colored based on tissue. - - All of the original coordinates for B73 RefGen_v2, v3, and v4 have been remapped to B73 RefGen_v5. MaizeGDB used the 100bp upstream and downstream flanking sequences (retrieved using bedtools getfasta) for each SNP (in RefGen_v4) from the RefGen_v4 SNP genomic coordinates from the GWAS Atlas database (retrieved using bedtools getfasta) as query sequences aligned to B73 RefGen_v5 with at least 98% coverage and 98% sequence identity. This track shows the top hit (see command below) for each of those query sequences. If multiple top hits exist, the hit nearest the original location is chosen. RS numbers were assigned based on coordinate positions for the B73 RefGen_v4 SNPs downloaded from the European Variation Archive (March 2020). MaizeGDB also added information about overlapping features (using bedtools intersect) including open chromatin peaks (leaf,ear), 32 TF-binding sites peaks, and RefGen_v5 gene models. - - BLAST was used for the mapping (example command syntax: "blastn -db Zm-B73-REFERENCE-NAM-5.0.fa -query flanking_sequences.fasta -perc_identity 98 -qcov_hsp_perc 98"). - -- Right click on respectic SNPs track and download `Whole reference sequence` as `GFF3` format. -- Download was done per chromosome - -### Teosinte GWAS SNPs and Literature Comparisons - -The results of the GWAS conducted on teosinte were systematically compared with established SNP datasets, including the GWAS SNPs for 21 traits from the Li et al. (2022) study and those from the GWAS Atlas database. Comparison tables were generated and stored alongside the annotated SNP results. The following commands were utilized to perform the comparisons: - -```bash -python3 literature_snps_comparisons.py -t /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results_all_factors_snps_n_annotations/mexicana -a /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps/data/processed_data/GWAS_SNPs_from_GWAS_Atlas_database_merged.gff3 -l /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps/data/processed_data/GWAS_SNPs_for_21_traits_Li_2022_Wang_Labs_merged.gff3 - -python3 literature_snps_comparisons.py -t /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/results_all_factors_snps_n_annotations/parviglumis -a /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps/data/processed_data/GWAS_SNPs_from_GWAS_Atlas_database_merged.gff3 -l /mnt/data/joseph/TEOSINTE/analyses/GWAS/pipelines/snp_gene_neighborhood_pipeline/gwas_snps_vs_published_snps/data/processed_data/GWAS_SNPs_for_21_traits_Li_2022_Wang_Labs_merged.gff3 -``` diff --git a/workflows/README.md b/workflows/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f33cad631d4b62873540287eaaac836a05e64bed --- /dev/null +++ b/workflows/README.md @@ -0,0 +1,22 @@ +# Directory Documentation + +The `workflows` directory contains independent analysis directories, scripts, and intermediate data generated for the GWAS study. Each subdirectory focuses on a specific analysis or task. + +The central documentation references these directories according to their respective analysis sequences. + +## Structure + +```plaintext +workflows/ +│ +├── admixture_plots/ +├── climate_factor_intersections/ +├── genetic_diversity/ +├── gwas_gapit_pipeline/ +├── manuscript_figures/ +├── phenotype_distribution_plots/ +├── phenotype_pca_and_hc/ +├── phylogeny_based_GBS/ +├── preprocessing_data/ +└── snp_gene_neighborhood_pipeline/ +```