Folder Structure
The dataset has the following folder structure
./ or the root folder has the scripts used for analysis in R Markdown files as well as the corresponding .html output from running these scripts.
./data/ has the raw data and the intermediate data saves from the analysis
./functions/ has one file "functions_for_selection_sweep_analysis.R" that has the custom functions written for the analysis in the manuscript.
./output/ has the analysis results and figures used in the manuscript
./output/mapchart/ has the MapChart input files for drawing linkage maps of canddiate selective sweeps that were filtered for Fst, Rsb, and xpEHH thresholds of 2 standard deviations
./output/mapchart_sd2.5/ has the MapChart input files for drawing linkage maps of candidate selective sweeps that were filtered for Fst, Rsb, and xpEHH thresholds of 2.5 standard deviations.
./rehh_files/ has two subfolders /genotype and /map that store the intermediate files generated by the R package 'rehh' to calcualte Rsb and xpEHH.
Raw data files
The analysis in the manuscript uses the following raw data files. Data files not in this list are all intermediate files created by the analysis scripts.
./data/90k_SNP_type.txt
A tab-delimmited file with 4 columns as described below:
Index: serial number of genetic markers/loci on the 90K wheat SNP chip.
Name: Unique names of the genetic markers/loci on the 90K wheat SNP chip.
SNP: Alleles present in the single nucleotide polymorphism (SNP) marker/loci.
SNPTYPE: Same information as in column SNP but in a format without square brackets and /
./data/KIM_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt
A tab-delimmited filed with information on known informative markers (KIM) recorded in 8 columns described below.
Marker: Name of the marker to be used as the label in the linkage maps in Supplemental Figures.
Chromosome: Chromosome label for wheat.
Start1.0: Physical position in base pairs in the 'Chinese Spring' wheat reference genome sequence version 1.0. This information was not used in the current study.
Start: Physical position in base pairs in the 'Chinese Spring' wheat reference genome sequence version 2.1.
Prop: Proportion sequence match for the marker to the reference genome sequence version 2.1.
SNP_ID: Alternative name for the marker. This information was not used in the current study.
Gene: Name of the gene.
Function: Function of the gene.
./data/R-generated-genotype-for-analysis-imputed-AB-format.csv
Raw 90K wheat SNP chip data after quality filtering and imputation uisng LinkImpute as described in Sthapit et al. The dataset includes the 7 information column described below, followed by 753 columns with genotype information in the AB format.
Name: Unique names of the genetic markers/loci on the 90K wheat SNP chip.
SNPid: Unique IWA and IWB SNP names of the genetic markers/loci on the 90K wheat SNP chip.
Chrom: Wheat chromosome labels.
Ord: Order of the marker. This information was not used for analysis.
cM: Centimorgan position of the marker. This information was not used for analysis.
Comment: Notes on manual classification of genotype calls in GenomeStudio.
Remaining columns have variety names and their corresponding genotype calls in AB format.
./data/R-generated-genotype-for-analysis-imputed-nucleotide-format.csv
Same information as in ./data/R-generated-genotype-for-analysis-imputed-AB-format.csv but the genotype information in the last 753 columns are recorded in the nucleotide (ACGT) format.
./data/SNP_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt
Contains physical base pair positions on the 'Chinese Spring' wheat reference sequence version 2.1 for the 90K SNP chip markers. The file has 5 columns without column headers. The column descriptions are given below.
First column has unique names of the genetic markers/loci on the 90K Wheat SNP chip.
Second column has wheat chromosome labels.
Third column has the starting base pair position of the marker on the reference sequence version 2.1.
Fourth column has the ending base pair position of the marker on the reference sequence version 2.1.
Fifth column has the mid-point of the third and fourth column, which was used at the SNP position for the marker in this study.
./data/variety_details.txt
Contains information about the 753 wheat varieties used as the diversity panel for this study. The file contains 12 columns, which are described below:
GS.Sample.ID: Names of the samples/varieties as they were in the raw output from the Illumina SNP calling software Genome Studio.
Corrected.Sample.ID: Names of the samples/varieties after they were corrected for typos (for example, 'Eric' to 'Erik') and removal of the prefix "varname" for varieties for varieties that only have numbers in their names ('varname2154' to '2154').
ACNO: Accession number of the varieties from the NPGS-GRIN database.
Habit: Growth habit (spring or winter) of the varieties.
Region: U.S. wheat growing regions: EAS, Eastern; GPL, Great Plains; NOR, Northern; PAC, Pacific; PNW, Pacific Northwest. Description of how states were assigned to these regions are in the methods section of the manuscript.
State: U.S. state the varieties are from.
Year: The year the variety was released in the U.S.
MC: Market class of the wheat variety: HRS, hard red spring; HRW, hard red winter; SRW, soft red winter; SWS, soft white spring; SWW, soft white winter.
HeadType: Designates if the spike or head of the wheat is club or common.
Sector: Was the variety from the public or private sector. Information in this column is incomplete and hence was not used for any analysis in the manuscript.
Decade: Decade the variety was released.
BP: Breeding period the variety was released.
Description of Scripts
Here we describe the scripts in order along with the input data files used and the output files these scripts produced.
./00_import_RefSeqv2.1_physical_positions.Rmd
./00_import_RefSeqv2.1_physical_positions.html (R Markdown output html)
The study uses genotype data generated from our previous study (https://doi.org/10.1002/tpg2.20196) that had marker physical positions based on wheat reference sequence version 1. This script updates the marker physical positions to the wheat reference sequence version 2.1 and saves the updated genotype files for subsequent analyses.
Input files:
./data/SNP_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt
./data/R-generated-genotype-for-analysis-imputed-nucleotide-format.csv
./data/R-generated-genotype-for-analysis-imputed-AB-format.csv
Output files:
./data/genotype_AB_format_13995_loci_imputed.txt
./data/genotype_nucleotide_format_13995_loci_imputed.txt
01_define_populations.Rmd
01_define_populations.html (R Markdown output html)
The script assigns what varieties go into what sub-populations as described in the methods section of the manuscript.
Input files:
./functions/functions_for_selection_sweep_analysis.R
./data/variety_details.txt
Output files:
./data/populations.rds./output/first_last_varieties.csv
02_calculate_iHH_iES_inES.Rmd
02_calculate_iHH_iES_inES.html (R Markdown output html)
This script uses the 'rehh' package function 'scan_hh' called through the custom function 'scan_population' to calculate the integrated extended haplotype homozygosity (iHH), integrated site-specific extended haplotype homozygosity (iES), and integrated normalized site-specific extended haplotype homozygosity (inES) for all markers of all 21 chromosomes and all wheat sub-populations in the study. The intermediate files needed to run these calculations were written to the folders ./rehh_files/genotype and ./rehh_files/map. The output is saved as an RDS file to be used as input for subsequent scripts.
Input files:
./functions/functions_for_selection_sweep_analysis.R
./data/genotype_nucleotide_format_13995_loci_imputed.txt
./data/populations.rds
Output files:
./output/scan_hh_ihs_results_polFALSE_sgap2.5MB_mgapNAMB_discardBorderTRUE.rds
03_calculate_allele_freq_Fst_Rsb_xpEHH.Rmd
03_calculate_allele_freq_Fst_Rsb_xpEHH.html (R Markdown output html)
Script calculates allele frequencies for all the sub-populations and Fst, Rsb, and xpEHH statistics for defined sub-population pairs.
Input files:
./functions/functions_for_selection_sweep_analysis.R
./data/genotype_nucleotide_format_13995_loci_imputed.txt
./data/genotype_AB_format_13995_loci_imputed.txt
./output/scan_hh_ihs_results_polFALSE_sgap2.5MB_mgapNAMB_discardBorderTRUE.rds
Output files:
./output/allele_freq_Fst_Rsb_xpEHH.Rds
04_calculate_candidate_regions.html
04_calculate_candidate_regions.html (R Markdown output html)
Scans for candidate selective sweeps for all the population pairs using the Fst, Rsb, xpEHH statistics calculated in the previous script.
Input files:
./functions/functions_for_selection_sweep_analysis.R
./data/genotype_nucleotide_format_13995_loci_imputed.txt
./data/genotype_AB_format_13995_loci_imputed.txt
./data/90k_SNP_type.txt
./output/allele_freq_Fst_Rsb_xpEHH.Rds
Output files:
./output/raw_results_for_cr_search.RDS
./output/cr_consolidated.csv
05_view_details_of_candidate_regions.Rmd
05_view_details_of_candidate_regions.html (R Markdown output html)
For every candidate selective sweep identified in the study, this script creates a table with informative details about the SNPs, physical position, polymorphic information scores, allele frequencies, Fst, Rsb, and xpEHH.
Input files:
./output/raw_results_for_cr_search.RDS
./output/cr_consolidated.csv
Output files:
./output/cr_pic_freq_consolidated.csv
06_format_MapChart_input.Rmd
06_format_MapChart_input.html (R Markdown output html)
This scripts generates the summaries for Table1 and creates Figure 3 from the manuscript. It also generates MapChart input files (.mct) that were used to create the supplemental figures S3 and S4.
Input files:
./functions/functions_for_selection_sweep_analysis.R
./output/raw_results_for_cr_search.RDS
./output/cr_pic_freq_consolidated.csv
./data/KIM_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt
./data/genotype_AB_format_13995_loci_imputed.txt
Output files:
./output/categories.rds
./output/cr_99_consolidated.csv
./output/mapchart_sd2.5/*.mct
07_Fst_distribution.Rmd
07_Fst_distribution.html (R Markdown output file)
This script generates Figure 2 from the manuscript.
Input files:
./output/allele_freq_Fst_Rsb_xpEHH.Rds
./functions/functions_for_selection_sweep_analysis.R
Output files:
./output/Fst_differentiation_plot.pdf