Candidate selective sweeps in U.S. wheat populations

Sthapit, Sajal 1 ; Ruff, Travis2 ; Hooker, Marcus3; See, Deven2

Published Nov 06, 2024 on Dryad. https://doi.org/10.5061/dryad.ghx3ffbx0

Data files

Nov 06, 2024 version files 274.54 MB

candidate_selective_sweeps_in_US_wheat.zip
274.53 MB
README.md
6.46 KB

Abstract

Exploration of novel alleles from ex situ collection is still limited in modern plant breeding as these alleles exist in genetic backgrounds of landraces that are not adapted to modern production environments. The practice of backcross breeding results in the preservation of the adapted background of elite parents but leaves little room for novel alleles from landraces to be incorporated. The selection of adaptation-associated linkage blocks instead of the entire adapted background may allow breeders to incorporate more of the landrace’s genetic background and to observe and evaluate novel alleles. Important adaptation-associated linkage blocks would have been selected over multiple cycles of breeding and hence are likely to exhibit signatures of positive selection or selective sweeps. We conducted a genome-wide scan for candidate selective sweeps (CSS) using F_st, Rsb, and xpEHH in state, regional, spring, winter, and market class population pairs and report 446 CSS in 19 population pairs over time and 1033 CSS in 44 population pairs across geography and class. Further validation of these candidate selective sweeps in specific breeding programs may lead to the identification of sets of loci that can be selected to restore population-specific adaptation without multiple backcrossing.

https://doi.org/10.5061/dryad.ghx3ffbx0

Data set:

R scripts documented as RMarkdown (.Rmd) files for the data preparation and analysis reported in the study titled “Candidate selective sweeps in U.S. wheat populations.” The R scripts are numbered 00_ to 08_ to be run sequentially to reproduce the results presented. The data set also includes the .html outputs generated from running the R scripts in the .Rmd files.

Files and descriptions:

Files and folders in the root directory and their contents are described below.

File or Directory Name	Description
00_import_RefSeqv2.1_physical_positions.Rmd	Adds ‘Chinese Spring’ wheat reference genome ver. 2.1 physical positions to the genotype data files.
01_define_populations.Rmd	Defines what varieties belong within different populations for the analysis.
02_calculate_iHH_iES_inES.Rmd	Uses the custom function ‘scan_populations’ function to call the function ‘rehh::scan_hh’ to calculate integrated extended haplotype homozygosity.
03_calculate_allele_freq_Fst_Rsb_xpEHH.Rmd	Calculates allele frequencies, Fst, Rsb, and xpEHH statistics.
04_calculate_candidate_regions.Rmd	Scans for clusters of candiate selective sweeps and also calculates PIC.
05_view_details_of_candidate_regions.Rmd	Generates SNP tables with PIC, allele frequency, Fst, Rsb, and xpEHH statistics for all candidate selective sweeps detected.
06_format_MapChart_input.Rmd	Generates input files (.mct) for the software MapChart to draw selection maps.
07_Fst_distribution.Rmd	Draws a plot of Fst distribution for all the population pairs in the study.
01_define_populations.html	Knitted Rmarkdown file for the corresponding .Rmd file.
00_import_RefSeqv2.1_physical_positions.html	Knitted Rmarkdown file for the corresponding .Rmd file.
02_calculate_iHH_iES_inES.html	Knitted Rmarkdown file for the corresponding .Rmd file.
03_calculate_allele_freq_Fst_Rsb_xpEHH.html	Knitted Rmarkdown file for the corresponding .Rmd file.
04_calculate_candidate_regions.html	Knitted Rmarkdown file for the corresponding .Rmd file.
05_view_details_of_candidate_regions.html	Knitted Rmarkdown file for the corresponding .Rmd file.
06_format_MapChart_input.html	Knitted Rmarkdown file for the corresponding .Rmd file.
07_Fst_distribution.html	Knitted Rmarkdown file for the corresponding .Rmd file.
data/	Folder contains the original as well as intermediate data files.
functions/functions_for_selection_sweep_analysis.R	Text file with the custom functions called by the .Rmd files for analysis.
output/	Folder contains the output plots, tables, and intermediate files.
output/mapchart/	Folder contains MapChart input files (.mct) to draw selection maps.
output/mapchart_sd2.5/	Folder contains MapChart input files (.mct) drawn using the standard deviation threshold o 2.5 (instead of the default 2.0) for Rsb and xpEHH to generate less dense selection maps.
rehh_files/	Folder contains intermediate files generate by the R package ‘rehh’ to calculate extended haplotype homozygosity.

Folder Structure

The dataset has the following folder structure

./ or the root folder has the scripts used for analysis in R Markdown files as well as the corresponding .html output from running these scripts.

./data/ has the raw data and the intermediate data saves from the analysis

./functions/ has one file "functions_for_selection_sweep_analysis.R" that has the custom functions written for the analysis in the manuscript.

./output/ has the analysis results and figures used in the manuscript

./output/mapchart/ has the MapChart input files for drawing linkage maps of canddiate selective sweeps that were filtered for Fst, Rsb, and xpEHH thresholds of 2 standard deviations

./output/mapchart_sd2.5/ has the MapChart input files for drawing linkage maps of candidate selective sweeps that were filtered for Fst, Rsb, and xpEHH thresholds of 2.5 standard deviations.

./rehh_files/ has two subfolders /genotype and /map that store the intermediate files generated by the R package 'rehh' to calcualte Rsb and xpEHH.

Raw data files

The analysis in the manuscript uses the following raw data files. Data files not in this list are all intermediate files created by the analysis scripts.

./data/90k_SNP_type.txt

A tab-delimmited file with 4 columns as described below:

Index: serial number of genetic markers/loci on the 90K wheat SNP chip.

Name: Unique names of the genetic markers/loci on the 90K wheat SNP chip.

SNP: Alleles present in the single nucleotide polymorphism (SNP) marker/loci.

SNPTYPE: Same information as in column SNP but in a format without square brackets and /

./data/KIM_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt

A tab-delimmited filed with information on known informative markers (KIM) recorded in 8 columns described below.

Marker: Name of the marker to be used as the label in the linkage maps in Supplemental Figures.

Chromosome: Chromosome label for wheat.

Start1.0: Physical position in base pairs in the 'Chinese Spring' wheat reference genome sequence version 1.0. This information was not used in the current study.

Start: Physical position in base pairs in the 'Chinese Spring' wheat reference genome sequence version 2.1.

Prop: Proportion sequence match for the marker to the reference genome sequence version 2.1.

SNP_ID: Alternative name for the marker. This information was not used in the current study.

Gene: Name of the gene.

Function: Function of the gene.

./data/R-generated-genotype-for-analysis-imputed-AB-format.csv

Raw 90K wheat SNP chip data after quality filtering and imputation uisng LinkImpute as described in Sthapit et al. The dataset includes the 7 information column described below, followed by 753 columns with genotype information in the AB format.

Name: Unique names of the genetic markers/loci on the 90K wheat SNP chip.

SNPid: Unique IWA and IWB SNP names of the genetic markers/loci on the 90K wheat SNP chip.

Chrom: Wheat chromosome labels.

Ord: Order of the marker. This information was not used for analysis.

cM: Centimorgan position of the marker. This information was not used for analysis.

Comment: Notes on manual classification of genotype calls in GenomeStudio.

Remaining columns have variety names and their corresponding genotype calls in AB format.

./data/R-generated-genotype-for-analysis-imputed-nucleotide-format.csv

Same information as in ./data/R-generated-genotype-for-analysis-imputed-AB-format.csv but the genotype information in the last 753 columns are recorded in the nucleotide (ACGT) format.

./data/SNP_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt

Contains physical base pair positions on the 'Chinese Spring' wheat reference sequence version 2.1 for the 90K SNP chip markers. The file has 5 columns without column headers. The column descriptions are given below.

First column has unique names of the genetic markers/loci on the 90K Wheat SNP chip.

Second column has wheat chromosome labels.

Third column has the starting base pair position of the marker on the reference sequence version 2.1.

Fourth column has the ending base pair position of the marker on the reference sequence version 2.1.

Fifth column has the mid-point of the third and fourth column, which was used at the SNP position for the marker in this study.

./data/variety_details.txt

Contains information about the 753 wheat varieties used as the diversity panel for this study. The file contains 12 columns, which are described below:

GS.Sample.ID: Names of the samples/varieties as they were in the raw output from the Illumina SNP calling software Genome Studio.

Corrected.Sample.ID: Names of the samples/varieties after they were corrected for typos (for example, 'Eric' to 'Erik') and removal of the prefix "varname" for varieties for varieties that only have numbers in their names ('varname2154' to '2154').

ACNO: Accession number of the varieties from the NPGS-GRIN database.

Habit: Growth habit (spring or winter) of the varieties.

Region: U.S. wheat growing regions: EAS, Eastern; GPL, Great Plains; NOR, Northern; PAC, Pacific; PNW, Pacific Northwest. Description of how states were assigned to these regions are in the methods section of the manuscript.

State: U.S. state the varieties are from.

Year: The year the variety was released in the U.S.

MC: Market class of the wheat variety: HRS, hard red spring; HRW, hard red winter; SRW, soft red winter; SWS, soft white spring; SWW, soft white winter.

HeadType: Designates if the spike or head of the wheat is club or common.

Sector: Was the variety from the public or private sector. Information in this column is incomplete and hence was not used for any analysis in the manuscript.

Decade: Decade the variety was released.

BP: Breeding period the variety was released.

Description of Scripts

Here we describe the scripts in order along with the input data files used and the output files these scripts produced.

./00_import_RefSeqv2.1_physical_positions.Rmd

./00_import_RefSeqv2.1_physical_positions.html (R Markdown output html)

The study uses genotype data generated from our previous study (https://doi.org/10.1002/tpg2.20196) that had marker physical positions based on wheat reference sequence version 1. This script updates the marker physical positions to the wheat reference sequence version 2.1 and saves the updated genotype files for subsequent analyses.

Input files:

./data/SNP_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt

./data/R-generated-genotype-for-analysis-imputed-nucleotide-format.csv

./data/R-generated-genotype-for-analysis-imputed-AB-format.csv

Output files:

./data/genotype_AB_format_13995_loci_imputed.txt

./data/genotype_nucleotide_format_13995_loci_imputed.txt

01_define_populations.Rmd

01_define_populations.html (R Markdown output html)

The script assigns what varieties go into what sub-populations as described in the methods section of the manuscript.

Input files:

./functions/functions_for_selection_sweep_analysis.R

./data/variety_details.txt

Output files:

./data/populations.rds./output/first_last_varieties.csv

02_calculate_iHH_iES_inES.Rmd

02_calculate_iHH_iES_inES.html (R Markdown output html)

This script uses the 'rehh' package function 'scan_hh' called through the custom function 'scan_population' to calculate the integrated extended haplotype homozygosity (iHH), integrated site-specific extended haplotype homozygosity (iES), and integrated normalized site-specific extended haplotype homozygosity (inES) for all markers of all 21 chromosomes and all wheat sub-populations in the study. The intermediate files needed to run these calculations were written to the folders ./rehh_files/genotype and ./rehh_files/map. The output is saved as an RDS file to be used as input for subsequent scripts.

Input files:

./functions/functions_for_selection_sweep_analysis.R

./data/genotype_nucleotide_format_13995_loci_imputed.txt

./data/populations.rds

Output files:

./output/scan_hh_ihs_results_polFALSE_sgap2.5MB_mgapNAMB_discardBorderTRUE.rds

03_calculate_allele_freq_Fst_Rsb_xpEHH.Rmd

03_calculate_allele_freq_Fst_Rsb_xpEHH.html (R Markdown output html)

Script calculates allele frequencies for all the sub-populations and Fst, Rsb, and xpEHH statistics for defined sub-population pairs.

Input files:

./functions/functions_for_selection_sweep_analysis.R

./data/genotype_nucleotide_format_13995_loci_imputed.txt

./data/genotype_AB_format_13995_loci_imputed.txt

./output/scan_hh_ihs_results_polFALSE_sgap2.5MB_mgapNAMB_discardBorderTRUE.rds

Output files:

./output/allele_freq_Fst_Rsb_xpEHH.Rds

04_calculate_candidate_regions.html

04_calculate_candidate_regions.html (R Markdown output html)

Scans for candidate selective sweeps for all the population pairs using the Fst, Rsb, xpEHH statistics calculated in the previous script.

Input files:

./functions/functions_for_selection_sweep_analysis.R

./data/genotype_nucleotide_format_13995_loci_imputed.txt

./data/genotype_AB_format_13995_loci_imputed.txt

./data/90k_SNP_type.txt

./output/allele_freq_Fst_Rsb_xpEHH.Rds

Output files:

./output/raw_results_for_cr_search.RDS

./output/cr_consolidated.csv

05_view_details_of_candidate_regions.Rmd

05_view_details_of_candidate_regions.html (R Markdown output html)

For every candidate selective sweep identified in the study, this script creates a table with informative details about the SNPs, physical position, polymorphic information scores, allele frequencies, Fst, Rsb, and xpEHH.

Input files:

./output/raw_results_for_cr_search.RDS

./output/cr_consolidated.csv

Output files:

./output/cr_pic_freq_consolidated.csv

06_format_MapChart_input.Rmd

06_format_MapChart_input.html (R Markdown output html)

This scripts generates the summaries for Table1 and creates Figure 3 from the manuscript. It also generates MapChart input files (.mct) that were used to create the supplemental figures S3 and S4.

Input files:

./functions/functions_for_selection_sweep_analysis.R

./output/raw_results_for_cr_search.RDS

./output/cr_pic_freq_consolidated.csv

./data/KIM_physical_positions_on_IWGSC_CS_RefSeq_v2.1.txt

./data/genotype_AB_format_13995_loci_imputed.txt

Output files:

./output/categories.rds

./output/cr_99_consolidated.csv

./output/mapchart_sd2.5/*.mct

07_Fst_distribution.Rmd

07_Fst_distribution.html (R Markdown output file)

This script generates Figure 2 from the manuscript.

Input files:

./output/allele_freq_Fst_Rsb_xpEHH.Rds

./functions/functions_for_selection_sweep_analysis.R

Output files:

./output/Fst_differentiation_plot.pdf

Candidate selective sweeps in U.S. wheat populations

Data files

Abstract

README: Candidate selective sweeps in U.S. wheat populations

Data set:

Files and descriptions:

Methods

Works referencing this dataset