Skip to main content
Dryad

Data from: A nested association mapping panel in Arabidopsis thaliana for mapping and characterizing genetic architecture

Cite this dataset

Brock, Marcus (2021). Data from: A nested association mapping panel in Arabidopsis thaliana for mapping and characterizing genetic architecture [Dataset]. Dryad. https://doi.org/10.5061/dryad.c2fqz614w

Abstract

Linkage and association mapping populations are crucial public resources that facilitate the characterization of trait genetic architecture in natural and agricultural systems.  We define a large nested association mapping panel (NAM) from 14 publicly available recombinant inbred populations (RILs) of Arabidopsis thaliana, which share a common recurrent parent (Col-0).  Using a genotype-by-sequencing approach (GBS), we identified single nucleotide polymorphisms (SNPs; range 563-1525 per population) and subsequently built updated linkage maps in each of the 14 RIL sets.  Simulations in individual RIL populations indicate that our GBS markers have improved power to detect small effect QTL and enhanced resolution of QTL support intervals in comparison to original linkage maps.  Using these robust linkage maps, we imputed a common set of publicly available parental SNPs into each RIL linkage map, generating overlapping markers across all populations.  Though ultimately depending on allele frequencies at causal loci, simulations of the NAM panel suggest that surveying between 4 to 7 of the 14 RIL populations provides high resolution of the genetic architecture of complex traits, relative to a single mapping population.

Methods

SNP discovery and curation:  We selected 14 Arabidopsis thaliana RIL populations from the Institut National de la Recherche Agronomique (INRA; Versailles, France) that utilize Col-0 as a common, recurrent, parent.  From each population, the most informative 150 RILs were selected from each population comprising 2100 unique F8 RIL lines.  To build high density linkage maps across all 14 Arabidopsis thaliana RIL populations, we took a genotyping-by-sequencing (GBS) approach to SNP discovery.  We digested each DNA sample with the restriction endonucleases EcoRI and HindIII and then ligated customized adapters to each fragment containing the Illumina adaptor sequences and 8-10 bp barcode sequences.  Ligated fragments were PCR amplified using two separate reactions and resulting products were pooled to limit stochastic effects on relative abundance of fragments.  PCR products were then pooled across individuals and libraries were size selected for fragments between 250-700bp using a BluePippin (Beverly, MA, USA).  Initial GBS libraries were sent to the RTSF Genomics Core (Michigan State University, East Lansing, MI, USA) and follow-up runs were sent to the Genomic Sequencing and Analysis Facility (University of Texas, Austin, TX, USA).  At both facilities, libraries were sequenced on the Illumina HiSeq 2500 platform (1 × 100 bp) and over 1 billion reads were assigned to barcoded samples.  

Reads were mapped onto the Arabidopsis thaliana reference genome (TAIR10) using two separate approaches and resulting SNPs calls were merged.  First, we used SOAP (SOAPaligner ver. 2.21 and SOAPsnp ver.1.03) in order to set priors on genotype calls based on the probability of expected homozygosity in an F8 RIL population.  Second, we utilized BWA’s aln and samse algorithms (ver. 0.7) to map reads to the TAIR10 reference.  We then called SNPs using SAMtools mpileup and BCFtools view (ver. 0.1.19) algorithms.  In both approaches, we retained only uniquely mapping reads and only SNP genotype calls with a read depth of eight or more.  We used custom perl scripts to combine SNP calling approaches, merging the novel SNPs from BWA/SAMtools into the SOAPsnp results.  Finally, we merged SNPs originally genotyped from each RIL population (INRA; Versailles, France) into our GBS approach after converting INRA SNPs to the TAIR10 coordinate system.  

RIL Linkage map construction and SNP imputation:  For each population, we combined our new GBS SNP markers with existing INRA markers and imported these data into the R/qtl package  (Broman et al. 2003)with SNP order based on physical location.  In each RIL population, we estimated marker map locations (est.map; R/qtl) for each chromosome using a Kosambi mapping function. We then imputed missing data across markers in each RIL set using R/qtl’s fill.geno function to “fill in” missing genotypes between markers with identical genotypes (ignoring chromosome ends and recombination breakpoint regions).  We then removed any imputed genotypes where multipoint marker data estimated genotype probabilities (calc.genoprob; R/qtl) were less than 99%.

NAM population SNP imputation and joint-linkage map construction:  Because our GBS markers rarely overlapped across populations, we used the robust linkage maps of each individual population to impute a common set of SNP markers across all 14 RIL sets.  We utilized the publicly available 250K SNP Arabidopsis dataset for imputation (Horton et al. 2012; Atwell et al. 2010) because it contained 211,786 overlapping SNPs from 13 of the 14 alternate parents.  For the remaining parent, Ita-0, we interrogated publicly available bam files (Durvasula et al. 2017) at each SNP location in the 250K dataset to determine Ita-0 marker states. 

In each RIL population, we interpolated map positions of the 250K SNPs and again used fill.geno to impute the 250K marker states from each parent between GBS markers with identical marker states, e.g., we filled alternate parent SNP states from the 250K dataset into intervals anchored at both ends by alternate GBS marker states.  Given that the intervals in the GBS derived linkage maps are on average 0.4cM (see results), this fill in approach has an average imputation error rate of 0.0016% (i.e., the probability of a double crossover in intervals anchored by like parental marker states) and a maximal error rate of 1.93% (in the largest interval across all populations, 13.9cM).  All 14 RIL populations were merged together based on imputed, overlapping SNPs and neighboring markers in perfect linkage with respect to both marker state and missing data were reduced to a single entry.  These SNPs could be used to map trait genetic architecture via GWA style analyses that control for population structure.  Alternatively, the genetic architecture of complex traits in NAM populations can be resolved via extensions of traditional linkage mapping approaches in concert with a joint-linkage map.  

Using the final imputed SNP files of our merged NAM population, we also generated a joint-linkage map for all 14 populations.  Because recombination events can only be detected between polymorphic SNPs, we selected SNPs for which at least 11 of the 14 alternate parents shared a SNP state and differed from the recurrent parent.  SNPs in populations that were not polymorphic for a specific marker were encoded as missing data.  Markers were imported into R/qtl with ordering based on physical location and genetic map locations were estimated using the Kosambi mapping function (est.map; R/qtl).

LITERATURE CITED

Atwell, S., Y.S. Huang, B.J. Vilhjalmsson, G. Willems, M. Horton et al., 2010 Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 465 (7298):627-631.

Broman, K.W., H. Wu, Ś. Sen, and C.G. A., 2003 R/qtl: QTL mapping in experimental crosses. Bioinformatics 19:889-890.

Durvasula, A., A. Fulgione, R.M. Gutaker, S.I. Alacakaptan, P.J. Flood et al., 2017 African genomes illuminate the early history and transition to selfing in Arabidopsis thalianaProceedings of the National Academy of Sciences of the United States of America 114 (20):5213-5218.

Horton, M.W., A.M. Hancock, Y.S. Huang, C. Toomajian, S. Atwell et al., 2012 Genome-wide patterns of genetic variation in worldwide Arabidopsis thaliana accessions from the RegMap panel. Nature Genetics 44 (2):212-216.

 

Usage notes

USAGE NOTES FOR INRA RIL FILES:

  • INRA_2RV_bla1_col0_imputed.csv  to  INRA_29RV_ita0_col0_imputed.csv

GBS SNPs for each of 14 INRA Arabidopsis thaliana RIL populations with Col-0 as a common recurrent parent.  GBS markers and original INRA markers were merged together and imported into R/qtl based on physical positions and linkage maps estimated. 

HEADER legend:
marker:  markers names of chromosome and bp position in TAIR10, e.g., 1_36150 is chr1 at 36150
chr:  chromosome
pos:  centimorgan position
X2RV4..X23RV499--list of INRA RILs.  e.g., X2RV4 is population 2RV, RIL 4


USAGE NOTES FOR INRA NAM FILES:

  • INRA_NAM_imputed_SNPs_chr1.csv to INRA_NAM_imputed_SNPs_chr5.csv

Imputed SNPs in 14 Arabidopsis thaliana RIL populations that share Col-0 as a common recurrent parent.  Neighboring SNP markers in perfect linkage with respect to both marker state and missing data were reduced to a single entry--however, for subsets of the 14 populations many of these SNPs could be collapsed again to reduce file sizes / SNP numbers. 

HEADER legend:
marker: markers names of chromosome and bp position in TAIR10, e.g., 1_36150 is chr1 at 36150
chr:  chromosome
pos:  position in bp
col0_base:  SNP call in Col0
col0_allele:  defined as REF
ALT_base:  SNP call in alternate parents that differ from Col0
ALT_allele:  defined as ALT
X2RV4..X23RV499--list of INRA RILs and their allele states (REF vs. ALT for each marker)  e.g., X2RV4 is population 2RV, RIL 4


USAGE NOTES FOR INRA JOINT-LINKAGE FILES:
INRA_NAM_joint_linkage_map.csv

Using INRA_NAM_imputed_SNPs, we identified markers where at least 11 of the 14 alternate parents shared a SNP state and differed from the recurrent parent.  SNPs in populations that were not polymorphic for a specific marker were encoded as missing data.  Markers were imported into R/qtl with ordering based on physical location and genetic map locations were estimated using the Kosambi mapping function (est.map; R/qtl).

HEADER legend:
marker:  markers names of chromosome and bp position in TAIR10, e.g., 1_48181 is chr1 at 48181bp 
chr:  chromosome
cM:  centimorgan position
X2RV4..X23RV499--list of INRA RILs and their genotypes at each marker.  e.g., X2RV4 is population 2RV, RIL 4
  AA is homozygous for the Reference allele (Col0)
  BB is homozygous for the Alternate allele 
  "-" is missing data


USAGE NOTES FOR INRA JOINT-LINKAGE FILES:
INRA_NAM_joint_linkage_map_SNP_states.csv

SNP states at markers in the INRA_NAM_joint_linkage_map.csv

HEADER legend:
marker:  markers names of chromosome and bp position in TAIR10, e.g., 1_48181 is chr1 at 48181bp 
REF_SNP_state:  base call in RILs carrying the REF allele (i.e., Col0 allele)     
ALT_SNP_state:  base call in RILs carrying the ALT allele 
 

Funding

National Science Foundation, Award: IOS-0923752

National Science Foundation, Award: IOS-1444571

National Science Foundation, Award: IOS-0923752