A common resequencing‐based genetic marker dataset for global maize diversity
Grzybowski, Marcin et al. (2023), A common resequencing‐based genetic marker dataset for global maize diversity, Dryad, Dataset, https://doi.org/10.5061/dryad.bnzs7h4f1
Maize (Zea mays ssp. mays) populations exhibit vast amounts of genetic and phenotypic diversity. As sequencing costs have declined, an increasing number of projects have sought to measure genetic differences between and within maize populations using whole genome resequencing strategies, identifying millions of segregating single-nucleotide polymorphisms (SNPs) and insertions/deletions (InDels). Unlike older genotyping strategies like microarrays and genotyping by sequencing, resequencing should, in principle, frequently identify and score common genetic variants. However, in practice, different projects frequently employ different analytical pipelines, often employ different reference genome assemblies, and consistently filter for minor allele frequency within the study population. This constrains the potential to reuse and remix data on genetic diversity generated from different projects to address new biological questions in new ways. Here we employ resequencing data from 1,276 previously published maize samples and 239 newly resequenced maize samples to generate a single unified marker set of ~366 million segregating variants and ~46 million high confidence variants scored across crop wild relatives, landraces as well as tropical and temperate lines from different breeding eras. We demonstrate that the new variant set provides increased power to identify known causal flowering time genes using previously published trait datasets, as well as the potential to track changes in the frequency of functionally distinct alleles across the global distribution of modern maize.
This maize genomic variants data set contains subset (~46 mln) of ~366 million single nucleotide polymorphisms (SNPs) and small insertion and deletion (InDels). Data set was created by aligning whole genome resequencing data from 1,515 maize individuals to maize B73 reference genome v5 (Hufford et al., 2022; 10.1126/science.abg5289). Variants were identified with GATK v4 software and saved to Variant Call Format (VCF) files (v4.2). Only variants which pass Hard-Filtering criteria (followed by GATK recommendation) were saved. The full variant set can be downloaded from MaizeGDB (www.maizegdb.org).
The data set in this repository contains only high confidence genetic variant set. The filtered and imputed variant set was generated by first removing variants where: >2 alleles were observed in the population, variants with > 50% missing data, variants with extremely low < 1,515 or extremely high > 33,550 sequencing depth, and variants with inbreeding coefficients > 0 resulting in ~46 million variants. Imputation were done with Beagle 5.0 (Browning et al., 2021; 10.1016/j.ajhg.2021.08.005) with default setting.
Each VCF file contains data for single chromosome.
In addition, VCF file for 752 lines from Wisconsin Association Panel (Mazaheri et al., 2019; https://doi.org/10.1186/s12870-019-1653-x) were also provided. This variants were filtered from high confident genetic variant set with MAF > 5%.
All VCF files were gzipped (with bgzip) and split into chromosome level. For each file tabix index were provided.
Files can be processed with software such as bcftools, bedtools, Plink, TASSEL, or basic Linux programs like awk or sed.
Files with variants for 1,515 maize individuals, each for one chromosome.
File with variants for 752 maize inbred lines from Wisconsin Diversity Panel.
U.S. Department of Energy, Award: DE-SC0020355
National Science Foundation, Award: OIA-182678
National Institute of Food and Agriculture, Award: 2021-67021-35329
Foundation for Food and Agriculture Research, Award: 602757
Narodowym Centrum Nauki, Award: 2012/05/B/NZ9/03407
Narodowym Centrum Nauki, Award: 2017/27/B/NZ9/00995