Dataset for genome-wide association study of maize phosphorus efficiency
Data files
Jan 17, 2025 version files 52.21 MB
-
maize_Dent.vcf
12.72 MB
-
maize_Flint.vcf
11.89 MB
-
maize_LR.vcf
27.61 MB
-
README.md
2.14 KB
Abstract
This dataset comprises genotypic data derived from 398 genotypes of maize (Zea mays), categorized into 100 Dent (maize_Dent), 100 Flint (maize_Flint), and 198 doubled haploid Landraces (maize_LR) sourced from six European lines. The data is provided in Variant Call Format (vcf), with each population having its dedicated file, analyzed separately. The vcf files contain meta-information lines, a header line, and subsequent data lines detailing genomic positions. Meta-information is included post the ## string. The header encompasses eight fixed fields per record, tab-delimited and specifying information such as chromosome (CHROM), position (POS), identifier (ID), reference base(s) (REF), alternate base(s) (ALT), quality (QUAL), filter status (FILTER), and additional information (INFO). Missing values are denoted by a dot ('.'). The data format includes a FORMAT field specifying data types and order, followed by individual fields for each sample. Genotypes are encoded as allele values separated by either '/' or '|'. This dataset was employed in a genome-wide association study (GWAS) utilizing a diversity panel to analyze Phosphorus use efficiency in these genotypes.
README: Genome-wide association study of maize phosphorus efficiency
https://doi.org/10.5061/dryad.9cnp5hqs0
Description of the data and file structure
A VCF (Variant Call Format) file is a text file storing genetic variation data, like SNPs or insertions/deletions. It includes meta-information, a header defining data fields, and data lines specifying genomic positions with details such as chromosome, position, alleles, quality, and additional information. The dataset is structured with individual vcf files for three distinct maize populations: Dent (maize_Dent), Flint (maize_Flint), and doubled haploid Landraces (maize_LR). Each vcf file contains meta-information lines, followed by a header line specifying eight fixed fields per sample and subsequent data lines detailing genomic positions. The fixed fields include information such as chromosome (CHROM), position (POS), identifier (ID), the reference base(s) (REF), alternate base(s) (ALT), quality (QUAL), filter status (FILTER), and additional information (INFO). Genotypes are encoded as allele values separated by '/' or '|'.
The relationships between the data files lie in their shared format and genomic content, with each file representing a specific maize population. Researchers can analyze the data collectively or focus on individual populations based on their research objectives.
Quality control measures have been applied uniformly across all files, involving the removal of markers with more than 5% heterozygotes or 50% missing values per marker. An additional criterion of 20% missing values per genotype in the complete dataset was applied. Subpopulation-specific quality control was carried out, such as setting heterozygous markers to 'NA' and implementing a minor allele frequency (MAF) filter of 3%. Imputation, conducted individually for each subpopulation using BEAGLE 5.0, further refines the dataset, with the final imputed data filtered for a MAF greater than 5%.
Code/Software
These files can all be opened with Tassel 5 as vcf files. They can also be opened with a standard text editor.
Methods
All genotypes in the study had available marker data generated using the MaizeSNP50 BeadChip from Illumina®. To ensure data integrity, quality control procedures were implemented. Markers exhibiting more than 5% heterozygotes or 50% missing values per marker were excluded, along with markers having an additional 20% missing values per genotype in the entire dataset. Subsequently, individualized quality control was applied to each subpopulation. Heterozygous markers were set to NA, and a minor allele frequency (MAF) filter of 3% was imposed. Imputation was conducted separately for each subpopulation using BEAGLE 5.0. The imputed data underwent further filtering, with a threshold set for a MAF greater than 5%. The physical positions of Single Nucleotide Polymorphisms (SNPs) are referenced to the B73 genome version 4.