SNP data set of Peruvian highland maize races
Data files
May 28, 2024 version files 2.25 MB
-
Peruvian_races__of_maize_Arbizu_et_al.gz
-
README.md
Abstract
Peruvian maize exhibits significant morphological diversity, with landraces cultivated from sea level up to 3,500 meters above sea level. Previous research based on morphological descriptors identified at least 52 Peruvian maize races, but their genetic diversity and population structure remain largely unknown. In this study, we used genotyping-by-sequencing (GBS) to infer the genetic structure and diversity of 423 maize accessions from the Genebank of La Molina National Agrarian University (UNALM). These accessions represent nine races and one sub-race, along with 15 open-pollinated lines (purple corn) and two yellow maize hybrids. We obtained 14,235 high-quality SNPs distributed along the 10 maize chromosomes. Gene diversity ranged from 0.33 (Pachia) to 0.362 (Ancashino), with Cusco showing the lowest inbreeding coefficient (0.205) and Ancashino the highest (0.274) among the landraces. Population divergence (FST) was very low (mean = 0.017), indicating extensive interbreeding among Peruvian maize varieties. Population structure analysis revealed that these 423 distinct genotypes could be grouped into 10 clusters, with some maize races clustering together. Peruvian maize races did not form monophyletic groups; instead, our phylogenetic tree identified two clades corresponding to the chronological classification of Peruvian maize races: Anciently Derived or Primary Races (ADPR) and Lately Derived or Secondary Races (LDSR). These clades also align with the geographic origins of the maize races, reflecting their mixed evolutionary backgrounds. Further investigation of Peruvian maize germplasm using modern technologies is essential to enhance their use in breeding programs, particularly in the Andean region of Peru.
README: SNP data set of Peruvian highland maize races
We examined (i) 406 accessions of nine races and one sub-race of Peruvian maize that are currently cultivated in 10 Andean geographic departments of Peru, (ii) 15 open-pollinated (OP) purple maize lines and (iii) two yellow maize hybrids (423 individuals in total).
Genotyping by sequencing libraries was created following Elshire et al. (2011) protocol. Genomic DNA was digested with the ApeKI enzyme and fragments were ligated to Illumina sequencing adapters and also with sequence barcodes that are unique to each sample, which allows the recovery of sample identity for each sequenced DNA fragment after multiplexing. The pooled samples were sequenced on the Illumina NovaSeq 600 platform from which 100 bp single-end sequences reads were obtained. Quality of the raw data was examined with FastQC v0.11.7 software (Andrews, 2010), then we employed the TASSEL v5.2.42 bioinformatic pipeline (Bradbury et al., 2007; Glaubitz et al., 2014) for SNPs calling with maize Zm-B73-REFERENCE-NAM-5.0 (Hufford et al., 2021) as the reference genome. Parameters employed in this pipeline were the same as in the study of Huaringa et al (2023). Data curation was performed using software VCFtools v0.1.16 (Danecek et al., 2011) with the following criteria of retention: (i) minimum minor allele frequency of 0.1, (ii) number of alleles less than or equal to two, and (iv) maximum missing data of 0.1. Additional filtering was conducted by removing SNPs in linkage disequilibrium (LD) at a threshold of r2 = 0.2 with function snpgdsLDpruning of SNPRelate package (Zheng et al., 2012) in R v4.2.233 program. Finally, TASSEL software was employed to convert the .vcf file to PHYLIP format with argument -exportType Phylip_Inter.
Elshire, R. J. et al. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6, (2011).
Andrews, S. FastQC: A Quality Control Tool for High Throughput Sequence Data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (2010).
Bradbury, P. J. et al. TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics 23, 2633–2635 (2007).
Glaubitz, J. C. et al. TASSEL-GBS: A high capacity genotyping by sequencing analysis pipeline. PLoS One 9, (2014).
Hufford, M. B. et al. De novo assembly, annotation, and comparative analysis of 26 diverse maize genomes. Science 373, 655–662 (2021).
Huaringa-Joaquin, A. et al. Assessment of the genetic diversity and population structure of the Peruvian Andean legume, Tarwi (Lupinus mutabilis), with high quality SNPs. Diversity 15, (2023).
Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).
Zheng, X. et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28, 3326–3328 (2012).
R Core Team. R: A language and environment for statistical computing. https://www.R-project.org/ (2022).