Single nucleotide Polymorphism (SNP) identification, genetic diversity, and population structure of Ryegrass from the northeastern highlands of Peru
Data files
Jul 24, 2025 version files 5.21 MB
-
Identification_of_experimental_lines.csv
2.55 KB
-
Passport_information_of_64_Lolium_multiflorum_genotypes.csv
6.68 KB
-
README.md
3.12 KB
-
SNP_matrix_coded_ryegrass.csv
5.20 MB
Abstract
Ryegrass (Lolium multiflorum) is a combined forage used in livestock support systems in temperate climates. Its favorable characteristics include high nutritional value, rapid regrowth, and remarkable environmental adaptability. Understanding its genetic diversity is essential for guiding conservation strategies and creating improved cultivars. This study aimed to identify single nucleotide polymorphisms (SNPs) and evaluate the genetic diversity and structure of 64 ryegrass lines from northeastern Peru. DNA was extracted and genotyped using the genotyping-by-sequencing (GBS) technique, followed by bioinformatic analysis using the Stacks pipeline and population genetics tools. A total of 12,199 high-quality single nucleotide polymorphisms (SNPs) were identified, evenly distributed across seven chromosomes, with an average density of 5.4 SNPs/Mb. A proportion of these markers were located in coding regions, which had a functionally significant impact on functionally significant genes, thus improving their use in marker-assisted selection. The expected average heterozygosity (0.2129) slightly exceeded the observed value (0.2009), suggesting a relatively balanced genetic structure among the lines analyzed. However, a significant reduction will be achieved in the Cajamarca population, possibly linked to artificial selection and geographical isolation processes. The analysis of the genetic structure indicated levels of mixing and differentiation between the lines, and the “Kumimarca” variety exhibited clear genetic separation. These findings demonstrate the effectiveness of the GBS approach in characterizing genetic diversity and providing a robust set of SNPs with the potential to support genetic improvement programs, conserve forage resources, and develop cultivars adapted to the agroecological conditions of the highlands.
This dataset contains a high-quality SNP matrix (n = 12,199) obtained through genotyping-by-sequencing (GBS) of 64 ryegrass (Lolium multiflorum) lines collected from the northeastern highlands of Peru. The data support the genetic diversity and population structure analyses described in the study “Single Nucleotide Polymorphism (SNP) Identification, Genetic Diversity, and Population Structure of Ryegrass from the Northeastern Highlands of Peru”.
The SNP matrix was generated by aligning raw sequencing reads to the reference genome 'Rabiosa' (L. multiflorum), followed by bioinformatic filtering to ensure data quality, representativeness, and removal of linkage disequilibrium.
Data Structure and File Description
The file SNP_matrix_coded_ryegrass.csv
contains the SNP matrix in wide format, with:
- Rows: Individual SNPs (12,199 in total).
- Columns: Genotypes of the 64 ryegrass lines evaluated.
- Encoding: Genotypes are expressed in standard VCF format (e.g., 0/0, 0/1, 1/1, ./.).
Samples include native ecotypes collected by INIA in Cajamarca and commercial materials evaluated at the experimental agrostological garden in Chachapoyas.
Each SNP marker was filtered using the following criteria:
- Minimum allele frequency (MAF ≥ 0.05)
- Maximum missing data per marker ≤ 10%
- No linkage disequilibrium (pruned using PLINK)
Identification_of_experimental_lines.csv
This file contains identifiers and naming codes used for the 64 genotypes. It links sample codes (e.g., L1-A, L1-B) with their associated ecotype or commercial line names.
Passport_information_of_64_Lolium_multiflorum_genotypes.csv
Provides passport metadata for the 64 genotypes, including:
- Collection site (district, province, department)
- Geographic coordinates (in degrees, minutes, seconds)
Data Sharing/Access
This dataset is associated with the manuscript:
- Bobadilla et al. “Single Nucleotide Polymorphism (SNP) Identification, Genetic Diversity, and Population Structure of Ryegrass from the Northeastern Highlands of Peru”, currently being prepared for submission to The Plant Genome journal.
Links to other publicly accessible data sources:
- https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_030979885.1/ – Reference genome of L. multiflorum (Rabiosa)
Code/Software
Bioinformatic processing and filtering were performed using the following tools:
- BWA v0.7.18: sequence alignment
- SAMtools v1.13: BAM file management
- Stacks v2.68: SNP calling
- VCFtools v0.1.16: filtering by MAF and missing data
- PLINK v1.90: LD pruning (
indep 50 5 2
) - R v4.1.3: genetic diversity and population structure analysis (packages: vcfR, adegenet, ggplot2, ggtree, etc.)
To reproduce the analysis, it is recommended to use the specified versions or later.
1.1.Plant material
During 2024, ryegrass samples were collected from the genebank of the National Institute of Agrarian Innovation (INIA) in Peru. The collection involved randomly sampling various ryegrass populations to ensure a genetically representative background for each population. The samples correspond to material originally collected in 2002 in the Cajamarca Region, from the following locations: Cutervo (5), Tacabamba (7), Sendamal (5), Micuypampa (4), Paccha (4), El Agrario (4), Calquis (4), Bambamarca (4), Campiña (3), Santa Cruz (2), Cochan (4), and Baños del Inca (4).
Additionally, 14 commercial ryegrass genotypes were collected from materials currently established at the Chachapoyas Agrarian Experimental Station, within the agrostological garden in the Amazonas region. These include the following cultivars: Wanca Grass (3), Bison II (3), Inglés (2), Max (4), and the AGP Ecotype (2) (Table 1 - uploaded as (Identification_of_experimental_lines.csv).
1.2.DNA extraction and Sequencing
Ten leaves were collected from each ryegrass line and stored in Ziploc bags with silica gel to preserve the plant material prior to extraction (Wilkie et al., 2013). Each ryegrass line was then cut until a total weight of 650 mg per line was obtained. The samples were placed in Eppendorf tubes and disrupted using a tissue homogenizer. Grinding was performed with the application of liquid nitrogen to ensure efficient homogenization.
DNA extraction was carried out using the NucleoSpin® Plant II kit from MACHEREY-NAGEL, designed for genomic DNA purification from plant samples, following the manufacturer’s instructions for grasses. Upon completion of the extraction process, the quality of the extracted DNA was assessed using the Quantus™ Fluorometer (Promega, Madison, USA) along with the Promega quantification kit.
DNA integrity was verified through 1% agarose gel electrophoresis using a 1X Tris-Acetate-EDTA (TAE) buffer. For each sample, 2 µL of extracted DNA and 3 µL of loading buffer with GelRed were loaded. Electrophoresis was conducted at 80 V for 30 minutes. The DNA samples were futher stored at -20°C for preservation. The material was later sent to the NOVAGENE laboratory in the United States for sequencing.
Sequencing-based genotyping libraries were developed following the protocol of Elshire et al. (2011). In summary, genomic DNA was digested with ApeKI enzyme, and fragments were ligated to Illumina sequencing adapters and sequence barcodes unique to each sample. After multiplexing, this allowed sample identity to be recovered for each sequenced DNA fragment. Pooled samples were sequenced on the Illumina NovaSeq 600 platform, obtaining 100-bp paired-end reads. Raw data quality was examined using FastQC v0.11.726 software.
Clean reads were aligned to the L. multiflorum reference genome, cultivar Rabiosa (GenBank accession GCA_030979885.1; https://www.ncbi.nlm.nih.gov/datasets/genome/GCA_030979885.1/), using BWA v0.7.18 (Li & Durbin, 2009). Outputs were processed using SAMtools v1.13 (Danecek et al., 2021; Li et al., 2009) to obtain ordered BAM files. Subsequently, the bioinformatics pipeline Stacks v2.68 (Catchen et al., 2013) was used for SNP calling, using the BAM alignment files as input.
Filtering of the initial VCF (Variant Call Format) output from the Stacks workflow was performed using VCFtools v0.1.16 (Danecek et al., 2011) with the following retention criteria: (i) minimum minor allele frequency of 0.05, and (ii) maximum missing data of 0.1 (SNPs with missing genotypes in more than 10% of individuals were removed). Plink (Chang et al., 2015) was further used to perform additional filtering by removing SNPs in linkage disequilibrium (LD) using the parameter indep (50 5 2).
1.3. Genetic diversity
1.3.1. Calculation of expected and observed heterozygosity
Expected heterozygosity (He) and observed heterozygosity (Ho) were calculated for each locus across the seven chromosomes of L. multiflorum using the population statistics summary file generated by the population command in the STACKS pipeline. Expected heterozygosity was determined using a formula based on allele frequencies (𝑝𝑖), as described by Schmidt et al. (2021), and average values were calculated for each chromosome. Observed heterozygosity was calculated as the proportion of heterozygous individuals at each locus. Both He and Ho values were computed for all L. multiflorum populations and specifically for the Cajamarca population.
1.4. Analysis of population structure
Population structure was determined using the Bayesian model-based parametric clustering method implemented in STRUCTURE v.2.3 (Pritchard & Wen, 2004, http://pritch.bsd.uchicago.edu/structure.html), which assigns individuals to K (i.e., the number of clusters in a sample of individuals) based on a membership coefficient (qi). For each K (ranging from 2 to 15), ten independent mixture model runs (INFERALPHA = 1) were performed, with correlated allele frequencies for SNP markers (FREQSCOR = 1), 100,000 Markov chain Monte Carlo (MCMC) runs, 100,000 burn-in periods, and RANDOMIZE = 1.
The optimal K value was determined using the ad-hoc ΔK statistic (Evanno et al., 2005) estimated with Structure Harvester software (Earl & vonHoldt, 2012). The graphical representation of the genetic clustering analysis was generated with the Clumpak program (Kopelman et al., 2015). Principal component analysis (PCA) was performed in R version 4.1.3. The VCF file used as input was processed using the vcfR package (Knaus & Grünwald, 2017). The adegenet package (Jombart, 2008) calculated the eigenvectors and eigenvalues. PCA results were visualized using ggplot2 (Wickham, 2016), RColorBrewer (Neuwirth, 2022), and ggrepel (Slowikowski, 2024), while population data were incorporated using the tibble package (Müller and Wickham, 2023). Finally, a UPGMA tree based on bitwise distances between samples was constructed using the poppr package (Kamvar et al., 2014) with 100 bootstrap replicates. The tree was visualized with the ggtree package (Yu et al., 2017).
References
Catchen, J., Hohenlohe, P. A., Bassham, S., Amores, A., & Cresko, W. A. (2013). Stacks: An analysis tool set for population genomics. Molecular Ecology, 22(11), 3124–3140.
Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015). Second-generation PLINK: Rising to the challenge of larger and richer datasets. GigaScience, 4, 7.
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., ... & Durbin, R. (2011). The variant call format and VCFtools. Bioinformatics, 27(15), 2156–2158.
Danecek, P., Bonfield, J. K., Liddle, J., Marshall, J., Ohan, V., Pollard, M. O., ... & Li, H. (2021). Twelve years of SAMtools and BCFtools. GigaScience, 10, giab008.
Earl, D. A., & vonHoldt, B. M. (2012). STRUCTURE HARVESTER: A website and program for visualizing STRUCTURE output and implementing the Evanno method. Conservation Genetics Resources, 4(2), 359–361.
Evanno, G., Regnaut, S., & Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE: A simulation study. Molecular Ecology, 14(8), 2611–2620.
Jombart, T. (2008). adegenet: An R package for the multivariate analysis of genetic markers. Bioinformatics, 24(11), 1403–1405.
Kamvar, Z. N., Tabima, J. F., & Grünwald, N. J. (2014). Poppr: An R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ, 2, e281.
Knaus, B. J., & Grünwald, N. J. (2017). Vcfr: A package to manipulate and visualize variant call format data in R. Molecular Ecology Resources, 17(1), 44–53.
Kopelman, N. M., Mayzel, J., Jakobsson, M., Rosenberg, N. A., & Mayrose, I. (2015). Clumpak: A program for identifying clustering modes and packaging population structure inferences across K. Molecular Ecology Resources, 15(5), 1179–1191.
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), 1754–1760.
Pritchard JK & Wen W (2004). Documentación del software STRUCTURE, versión 2. Chicago. http://www.pritch.bsd.uchicago.edu/software/structure2_1.html
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., ... & Durbin, R. (2009). The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), 2078–2079.
Müller, K., & Wickham, H. (2023). tibble: Simple data frames (Version 3.2.1) [R package].
Neuwirth, E. (2022). RColorBrewer: ColorBrewer palettes (Version 1.1-3) [R package].
Slowikowski, K. (2024). ggrepel: Automatically position non-overlapping text labels with 'ggplot2' (Version 0.9.5) [R package].
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag.
Wilkie, P., Dalberg, A., Harris, D., & Forrest, L. L. (2013). The collection and storage of plant material for DNA extraction: The Teabag Method. Gardens’ Bulletin Singapore, 65, 231–234.
Elshire, R. J., Glaubitz, J. C., Sun, Q., Poland, J. A., Kawamoto, K., Buckler, E. S., & Mitchell, S. E. (2011). A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS ONE, 6(5), e19379. https://doi.org/10.1371/journal.pone.0019379
Schmidt, T. L., Swan, T., Chung, J., Karl, S., Demok, S., Yang, Q., Field, M. A., Muzari, M. O., Ehlers, G., Brugh, M., Bellwood, R., Horne, P., Burkot, T. R., Ritchie, S., & Hoffmann, A. A. (2021). Spatial population genomics of a recent mosquito invasion. Molecular Ecology, 30(5), 1174–1189. https://doi.org/10.1111/mec.15792
Yu, G., Smith, D. K., Zhu, H., Guan, Y., & Lam, T. T.-Y. (2017). ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution, 8(1), 28–36. https://doi.org/10.1111/2041-210X.12628
Supplementary material
Supplemental Table S1 (Passport_information_of_64_Lolium_multiflorum_genotypes.csv) provides comprehensive passport data for 64 genotypes of Lolium multiflorum used in the present study. These entries include experimental and commercial lines across collected diverse agroecological zones in the Cajamarca and Amazonas regions of Peru. Each entry is characterized by its sequence code, genotype name, collection site, locality, province, department, and precise geographical coordinates (latitude and longitude in degrees, minutes, and seconds).