Skip to main content
Dryad

Genomic characterization and gene bank curation of Aegilops using genotyping-by-sequencing

Cite this dataset

Adhikari, Laxman; Poland, Jesse; Raupp, John; Wu, Shuangye (2024). Genomic characterization and gene bank curation of Aegilops using genotyping-by-sequencing [Dataset]. Dryad. https://doi.org/10.5061/dryad.mgqnk994n

Abstract

In this study, genotyping-by-sequencing (GBS) was performed on 1041 Aegilops accessions, representing 23 different species. These accessions have been maintained by the Wheat Genetics and Resource Center (WGRC) at Kansas State University. The GBS FASTQ files have been uploaded to the NCBI SRA public repository under the BioProject accession number # PRJNA985892. We have provided other files related to data analysis, such as the barcode key file, SNP matrices, and taxonomic information of the accessions in this Dryad repository, which can be accessed through the provided link.  The aim of the study was to explore the genetic and genomic characteristics of wild wheat relatives, Aegilops, using a larger number of SNP markers. Here, we also curated the WGRC gene bank Aegilops collection via the identification of misclassified accessions and genetically identical redundant accessions. Further, we explored the genomic relationship between wheat and the different Aegilops species. 

README

Genomic Characterization and Gene Bank Curation of Aegilops: The Wild Relatives of Wheat

Laxman Adhikari*, John Raupp, Shuangye Wu, Dal-Hoe Koo, Bernd Friebe, and Jesse Poland*
Emails: laxman.adhikari@kaust.edu.sa and jesse.poland@kaust.edu.sa

Data Set
In this study, Aegilops accessions preserved and maintained as single seed descent (SSD) at the Wheat Genetics Resource Center at Kansas State University were subjected to genotyping-by-sequencing. The DNA was extracted from the seedlings of individual plants and GBS library was made as described in the manuscript. The GBS data were utilized to detect variants using a reference-based pipeline, such as TASSEL GBS, in cases where genome assemblies were available. For the accessions without available genome assemblies, the sequence FASTQ files were aligned to a mock reference generated from the raw GBS data. The accessions with a higher amount of data were chosen to generate the mock reference. Variants were called using both GBS-SNP-CROP and bcfools pipeline. Subsequently, the SNP matrices were filtered and utilized for various analyses.

Files and Descriptions

  1. Readme.md
    This file contains basic information about the dataset, software, and the other files uploaded to Dryad.

  2. barcode.file.all.accessions.AegilopsProject.txt
    This file contains flowcell, lane, sample ID, sequence files name in NCBI SRA and the NCBI Project ID. This information is sufficient for demultiplexing the GBS data using Tassel GBS [https://www.maizegenetics.net/tassel] and Sabre [https://github.com/najoshi/sabre].

  3. sample.info.after.adjusted.missclassified.txt
    The file contains species-level taxonomy information for all accessions after curating the misclassified accessions. The final genetic cluster and PCA plots were colored based on this information. Each species was represented by a single number, as demonstrated in the "All.aegilops.with.mock_Ref.phylogenetic.PCA.Rmd" file. Additionally, there are several "line.info" or "sample.info" files corresponding to specific analyses and the inclusion of species in those particular analyses.

  4. Ae.all.species.denovo.mockref.SNP.matrix.final.hmp.txt.zip
    This file contains the matrix of 54,667 SNPs generated for all Aegilops accessions. We filtered the genotyping information based on minor allele frequency (MAF), missing data, and heterozygosity. After applying the following filters (MAF > 0.01, missing data < 30%, heterozygosity < 10%), we retained 46,879 SNPs for 1,041 individuals.

  5. filtered.numeric.coded.SNP.matrix.hmp.txt.zip
    The SNP matrix for all accessions is represented numerically as [1, 0, -1]. After filtration, the matrix consisted of 46,879 SNPs, which were utilized for phylogenetic clustering and PCA analysis.

  6. U-genome.clade.SNP.matrix.txt.zip
    The SNP matrix file was used to generate the phylogenetic tree of the U-genome clade, as depicted in the manuscript.

  7. S-genome.and.mutica_SNP.matrix_speltoides-ref.txt.zip
    The SNP matrix was used to cluster the Sitopsis section Aegilops species and Ae. mutica. The provided line info file, [S.genome.and.mutica.lineinfo.txt], is essential for coloring the NJ phylogenetic tree generated using this matrix.

  8. crassa.juvenalis.vavilovii.SNP.matrix.txt
    The SNP matrix was used to cluster the hexaploid Aegilops species.

  9. Neglecta.columnaris.SNP.matrix.txt.zip
    The SNP matrix was used to cluster and distinguish between Ae. neglecta and Ae. columnaris.

  10. F_MAF0.01_Miss50_Het20-all.diploid.wheat.geno-tags.fastq-all.wheat.D.genome.ref.hmp.txt.zip
    The SNP matrix includes all diploid Aegilops, including the previously curated Ae. tauschii collection and the CIMMYT wheat lines. Information regarding the GBS data for these Ae. tauschii and CIMMYT lines can be found in Supplementary Table 1.

  11. gene.bank.curation.tauschii_F_MAF0.05_Miss50_Het10-tauschii.genome-tags.fastq-all.vcf.hmp.txt.zip
    The SNP matrix was utilized to curate the Ae. tauschii collection, specifically in identifying and removing duplicates that were not curated in the previous study.

  12. gene.bank.curation.data-sharonensis.longissima.0.01maf.0.5miss.0.1het.19Ksnps.hmp.txt.zip
    The SNP matrix was used to curate the Ae. sharonensis and Ae. lognissima collections, with a specific focus on identifying duplicates.

  13. gene.bank.curation.data-searsii.with.Searsii.genome.0.01maf.0.5miss.0.1het.SNPs11663.zip
    The SNP matrix was used to curate the Ae. searsii collection, with a specific focus on identifying duplicates.

R-script

  1. All.aegilops.with.mock_Ref.phylogenetic.PCA.Rmd
    This file has R script used to generate a phylogenetic tree and PCA plot for the whole Aegilops collection

  2. All.species.diversity.and.seg.loci.Rmd
    This file includes the R script used to compute Neis genetic diversity indices and the segregating loci for each of the 23 species.

  3. Aegilops.duplicated.accessions.Rmd
    This file includes the R script used to identify genetically duplicated accessions. The script was specifically executed on diploid Aegilops species for which a reference genome is available.

  4. U-genome.clade.Rmd
    This file contains the R script used to generate the U-genome phylogenetic tree.

  5. neglecta-columnaris-de-novo.Rmd
    The R script file was used to cluster Ae. neglecta and Ae. columnaris, and to study their genomic relations and genome formula.

  6. M.genome.loci.in.tetraploids.Rmd
    The file contains an R script to study genetic variations in tetraploid Aegilops species that share the M genome. The variants in the input file were initially called on the diploid M genome species (Ae. comosa).

  7. crassa-vavilovii-juvenalis.Rmd
    This file contains an R script used to generate a phylogenetic tree of the hexaploid Aegilops group.

  8. Genome.vs.snp.plot.sitopsis.seg.loci.Rmd
    This file contains an R script to generate a bar chart illustrating the number of loci of diploid Aegilops of Sitopsis section that were mapped on different wheat subgenomes.

    In addition to these files, there are several line info text files that contain taxonomic information about the accessions. These files are essential for coloring the NJ phylogenetic tree and the PCA plots.

Methods

In this study, Aegilops accessions preserved and maintained as single seed descent (SSD) at the Wheat Genetics Resource Center at Kansas State University were subjected to genotyping-by-sequencing. The DNA was extracted from the seedlings of individual plants and GBS library was made as described in the manuscript. The GBS data were utilized to detect variants using a reference-based pipeline, such as TASSEL GBS, in cases where genome assemblies were available. For the accessions without available genome assemblies, the sequence FASTQ files were aligned to a mock reference generated from the raw GBS data. The accessions with a higher amount of data were chosen to generate the mock reference. Variants were called using both GBS-SNP-CROP and bcfools pipeline. Subsequently, the SNP matrices were filtered and utilized for various analyses. 

Usage notes

R and RStudio

Funding

Kansas State University