Genome-wide SNP datasets for Indonesian Curcuma longa
Data files
Oct 15, 2024 version files 2.05 MB
-
Curcuma_longa_SNP.vcf
2.04 MB
-
README.md
3.29 KB
Abstract
Turmeric is a beneficial crop with notable health advantages, yet limited genetic data poses challenges for developing new varieties. This study assessed the genetic diversity of 40 turmeric genotypes using single nucleotide polymorphism (SNP) markers. We used dartseq technology to genotype the SNP data then we perform several filtering process maintain the SNP quality of data. Filtering procedures involved parameters such as removal of secondary loci, call rate <90%, elimination of monomorphic loci, reproducibility <99%, read depth <5, and minor allele frequency (MAF) <0.0 resulted 10092 SNPs. We then used this SNP to perform association study for rhizome weight and curcuminoid content.
https://doi.org/10.5061/dryad.547d7wmhz
Description of the data and file structure
Data Description:
The SNP dataset contains 10,092 SNPs derived from 40 genotypes of Indonesian Curcuma longa (turmeric) using DArTseq (Diversity Arrays Technology sequencing). The identified SNPs underwent rigorous quality control filtering, and the following criteria were applied:
- Removal of secondary loci
- Call rate threshold: SNPs with a call rate <90% were excluded.
- Monomorphic loci exclusion: Only polymorphic SNPs were retained.
- Reproducibility threshold: SNPs with reproducibility <99% were discarded.
- Read depth filter: SNPs with a read depth <5 were eliminated.
- Minor Allele Frequency (MAF): SNPs with MAF <0.01 were removed.
A total of 10,092 high-quality SNPs were retained after filtering. The positions of 8,039 SNP markers (79.6%) were mapped to specific chromosomes, while 2,053 SNPs (~20.4%) were chromosomally non-assigned.
Technology Used:
The SNPs were identified using DArTseq technology, combining genome complexity reduction through restriction enzymes with next-generation sequencing on the Illumina HiSeq2500 platform.
File Format:
The SNP data is provided in VCF (Variant Call Format), a standard format for storing variants including SNPs.
File Contents:
- VCF File: Contains 10,092 high-quality SNPs in VCF format. Of these, 8,039 SNPs have known chromosomal positions, while 2,053 SNPs are chromosomally non-assigned.
- Filtered SNP Data: SNPs that passed all quality control steps are included in the VCF file.
Dataset Overview
This Variant Call Format (VCF) file contains Single Nucleotide Polymorphism (SNP) data from 40 curcuma longa accessions from various region in Indonesia. It consists of 10092 functional SNPs after filtering with various parameters.
Curcuma_longa_SNP.vcf
File Details
- File Name: Curcuma_longa_SNP.vcf
- File Format: VCF (Variant Call Format) Version 4.2
- Date: March 26, 2024
- Source Software: PLINKv1.90
Data Description
The VCF file includes the following columns standard to the format:
#CHROM: Chromosome numberPOS: Position of the SNP on the chromosomeID: Identifier of the SNPREF: Reference baseALT: Alternate base(s)QUAL: Quality score of the SNPFILTER: Filter statusINFO: Additional information (e.g., allele frequency, number of samples)FORMAT: Data format- Sample columns: One per individual, containing genotype information
How to Open and Analyze the Data
The VCF file can be opened and analyzed using various bioinformatics tools and software, some of the most common include:
- IGV (Integrative Genomics Viewer): Useful for visualizing genomic data.
- BCFtools: Command-line tool for working with VCF/BCF files, including filtering, viewing, and converting.
- GATK (Genome Analysis Toolkit): Provides tools to analyze high-throughput sequencing data with a focus on variant discovery.
- vcftools: Command-line program designed to work specifically with VCF files to perform various types of analyses.
- R Studio
Data Collection
A diversity panel of 40 turmeric (Curcuma longa) genotypes from various regions of Indonesia, including six widely marketed commercial accessions, was established. These genotypes were cultivated at the Ciparanje Experimental Field, Faculty of Agriculture, Padjadjaran University, from January to October 2023. The field is located at 753 meters above sea level, with coordinates of 6°55’0.72804’’ S latitude and 107°46’18.46056’’ E longitude.
Genomic DNA was extracted from fresh leaves of three-month-old plants, using a single plant per genotype. DNA extraction followed the manufacturer's protocols, utilizing the Wizard® Genomic DNA Purification Kit (Promega, USA) in combination with the Geneaid Plant Genomic DNA Mini Kit (GP100) columns (Geneaid, Taiwan). DNA concentration and purity were verified with a Thermo Scientific NanoDrop Lite Spectrophotometer and confirmed by 2% agarose gel electrophoresis. Samples with a minimum concentration of 50 ng/µl and a 260/280 absorbance ratio of 1.8-2.0 were selected for SNP genotyping. SNP genotyping was conducted in Australia using the Diversity Arrays Technology (DArT) sequencing approach. Genomic representations were generated by digesting 100 ng of DNA with PstI and either BstNI or TaqI frequent cutter enzymes in a solution of 10 mM Tris-OAc, 50 mM KOAc, 10 mM Mg(OAc)2, and 5 mM DTT (Jaccoud et al., 2001; Kilian et al., 2012). The resulting amplicons were organized in a 97-well microtiter plate and sequenced on the Illumina HiSeq2500 platform for 77 cycles. DNA sequences were processed with the DArT pipeline, which filters out low-quality sequences by setting a minimum Phred score of 30 for barcoded regions and a Phred score of 10 for the remaining sequences. Approximately 2.5 million sequences per sample were retained for SNP marker calling. Data analysis was conducted using DArTsoft14 and DArTdb (Adu et al., 2021). Filtering criteria included the exclusion of secondary loci, call rates below 90%, monomorphic loci, reproducibility less than 99%, read depth below 5, and a minor allele frequency (MAF) threshold of 0.01 (Gruber et al., 2018).
Analyses Performed using the Data Set
In our study, we analyzed the population structure and kinship of turmeric accessions using a combination of Bayesian clustering and principal component analysis (PCA). Population structure was inferred using STRUCTURE software version 2.3.4 (Pritchard et al., 2000), which employs a Bayesian model-based clustering algorithm to estimate individual membership coefficients across potentially mixed populations. To determine the optimal number of genetic clusters, we evaluated K values ranging from 1 to 10 with 10,000 burn-in generations followed by 10,000 Markov Chain Monte Carlo (MCMC) iterations (Zenetta et al., 2024). The best-fit delta K was determined according to Evanno et al. (2005), using StructureSelector (https://lmme.ac.cn/StructureSelector/(opens in new window) (Li & Liu, 2018), enhancing our understanding of the subpopulation structure within the accessions. Kinship coefficients between pairs of accessions were calculated using TASSEL v5.2.93, providing insight into genetic relatedness across the population. The PCA and kinship results were visualized using the GAPIT v3 package in RStudio 4.3.2, allowing us to identify major patterns of genetic variation and relatedness (Qu et al., 2015). This combined approach enabled a detailed examination of the genetic grouping and the extent of admixture among the turmeric accessions.
We measured rhizome weight and curcuminoid content (curcumin, demethoxycurcumin, and bisdemethoxycurcumin) in 40 turmeric accessions to assess phenotypic variation across accessions. Furthermore, we used these phenotypic data to assess phenotype-genotype associations using GWAS. Genome-wide association studies (GWAS) were conducted using the Genomic Association and Prediction Integrated Tool (GAPIT) in RStudio 4.3.2 (Wang & Zhang, 2021). The FarmCPU model was applied with PCA (Q-matrix) and kinship (K-matrix) as covariates to identify significant associations between SNP markers and rhizome weight and curcuminoid content (Bararyenya et al., 2020). Candidate genes were identified by annotating regions flanking the significant SNPs (500 bp upstream and downstream) using BLASTn against the NCBI nucleotide database, focusing on regions with the highest similarity and lowest e-value. This comprehensive genetic analysis provided insights into the genetic basis of key agronomic traits, informing potential breeding strategies for turmeric improvement (Jiang et al., 2021).
References
- Adu, B. G., Akromah, R., Amoah, S., Nyadanu, D., Yeboah, A., Aboagye, L. M., Amoah, R. A., & Owusu, E. G. (2021). High-density DArT-based SilicoDArT and SNP markers for genetic diversity and population structure studies in cassava (Manihot esculenta Crantz). PLoS ONE, 16(7 July). https://doi.org/10.1371/journal.pone.0255290(opens in new window)
- Bararyenya, A., Olukolu, B. A., Tukamuhabwa, P., Grüneberg, W. J., Ekaya, W., Low, J., Ochwo-Ssemakula, M., Odong, T. L., Talwana, H., Badji, A., Kyalo, M., Nasser, Y., Gemenet, D., Kitavi, M., & Mwanga, R. O. M. (2020). Genome-wide association study identified candidate genes controlling continuous storage root formation and bulking in hexaploid sweetpotato. BMC Plant Biology, 20(1). https://doi.org/10.1186/s12870-019-2217-9(opens in new window)
- Evanno, G., Regnaut, S., & Goudet, J. (2005). Detecting the number of clusters of individuals using the software STRUCTURE: A simulation study. Molecular Ecology, 14(8), 2611–2620. https://doi.org/10.1111/j.1365-294X.2005.02553.x(opens in new window)
- Gruber, B., Unmack, P. J., Berry, O. F., & Georges, A. (2018). dartr: An r package to facilitate analysis of SNP data generated from reduced representation genome sequencing. Molecular Ecology Resources, 18(3), 691–699. https://doi.org/10.1111/1755-0998.12745(opens in new window)
- Jaccoud, D. (2001). Diversity Arrays: a solid state technology for sequence information independent genotyping. Nucleic Acids Research, 29(4), 25e–225. https://doi.org/10.1093/nar/29.4.e25(opens in new window)
- Jiang, J., Cao, Y., Shan, H., Wu, J., Song, X., & Jiang, Y. (2021). The GWAS Analysis of Body Size and Population Verification of Related SNPs in Hu Sheep. Frontiers in Genetics, 12. https://doi.org/10.3389/fgene.2021.642552(opens in new window)
- Li, Y. L., & Liu, J. X. (2018). StructureSelector: A web-based software to select and visualize the optimal number of clusters using multiple methods. Molecular Ecology Resources, 18(1), 176–177. https://doi.org/10.1111/1755-0998.12719(opens in new window)
- Pritchard, J. K., Stephens, M., & Donnelly, P. (2000). Inference of Population Structure Using Multilocus Genotype Data. Genetics, 155(2), 945–959. https://doi.org/10.1093/genetics/155.2.945(opens in new window)
- Qu, C. M., Li, S. M., Duan, X. J., Fan, J. H., Jia, L. D., Zhao, H. Y., Lu, K., Li, J. N., Xu, X. F., & Wang, R. (2015). Identification of candidate genes for seed glucosinolate content using association mapping in brassica napus l. Genes, 6(4), 1215–1229. https://doi.org/10.3390/genes6041215(opens in new window)
- Wang, J., & Zhang, Z. (2021). GAPIT Version 3: Boosting Power and Accuracy for Genomic Association and Prediction. Genomics, Proteomics & Bioinformatics, 19(4), 629–640. https://doi.org/10.1016/j.gpb.2021.08.005(opens in new window)
- Zanetta, C. U., Gali, K. K., Rafii, M. Y., Jaafar, J. N., Waluyo, B., Warkentin, T. D., & Ramlee, S. I. (2024). Dissecting genetic variation and association mapping for agro-morphological traits under high temperature stress in pea (Pisum sativum L.). Euphytica, 220(2). https://doi.org/10.1007/s10681-023-03279-x
