A genotype-by-sequencing dataset and identity-by-state matrix of genetic variation in Pinus radiata from 16 counties
Data files
Aug 21, 2025 version files 5.55 GB
-
GBS_metadata.csv
45.80 KB
-
NXGSQCAGRF24050329-3_variants_cleaned.tsv
5.55 GB
-
README.md
3.05 KB
-
Similarity_reduced_GBS_GLOBAL_June25_updated.csv
3.34 MB
Abstract
Pinus radiata is a commercially important softwood species planted extensively worldwide, while its native populations remain endangered and of high conservation concern. This dataset provides genome-wide SNP genotyping data generated via genotype-by-sequencing (GBS) using double-digest RAD sequencing (ddRADseq) on needle-derived DNA. Samples were collected from 16 countries and include both domesticated breeding material and wild-origin individuals from native populations. The dataset includes a SNP-by-sample matrix and an identity-by-state (IBS) matrix describing pairwise genetic similarity among individuals. These data support applications in population structure analysis, breeding program design, genetic resource management, and conservation planning for P. radiata.
Dataset DOI: 10.5061/dryad.bzkh189pc
Description of the data and file structure
Needles were collected from four positions around each tree canopy and pooled per tree. DNA was extracted from homogenized needle tissue using the Qiagen DNeasy Plant Pro Kit, with replicate extractions pooled to maximize yield and representation. Genotyping was performed using genotype-by-sequencing (GBS/ddRADseq) with EcoRI and MseI digestion, ligation of barcoded adapters, size selection (280–375 bp), and sequencing on an Illumina NovaSeq 6000. Reads were demultiplexed, clustered de novo, and SNPs called using a Bayesian genotyping model. Variant data were filtered and combined with sample metadata to produce the final SNP dataset. Identity-by-state (IBS) genetic similarity matrices were also generated using PLINK.
Files and variables
File: Similarity_reduced_GBS_GLOBAL_June25_updated.csv
Description: Similarity matrix of all samples compared to one another across the entire dataset. Sample names link to the metadata where GPS locations and country source.
Variables
- Sample columns: One per individual, containing similarity
File: GBS_metadata.csv
Description: Metadata file for each of the samples. Contains sample names, country of collected sample, whether the sample was from an endemic or introduced source and which endemic populations the endemic samples belonged to. This is accompanied by GPS locations for each sample.
Variables
- Sample_Name
- Country
- Endemic_Introduced
- Endemic_Source
- Latitude
- Longitude
File: NXGSQCAGRF24050329-3_variants_cleaned.tsv
Description: This dataset is derived from a Variant Call Format (VCF) file that has been exported into tab-separated values (TSV) format and contains Single Nucleotide Polymorphism (SNP) data from 821 individuals.
Variables
#CHROM: Chromosome numberPOS: Position of the SNP on the chromosomeREF: Reference baseALT: Alternate base(s)- Sample columns: One per individual, containing genotype information
Code/software
The .tsv file (tab-separated values) can be opened and analyzed using a wide range of software and tools, depending on what you want to do:
- R / Python (pandas, tidyverse, etc.): Ideal for more complex statistical analysis, filtering variants, or creating custom visualizations.
- Database systems (SQLite, PostgreSQL, etc.): If the file is large, importing into a database makes it easier to query specific variants or samples.
- Spreadsheet software (Excel, Google Sheets, LibreOffice Calc): Easily open and explore the table, filter columns, and perform simple analyses. The size of the file makes this option difficult.
- Custom scripts / pipelines: Since the file is plain-text tabular, you can parse it with bash, awk, or other scripting tools.
The CSV similarity matrix can be opened with various tools and software. PRIMER and R based programs are common tools for analysis of this type of file.
