A genotype-by-sequencing dataset and identity-by-state matrix of genetic variation in Pinus radiata from 16 counties

Addison, Sarah 1

Research facility: Scion

Published Aug 21, 2025 on Dryad. https://doi.org/10.5061/dryad.bzkh189pc

Data files

Aug 21, 2025 version files 5.55 GB

GBS_metadata.csv

45.80 KB
NXGSQCAGRF24050329-3_variants_cleaned.tsv

5.55 GB
README.md

3.05 KB
Similarity_reduced_GBS_GLOBAL_June25_updated.csv

3.34 MB

Abstract

Pinus radiata is a commercially important softwood species planted extensively worldwide, while its native populations remain endangered and of high conservation concern. This dataset provides genome-wide SNP genotyping data generated via genotype-by-sequencing (GBS) using double-digest RAD sequencing (ddRADseq) on needle-derived DNA. Samples were collected from 16 countries and include both domesticated breeding material and wild-origin individuals from native populations. The dataset includes a SNP-by-sample matrix and an identity-by-state (IBS) matrix describing pairwise genetic similarity among individuals. These data support applications in population structure analysis, breeding program design, genetic resource management, and conservation planning for P. radiata.

Dataset DOI: 10.5061/dryad.bzkh189pc

Description of the data and file structure

Needles were collected from four positions around each tree canopy and pooled per tree. DNA was extracted from homogenized needle tissue using the Qiagen DNeasy Plant Pro Kit, with replicate extractions pooled to maximize yield and representation. Genotyping was performed using genotype-by-sequencing (GBS/ddRADseq) with EcoRI and MseI digestion, ligation of barcoded adapters, size selection (280–375 bp), and sequencing on an Illumina NovaSeq 6000. Reads were demultiplexed, clustered de novo, and SNPs called using a Bayesian genotyping model. Variant data were filtered and combined with sample metadata to produce the final SNP dataset. Identity-by-state (IBS) genetic similarity matrices were also generated using PLINK.

Files and variables

File: Similarity_reduced_GBS_GLOBAL_June25_updated.csv

Description: Similarity matrix of all samples compared to one another across the entire dataset. Sample names link to the metadata where GPS locations and country source.

Variables

Sample columns: One per individual, containing similarity

File: GBS_metadata.csv

Description: Metadata file for each of the samples. Contains sample names, country of collected sample, whether the sample was from an endemic or introduced source and which endemic populations the endemic samples belonged to. This is accompanied by GPS locations for each sample.

Variables

Sample_Name
Country
Endemic_Introduced
Endemic_Source
Latitude
Longitude

File: NXGSQCAGRF24050329-3_variants_cleaned.tsv

Description: This dataset is derived from a Variant Call Format (VCF) file that has been exported into tab-separated values (TSV) format and contains Single Nucleotide Polymorphism (SNP) data from 821 individuals.

Variables

#CHROM: Chromosome number
POS: Position of the SNP on the chromosome
REF: Reference base
ALT: Alternate base(s)
Sample columns: One per individual, containing genotype information

Code/software

The .tsv file (tab-separated values) can be opened and analyzed using a wide range of software and tools, depending on what you want to do:

R / Python (pandas, tidyverse, etc.): Ideal for more complex statistical analysis, filtering variants, or creating custom visualizations.
Database systems (SQLite, PostgreSQL, etc.): If the file is large, importing into a database makes it easier to query specific variants or samples.
Spreadsheet software (Excel, Google Sheets, LibreOffice Calc): Easily open and explore the table, filter columns, and perform simple analyses. The size of the file makes this option difficult.
Custom scripts / pipelines: Since the file is plain-text tabular, you can parse it with bash, awk, or other scripting tools.

The CSV similarity matrix can be opened with various tools and software. PRIMER and R based programs are common tools for analysis of this type of file.

A genotype-by-sequencing dataset and identity-by-state matrix of genetic variation in Pinus radiata from 16 counties

Data files

Abstract

README: Dryad dataset

Description of the data and file structure

Files and variables

File: Similarity_reduced_GBS_GLOBAL_June25_updated.csv

Variables

File: GBS_metadata.csv

Variables

File: NXGSQCAGRF24050329-3_variants_cleaned.tsv

Variables

Code/software