Landscape genomics analysis: A comprehensive guide to enhance the conservation and use of plant genetic resources
Data files
Dec 28, 2024 version files 42.44 MB
-
IndBng_SPET.vcf
42.27 MB
-
Meta_envdata.csv
170.27 KB
-
README.md
5.21 KB
Abstract
Applying landscape genomics will significantly advance our understanding of biodiversity, informing effective genetic rescue and conservation strategies and crop development programs. Continued research will expand and refine these methods, broadening the range of taxa for comparison. This data has been used to develop a landscape genomics manual that offers insights into landscape genomic studies using several commonly applied methods. It also includes a collection of R scripts for achieving specific outcomes and creating simplified graphical displays of the results using a case study based on Indian eggplant accessions.
README: Landscape genomics analysis: A comprehensive guide to enhance the conservation and use of plant genetic resources
https://doi.org/10.5061/dryad.08kprr5c7
Description of the data and file structure
1. Environmental variables data
It consists of 324 eggplant samples from the Indian subcontinent (Meta_envdata.csv).
The bioclimatic variables are related to temperature precipitation, solar radiation, wind speed, and vapor pressure from WorldClim 2.0 (Fick and Hijmans, 2017) with a 2.5 min (~5km) resolution. The data represent a 30-year average from 1970 to 2000. We averaged the monthly solar radiation, wind, and vapor pressure rasters to obtain annual value rasters from this period.
The Soil variables included nitrogen, soil organic carbon, organic carbon density, organic carbon stock, cation exchange capacity, pH, clay sand, and silt content. We downloaded the soil data from the SoilGrids database released in 2016 (https://soilgrids.org/) through ISRIC—WDC Soils (Hengl et al., 2017) at 250-meter resolution and a depth of 15-30 cm. The raster package in R (Hijmans and Van Etten, 2023) was used to extract the individual sampling point values from the accessions' georeferenced collection sites. Table 1 below describes the variables in the Environmental dataset in detail.
Table 1: Detailed descriptions of the samples’ metadata and the environmental variables (Bioclimatic and soil variables)
Code Description
G2P.Code Sample genotyping codes
Accession Genebank accession codes
lat Latitude
lon Longitude
Country Country of sample origin
C_code Country code
Region Region of sampling in the country of origin
AMT_1 BIO1 = Annual Mean Temperature (°C)
MTWaQ_10 BIO10 = Mean Temperature of Warmest Quarter (°C)
MTCoQ_11 BIO11 = Mean Temperature of Coldest Quarter (°C)
AP_12 BIO12 = Annual Precipitation (mm)
PWeM_13 BIO13 = Precipitation of Wettest Month (mm)
PDrM_14 BIO14 = Precipitation of Driest Month (mm)
PSe_15 BIO15 = Precipitation Seasonality (Coefficient of Variation) (mm)
PWeQ_16 BIO16 = Precipitation of Wettest Quarter (mm)
PDrQ_17 BIO17 = Precipitation of Driest Quarter (mm)
PWaQ_18 BIO18 = Precipitation of Warmest Quarter (mm)
PCoQ_19 BIO19 = Precipitation of Coldest Quarter (mm)
MDiR_2 BIO2 = Mean Diurnal Range (Mean of monthly (max temp - min temp)) (°C)
Iso_3 BIO3 = Isothermality (BIO2/BIO7) (×100) (°C)
TSe_4 BIO4 = Temperature Seasonality (standard deviation ×100) (°C)
MTWaM_5 BIO5 = Max Temperature of Warmest Month (°C)
MiTCoQ_6 BIO6 = Min Temperature of Coldest Month (°C)
TAR_7 BIO7 = Temperature Annual Range (BIO5-BIO6) (°C)
MTWeQ_8 BIO8 = Mean Temperature of Wettest Quarter (°C)
MTDrQ_9 BIO9 = Mean Temperature of Driest Quarter (°C)
tmin minimum temperature (°C)
tmax maximum temperature (°C)
srad solar radiation (kJ m-2 day-1)
vapr water vapor pressure (kPa)
wind wind speed (m s-1)
cec Cation exchange capacity (at ph 7) (mmol(c)/kg)
clay Soil clay content (g/kg)
nitrogen Soil nitrogen content (cg/kg)
ocd Organic carbon density (hg/m³)
ocs Soil organic carbon stock (t/ha)
pH soil water pH (pH*10)
sand soil sand content (g/kg)
silt Soil silt content (g/kg)
soc Soil organic carbon (dg/kg)
Elevs Elevation (m)
2. Genomic data
Contains a set of 4,308 SNPs dataset (IndBng_SPET.vcf).
SPET library construction
DNA was extracted using the Qiagen plant mini-prep, the LGC Sbeadex kit, the SILEX protocol (Vilanova et al., 2020), or a modified CTAB method. A total of 33 DNA samples of the reference S. melongena ‘67/3’ line (nearly one per plate), obtained from a unique seed batch (Barchi et al., 2021; Barchi et al., 2019), were included as controls. The final set of 5082 (5K) probes previously identified was used and libraries were prepared as previously reported (Barchi et al., 2019) for genotyping the whole set of accessions at IGATech (Udine, Italy). Sequencing was performed on an Illumina NextSeq 500 platform (Illumina, Inc., San Diego, CA, USA), using 150SE chemistry. The raw sequencing data are available at NCBI SRA (BioProject ID PRJNA808188 and PRJNA542231). Accessions having an average read depth of <10 were discarded from the subsequent analyses.
Read alignment and variant calling
Base calling and demultiplexing were carried out using the standard Illumina pipeline. The read quality check and adapter trimming were carried out using ERNE (Del Fabbro et al., 2013) and Cutadapt (Martin, 2011) software. After alignment to the reference eggplant genome (Barchi et al., 2021), using BWA-MEM (Li, 2013) with default parameters. SNP calling was obtained with GATK 4.1.9 (DePristo et al., 2011), following the software best practices in June 2021 for germline short variant discovery and as previously described in Barchi et al. (2019). To extract high-confidence SNPs, Vcftools (Danecek et al., 2011) was applied using the following parameters: min-meanDP 15 and no more than 5% of missing data.
Usage Notes
.csv files can be viewed in standard text editor or Excel
.vcf files can be viewed in some text editors (eg Glogg), but are more suitably viewed in IGV or BAMseek.
Methods
Landscape genomics manual data
1. Environmental variables data
Consists of 324 from the Indian subcontinent (Meta_envdata.csv ).
The bioclimatic variables are related to temperature precipitation, solar radiation, wind speed, and vapor pressure from WorldClim 2.0 (Fick and Hijmans, 2017) with a 2.5 min (~5km) resolution. The data represent a 30-year average from 1970 to 2000. We averaged the monthly solar radiation, wind, and vapor pressure rasters to obtain annual value rasters from this period.
The Soil variables included nitrogen, soil organic carbon, organic carbon density, organic carbon stock, cation exchange capacity, pH, clay sand, and silt content. We downloaded the soil data from the SoilGrids database released in 2016 (https://soilgrids.org/) through ISRIC—WDC Soils (Hengl et al., 2017) at 250-meter resolution and a depth of 15-30 cm.
2. Genomic data
Contains a set of 4,308 SNPs dataset (IndBng_SPET.vcf).
SPET library construction
DNA was extracted using the Qiagen plant mini-prep, the LGC Sbeadex kit, the SILEX protocol (Vilanova et al., 2020), or a modified CTAB method. A total of 33 DNA samples of the reference S. melongena ‘67/3’ line (nearly one per plate), obtained from a unique seed batch (Barchi et al., 2021; Barchi et al., 2019), were included as controls. The final set of 5082 (5K) probes previously identified was used, and libraries were prepared as previously reported (Barchi et al., 2019) to genotype the whole set of accessions at IGATech (Udine, Italy). Sequencing was performed on an Illumina NextSeq 500 platform (Illumina, Inc., San Diego, CA, USA) using 150SE chemistry. The raw sequencing data are available at NCBI SRA (BioProject ID PRJNA808188 and PRJNA542231). Accessions having an average read depth of <10 were discarded from the subsequent analyses.
Read alignment and variant calling.
Base calling and demultiplexing were carried out using the standard Illumina pipeline. The read quality check and adapter trimming were done using ERNE (Del Fabbro et al., 2013) and Cutadapt (Martin, 2011) software. After alignment to the reference eggplant genome (Barchi et al., 2021), using BWA-MEM (Li, 2013) with default parameters. SNP calling was obtained with GATK 4.1.9 (DePristo et al., 2011), following the software best practices in June 2021 for germline short variant discovery and as previously described in Barchi et al. (2019). To extract high-confidence SNPs, Vcftools (Danecek et al., 2011) was applied using the following parameters: min-meanDP 15 and no more than 5% of missing data.