Data and code from: Integrating genomics, collections, and community science to delimit species clarifies the taxonomy of a variable monitor lizard (Varanus tristis)
Data files
Oct 22, 2025 version files 74.39 MB
-
Custom_scripts.zip
6.48 KB
-
Data.zip
74.30 MB
-
MMRR_Wang2013_code.zip
1.04 KB
-
README.md
16.69 KB
-
Script.R
67.71 KB
Abstract
The accurate characterization of species diversity is a vital prerequisite for ecological and evolutionary research, as well as conservation. Thus, it is necessary to generate robust hypotheses of species limits based on the inference of evolutionary processes. Integrative species delimitation, the inference of species limits based on multiple sources of evidence, can provide unique insight into species diversity and the processes behind it. Here, we show how community observations can be integrated with standard molecular and phenotypic datasets under an integrative framework to identify the processes generating genetic and phenotypic variation. We implement this approach in Varanus tristis, a widespread and variable complex of Australian monitor lizards. Using genomic, phenotypic (linear and geometric morphometrics, coloration), spatial, and environmental data, we show that disparity in this complex is inconsistent with intraspecific variation and instead suggests that speciation has occurred. Based on our results, we provide an updated taxonomy for this complex and identify the processes that may have been responsible for the geographic sorting of variation. Our workflow provides a guideline for the integrative analysis of several types of data to identify the occurrence and causes of speciation. Furthermore, our study highlights the benefits and caveats associated with community science and machine learning—two tools used here—in taxonomic research.
Dataset DOI: 10.5061/dryad.tqjq2bwb0
Description of the data and file structure
These repository contains the data and code necessary to replicate the analyses in our manuscript, which uses an integrative species delimitation approach to delimit species in the monitor lizard Varanus tristis.
Files and variables
File: Custom_scripts.zip
Description: Contains custom R scripts needed to run the analyses:
- gl2fasta_snp_cjpv.R transforms a "genlight" object to a "fasta" file.
- gl2svdquartets_cjpv.R transforms a "genlight" object to an input file for SVDquartets.
- KeepFromCorrs.R will get you independent variables from a set of correlated variables based on a correlation coefficient.
- KeepFromGraphs.R will get you independent variables from a set of correlated variables based on graphs.
File: MMRR_Wang2013_code.zip
Description: Contains the function to implement the multiple matrix regression with randomization approach of Wang (2013) (https://doi.org/10.1111/evo.12134)
File: Script.R
Description: Script used to run all the analyses described in the paper.
File: Data.zip
Description: Data used by Script.R:
- a_hat.csv contains the pairwise a-hat metric of genetic distances. "NA" included to allow R to correctly read the file. Contains lower triangle of matrix only (blank cells used for diagonal and upper triangle).
- all_data_inds.csv is the metadata for the individuals that have all types of data available:
- Collection source: Herpetological collections where the specimens associated with each tissue sample are deposited.
- Collection voucher: Unique identifier for each specimen within its respective collection.
- Tissue source: Biological sample collections where the tissue samples were sourced from.
- Tissue voucher: Unique identifier for each tissue sample within its respective collection.
- Lon: Longitude in decimal degrees.
- Lat: Latitude in decimal degrees.
- MorphologyID: Indicates how the individuals are labeled in the morphological data files.
- MolecularID: Indicates how the individuals are labeled in the molecular data files.
- bil2d_dorsal_supersom.csv contains the geometric morphometric data for the dorsal head shape for the individuals included in the SuperSOM analyses. First column (missing header; containing source collection and voucher) read by R as row names. V1–V66 represent landmark/semi-landmark coordinates. X-value and Y-value for each of the 33 landmarks/semi-landmarks given consecutively.
- citizen_science_data.csv contains metadata for the community science photographs used in the coloration analyses:
- decimalLatitude: Latitude in decimal degrees.
- decimalLongitude: Longitude in decimal degrees.
- url: URL associated with each image.
- head: Head color.
- col_df.rdata contains the environmental data used in the SuperSOM analyses. Once read into R, it returns a data frame with the following columns:
- Collection source: Herpetological collections where the specimens associated with each tissue sample are deposited.
- Collection voucher: Unique identifier for each specimen within its respective collection.
- Tissue source: Biological sample collections where the tissue samples were sourced from.
- Tissue voucher: Unique identifier for each tissue sample within its respective collection.
- Lon: Longitude in decimal degrees.
- Lat: Latitude in decimal degrees.
- MorphologyID: Indicates how the individuals are labeled in the morphological data files.
- MolecularID: Indicates how the individuals are labeled in the molecular data files.
- elev: Elevation (in m).
- solar: Annual mean solar radiation (in kJ/m2/day).
- bio_1: Annual mean temperature (in °C).
- bio_4: Temperature seasonality (standard deviation in °C * 100)
- bio_5: Maximum temperature of warmest month (in °C).
- bio_6: Minimum temperature of coldest month (in °C).
- bio_12: Annual precipitation (in mm).
- bio_15: Precipitation seasonality (coefficient of variation in mm).
- bio_13: Precipitation of wettest month (in mm).
- bio_14: Precipitation of driest month (in mm).
- color_pred_df.rdata contains the color data and predictors used in the SuperSOM analyses. Once read into R, it returns a data frame with the following columns:
- individual: Unique identifier for each individual.
- head: Head color.
- decimalLatitude: Latitude in decimal degrees * −1. Log-transformed and scaled.
- dist_to_closest: Great-circle distance to closest individual in dataset (in km). Log-transformed and scaled.
- elev: Elevation (in m). Log-transformed and scaled.
- solar: Annual mean solar radiation (in kJ/m2/day). Log-transformed and scaled.
- bio_1: Annual mean temperature (in °C). Log-transformed and scaled.
- bio_12: Annual precipitation (in mm). Log-transformed and scaled.
- examined_specimens_color.csv contains the color data for examined specimens:
- download_path: Path to image of each specimen.
- recordID: Collection and voucher of each specimen.
- scientificName: Taxon to which each specimen belongs.
- decimalLatitude: Latitude in decimal degrees.
- decimalLongitude: Longitude in decimal degrees.
- kml_coords.txt defines the polygon used in the EEMS analyses. Given as the coordinates of successive points: longitude in decimal degrees, latitude in decimal degrees, 0 (0 stands for vertical dimension, which is absent from the polygon).
- proc2d_lateral_supersom.csv contains the geometric morphometric data for the lateral head shape for the individuals included in the SuperSOM analyses. First column (missing header; containing source collection and voucher) read by R as row names. 1.X,1.Y–10.X,10.Y represent landmark coordinates. X-value and Y-value for each of the 10 landmarks given consecutively.
- raw_morphometric_data.csv contains the raw linear morphometric data. Missing measurements (e.g., due to incomplete toes) given as "NA":
- Species: Species.
- Clade: Subspecific taxon to which each specimen was assigned (V. t. tristis or V. t. orientalis).
- Museum: Herpetological collections where the specimens are deposited.
- Museum.No.: Unique identifier for each specimen within its respective collection.
- Latitude: Latitude in decimal degrees.
- Longitude: Longitude in decimal degrees.
- Sex: Female (F), male (M), or uncertain (NA).
- Mature.: Yes (Y), no (N), could be based on body size (Y?), uncertain (?).
- Tail.complete: Is the tail complete? Yes or no.
- SVL: Snout-vent length (in mm).
- Tail.length..dorsally.: Tail length measured along the drosal midline (in mm).
- Body.length..gular.fold.vent..dorsally.: Body length (between vent and gular fold) measured along the dorsal midline (in mm).
- Head.Length.Mid..snout.ant.ear.: Head length (between anterior end of snout and level of anterior edge of ear opening) measured along the dorsal midline (in mm).
- Head.width..ant.ears. :Head width measured between anterior edge of ear openings (in mm)..
- Head.depth..mid.eyes.: Head depth measured at level of middle of eyes (in mm).
- Neck.length: Neck length (measured between gular fold and anterior edge of ear opening) (in mm).
- Hip.width..between.legs..dorsally.: Hip width measured dorsally between hindlimbs (in mm).
- X1.3.tail.width: Width of tail measured at the level of the posterior edge of the anteriormost third of its length (in mm).
- X1.3.tail.depth: Depth of tail measured at the level of the posterior edge of the anteriormost third of its length (in mm).
- Upper.arm.length: Length of humerus (in mm).
- Lower.arm.length: Length of ulna (in mm).
- Hand.length..wrist.base.of.finger.IV.: Hand length (between wrist and base of finger IV) (in mm).
- X4th.finger.length..NOT.inc..webbing.: Length of finger IV, excluding any interdigital webbing (in mm).
- Hand.width..perpendicular.base.finger.V.: Hand width, measured perpendicular to base of finger V (in mm).
- Upper.leg.length: Femur length (in mm).
- Lower.leg.length: Tibia length (in mm).
- Foot.length..wrist.base.of.toe.IV.: Foot length (between ankle and base of toe IV) (in mm).
- X4th.toe.length..NOT.inc..webbing.: Length of toe IV, excluding any interdigital webbing (in mm).
- Foot.width..perpendicular.base.toe.V.: Foot width, measured perpendicular to base of finger V (in mm).
- ref_table_dorsal.csv contains the metadata for the individuals for which geometric morphometric data for dorsal head shape is available. First column (missing header) read by R as row names:
- Photo: Path to photograph from which the data was obtained in local computer.
- Voucher: Collection and voucher of each specimen.
- Species: Taxon to which each specimen was assigned (V. t. tristis or V. t. orientalis).
- Sex: Female (F) or male (M).
- Mature: Indicates whether the individual was sexually mature at the time of collection.
- ref_table_lateral.csv contains the metadata for the individuals for which geometric morphometric data for lateral head shape is available. First column (missing header) read by R as row names:
- Photo: Path to photograph from which the data was obtained in local computer.
- Voucher: Collection and voucher of each specimen.
- Species: Taxon to which each specimen was assigned (V. t. tristis or V. t. orientalis).
- Sex: Female (F) or male (M).
- Mature: Indicates whether the individual was sexually mature at the time of collection.
- Report_DVara19-4371_SNP_1.csv is the file containing the DArTseq data (two row format; for each locus, Reference row first, SNP row second) as delivered by DArT containing outgroups. Each allele scored in a binary fashion ("1" = presence, "0" = absence; "-" = null allele). Heterozygotes are therefore scored as 1/1 (presence for both alleles/both rows). First 18 columns contain the metadata for each locus; asterisks used to fill space but with no other meaning. Following columns contain the data for each sample; the seven header rows indicate the order number to which the sample belongs to, DArT plate barcode, client plate barcode, well row position, well column position, sample comments, and genotype name, respectively. The first 18 columns contain the following information:
- AlleleID: Unique identifier for the sequence in which the SNP marker occur.
- AlleleSequence: The sequence of the Reference allele is in the Reference row, the sequence of the SNP allele in the SNP row.
- AvgCountRef: The sum of the tag read counts for all samples, divided by the number of samples with non-zero tag read counts, for the Reference allele row.
- AvgCountSnp: The sum of the tag read counts for all samples, divided by the number of samples with non-zero tag read counts, for the SNP allele row.
- AvgPIC: The average of the polymorphism information content (PIC) of the Reference and SNP allele rows.
- CallRate: The proportion of samples for which the genotype call is either "1" or "0", rather than "-".
- CloneID: Unique identifier for the sequence in which the SNP marker occurs.
- FreqHets: The proportion of samples which score as heterozygous.
- FreqHomRef: The proportion of samples which score as homozygous for the Reference allele.
- FreqHomSnp: The proportion of samples which score as homozygous for the SNP allele.
- OneRatioRef: The proportion of samples for which the genotype score is "1", in the Reference allele row.
- OneRatioSnp: The proportion of samples for which the genotype score is "1", in the SNP allele row.
- PICRef: The polymorphism information content (PIC) for the Reference allele row.
- PICSnp: The polymorphism information content (PIC) for the SNP allele row.
- RepAvg: The proportion of technical replicate assay pairs for which the marker score is consistent.
- SNP: This column is blank in the Reference row, and contains the base position and base variant details in the SNP row.
- SnpPosition: The position (zero indexed) in the sequence tag at which the defined SNP variant base occurs.
- TrimmedSequence: Same as the full sequence, but with removed adapters in short marker tags.
- Report_DVara19-4371_SNP_2.csv is the file containing the DArTseq data (two row format; for each locus, Reference row first, SNP row second) as delivered by DArT without outgroups. Each allele scored in a binary fashion ("1" = presence, "0" = absence; "-" = null allele). Heterozygotes are therefore scored as 1/1 (presence for both alleles/both rows). First 26 columns contain the metadata for each locus; asterisks used to fill space but with no other meaning. Following columns contain the data for each sample; the seven header rows indicate the order number to which the sample belongs to, DArT plate barcode, client plate barcode, well row position, well column position, sample comments, and genotype name, respectively. The first 26 columns contain the following information:
- AlleleID: Unique identifier for the sequence in which the SNP marker occur.
- AlleleSequence: The sequence of the Reference allele is in the Reference row, the sequence of the SNP allele in the SNP row.
- AvgCountRef: The sum of the tag read counts for all samples, divided by the number of samples with non-zero tag read counts, for the Reference allele row.
- AvgCountSnp: The sum of the tag read counts for all samples, divided by the number of samples with non-zero tag read counts, for the SNP allele row.
- AvgPIC: The average of the polymorphism information content (PIC) of the Reference and SNP allele rows.
- CallRate: The proportion of samples for which the genotype call is either "1" or "0", rather than "-".
- CloneID: Unique identifier for the sequence in which the SNP marker occurs.
- FreqHets: The proportion of samples which score as heterozygous.
- FreqHomRef: The proportion of samples which score as homozygous for the Reference allele.
- FreqHomSnp: The proportion of samples which score as homozygous for the SNP allele.
- OneRatioRef: The proportion of samples for which the genotype score is "1", in the Reference allele row.
- OneRatioSnp: The proportion of samples for which the genotype score is "1", in the SNP allele row.
- PICRef: The polymorphism information content (PIC) for the Reference allele row.
- PICSnp: The polymorphism information content (PIC) for the SNP allele row.
- RepAvg: The proportion of technical replicate assay pairs for which the marker score is consistent.
- SNP: This column is blank in the Reference row, and contains the base position and base variant details in the SNP row.
- SnpPosition: The position (zero indexed) in the sequence tag at which the defined SNP variant base occurs.
- TrimmedSequence: Same as the full sequence, but with removed adapters in short marker tags.
- AlnCnt_[...]: Total count of aligning markers / tags with selection criteria described below.
- AlnEvalue_[...]: E value of the best alignment to an existing model genome.
- ChromPos_[...]: Position(s) on contig(s) with the best alignment of marker / tag to an existing model genome.
- Chrom_[...]: Contig(s) with the best alignment of marker / tag to an existing model genome.
- sampling_dart.csv is the metadata for the individuals that were sequenced with DArTseq:
- Voucher: Collection and voucher of each specimen.
- Lat: Latitude in decimal degrees.
- Lon: Longitude in decimal degrees.
- tps_dorsal.txt contains the raw geometric morphometric data for dorsal head shape in tps format (for each specimen, the following is given: LM = number of landmarks/semi-landmarks, landmark coordinates, and ID = collection and voucher).
- tps_lateral.txt contains the raw geometric morphometric data for lateral head shape in tps format (for each specimen, the following is given: LM = number of landmarks/semi-landmarks, landmark coordinates, and ID = collection and voucher).
Code/software
Analyses were run in R 4.2.2. The packages that are needed are:
- dartR 2.7.2
- LEA 3.10.2
- maps 3.4.1
- conStruct 1.0.5
- randomForest 4.7.1.1
- factoextra 1.0.7
- cluster 2.1.4
- mclust 6.1
- MASS 7.3.58.1
- adegenet 2.1.10
- kohonen 3.0.12
- viridis 0.6.2
- missForest 1.5
- geomorph 4.0.5
- GroupStruct 0.1.0
- tidyr 1.3.0
- chopper 0
- gdata 2.18.0.1
- sp 1.6.0
- stringr 1.5.0
- genepop 1.2.2
- ape 5.7.1
- gdm 1.5.0.9.1
- mapproj 1.2.11
- raster 3.6.14
- ggthemes 4.2.4
- dplyr 1.1.1
- brms 2.22.0
- pgirmess 2.0.3
