Establishing species boundaries in Bornean geckos
Data files
Aug 13, 2024 version files 35.30 MB
Abstract
Species delimitation using mitochondrial DNA (mtDNA) remains an important and accessible approach for discovering and delimiting species. However, delimiting species with a single locus (e.g. DNA barcoding) is biased towards overestimating species diversity. The highly diverse gecko genus Cyrtodactylus is one such group where delimitation using mtDNA remains the paradigm. In this study, we use genomic data to test putative species boundaries established using mtDNA within three recognized species of Cyrtodactylus on the island of Borneo. We predict that multilocus genomic data will estimate fewer species than mtDNA, which could have important ramifications for the species diversity within the genus. We aim to 1) investigate the correspondence between species delimitations using mtDNA and genomic data; 2) infer species trees for each target species; and 3) quantify gene flow and identify migration patterns to assess population connectivity. We show mtDNA approaches overestimate species diversity compared to genomic methods, underscoring the value of using genomic data to reassess mtDNA-based species delimitations for taxa lacking clear species boundaries. We expect the number of recognized species within Cyrtodactylus to continue increasing, but when possible, genomic data should be included to inform more accurate species boundaries.
README: Establishing species boundaries in Bornean geckos
https://doi.org/10.5061/dryad.w0vt4b90w
Description of the data and file structure
Fieldwork was conducted in Borneo in the Malaysian state of Sarawak between 2014 and 2018 (Permits: NPW.907.4.4.(Jld.14)-79; (119)JHS/NCCD/600-7/2/107).
To delimit species using mtDNA, we assembled and aligned NADH dehydrogenase subunit 2 (ND2) data. We generated an assembly for Cyrtodactylus species in Borneo, comprising two evolutionarily distinct clades referred to herein as the small-bodied (Cyrtodactylus cavernicolus, C. miriensis, C. pubisulcus, C. hantu) and large-bodied (C. consobrinus, C. hutan, C. kapitensis, C. malayanus) clades. Sample sizes for the target species are as follows: C. miriensis = 19, C. pubisulcus = 20, and C. consobrinus = 9. We used the assembly to apply commonly used species delimitation methods (BPTP; MPTP; GMYC; ASAP) and calculate pairwise-distances (p-distance) using SPDEL, and estimated genealogies using BEAST2.
For genomic data, we prepared double digest restriction-site associated DNA sequencing (ddRADseq) libraries. Sample sizes for the target species are as follows: C. miriensis = 26; C. pubisulcus = 32; and C. consobrinus = 15. We double-digested each sample using the digestion enzymes SbfI and MspI in CutSmart Buffer (New England Biolabs) for 7 hours at 37 ºC For fragment purification, we used Sera-Mag SpeedBeads. We then ligated eight distinct barcodes to the cut sites of the fragmented DNA and subsequently size-selected (between 415 and 515 bp after accounting for adapter length) each library on a Blue Pippin Prep size fractionator (Sage Science). For the final library amplification, we used Phusion Hi-Fidelity DNA Polymerase and Illumina's indexed primers. We determine the concentration and size distribution of each indexed pool using an Agilent 2200 TapeStation. Lastly, we sent the quantified pools to QB3-Berkeley Genomics, Vincent J. Coates Genomics Sequencing Laboratory, UC Berkeley for qPCR to determine sequenceable library concentrations before multiplexing equimolar amounts of each pool for sequencing on two Illumina HiSeq 4000 lane (51-bp, single-end reads; 9 pools containing 8 samples). All sequences are uploaded to the Sequence Read Archive (SRA; BioProject ID: PRJNA1117503).
We produced separate assemblies for the small and large-bodied clades and used branching in IPYRAD to generate species-specific assemblies from their respective clades. We removed individuals from each assembly containing less than 500,000 raw reads. We applied a sequence similarity threshold of 85% to cluster reads within samples and loci between samples and removed consensus sequences with low coverage (< 6 reads), excessive undetermined or heterozygous sites (> 5%), too many alleles for a sample (> 2 for diploids), or an excess of shared heterozygosity among samples (paralog filter = 0.5). For clade-specific assemblies, we required approximately 60% of individuals to share any given locus. For population genetic analyses, we filtered each dataset using VCFTOOLS to only allow one SNP per locus (--thin 50) and filtered out variable sites present in less than 5% of individuals (--maf 0.05).
Files and variables
File: consobrinus_imap.txt
Description: Population assignments used as the input for running the multispecies coalescent with migration (MSC-M) model for Cyrtodactylus consobrinus. KUC = Kuching region; SER = Serian region.
File: consobrinus_merge.txt
Description: Necessary set up file to indicate prior information to run the MSC-M model for C. consobrinus.
File: miriensis_imap.txt
Description: Population assignments used as the input for running the multispecies coalescent with migration (MSC-M) model for Cyrtodactylus miriensis. MIR = Miri region; LAW = Lawas region; MUL = Gunung Mulu region.
File: miriensis_merge.txt
Description: Necessary set up file to indicate prior information to run the MSC-M model for C. miriensis. The merge function is applied in the HHSD program to test whether two lineages should be considered one species, based on the genealogical diversity index (gdi). In this model, each lineage input is considered a distinct species, but the null is rejected if the two lineages are more accurately represented as a single lineage.
File: consobrinus_bpp.txt
Description: Phased ddRAD sequences used for to perform the MSC-M model for C. consobrinus.
File: miriensis_split.txt
Description: Necessary set up file to indicate prior information to run the MSC-M model for C. miriensis. The split function is applied in the HHSD program to test whether two lineages should be considered one species, based on the genealogical diversity index (gdi). In this model, all lineage are considered conspecific, but the null is rejected if the two lineages are more accurately represented as distinct lineages.
File: consobrinus_split.txt
Description: Necessary set up file to indicate prior information to run the MSC-M model for C. consobrinus. The split function is applied in the HHSD program to test whether two lineages should be considered one species, based on the genealogical diversity index (gdi). In this model, all lineage are considered conspecific, but the null is rejected if the two lineages are more accurately represented as distinct lineages.
File: pubisulcus_imap.txt
Description: Population assignments used as the input for running the multispecies coalescent with migration (MSC-M) model for Cyrtodactylus miriensis. KUC = Kuching region; BH = Borneo Highlands region.
File: pubisulcus_merge.txt
Description: Necessary set up file to indicate prior information to run the MSC-M model for C. pubisulcus. The merge function is applied in the HHSD program to test whether two lineages should be considered one species, based on the genealogical diversity index (gdi). In this model, each lineage input is considered a distinct species, but the null is rejected if the two lineages are more accurately represented as a single lineage.
File: pubisulcus_split.txt
Description: Necessary set up file to indicate prior information to run the MSC-M model for C. pubisulcus. The split function is applied in the HHSD program to test whether two lineages should be considered one species, based on the genealogical diversity index (gdi). In this model, all lineage are considered conspecific, but the null is rejected if the two lineages are more accurately represented as distinct lineages.
File: consobrinus.vcf
Description: Variant call format file used to conduct population genetic analyses (PCA, ADMIXTURE) for C. consobrinus.
File: miriensis_bpp.txt
Description: Phased ddRAD sequences used for to perform the MSC-M model for C. miriensis.
File: miriensis.vcf
Description: Variant call format file used to conduct population genetic analyses (PCA, ADMIXTURE) for* C. miriensis*.
File: mtDNA.mafft
Description: Alignment of the mitochondrial NADH2 gene, which was used to assemble gene trees. Both the large- and small-bodied samples were aligned together.
File: pubisulcus_bpp.txt
Description: Phased ddRAD sequences used for to perform the MSC-M model for C. pubisulcus.
File: SNAPP_Large.xml
Description: Input file used to obtain secondary fossil-calibration dates for the large-bodied dataset using biallelic SNP data. Dates were added to the .xml file using the snapp*-*prep-master script noted in the methods.
File: pubisulcus.vcf
Description: Variant call format file used to conduct population genetic analyses (PCA, ADMIXTURE) for* C. pubisulcus*.
File: Large_Body.vcf
Description: Variant call format file used to conduct population genetic analyses of all large-bodied species together. Analyses not shown in manuscript.
File: SNAPP_Small.xml
Description: Input file used to obtain secondary fossil-calibration dates for the small-bodied dataset using biallelic SNP data. Dates were added to the .xml file using the snapp*-*prep-master script noted in the methods.
File: Small_Body.vcf
Description: Variant call format file used to conduct population genetic analyses of all small-bodied species together. Analyses not shown in manuscript.
File: README.md
Description: Data collection and files used to conduct the study: "Establishing Species Boundaries in Bornean Geckos".
Code/software
To repeat the experiments in this study, the following programs are needed:
SPDel:
This program tests four distinct single-locus species delimitation models, and conducts pairwise-distance calculations. The pipeline can be found here: https://github.com/jolobito/SPdel.
ADMIXTURE v1.3:
This program is used to test for signatures of admixture among samples. The program can be found here: https://dalexander.github.io/admixture/download.html
HHSD:
HHSD is a wrapper script for conducting BPP and genealogical diversity index (gdi) estimation. The pipeline can be found here: https://github.com/abacus-gene/hhsd
BEAST2 v.2.7.4:
BEAST2 is a program for conducting Bayesian phylogenetic analyses, and allows for trees to be time calibrated. The program can be found here: https://www.beast2.org/
Access information
Other publicly accessible locations of the data:
- NCBI Sequence Read Archive (SRA): BioProject ID: PRJNA1117503
Methods
To generate genomic data, we conducted double digest restriction-site associated DNA sequencing (ddRADseq). We double-digested each sample using the digestion enzymes SbfI and MspI in CutSmart Buffer (New England Biolabs) for 7 hours at 37º C. For fragment purification, we used Sera-Mag SpeedBeads. We then ligated eight distinct barcodes to the cut sites of the fragmented DNA and subsequently size-selected (between 415 and 515 bp after accounting for adapter length) each library on a Blue Pippin Prep size fractionator (Sage Science). For the final library amplification, we used Phusion Hi-Fidelity DNA Polymerase and Illumina's indexed primers. We determine the concentration and size distribution of each indexed pool using an Agilent 2200 TapeStation. Lastly, we sent the quantified pools to QB3-Berkeley Genomics, UC Berkeley for qPCR to determine sequenceable library concentrations before multiplexing equimolar amounts of each pool for sequencing on two Illumina HiSeq 4000 lane (51-bp, single-end reads; 9 pools containing 8 samples). We combined these data with publically available ddRADseq data.
We removed individuals from each assembly containing less than 500,000 raw reads. For all assemblies, we applied a sequence similarity threshold of 85% to cluster reads within samples and loci between samples. We removed consensus sequences with low coverage (< 6 reads), excessive undetermined or heterozygous sites (> 5%), too many alleles for a sample (> 2 for diploids), or an excess of shared heterozygosity among samples (paralog filter = 0.5). For clade-specific assemblies, we required approximately 60% of individuals to share any given locus. For population genetic analyses, we filtered each dataset using VCFtools to only allow one SNP per locus (--thin 50) and filtered out variable sites present in less than 5% of individuals (-- maf 0.05).
Using a reduced dataset for computational efficiency, we inferred time-calibrated species trees using SNAPP within the BEAST2 framework for both the small and large-bodied datasets. We applied secondary calibrations to the roots of each tree using the snapp_prep ruby script, which allows the generation of an .xml file with a molecular clock matschiner2022species. For the large-bodied dataset, we constrained the crown age of the tree to 18.83 mya with a normal distribution (sigma = 2), and for the small-bodied clade we constrained the crown age to 25.80 mya with a normal distribution (sigma = 2).To maximize the number of loci from the ddRADseq dataset, we generated separate assemblies for each Bornean clade (large- and small-bodied). For the time-calibrated species BEAST2 tree, we thinned the number of samples in our dataset to 32 small-bodied and 28 large-bodied individuals for computational efficiency.