Data from: Historical biogeography and genetic status of the enigmatic pig-nosed turtle (Carettochelys insculpta) within the Australo-Papuan region

Young, Matthew1; Georges, Arthur 1

Research facility: Oeko Institut

Published Feb 27, 2025 on Dryad. https://doi.org/10.5061/dryad.qrfj6q5pb

Data files

Feb 27, 2025 version files 7.83 MB

CR_Nucleotide_alignment(1).nex

77.87 KB
DIYABC_SNP_input.snp

3.73 MB
ms_initial_read.R

2.25 KB
ND4_Nucleotide_alignment(2).nex

97.95 KB
README.md

9 KB
SNP_dataset_initial.Rdata

3.91 MB

Abstract

This dataset was used to examine the phylogeographic genetic structure of an iconic species, the pignosed turtle Carettochelys insculpta, the last remaining member of a once globally widespread family, now restricted to a region with a dynamic and complex geological and geographical history – the Australo-Papuan region. It comprises aligned sequences of mtDNA data (Nd4 and Control Region) and SNP data used for population genetics and phylogenetic reconstruction.

https://doi.org/10.5061/dryad.qrfj6q5pb

This Dryad entry contains the data files and associated R script to prepare the data in a form conducive to analysis, including the analyses presented in the companion review article. They include SNP datasets for the pignosed turtle and aligned mtDNA sequences.

Description of the data and file structure

SNP analyses

The turtle SNP data comprises a matrix of entities (individuals) versus attributes (loci) taking on the states 0 for homozygous reference allele, 2 for homozygous alternate allele and 1 for the heterozygous state. The data are stored in compressed form as an adegenet genlight object with associated locus metadata (e.g. callrate, reproducibility) and individual metadata (e.g. latitude, longitude, population). The SNP scores can be examined in R with as.matrix(gl); the names of the locus metadata can be obtained with names(gl@other$loc.metrics); the names of the individual metadata can be obtained with names(gl@other$ind.metrics). Missing data are scored as NA.

SNP_dataset_initial.Rdata – a disk copy of the genlight object containing the SNP scores for the turtle SNP genotypes. Can be accessed in R with readRDS("filename"). Refer to the accompanying ms_initial_read.R.

ms_initial_read.R – an R script to read the data held in SNP_dataset_initial.Rdata, provide some initial descriptive statistics and run a sample analysis.

To access the datasets (those with .Rdata extension), use my.genlight.object <- readRDS("path/filename") or dartR.base::gl.load(("path/filename"). To interrogate the datasets after loading, use the accessors provided in the R package {adegenet}. Refer to the accompanying ms_initial_read.R.

DIYABC_SNP_input.snp -- SNP data converted to the input format for diyABC-RF (Collin et al. 2021). We first removed all SNPs that coexisted on single sequence tags, retaining only one at random, then we filtered all loci for which there were missing data, then we filtered on read depth (> 10x in accordance with the recommendations of Collin et al. 2021).

mtDNA analyses

The turtle mtDNA data comprises an 883 bp fragment of the mtDNA NADH dehydrogenase subunit 4 (Nd4), 69 bp of adjacent tRNAHis and 26 bp of adjacent tRNASer and a 741 bp fragment of the mtDNA Control Region (and 29 bp of tRNAPro).

ND4_Nucleiotide_alignment.nex – the mtDNA sequences for *Nd4 *– aligned and trimmed. Nexus format.

CR_Nucleiotide_alignment.nex – the mtDNA sequences for Control Region* *– aligned and trimmed. Nexus format.

These files can be used to generate a phylogeny using appropriate methods, e.g. Maximum Likelihood.

Refer to Genbank for raw haplotype sequences: ND4, Accession Numbers PP213549 - PP213645; Control Region, PP213646 - PP213742.

Sharing/Access information

The data and script can be used for any purpose preferably using the Dryad citation provided.

Code/Software

The script necessary to undertake the SNP analyses depends on the R software package dartR.base available on the Comprehensive R Archive Network (CRAN). dartR.base works with adegenet genlight objects such as those listed above.

Software for undertaking phylogenetic analysis on the SNPs include SVDQuartets (Singular Value Decomposition Quartets, Chifman & Kubatko, 2014) available through Paup (Swofford, 2003).

Software for undertaking phylogenetic analysis on the mtDNA sequences include IQ-Tree (Minh et al., 2020) and BEAST2 2.6.6 (Bouckaert et al., 2019).

References

Abreu-Grobois, F. A., Horrocks, J., Formia, A., Dutton, P., LeRoux, R., Vélez-Zuazo, X., Soares, L., & Meylan, P. (2006). New mtDNA Dloop primers which work for a variety of marine turtle species may increase the resolution of mixed stock analyses. In M. Frick, A. Panagopoulou, A. Rees, & F. K. Williams (Eds.), Book of Abstracts. Twenty Sixth Annual Symposium on Sea Turtle Biology and Conservation. International Sea Turtle Society, Athens, Greece. 376 pp. (p. 179). https://internationalseaturtlesociety.org/wp-content/uploads/2021/02/26-turtle.pdf#page=179

Bouckaert, R., Vaughan, T. G., Barido-Sottani, J., Duchêne, S., Fourment, M., Gavryushkina, A., Heled, J., Jones, G., Kühnert, D., De Maio, N., Matschiner, M., Mendes, F. K., Müller, N. F., Ogilvie, H. A., Du Plessis, L., Popinga, A., Rambaut, A., Rasmussen, D., Siveroni, I., … Drummond, A. J. (2019). BEAST 2.5: An advanced software platform for Bayesian evolutionary analysis. PLoS Computational Biology, 15, 1–28. https://doi.org/10.1371/journal.pcbi.1006650

Chifman, J., & Kubatko, L. (2014). Quartet inference from SNP data under the coalescent model. Bioinformatics, 30, 3317–3324. https://doi.org/10.1093/bioinformatics/btu530

Collin, F-D., Durif, G., Raynal, L., Gautier, M., Vitalis, R., Lombaert, E., Marin, J-M. & Estoup, A. (2021). Extending approximate bayesian computation with supervised machine learning to infer demographic history from genetic polymorphisms using DIYABC Random Forest. Molecular Ecology Resources, 21, 2598–2613. doi:10.1111/1755-0998.13413.

Do, C., Waples, R. S., Peel, D., Macbeth, G., Tillett, B. J., & Ovenden, J. R. (2014). NeEstimator v2: re‐implementation of software for the estimation of contemporary effective population size (Ne) from genetic data. Molecular Ecology Resources, 14, 209–214. https://doi.org/10.1111/1755-0998.12157

Georges, A., Gruber, B., Pauly, G., Adams, M., White, D., Young, M., Kilian, A., Zhang, X., Shaffer, H. B., & Unmack, P. J. (2018). Genome-wide SNP markers breathe new life into phylogeography and species delimitation for the problematic short-necked turtles (Chelidae: Emydura) of eastern Australia. Molecular Ecology, 27, 5195–5213. https://doi.org/10.1111/mec.14925

Gruber, B., Unmack, P. J., Berry, O. F., & Georges, A. (2018). dartr: An r package to facilitate analysis of SNP data generated from reduced representation genome sequencing. Molecular Ecology Resources, 18(3), 691–699. https://doi.org/10.1111/1755-0998.12745

Kilian, A., Wenzl, P., Huttner, E., Carling, J., Xia, L., Blois, H., Caig, V., Heller-Uszynska, K., Jaccoud, D., Hopper, C., Aschenbrenner-Kilian, M., Evers, M., Peng, K., Cayla, C., Hok, P., & Uszynski, G. (2012). Diversity Arrays Technology: a generic genome profiling technology on open platforms. In F. Pompanon & A. Bonin (Eds.), Data Production and Analysis in Population Genomics: Methods and Protocols (pp. 67–89). Humana Press. https://doi.org/10.1007/978-1-61779-870-2_5

Mijangos, J., Gruber, B., Berry, O., Pacioni, C., & Georges, A. (2022). dartR v2: an accessible genetic analysis platform for conservation, ecology, and agriculture. Methods in Ecology and Evolution, 3, 2150–2158. https://doi.org/https://doi.org/10.1111/2041-210X.13918

Minh, B. Q., Schmidt, H. A., Chernomor, O., Schrempf, D., Woodhams, M. D., Von Haeseler, A., Lanfear, R., & Teeling, E. (2020). IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Molecular Biology and Evolution, 37, 1530–1534. https://doi.org/10.1093/molbev/msaa015

Nie, L.-W., Wang, L., Xiong, L., & Zhou, K. (2010). The mitochondrial genome complete sequence and organization of the Pig-nosed Turtle Carettochelys insculpta (Testudines, Carettochelyidae) and its phylogeny position in Testudines. Amphibia-Reptilia, 31, 541–551.

Ochoa, A., & Storey, J. D. (2021). Estimating FST and kinship for arbitrary population structures. PLoS Genetics, 17, e1009241. https://doi.org/10.1371/journal.pgen.1009241

Stuart, B. L., & Parham, J. F. (2004). Molecular phylogeny of the critically endangered Indochinese box turtle (Cuora galbinifrons). Molecular Phylogenetics and Evolution, 31, 164–177. https://doi.org/10.1016/S1055-7903(03)00258-6

Swofford, D. L. (2003). Phylogenetic Analysis Using Parsimony * (and other methods). Version 4. In PAUP. Phylogenetic Analysis Using Parsimony (and Other Methods). Version 4. Sinauer Associates.

SNP analyses

Skin tissues and extracted DNA were provided to DArT for processing, sequencing and informative SNP marker identification using DArTseqTM (Kilian et al., 2012). DArT performed a genome complexity reduction technique using double digestion of genomic DNA with two restriction endonucleases PstI (5′- CTGCA|G- 3′) and SphI (5′- GCATG|C- 3′), fragment-size selection and next-generation sequencing on an Illumina HiSeq2500 (CA, USA). Sequences were processed using proprietary DArT analytical pipelines (for full details refer to Georges et al. (2018). Initial filtering was based primarily on average and variance of sequencing depth, average allele counts and call rate across samples. Approximately one third of samples were sequenced twice as technical replicates, with scoring consistency identifying high quality SNP markers with low error rates. We applied further quality control filtering using the R package dartR 2.7.2 (Gruber et al., 2018; Mijangos et al., 2022). These filters were for reproducibility across technical replicates (< 99%), call rate removing both loci and individuals with > 5% missing data, read depth (< 8x and above > 50x) to remove low coverage SNPs and potential paralogs and by removing all but one of multiple SNPs per locus. Specimens were removed from non-bottlenecked populations with close kinship probabilities (≥ 0.23) assessed using the R package popkin (Ochoa & Storey, 2021). We then filtered on linkage disequilibrium to account for the presence of monomorphic heterozygous loci from a possible gene duplication, resulting in a stringently filtered dataset of 16,002 SNPs.

mtDNA analyses

Nd4 sequence was amplified using primers CiND4-F (5’-CACGATGAGGCAACCAAATAGAAC-3’) and CiND4-R (5’-ATTACTTTTACTTGGAATTGCACCA-3’). Control Region sequence was amplified using primers CiCR-F (5’- CTCTATCCCCAAAGCACTGG-3’) and CiCR-R (5’-TTCTTGTATTTAGGGGTTT-3’). Primer CiND-R was modified from the H-Leu primer developed by Stuart and Parham (2004), and primers CiCR-F and CiCR-R were modified from the LCM15382 and H950g primers developed by Abreu-Grobois et al. (2006). Each was modified to better match the Carettochelys insculpta mitochondrial genome sequence of Nie et al. (2010). Amplification was done in separate 25 µL reactions for each locus, using 50 ng of template DNA, 1 × MyTaq HS Red Mix (Bioline) and 0.4 μM of forward and reverse primer. PCR was performed using an EPGradient Thermal Cycler (Eppendorf Mastercycler Pro S 6325) with cycling conditions for ND4 of 95ºC for 2 min, followed by 35 cycles of 20 s at 95°C, 20 s at 60°C and 20 s at 72°C, and a final extension of 1 min at 72°C. Cycling conditions to amplify the control region fragment differed from the Nd4 protocol with an annealing temperature of 50°C, all other conditions were the same.

Amplified products were purified with ExoSAP-IT PCR Product Cleanup Reagent (Thermo Fisher Scientific, Melbourne); 5 µl of PCR product was combined with 2 µl ExoSAP-IT, incubated at 37°C for 15 min followed by enzyme inactivation at 80°C for 15 min. Sequencing reactions consisted of 1 µl purified PCR product, 0.25 µl BigDye® v3.1 (Applied Biosystems, ThermoFisher Scientific, Melbourne), 1× sequencing buffer, 0.16 µM primer and ddH20 to a total volume of 20 µl. PCR cycling conditions were 96°C for 1 min, followed by 35 cycles of 96°C for 10 s, 50°C for 5 s and 60°C for 3 min. Sequencing reactions were purified using an ethanol/EDTA precipitation method (cms 081527, ThermoFisher Scientific, Melbourne). Sequencing was performed on an ABI 3730xl DNA Analyser at the ACRF Biomolecular Resource Facility within the John Curtin School of Medical Research, Australian National University.

Forward and reverse sequences were aligned, trimmed based on quality, manually edited, and coding sequences checked for unexpected frame shift errors or stop codons via amino acid coding in Geneious Prime 2020.2.2 (Biomatters, Auckland, New Zealand).