Skip to main content

Data from: A new lineage of Galapagos giant tortoises identified from museum samples

Cite this dataset

Jensen, Evelyn L. et al. (2022). Data from: A new lineage of Galapagos giant tortoises identified from museum samples [Dataset]. Dryad.


The Galapagos Archipelago is recognized as a natural laboratory for studying evolutionary processes. San Cristóbal was one of the first islands colonized by tortoises, which radiated from there across the archipelago to inhabit 10 islands. Here, we sequenced the mitochondrial control region from six historical giant tortoises from San Cristóbal (five long deceased individuals found in a cave and one found alive during an expedition in 1906) and discovered that the five from the cave are from a clade that is distinct among known Galapagos giant tortoises but closely related to the species from Española and Pinta Islands. The haplotype individual collected alive in 1906 is in the same clade as the haplotype in the contemporary population. To search for traces of a second lineage in the contemporary population on San Cristóbal, we closely examined the population by sequencing the mitochondrial control region for 129 individuals and genotyping 70 of these for both 21 microsatellite loci and >12 000 genome-wide single nucleotide polymorphisms [SNPs]. The dataset archived here consists of a VCF file of the SNPs genotyped through ddRAD and a structure file of the 21 microsatellites with the genotypes for the same 64 individuals in each. Only a single mitochondrial haplotype was found, with no evidence to suggest substructure based on the nuclear markers.


Blood (0.1-1.9 mL) was collected from live individuals during expeditions to San Cristóbal Island in 1999, 2012 and 2016. Genomic DNA was extracted using a DNeasy Blood and Tissue Kit (Qiagen) following the manufacturer’s protocol for blood.

For 60 samples, double digest Restriction-site Associated DNA (ddRAD) libraries were created following Peterson et al. (2012), as described in Miller et al. (2018), and sequenced using two lanes of Illumina HiSeq 2000 at the Yale Center for Genome Analysis. Sequences from these new libraries were combined with previously collected data for an additional 10 contemporary San Cristóbal individuals from Miller et al. (2018). These 70 individuals (10 sampled in 1999, 14 in 2006, and 46 in 2016) represent the geographic breadth of sampling locations across San Cristóbal Island. The sequences were processed and aligned to the C. abingdonii reference genome (assembly ASM359739v1, Quesada et al. 2019) using the PALEOMIX bam pipeline (version 1.2.14, Schubert et al. 2014). Reads were then filtered to remove any with more than 4 mismatches to the reference using BAMTOOLS (version 2.5.1, Barnett et al. 2011). Variant detection and genotype calling were performed using BCFtools (Li et al. 2009) mpileup and call, excluding reads with a mapping quality score of less than 30, ignoring indels and outputting only variants. The resulting VCF file was filtered using VCFtools (Danecek et al. 2011) to exclude repetitive regions of the genome, sites with a sequencing read depth greater than one standard deviation above the mean depth (mean = 21.5, SD = 19.0), and sites with >20% missing data. Only biallelic SNP loci with a minor allele count of three were retained. Following these steps, we assessed missingness per individual, and removed six individuals with >50% missing data. Next, we re-filtered the original VCF file with only retained individuals, following the steps above, followed by a filter for Hardy-Weinberg Equilibrium (HWE) using the correction for false discovery rate described by Benjamini and Yekutieli (2001) (adjusted p-value = 0.004888) and thinned the loci to retain one SNP per 10 000 bp in order to reduce linkage among SNP loci.

We genotyped these same 70 contemporary individuals from the ddRAD at 21 microsatellite loci, following the procedures described in Quinzin et al. (2019).

Sequencing of the ddRAD libraries resulted in an average of 19.7 million reads per individual that aligned to the reference genome (range 5.6-87.9 million reads), which were combined with the 10 individuals sequenced previously (Miller et al. 2018). The final filtered ddRAD dataset had 64 individuals retained (six individuals were dropped due to high levels of missing data) that were genotyped at 12 192 loci with a mean depth of 16.5x, and 12% missing data. We omitted these six individuals from the microsatellite analyses in order to have fully overlapping datasets of the same 64 individuals.