Data from: Exon capture museomics deciphers the nine-banded armadillo species complex and identifies a new species endemic to the Guiana Shield
Data files
Jun 14, 2024 version files 200.56 MB
-
Abba_shotgun.fasta
1.67 KB
-
Abba_shotgun.fasta.treefile
329 B
-
Astral_consensus_tree_TESTNEW.txt
2.13 KB
-
Barthe_et_al_SystBiol_Supplementary.pdf
7.05 MB
-
Concatenated_mitochondrial_genes_partition.txt.txt
1.11 KB
-
Concatenated_mitochondrial_genes_TESTNEW.treefile
3.63 KB
-
Concatenated_mitochondrial_genes.fasta
1.13 MB
-
Concatenated_nuclear_loci_partition.txt
53.31 KB
-
Concatenated_nuclear_loci_TESTNEW.treefile
2.47 KB
-
Concatenated_nuclear_loci.fasta
31.79 MB
-
Diploid_837_nuclear_loci.fasta
63.99 MB
-
Dloop_alignment.fasta
98.38 KB
-
Dloop_alignment.fasta.treefile
10.38 KB
-
Final_bait_set.fasta
2.15 MB
-
Input_for_bpp.phy
31.38 MB
-
Input_for_PCA.fasta
59.81 MB
-
Lineages_for_BPP
841 B
-
Location_loci_targeted.bed
35.94 KB
-
Nuclear_gene_trees.treefile
1.87 MB
-
Output_PCA.txt
412.04 KB
-
PTPh_Support_Partition.txt
934 B
-
PTPh_tree_partition.png
764.09 KB
-
README.md
4.40 KB
-
Ultrametric_tree_concatenated_nuclear_loci.treefile
3.64 KB
Jul 20, 2024 version files 2.15 MB
-
Final_bait_set.fasta
2.15 MB
-
README.md
1.18 KB
Abstract
The nine-banded armadillo (Dasypus novemcinctus) is the most widespread xenarthran species across the Americas. Recent studies have suggested it is composed of four morphologically and genetically distinct lineages of uncertain taxonomic status. To address this issue, we used a museomic approach to sequence 80 complete mitogenomes and capture 997 nuclear loci for 71 Dasypus individuals sampled across the entire distribution. We carefully cleaned up potential genotyping errors and cross contaminations that could blur species boundaries by mimicking gene flow. Our results unambiguously support four distinct lineages within the D. novemcinctus complex. We found cases of mito-nuclear phylogenetic discordance but only limited contemporary gene flow confined to the margins of the lineage distributions. All available evidence including the restricted gene flow, phylogenetic reconstructions based on both mitogenomes and nuclear loci, and phylogenetic delimitation methods consistently supported the four lineages within D. novemcinctus as four distinct species. Comparable genetic differentiation values to other recognized Dasypus species further reinforced their status as valid species. Considering congruent morphological results from previous studies, we provide an integrative taxonomic view to recognise four species within the D. novemcinctus complex: D. novemcinctus, D. fenestratus, D. mexicanus, and D. guianensis sp. nov., a new species endemic of the Guiana Shield that we describe here. The two available individuals of D. mazzai and D. sabanicola were consistently nested within D. novemcinctus lineage and their status remains to be assessed. The present work offers a case study illustrating the power of museomics to reveal cryptic species diversity within a widely distributed and emblematic species of mammals.
https://doi.org/10.5061/dryad.95x69p8sz
Description of the data and file structure
● Supplementary Material
- Barthe_et_al_SystBiol_Supplementary.pdf: Supplementary Text, Figures and Tables
● Supplementary Data
- Final_bait_set.fasta: Sequences of the 16,146 probes used to capture the 997 nuclear loci (exons and flanking regions).
Sharing/Access information
Links to other publicly accessible locations of the data:
● Zenodo public online repository (https://zenodo.org/records/11283950).
● Digital 3D models of the Dasypus guianensis holotype specimen are freely available through MorphoMuseuM (https://morphomuseum.com/specimenfiles/send-file-specimenfile/1200/4a65cc83; https://morphomuseum.com/specimenfiles/send-file-specimenfile/1201/7ed725a9).
Biological sampling
A total of 80 armadillo tissue samples were obtained over the years for this study through multiple sources (Table S1). Our sampling notably includes 46 individuals sampled in the form of dried skin pieces. Among these, 38 were obtained from museum specimens collected between 1894 and 2000 and stored in 12 different collections. The other eight were received in the form of dried ear biopsies not associated with museum vouchers. The remaining samples were fresh tissue biopsies stored in 95% ethanol. In accordance with the policy of sharing benefits and advantages (APA TREL1916196S/224), biological material from French Guiana has been deposited in the JAGUARS collection supported by the Collectivité Territoriale de Guyane and the Direction Générale des Territoires et de la Mer de la Guyane.
DNA extractions and sequencing
Total genomic DNA extractions from fresh tissue biopsies preserved in 95% ethanol were performed using the DNeasy Blood & Tissue Kit (QIAGEN) according to the manufacturer’s instructions. Museum dried skin samples were processed sequentially under a dedicated UV hood, cleaned between samples to minimize cross-contamination. DNA extractions were then performed in batches of 12 samples including an extraction blank, using the same DNeasy Blood & Tissue Kit under a UV hood in a dedicated clean room. The only minor modification was to reduce the final elution volume to 70 μl instead of 100 μl. Illumina libraries were constructed from DNA extracts according to the cost-effective version of the Meyer and Kircher (2010) protocol proposed by Tilak et al. (2015). Mitogenomes were obtained through shotgun Illumina sequencing, using single-end 100 bp reads for the fresh samples (sequenced at GATC Biotech, Konstanz, Germany) and paired-end 150 bp reads for the museum specimens (sequenced by Daicel Arbor Biosciences, Ann Arbor, MI, USA). To obtain complementary nuclear data, single-copy orthologous exons shared by the nine-banded armadillo (D. novemcinctus) and the Hoffmann’s two-toed sloth (Choloepus hoffmanni) were selected from the OrthoMam v8 database (Douzery et al. 2014) based on their size, that was set to be around 200 bp, until a total of 1000 exons was reached. The sequences of these 1000 exons plus 400 bp of flanking introns (200 bp on each side) were then extracted from the Dasnov3.1 nine-banded armadillo genome assembly (GCF_000208655.2) and used to design RNA capture probes resulting in a final set of 16,146 probes targeting 997 nuclear loci of about 600 bp length each. We verified that these 997 loci provide a representative sampling of the nine-banded armadillo genome by locating them on the latest chromosome-scale reference assembly (GCF_030445035.1) (Fig. S1). Probe design and synthesis, library preparation on DNA extracts from museum specimens, capture reactions on previously constructed libraries, and Illumina sequencing of the 997 nuclear loci were outsourced to Daicel Arbor Biosciences.
Dataset assembly
Mitochondrial dataset. The raw reads obtained by shotgun sequencing were cleaned with FastP v0.21.0 (Chen et al. 2018). The reference mitogenome of Dasypus novemcinctus (NC_001821.1; Arnason et al. 1997) was used to map the reads of each individual using bwa mem v.0.7.17 (Li 2013) with default parameters. Samtools v1.9 (Li et al. 2009) and Picard v2.25.5 (Picard Toolkit 2019) were respectively used to convert the mapping files and to order and index reads according to their position on the reference genome. MarkDuplicates v2.25.5 (Picard Toolkit 2019) was used to mark duplicate reads. Samtools depth v1.9 (Li et al. 2009) was used to estimate depth coverage (Fig. S2). Variant calling for haploid data (--ploidy 1) was then performed with Freebayes v1.3.1 (Garrison and Marth 2012). Finally, the vcf file was converted with bcftools consensus v1.14 (Danecek et al. 2021) using haploid parameters and filtering out positions with less than 5x depth coverage to obtain sequences in fasta format. Furthermore, we assembled two additional mitochondrial datasets of shorter fragments including samples from previous studies on 16S rRNA (Abba et al. 2018) and D-loop (Arteaga et al. 2020) to evaluate the consistency between these sequences and our new mitogenomes (Supplementary Material on Dryad, https://datadryad.org/XXXXXX).
Contamination exploration. We implemented a method inspired by Green et al. (2008, 2010) for detecting human contamination in Neanderthal genomes. To identify cross-contamination between lineages, we identified the mitochondrial diagnostic positions (DPs) for each of the four lineages of the species complex and D. pilosus using the apolist command of PAUP* v4.0a (Swofford 1998). A total of 122 DPs were identified for the Guianan lineage, 31 for the Southern lineage, 26 for the Central lineage, 28 for the Northern lineage, and 143 for D. pilosus. Then, for each individual, we recorded the proportion of the lineage DPs supported by at least three reads (Fig. S3). The frequency of reads supporting these positions (i.e., the number of reads with the diagnostic allele divided by the total number of reads mapping at the position) was also taken into account (Fig. S3). Proportions of DPs and read frequencies were estimated using bam-readcount (https://github.com/genome/bam-readcount) and the custom script Estimate_lineage_support.py (https://github.com/Mathilde-Barthe/Mitochondrial_support). The support for a specific lineage was then estimated as the product between the proportion of DPs supporting this lineage and their read frequency (Lineage support = proportion of DPs x read frequency; Fig. S3) and plotted using the custom script Plot_lineage_support.R (https://github.com/Mathilde-Barthe/Mitochondrial_support) and the ggplot2, dplyr, colorspace, cowplot, grid, gridExtra, ggpubr R packages (Auguie and Antonov 2017; R Core Team 2018; Villanueva and Chen 2019; Wilke 2019; Zeileis et al. 2019; Kassambara 2020; Wickham et al. 2023). DPs of D. pilosus were used as controls as laboratory experiments (DNA extractions and library preparations) for these two individuals were performed separately from those of the D. novemcinctus complex. Thus, their proportion within D. novemcinctus samples represents shared genotyping errors expected between individuals that did not cross-contaminated each other.
Nuclear dataset. Reads from exon capture sequencing were cleaned with FastP v0.21.0 (Chen et al. 2018). The 997 targeted sequences used to define the capture probes were used as references to map the reads of each individual using bwa mem v.0.7.17 (Li 2013) with default parameters for paired-end data. As for the mitogenomes, we used Samtools v1.9 (Li et al. 2009), Picard v2.25.5 (Picard Toolkit 2019) and MarkDuplicates v2.25.5 (Picard Toolkit 2019) to respectively convert the mapping files, order and index reads according to their position on the reference genome, and mark duplicate reads. Variant calling for diploid data was then performed with Freebayes v1.3.1 (Garrison and Marth 2012). Positions with a coverage under 10x and a quality score below 20 were considered ambiguous and called as 'N'. The vcf file was modified with a custom script to ensure that heterozygous positions were supported by reads at frequency 0.3 to 0.7, otherwise, the most frequent allele was called as a homozygous position. Depth of coverage was estimated with samtools depth v1.9 (Li et al. 2009). We excluded from the analysis 3,541 sequences (57 loci per individual on average) with more than 40% missing data (Table S2) and nine individuals with more than 55% average missing data across loci (see Table S1 for details). We also excluded one locus containing less than three individuals as this is the minimal number required to reconstruct a phylogenetic tree. Note however that only seven loci contained less than 10 individuals. Next, we used the “–hardy” option from vcftools to filter out 159 loci in which at least 75% of the individuals in a given nuclear lineage were heterozygous at a single position. Inbreeding coefficient (F) and heterozygosity (He) were estimated with the “–het” option from vcftools and a custom script, respectively. Using these cleaned vcf files, 25,742 variant positions out of a total of 506,355 nucleotide positions across the 837 retained loci were reconstructed in fasta format using the VCF2FastaFreebayes custom script (https://github.com/benoitnabholz/VCF2Fasta). Finally, bcftools consensus v1.14 was used to obtain the fasta sequence, with heterozygous sites encoded as IUPAC characters.