Validating a target-enrichment design for capturing uniparental haplotypes in ancient domesticated animals
Data files
Apr 09, 2025 version files 633.28 KB
-
README.md
4.26 KB
-
Supplementary_File_X.zip
201.98 KB
-
Supplementary_File_Y.zip
427.04 KB
Abstract
In the last three decades, DNA sequencing of ancient animal osteological assemblages has become an important tool complementing standard archaeozoological approaches to reconstruct the history of animal domestication. However, osteological assemblages of key archaeological contexts are not always available or do not necessarily preserve sufficient amounts of ancient DNA for a cost-effective genetic analysis. Here, we develop an in-solution target-enrichment approach, based on 80-mer species-specific RNA probes (ranging from 306 to 1,686 per species) to characterize (in single experiments) the mitochondrial genetic variation from eight domesticated animal species of major economic interest: cattle, chickens, dogs, donkeys, goats, horses, pigs, and sheep. We also illustrate how our design can be adapted to enrich DNA library content and map the Y-chromosomal diversity within Equus caballus. By applying our target-enrichment assay to an extensive panel of ancient osteological remains, farm soil, and cave sediments spanning the last 43 kyrs, we demonstrate that minimal sequencing efforts are necessary to exhaust the DNA library complexity and to characterize mitogenomes to an average depth-of-coverage of 19.4 to 2,003.7 -fold. Our assay further retrieved horse mitogenome and Y-chromosome data from Late Pleistocene coprolites, as well as bona fide mitochondrial sequences from species that were not part of the probe design, such as bison and cave hyena. Our methodology will prove especially useful to minimize costs related to the genetic analyses of maternal and paternal lineages of a wide range of domesticated and wild animal species, and for mapping their diversity change over space and time, including from environmental samples. Here we provide the fasta files of the probes.
https://doi.org/10.5061/dryad.612jm64cr
Description of the data and file structure
We screened the literature to collect an extensive representation of full mitogenome sequences from a total of eight species of domesticated animals, namely cattle (Bos taurus), chickens (Gallus gallus), dogs (Canis familiaris), donkeys (Equus asinus), goats (Capra aegagrus hircus), horses (Equus caballus), pigs (Sus scrofa domesticus), and sheep (Ovis aries) (Table S2). We then added the mitochondrial haplotypes from three equid and three canid outgroups (Equus hemionus, N=1; E, africanus somaliensis, N=1, and E. hydruntinus, N=1; Canis latrans, N=5; Canis himalayensis, N=2, and C. lupus signatus, N=2, respectively) to the E. asinus and C. familiaris *data sets. We also added to this dataset 476 full mitogenome sequences from hominins, including 433 Homo sapiens* sapiens with a worldwide distribution, 34 Homo sapiens neanderthaliensis, seven Homo altaiensis *and two *Homo heidelbergensis. We also included a total of 409 bear mitochondrial sequences. Independent multiple sequence alignments (MSA) were built for each of species above emntioned using MAFFT v7.453 (Nakamura et al. 2018), and manually curated with AliView v1.28 (Larsson 2014) and Seaview v4.7 (Gouy, Guindon & Gascuel 2010). A maximum likelihood (ML) tree was then generated using IQ-Tree v2.0.3 (Minh et al. 2020), with 1,000 bootstrap pseudoreplicates (option -b 1000) for each data set and visualized in iTOL v6.7 (Letunic & Bork 2021). The resulting tree topologies were manually compared with those published to confirm the haplogroup structure within each species group. MSAs were used to identify those biallelic Single Nucleotide Polymorphisms (SNPs) private to each individual haplogroup, which were used to prioritize probe design. In a first step, we defined a set of 90 bp long primary probes centered on every SNP defining a haplogroup, considering as candidate probes all reported haplotypes starting 45 and 44 nucleotides prior to the underlying SNP position. This first step was necessary to account for the possible presence of deletions in specific MSA regions. DNA variation consisting of gaps larger than three nucleotides and/or present only once were disregarded. In a second step, we defined a set of 80-mer candidate probes by further trimming each individual 90-mer symmetrically from both ends. Those probes showing %GC values within the 0.33 to 0.6 range were retained, and in the case where two or more probes were available for the target SNP, the one showing the highest entropy value, indicative of greater information content, hence, sequence complexity, was preferentially selected (otherwise, the selection was random), delivering the final set of 80-mer probes (Fig. 1). Our procedure resulted in a final set of 306 to 1,686 probes per species group (Table S2). The sites overlapping with probes in each MSA were sub-selected for ML tree reconstruction in IQ-Tree, following the same procedure as above, to confirm their capacity to accurately identify the entire range of DNA variation. A total of 18,780 probes targeting Y-chromosomal variation in horses were designed by applying the same procedure to an MSA containing 403 horse Y-chromosomal sequences (272 ancient and 131 modern; Fages et al. 2019, Librado et al. 2021, 2024), with the exception that 60-mers were considered instead of 80-mers. The mitochondrial and Y-chromosomal probe sequences were then sent to Arbor Biosciences, Ann Arbor, USA for production as two independent sets of RNA baits. These sets are provided as Supplementary File X and Supplementary File Y.
Files and variables
File: Supplementary_File_Y.zip
Description: This file contains fasta sequences of Y chromosomal probes.
File: Supplementary_File_X.zip
Description: This file contains fasta sequences of Mitochondrial probes. Species name are written in the fasta heading of each probe sequence while the number denotes the position of target SNP of that probe when compared reference sequence.
We screened the literature to collect an extensive representation of full mitogenome sequences from a total of eight species of domesticated animals, namely cattle (Bos taurus), chickens (Gallus gallus), dogs (Canis familiaris), donkeys (Equus asinus), goats (Capra aegagrus hircus), horses (Equus caballus), pigs (Sus scrofa domesticus), and sheep (Ovis aries) (Table S2). We then added the mitochondrial haplotypes from three equid and three canid outgroups (Equus hemionus, N=1; E, africanus somaliensis, N=1, and E. hydruntinus, N=1; Canis latrans, N=5; Canis himalayensis, N=2, and C. lupus signatus, N=2, respectively) to the E. asinus and C. familiaris data sets. We also added to this dataset 476 full mitogenome sequences from hominins, including 433 Homo sapiens sapiens with a worldwide distribution, 34 Homo sapiens neanderthaliensis, seven Homo altaiensis and two Homo heidelbergensis. We also included a total of 409 bear mitochondrial sequences. Independent multiple sequence alignments (MSA) were built for each of species above emntioned using MAFFT v7.453 (Nakamura et al. 2018), and manually curated with AliView v1.28 (Larsson 2014) and Seaview v4.7 (Gouy, Guindon & Gascuel 2010). A maximum likelihood (ML) tree was then generated using IQ-Tree v2.0.3 (Minh et al. 2020), with 1,000 bootstrap pseudoreplicates (option -b 1000) for each data set and visualized in iTOL v6.7 (Letunic & Bork 2021). The resulting tree topologies were manually compared with those published to confirm the haplogroup structure within each species group. MSAs were used to identify those biallelic Single Nucleotide Polymorphisms (SNPs) private to each individual haplogroup, which were used to prioritize probe design. In a first step, we defined a set of 90 bp long primary probes centered on every SNP defining a haplogroup, considering as candidate probes all reported haplotypes starting 45 and 44 nucleotides prior to the underlying SNP position. This first step was necessary to account for the possible presence of deletions in specific MSA regions. DNA variation consisting of gaps larger than three nucleotides and/or present only once were disregarded. In a second step, we defined a set of 80-mer candidate probes by further trimming each individual 90-mer symmetrically from both ends. Those probes showing %GC values within the 0.33 to 0.6 range were retained, and in the case where two or more probes were available for the target SNP, the one showing the highest entropy value, indicative of greater information content, hence, sequence complexity, was preferentially selected (otherwise, the selection was random), delivering the final set of 80-mer probes (Fig. 1). Our procedure resulted in a final set of 306 to 1,686 probes per species group (Table S2). The sites overlapping with probes in each MSA were sub-selected for ML tree reconstruction in IQ-Tree, following the same procedure as above, to confirm their capacity to accurately identify the entire range of DNA variation. A total of 18,780 probes targeting Y-chromosomal variation in horses were designed by applying the same procedure to an MSA containing 403 horse Y-chromosomal sequences (272 ancient and 131 modern; Fages et al. 2019, Librado et al. 2021, 2024), with the exception that 60-mers were considered instead of 80-mers. The mitochondrial and Y-chromosomal probe sequences were then sent to Arbor Biosciences, Ann Arbor, USA for production as two independent sets of RNA baits. These sets are provided as Supplementary File X and Supplementary File Y.