Data from: Comparing species tree estimation with large anchored phylogenomic and small Sanger-sequenced molecular datasets: an empirical study on Malagasy pseudoxyrhophiine snakes
Ruane, Sara et al. (2015), Data from: Comparing species tree estimation with large anchored phylogenomic and small Sanger-sequenced molecular datasets: an empirical study on Malagasy pseudoxyrhophiine snakes, Dryad, Dataset, https://doi.org/10.5061/dryad.kp400
Background: Using molecular data generated by high throughput next generation sequencing (NGS) platforms to infer phylogeny is becoming common as costs go down and the ability to capture loci from across the genome goes up. While there is a general consensus that greater numbers of independent loci should result in more robust phylogenetic estimates, few studies have compared phylogenies resulting from smaller datasets for commonly used genetic markers with the large datasets captured using NGS. Here, we determine how a 5-locus Sanger dataset compares with a 377-locus anchored genomics dataset for understanding the evolutionary history of the pseudoxyrhophiine snake radiation centered in Madagascar. The Pseudoxyrhophiinae comprise ~86 % of Madagascar’s serpent diversity, yet they are poorly known with respect to ecology, behavior, and systematics. Using the 377-locus NGS dataset and the summary statistics species-tree methods STAR and MP-EST, we estimated a well-supported species tree that provides new insights concerning intergeneric relationships for the pseudoxyrhophiines. We also compared how these and other methods performed with respect to estimating tree topology using datasets with varying numbers of loci. Methods: Using Sanger sequencing and an anchored phylogenomics approach, we sequenced datasets comprised of 5 and 377 loci, respectively, for 23 pseudoxyrhophiine taxa. For each dataset, we estimated phylogenies using both gene-tree (concatenation) and species-tree (STAR, MP-EST) approaches. We determined the similarity of resulting tree topologies from the different datasets using Robinson-Foulds distances. In addition, we examined how subsets of these data performed compared to the complete Sanger and anchored datasets for phylogenetic accuracy using the same tree inference methodologies, as well as the program *BEAST to determine if a full coalescent model for species tree estimation could generate robust results with fewer loci compared to the summary statistics species tree approaches. We also examined the individual gene trees in comparison to the 377-locus species tree using the program MetaTree. Results: Using the full anchored dataset under a variety of methods gave us the same, well-supported phylogeny for pseudoxyrhophiines. The African pseudoxyrhophiine Duberria is the sister taxon to the Malagasy pseudoxyrhophiines genera, providing evidence for a monophyletic radiation in Madagascar. In addition, within Madagascar, the two major clades inferred correspond largely to the aglyphous and opisthoglyphous genera, suggesting that feeding specializations associated with tooth venom delivery may have played a major role in the early diversification of this radiation. The comparison of tree topologies from the concatenated and species-tree methods using different datasets indicated the 5-locus dataset cannot be used to infer a correct phylogeny for the pseudoxyrhophiines under any method tested here and that summary statistics methods require 50 or more loci to consistently recover the species-tree inferred using the complete anchored dataset. However, as few as 15 loci may infer the correct topology when using the full coalescent species tree method *BEAST. MetaTree analyses of each gene tree from the Sanger and anchored datasets found that none of the individual gene trees matched the 377-locus species tree, and that no gene trees were identical with respect to topology. Conclusions: Our results suggest that ≥50 loci may be necessary to confidently infer phylogenies when using summary species-tree methods, but that the coalescent-based method *BEAST consistently recovers the same topology using only 15 loci. These results reinforce that datasets with small numbers of markers may result in misleading topologies, and further, that the method of inference used to generate a phylogeny also has a major influence on the number of loci necessary to infer robust species trees.