Targeted sequence capture is becoming a widespread tool for generating large phylogenomic data sets to address difficult phylogenetic problems. However, this methodology often generates data sets in which increasing the number of taxa and loci increases amounts of missing data. Thus, a fundamental (but still unresolved) question is whether sampling should be designed to maximize sampling of taxa or genes, or to minimize the inclusion of missing data cells. Here, we explore this question for an ancient, rapid radiation of lizards, the pleurodont iguanians. Pleurodonts include many well-known clades (e.g., anoles, basilisks, iguanas, and spiny lizards) but relationships among families have proven difficult to resolve strongly and consistently using traditional sequencing approaches. We generated up to 4921 ultraconserved elements with sampling strategies including 16, 29, and 44 taxa, from 1179 to approximately 2.4 million characters per matrix and approximately 30% to 60% total missing data. We then compared mean branch support for interfamilial relationships under these 15 different sampling strategies for both concatenated (maximum likelihood) and species tree (NJst) approaches (after showing that mean branch support appears to be related to accuracy). We found that both approaches had the highest support when including loci with up to 50% missing taxa (matrices with ∼40–55% missing data overall). Thus, our results show that simply excluding all missing data may be highly problematic as the primary guiding principle for the inclusion or exclusion of taxa and genes. The optimal strategy was somewhat different for each approach, a pattern that has not been shown previously. For concatenated analyses, branch support was maximized when including many taxa (44) but fewer characters (1.1 million). For species-tree analyses, branch support was maximized with minimal taxon sampling (16) but many loci (4789 of 4921). We also show that the choice of these sampling strategies can be critically important for phylogenomic analyses, since some strategies lead to demonstrably incorrect inferences (using the same method) that have strong statistical support. Our preferred estimate provides strong support for most interfamilial relationships in this important but phylogenetically challenging group.
Table S1
Voucher specimens from which tissues were obtained for molecular data. Standard abbreviations are as follows: Louisiana State University Museum of Natural Science, LSU; Yale Peabody Museum, YPM; California Academy of Sciences, CAS; University of Kansas, KU; Museum of Vertebrate Zoology, University of California Berkeley, MVZ; San Diego State University, SDSU; University of Michigan Museum of Zoology, UMMZ. Non-standard abbreviations include the following: APR (Achille P. Raselimanana field series); ATOL (Assembling the Tree of Life voucher series); BPN (Brice P. Noonan field series); JAC (Jonathan A. Campbell field series); JAM (Jimmy A. McGuire field series); JJK (Jason J. Kolbe field series); JPV (John Pablo Valladares field series); LJA (Luciano J. Avila field series); SDZoo (Oliver Ryder, San Diego Zoo); POE (Steve Poe field series); RAN (Ronald A. Nussbaum field series); REE (Richard E. Etheridge field series); RLB (Robert L. Bezy field series); SBH (S. Blair Hedges field series); SDZoo (San Diego Zoo Series); TJS (Thomas J. Sanger field series); TWR (Tod W. Reeder field series), WL (William Lamar), YPM (Yale Peabody Museum). For each newly sequenced sample, a Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra) and GenBank numbers are provided.
Table_S1_28_July_2015.doc
Table S2
Oligonuclotides used for construction of 43 uniquely barcoded adaptors. Index sequences (i.e., barcodes) are bolded.
Table_S2_28_July_2015.doc
Figure S1
Histograms depicting variation across ultraconserved elements (UCEs) generated for this study. (A–C) The number of segregating sites, (D–F) length in base pairs, and (G–I) number of taxa are reported for all three sampling strategies (16 taxa each with more than 3000 UCEs, 29 taxa each with more than 2000 UCES, and 44 taxa each with more than 120 UCEs) using 50% missing taxa per locus datasets. Note that datasets allowed for up to 50% missing taxa per locus and contained 2,716, 4,319, and 4,789 UCE loci total (16 taxa, 29 taxa, and 44 taxa, respectively).
S1_uce_summary.pdf
Figures S2-S37
Phylogenetic trees from individual analyses performed in this study.
Figs_S2-S37.zip
RAxML_phylip_alignments
Zip archive containing the following alignments: Alignment 1: 16 taxa 20% missing taxa dataset analyzed in RAxML. Alignment 2: 16 taxa 30% missing taxa dataset analyzed in RAxML. Alignment 3: 16 taxa 40% missing taxa dataset analyzed in RAxML. Alignment 4: 16 taxa 50% missing taxa dataset analyzed in RAxML. Alignment 5: 16 taxa 60% missing taxa dataset analyzed in RAxML. Alignment 6: 29 taxa 20% missing taxa dataset analyzed in RAxML. Alignment 7: 29 taxa 30% missing taxa dataset analyzed in RAxML. Alignment 8: 29 taxa 40% missing taxa dataset analyzed in RAxML. Alignment 9: 29 taxa 50% missing taxa dataset analyzed in RAxML. Alignment 10: 29 taxa 60% missing taxa dataset analyzed in RAxML. Alignment 11: 44 taxa 20% missing taxa dataset analyzed in RAxML. Alignment 12: 44 taxa 30% missing taxa dataset analyzed in RAxML. Alignment 13: 44 taxa 40% missing taxa dataset analyzed in RAxML. Alignment 14: 44 taxa 50% missing taxa dataset analyzed in RAxML. Alignment 15: 44 taxa 60% missing taxa dataset analyzed in RAxML.
Gene tree set 1
16 taxa 20% missing taxa dataset analyzed in NJst.
16_taxa_0.20_bootstrap_zip.zip
Gene tree set 2
16 taxa 30% missing taxa dataset analyzed in NJst.
16_taxa_0.30_bootstrap_zip.zip
Gene tree set 3
16 taxa 40% missing taxa dataset analyzed in NJst.
16_taxa_0.40_bootstrap_zip.zip
Gene tree set 4
16 taxa 50% missing taxa dataset analyzed in NJst and ASTRAL.
16_taxa_0.50_bootstrap_zip.zip
Gene tree set 5
16 taxa 60% missing taxa dataset analyzed in NJst.
16_taxa_0.60_bootstrap_zip.zip
Gene tree set 6
29 taxa 20% missing taxa dataset analyzed in NJst.
29_taxa_0.20_bootstrap.zip
Gene tree set 7
29 taxa 30% missing taxa dataset analyzed in NJst.
29_taxa_0.30_bootstrap.zip
Gene tree set 8
29 taxa 40% missing taxa dataset analyzed in NJst.
29_taxa_0.40_bootstrap.zip
Gene tree set 9
29 taxa 50% missing taxa dataset analyzed in NJst and ASTRAL.
29_taxa_0.50_bootstrap_zip.zip
Gene tree set 10
29 taxa 60% missing taxa dataset analyzed in NJst.
29_taxa_0.60_bootstrap_zip.zip
Gene tree set 11
44 taxa 20% missing taxa dataset analyzed in NJst.
44_taxa_0.20_bootstrap_zip.zip
Gene tree set 12
44 taxa 30% missing taxa dataset analyzed in NJst.
44_taxa_0.30_bootstrap.zip
Gene tree set 13
44 taxa 40% missing taxa dataset analyzed in NJst.
44_taxa_0.40_bootstrap.zip
Gene tree set 14
44 taxa 50% missing taxa dataset analyzed in NJst and ASTRAL.
44_taxa_0.50_bootstrap.zip
Gene tree set 15
44 taxa 60% missing taxa dataset analyzed in NJst.
44_taxa_0.60_bootstrap.zip