Data from: A protocol for targeted enrichment of intron-containing sequence markers for recent radiations: a phylogenomic example from Heuchera (Saxifragaceae)
Folk, Ryan A., The Ohio State University
Mandel, Jennifer R., University of Memphis
Freudenstein, John V., The Ohio State University
Published Jun 09, 2016 on Dryad.
Cite this dataset
Folk, Ryan A.; Mandel, Jennifer R.; Freudenstein, John V. (2016). Data from: A protocol for targeted enrichment of intron-containing sequence markers for recent radiations: a phylogenomic example from Heuchera (Saxifragaceae) [Dataset]. Dryad. https://doi.org/10.5061/dryad.4cn66
Premise of the study: Phylogenetic inference is moving to large multilocus data sets, yet there remains uncertainty in the choice of marker and sequencing method at low taxonomic levels. To address this gap, we present a method for enriching long loci spanning intron-exon boundaries in the genus Heuchera. Methods: Two hundred seventy-eight loci were designed using a splice-site prediction method combining transcriptomic and genomic data. Biotinylated probes were designed for enrichment of these loci. Reference-based assembly was performed using genomic references; additionally, chloroplast and mitochondrial genomes were used as references for off-target reads. The data were aligned and subjected to coalescent and concatenated phylogenetic analyses to demonstrate support for major relationships. Results: Complete or nearly complete (>99%) sequences were assembled from essentially all loci from all taxa. Aligned introns showed a fourfold increase in divergence as opposed to exons. Concatenated analysis gave decisive support to all nodes, and support was also high and relationships mostly similar in the coalescent analysis. Organellar phylogenies were also well-supported and conflicted with the nuclear signal. Discussion: Our approach shows promise for resolving a recent radiation. Enrichment for introns is highly successful with little or no sequencing dropout at low taxonomic levels despite higher substitution and indel frequencies, and should be exploited in studies of species complexes.
Concatenated low-copy nuclear matrix
Concatenated data matrix of all 277 successfully enriched loci, assembled by BWA and aligned by MAFFT.
Concatenated nuclear matrix, exons only
Concatenated nuclear matrix, trimmed of introns.
Concatenated nuclear matrix, introns only
The concatenated nuclear matrix, with exonic regions deleted. The resultant matrix was shortened to match the exon-only alignment exactly in length, in order to fairly compare phylogenetic signal between the two regions.
Chloroplast genome matrix
Chloroplast genome alignment, assembled by BWA and aligned by MAFFT.
Mitochondrial genome matrix
Mitochondrial genome alignment, assembled in BWA and aligned with Mauve. Tree labels refer to specimen voucher numbers.
Concatenated nuclear ML tree
Maximum likelihood tree inferred on the concatenated low-copy nuclear data using RAxML. Tree labels refer to specimen voucher numbers.
Chloroplast ML tree
Maximum likelihood tree inferred on the chloroplast data using RAxML. Tree labels refer to specimen voucher numbers.
Mitochondrial ML tree
Maximum likelihood tree inferred on the mitochondrial data using RAxML. Tree labels refer to specimen voucher numbers.
Annotated target loci
The 278 loci for which probes were developed, with exonic and intronic regions annotated, in GenBank flatfile format. The loci are numbered in descending order of length, so that "Locus 1" is the longest and "Locus 278" the shortest. One locus was not consistently enriched and was dropped from analyses; this is labeled "Locus 4".
Individual nuclear locus alignments
Individual locus alignments for the 277 successfully enriched loci. Labeling matches other files; hence loci are labeled by descending target length.
ML gene trees (inferred in RAxML) for each of the 277 enriched loci. The gene trees are unlabeled, but they are in precisely the same order as other files (descending locus length), hence line 1 contains the tree for Locus 1, and line 277 contains the tree for Locus 278 (Locus 4 is again omitted, hence the numbering discrepancy). Gene tree labels refer to specimen voucher numbers.
Tree inferred in MP-EST; the STAR tree was topologically identical. Branch labels are coalescent branch lengths. Given the lack of infraspecific sampling, it is impossible to estimate tip branch lengths; coalescent programs generally plot these as the maximum possible value (here, 9) but these should be ignored. Internal branch estimates are correct assuming gene tree discord is solely due to the coalescent. Tree labels refer to specimen voucher numbers.