Restriction-site associated DNA (RAD) sequencing and related methods rely on the conservation of enzyme recognition sites to isolate homologous DNA fragments for sequencing, with the consequence that mutations disrupting these sites lead to missing information. There is thus a clear expectation for how missing data should be distributed, with fewer loci recovered between more distantly related samples. This observation has led to a related expectation: that RAD-seq data are insufficiently informative for resolving deeper scale phylogenetic relationships. Here we investigate the relationship between missing information among samples at the tips of a tree and information at edges within it. We re-analyze and review the distribution of missing data across ten RAD-seq data sets and carry out simulations to determine expected patterns of missing information. We also present new empirical results for the angiosperm clade Viburnum (Adoxaceae, with a crown age >50 Ma) for which we examine phylogenetic information at different depths in the tree and with varied sequencing effort. The total number of loci, the proportion that are shared, and phylogenetic informativeness varied dramatically across the examined RAD-seq data sets. Insufficient or uneven sequencing coverage accounted for similar proportions of missing data as dropout from mutation-disruption. Simulations reveal that mutation-disruption, which results in phylogenetically distributed missing data, can be distinguished from the more stochastic patterns of missing data caused by low sequencing coverage. In Viburnum, doubling sequencing coverage nearly doubled the number of parsimony informative sites, and increased by >10X the number of loci with data shared across >40 taxa. Our analysis leads to a set of practical recommendations for maximizing phylogenetic information in RAD-seq studies.
Supplementary Figure 1
Simulation procedure for dropping RAD-seq data by mutation-disruption in the program simrrls.
DRYAD_fig_S1.pdf
Supplementary Figure 2
The impact of missing data on quartet informativeness for simulated data (a-f) and the empirical Viburnum data set (g). This is an extension of Fig. 1. Data were simulated on three topologies, a balanced tree, an imbalanced tree, and the Viburnum topology with branch lengths scaled by penalized likelihood, and the outgroup removed. The number of loci that are quartet informative for each split is shown under each tree. In the absence of missing data all 1,000 simulated loci are informative about every edge (a-f; black circles). Under mutation-disruption (a-c) quartet information is lost faster in double-digest data (light grey) than in single-digest data (dark grey), and its effect varies depending on tree shape (see description in Fig. 1). Data simulated at low sequencing coverage (d-f) had either 50% (dark grey) or 80% (light grey)
of data randomly missing. Here the effect of tree shape is more pronounced. Nearly all information is recovered across the deepest splits in the balanced topology (d) due to its hierarchical redundancy, but no data is recovered in the imbalanced topology (e) which does not increase in hierarchical redundancy across deeper edges. The empiricalViburnum topology is relatively balanced, and data simulated on this topology (c, f) appears similar to that simulated on the balanced topology (a, d). The true distribution of quartet informativeness recovered in the Viburnum RAD-seq data set (g) is similar to the expectation when data were simulated on this topology under low sequencing coverage (f).
DRYAD_fig_S2.pdf
Supplementary Figure 3
The effect of two forms of mutation-disruption in causing allelic dropout in simulations. Single digest (restriction recognition site length = 4, 6, or 8 bp; grey) and double digest (ezyme1 recognition length = 4, 6, or 8, and enzyme2 recognition length = 4; black) differ in the number of loci retained. (a) When only mutations occurring within cut sites cause disruption (dashed lines) longer cutters recover fewer data than short cutters. When only mutations giving rise to new cut sites within sequences cause disruption (solid lines) shorter cutters recover fewer data than long cutters. In both cases, the double digest data recover fewer loci than single digest, due to the greater opportunity for disruption. (b) When both forms of disruption cause dropout simultaneously the length of a single cutter (4, 6, or 8 bp) has little effect on the amount of data loss, while adding a second independent cutter causes the rate of mutation-disruption to approximately double.
DRYAD_fig_S3.pdf
Supplementary Figure 4
Scatterplots of the number of shared loci among quartets of sampled individuals in ten empirical data sets and their relationship with two predictor variables: phylogenetic distance and log median number of input reads. All values are mean-standardized
DRYAD_fig_S4.pdf
Supplementary Figure 5
Histograms of sequencing depth (coverage) in clusters recovered by pyrad across ten empirical data sets. In each, the sample with the fewest excluded low depth clusters (loci at depth <6X; green) is shown. The proportion of low coverage loci varies greatly across data sets with respect to total sequencing effort and the evenness of sequencing (Table 1).
DRYAD_fig_S5.pdf
Supplementary Table 1
Archived locations of raw data files for ten empirical RAD-seq data sets. Jupyter notebooks containing the code and assembly statistics for each assembled data set are available in the online repository: https://github.com/dereneaton/RADmissing.
DRYAD_tab_S1.pdf
Supplementary Table 2
Sequence read archive metadata for bioproject accession PRJNA299402 -- Viburnum RAD sequences.
SRA_metadata_final.csv
Supplementary Figure 6
Maximum likelihood phylogeny of Viburnum inferred from the full (a) and half (b) min4 data sets, and (c) a species tree constructed by quartet-joining with quartets inferred from the full min4 SNP alignment.
DRYAD_fig_S6.pdf