Skip to main content
Dryad logo

Data from: Unforeseen consequences of excluding missing data from next-generation sequences: simulation study of RAD sequences

Citation

Huang, Huateng; Knowles, L. Lacey (2014), Data from: Unforeseen consequences of excluding missing data from next-generation sequences: simulation study of RAD sequences, Dryad, Dataset, https://doi.org/10.5061/dryad.jf361

Abstract

There is a lack of consensus on how next-generation sequence data should be considered for phylogenetic and phylogeographic estimates, with some studies excluding loci with missing data, while others include them, even when sequences are missing from a large number of individuals. Here we use simulations, focusing specifically on RAD sequences, to highlight some of the unforeseen consequence of excluding missing data from next-generation sequencing. Specifically, we show that in addition to the obvious effects associated with reducing the amount of data used to make historical inferences, the decisions we make about missing data (such as the minimum number of individuals with a sequence for a locus to be included in the study) also impact the types of loci sampled for a study. In particular, as the tolerance for missing data becomes more stringent, the mutational spectrum represented in the sampled loci becomes truncated such that loci with the highest mutation rates are disproportionately excluded. This effect is exacerbated further by factors involved in the preparation of the genomic library (i.e., the use of reduced representation libraries, as well as the coverage) and the taxonomic diversity represented in the library (i.e., the level of divergence among the individuals). We demonstrate that the intuitive appeals about being conservative by removing loci may be misguided.

Usage Notes