Skip to main content

Data from: A preliminary framework for DNA barcoding, incorporating the multispecies coalescent

Cite this dataset

Dowton, Mark; Meiklejohn, Kelly; Cameron, Stephen L.; Wallman, James (2014). Data from: A preliminary framework for DNA barcoding, incorporating the multispecies coalescent [Dataset]. Dryad.


The capacity to identify an unknown organism using the DNA sequence from a single gene has many applications. These include the development of biodiversity inventories (Janzen et al. 2005), forensics (Meiklejohn et al. 2011), biosecurity (Armstrong and Ball 2005), and the identification of cryptic species (Smith et al. 2006). The popularity and widespread use (Teletchea 2010) of the DNA barcoding approach (Hebert et al. 2003), despite broad misgivings (e.g., Smith 2005; Will et al. 2005; Rubinoff et al. 2006), attest to this. However, one major shortcoming to the standard barcoding approach is that it assumes that gene trees and species trees are synonymous, an assumption that is known not to hold in many cases (Pamilo and Nei 1988; Funk and Omland 2003). Biological processes that violate this assumption include incomplete lineage sorting and interspecific hybridization (Funk and Omland 2003). Indeed, simulation studies indicate that the concatenation approach (in which these two processes are ignored) can lead to statistically inconsistent estimation of the species tree (Kubatko and Degnan 2007). However, recent developments make a barcoding approach that utilizes a single locus outdated. The cost of sequencing multiple gene fragments is no longer inhibitory, but more importantly, a range of analytical approaches have been developed that account for incomplete lineage sorting (Degnan and Salter 2005; Edwards et al. 2007; Liu et al. 2008; Kubatko et al. 2009; Heled and Drummond 2010; Yang and Rannala 2010). These approaches incorporate coalescent theory into the analysis of species trees and species delimitation (Fujita et al. 2012) and are conveniently accessible as software programs (e.g., BEST, BPP, *BEAST, MrBayes v. 3.2, STEM, and COAL). Although the general mixed Yule coalescent (GMYC) approach has also been developed for species delimitation (Pons et al. 2006), we do not consider it further here. It operates quite differently to the approaches outlined above (i.e., BEST, BPP, *BEAST, MrBayes v. 3.2, STEM, and COAL). The GMYC approach seeks to identify the shift in the rate of lineage branching that should be evident when interspecific evolutionary processes switch to population-level processes (Pons et al. 2006). Both empirical (Esselstyn et al. 2012) and simulation studies (Esselstyn et al. 2012; Fujisawa and Barraclough 2013) report that it performs poorly when effective population sizes and speciation rates are high, but within biologically relevant ranges. Ideally, a “next-generation” barcoding approach would (1) identify a minimal set of barcoding genes (perhaps specific to certain lineages), (2) generate a large and cladistically divergent database for comparisons, and (3) identify species using species delimitation approaches that incorporate the multispecies coalescent. The first two of these conditions are straightforward and require only discussion (requirement 1) and resources (requirement 2). However, the third requirement is much more problematic. Some of the recently developed approaches for species delimitation could not be used alone; for example, BPP requires a user-specified guide tree (Yang and Rannala 2010). All of the recently developed approaches are computationally intensive (Degnan and Rosenberg 2009), with many having practical limitations on the number of individuals that can be compared. By contrast, the current barcoding approach is able to compare enormous numbers of sequences in a very short time, primarily because the approach is analytically simple; a single sequence is compared with all sequences in the database by calculating all possible pairwise K2P distances. As long as exemplars exist within the database that have K2P distances below some predetermined threshold (usually 4%), the species is considered identified. The speed of analysis is due primarily to the use of distance-based measures. The purpose of this article is to initiate the development of a framework for “next-gen barcoding”: one that incorporates the multispecies coalescent, but does so by comparing multiple gene sequences from an unknown taxon with a database of sequences.

Usage notes