Skip to main content

Data from: Resolving ambiguity of concatenation in multi-locus sequence data for the construction of phylogenetic supermatrices

Cite this dataset

Vogler, Alfried P.; Chesters, Douglas (2013). Data from: Resolving ambiguity of concatenation in multi-locus sequence data for the construction of phylogenetic supermatrices [Dataset]. Dryad.


The construction of supermatrices from mining of DNA metadata is problematic due to incomplete species identification and incongruence of gene trees that hamper sequence concatenation based on Linnaean binomials. We applied methods from graph theory to minimize ambiguity of concatenation globally over a large data set. An initial step establishes sequence clusters for each locus that broadly correspond to Linnaean species. These clusters frequently are not consistent with binomials and specimen identifiers, which greatly complicates the concatenation of clusters across multiple loci. A multipartite heuristic algorithm is used to match clusters across loci and to generate a global set of concatenates that minimizes conflict of taxonomic names. The procedure was applied to all available data on GenBank for the Coleoptera (beetles) including >10500 taxon labels for >23500 sequences of four loci. The BlastClust algorithm was used in the initial clustering step, resulting in 11241 clusters or divergent singletons. Clusters were first used for name assignment of unidentified sequences resulting in 510 new identifications (13.9% of total unidentified sequences) of which nearly half were by clustering of a specimen at a secondary locus. Concatenation was straightforward only for 12.8% of all binomials represented by a singleton sequence at each locus with an available entry, while the majority of binomials were associated to multi-sequence clusters in at least one locus. Concatenation of clusters is particularly problematic where limits of DNA-based clusters are inconsistent with the Linnaean binomials, either containing more than one binomial or splitting a binomial among multiple clusters. The current data set contained 1518 such clusters (13.5% of total). By applying a scoring scheme for full and partial name matches in pairs of clusters, the maximum weight set of concatenates produced a matrix of minimally 7366 terminals. Varying the match weights for partial matches had little effect on the number of terminals, although if partial matches were disallowed, the number of terminals increased greatly. Trees from the resulting supermatrices generally produced tree topologies in good agreement with the Linnaean taxonomy, with fewer terminals compared to trees generated according to standard species labels. The study illustrates a strategy for assembling the Tree-of-Life from an ever more complex primary database.

Usage notes