Data from: In search of an optimal DNA diagnosis for taxonomic descriptions with MOLD, a novel tool to identify diagnostic nucleotide characters
Fedosov, Alexander (2020), Data from: In search of an optimal DNA diagnosis for taxonomic descriptions with MOLD, a novel tool to identify diagnostic nucleotide characters, Dryad, Dataset, https://doi.org/10.5061/dryad.pnvx0k6m0
While DNA characters are increasingly used for phylogenetic inference, taxa delimitation and identification, their use for formal description of taxa remains scarce and inconsistent. The major impediments until recently was a lack of a suitable algorithm to identify signature DNA characters. The 2019-2020 however were marked by an almost simultaneous release of three softwares, simple to run and designed specifically for taxonomists. There is, nevertheless, a major concern, whether taxonomy will benefit from wide application of these, or any of the previously available tools. The reluctance of using DNA data in taxonomy is partly due to concerns of insufficient reliability of DNA characters, as robustness of DNA based diagnoses, depending on the sampled fraction of the species diversity has not thus far been assessed.
We propose a novel program, named MOLD that recovers diagnostic nucleotide combinations (DNCs) for selected taxa with DNA sequences available. We carried our random iterated haplotype subsampling on species in six published DNA data sets of varying complexity, providing a diagnosis to each subsample to evaluate how the robustness of DNA based diagnosis changes depending on the sampled fraction of the taxon’s diversity. We demonstrate that the currently used diagnostic DNA characters, or combinations thereof (DNCs) often do not exist for a particular species in a particular data set, or are not sufficiently reliable. We propose a new type of DNA diagnosis, termed herein rDNCs, which is compiled to suit pre-defined criteria of reliability, and is implemented in MOLD. We demonstrate that rDNCs can be successfully identified even in data sets comprising hundreds of species, and allow for notably more reliable diagnoses, than the currently used diagnostic DNA characters. MOLD recovers reliable and reproducible diagnoses in traditionally problematic cases, such as cryptic species or species with pronounced genetic structure, and shows unparalleled efficiency in large DNA data sets, making a valuable complement to the currently existing toolkit.
Seven published DNA datasets were analysed using MOLD - a novel software tool to recover diagnostic DNA characters for taxonomy.
Russian Science Foundation, Award: 19-74-10020