Data from: Challenges with using names to link digital biodiversity information
Patterson, David J.; Mozzherin, Dmitry; Shorthouse, David Peter; Thessen, Anne (2017), Data from: Challenges with using names to link digital biodiversity information, Dryad, Dataset, https://doi.org/10.5061/dryad.3160r
The need for a names-based cyber-infrastructure for digital biology is based on the argument that scientific names serve as a standardized metadata system that has been used consistently and near universally for 250 years. As we move towards data-centric biology, name-strings can be called on to discover, index, manage, and analyze accessible digital biodiversity information from multiple sources. Known impediments to the use of scientific names as metadata include synonyms, homonyms, mis-spellings, and the use of other strings as identifiers. We here compare the name-strings in GenBank, Catalogue of Life (CoL), and the Dryad Digital Repository (DRYAD) to assess the effectiveness of the current names-management toolkit developed by Global Names to achieve interoperability among distributed data sources. New tools that have been used here include Parser (to break name-strings into component parts and to promote the use of canonical versions of the names), a modified TaxaMatch fuzzy-matcher (to help manage typographical, transliteration, and OCR errors), and Cross-Mapper (to make comparisons among data sets). The data sources include scientific names at multiple ranks; vernacular (common) names; acronyms; strain identifiers and other surrogates including idiosyncratic abbreviations and concatenations. About 40% of the name-strings in GenBank are scientific names representing about 400,000 species or infraspecies and their synonyms. Of the formally-named terminal taxa (species and lower taxa) represented, about 82% have a match in CoL. Using a subset of content in DRYAD, about 45% of the identifiers are names of species and infraspecies, and of these only about a third have a match in CoL. With simple processing, the extent of matching between DRYAD and CoL can be improved to over 90%. The findings confirm the necessity for name-processing tools and the value of scientific names as a mechanism to interconnect distributed data, and identify specific areas of improvement for taxonomic data sources. Some areas of diversity (bacteria and viruses) are not well represented by conventional scientific names, and they and other forms of strings (acronyms, identifiers, and other surrogates) that are used instead of names need to be managed in reconciliation services (mapping alternative name-strings for the same taxon together). On-line resolution services will bring older scientific names up to date or convert surrogate name-strings to scientific names should such names exist. Examples are given of many of the aberrant forms of ‘names’ that make their way into these databases. The occurrence of scientific names with incorrect authors, such as chresonyms within synonymy lists, is a quality-control issue in need of attention. We propose a future-proofing solution that will empower stakeholders to take advantage of the name-based infrastructure at little cost. This proposed infrastructure includes a standardized system that adopts or creates UUIDs for name-strings, software that can identify name-strings in sources and apply the UUIDs, reconciliation and resolution services to manage the name-strings, and an annotation environment for quality control by users of name-strings.