Updating splits, lumps, and shuffles: Reconciling GenBank names with standardized avian taxonomies
Cite this dataset
Hosner, Peter A et al. (2022). Updating splits, lumps, and shuffles: Reconciling GenBank names with standardized avian taxonomies [Dataset]. Dryad. https://doi.org/10.5061/dryad.gtht76hqf
Biodiversity research has advanced by testing expectations of ecological and evolutionary hypotheses through the linking of large-scale genetic, distributional, and trait datasets. The rise of molecular systematics over the past 30 years has resulted in a wealth of DNA sequences from around the globe. Yet, advances in molecular systematics also have created taxonomic instability, as new estimates of evolutionary relationships and interpretations of species limits have required widespread scientific name changes. Taxonomic instability, colloquially “splits, lumps, and shuffles,” presents logistical challenges to large-scale biodiversity research because (1) the same species or sets of populations may be listed under different names in different data sources, or (2) the same name may apply to different sets of populations representing different taxonomic concepts. Consequently, distributional and trait data are often difficult to link directly to primary DNA sequence data without extensive and time-consuming curation. Here, we present RANT: Reconciliation of Avian NCBI Taxonomy. RANT applies taxonomic reconciliation to standardize avian taxon names in use in NCBI GenBank, a primary source of genetic data, to a widely used and regularly updated avian taxonomy: eBird/Clements. Of 14,341 avian species/subspecies names in GenBank, 11,031 directly matched an eBird/Clements; these link to more than 6 million nucleotide sequences. For the remaining unmatched avian names in GenBank, we used Avibase’s system of taxonomic concepts, taxonomic descriptions in Cornell’s Birds of the World, and DNA sequence metadata to identify corresponding eBird/Clements names. Reconciled names linked to more than 600,000 nucleotide sequences, ~9% of all avian sequences on GenBank. Nearly 10% of eBird/Clements names had nucleotide sequences listed under 2 or more GenBank names. Our taxonomic reconciliation is a first step towards rigorous and open-source curation of avian GenBank sequences and is available at GitHub, where it can be updated to correspond to future annual eBird/Clements taxonomic updates.
We downloaded all names from the NCBI Taxonomy database (Schoch et al., 2020) that descended from “Aves” (TaxID: 8782) on 3 May 2020 (Data Repository D2). From this list, we extracted all species and subspecies names as well as their NCBI Taxonomy ID (TaxID) numbers. We then ran a custom Perl script (Data Repository D3) to exactly match binomial (genus, species) and trinomial (genus, species, subspecies) names from NCBI Taxonomy to the names recognized by eBird/Clements v2019 Integrated Checklist (August 2019; Data Repository D4). For each mismatch with the NCBI Taxonomy name, we then identified the corresponding equivalent eBird/Clements species or subspecies. We first searched for names in Avibase (Lepage et al., 2014). However, Avibase’s search function currently facilitates only exact matches to taxonomies it implements. For names that were not an exact match to an Avibase taxonomic concept, we implemented web searches (Google) which often identified minor spelling differences, consulted Cornell’s Birds of the World Online (https://birdsoftheworld.org), and consulted relevant literature— often the papers that first published those sequence data.
We classified nine categories of naming mismatches resulting from discrepancies between GenBank and eBird/Clements names: split, lump, shuffle, new, spelling, hybrid, extinct, domesticated, and unidentified (Table 2). Split is a name that corresponds to a subspecies rank in GenBank, but a species rank in eBird/Clements. For example, the GenBank subspecies name Otus megalotis everetti (taxiid: 56274) corresponds to the species name Otus everetti in eBird/Clements. Lump is a name that corresponds to species rank in GenBank, but a subspecies rank in eBird/Clements. For example, the GenBank name Megascops colombianus (TaxID: 1740167) corresponds to Megascops ingens colombianus in eBird/Clements. Shuffle is a taxon that has an equivalent rank in GenBank and eBird/Clements, but different name usage. Most often shuffles stem from changes in genera, but a few species epithets have changed because of new evidence regarding nomenclature priority. For example, the GenBank name Mimizuku gurneyi (id: 56287) corresponds to Otus gurneyi in eBird/Clements, reflecting a change in the generic name. New is a species or subspecies that was undescribed when its sequences were initially uploaded to GenBank. To preserve nomenclature priority, GenBank avoids unpublished or in-press names of undescribed taxa, instead assigning an informal placeholder name. Typically, the placeholder name consists of the genus, the data uploaders' initials, and the year of first upload. For example, Megascops_sp._SMD-2015 (TaxID: 1740173) corresponds to the Santa Marta Screech-Owl, Megascops gilesi, Krabbe, 2017. Spelling is a taxon that has an equivalent name in GenBank and eBird/Clements, but for which a slightly different spelling is implemented. For example, the GenBank name Glaucidium nanum (TaxID: 126809) corresponds to the eBird/Clements name Glaucidium nana. Hybrid is a hybrid individual and usually identified in GenBank by a name comprising the putative parental species separated by a cross “x”. For example, the GenBank name Strix occidentalis x Strix varia. Hybrids were not reconciled to eBird/Clements names, although eBird taxonomy does include and organize names for some frequent avian hybrid parental combinations. Extinct is an extinct taxon that is not regulated by eBird/Clements because it was not documented in the modern era. For example, the elephant bird Aepyornis maximus (TaxID: 748142) is known from Holocene bones and eggshell materials that have yielded DNA sequences, but this name is not regulated by eBird/Clements. Domesticated is a domesticated breed or line. For example, GenBank has a listing for the domesticated “Society Finch” as Lonchura striata domestica (TaxID: 299123), but in eBird/Clements it refers to Lonchura striata because domesticated forms are not generally considered subspecies. Finally, Unidentified refers to TaxIDs where we were unable to assign a species name. These were generally samples not identified to species, or environmental DNA samples.
We summarized the total number and proportion of reconciled GenBank TaxIDs by bird orders, and within the largest bird order Passerformes, by families. We also summarized the number of GenBank nucleotide sequences and number of reconciliations for each IUCN conservation status category. For a taxon that did not have a direct match to an IUCN name, we placed it under “Not Assessed”.
GenBank sequences associated with avian names
We tallied the number of core nucleotide sequences in GenBank associated with each taxonomic ID by downloading the “nucl_gb.accession2TaxID” file on 2 November 2020 (Data Repository D5). This file lists the accession number for each sequence in the GenBank nucleotide database and its corresponding taxonomic ID number. From this, we wrote a Perl script (Data Repository D6) to count the number of nucleotide sequences associated with each taxonomic ID corresponding to an avian taxonomic ID. To obtain counts of the number of runs in the NCBI Sequence Read Archive (SRA) associated with each bird species, we downloaded the “RunInfo” for the SRA runs (“SraRunInfo.csv”) within “Aves” on August 1, 2021 (Data Repository D7). To obtain counts of the number of genome sequences in GenBank associated with each name, we downloaded from NCBI on September 5, 2021 a summary of the NCBI Genome files (“genome_result.txt”) within “Aves” (Data Repository D8).
Linking eBird/Clements names to geographic realms
For TaxIDs that were successfully assigned to eBird/Clements species names (either by direct name match or taxonomic reconciliation), we delimited their geographic realms using the associated IOC breeding ranges (eight terrestrial realms and four oceanic realms). Here we implemented IOC, rather than eBird/Clements geographic information because eBird/Clements does not summarize species occurrence by geographic realm. We also manually assigned geographic realms for species without range information available in the IOC v10.1 checklist (master_ioc_list_v10.1.xlsx). We defined species that occur in only one realm as realm endemics, and species that occur in two or more realms as widespread. We then summarized the number of reconciliations and the number of GenBank nucleotide sequences for each realm, and widespread species.
Linking eBird/Clements names to other databases
We used audio data as an example to examine the extent to which name-reconciled GenBank sequences apply to large avian comparative databases, such as Macaulay Library and Xeno-canto. Since Macaulay Library uses eBird/Clements taxonomy for its bird images, audios and videos, we can readily link these media resources to the GenBank nucleotide data under the same eBird/Clements names. We downloaded a summary of available audio data (April 2021) from Macaulay Library (https://www.macaulaylibrary.org/resources/media-target-species/; Data Repository D9). We also examined Xeno-canto, a global avian vocalization database, which uses the IOC taxonomy. To match Xeno-canto’s 10,909 avian names to eBird/Clements names, we filtered out the species with a direct name match and then reconciled the remaining using Avibase taxonomic concepts. Lastly, we summed up the number of Xeno-canto sound recordings (October 2020; https://www.xeno-canto.org/collection/species/all; Data Repository D10) under the same eBird/Clements name. For example, the Xeno-canto name Colinus leucopogon had 26 sound recordings and Colinus cristatus had 57, but the eBird/Clements name C. cristatus would have 83, because C. leucopogon is treated as a subspecies of C. cristatus by eBird/Clements.
"PetersVsClements2Final.txt" - This file tells which species from the Peters taxonomy match the 2019 Clements/ebird taxonomy. The first column has a species name from the Peters taxonomy. In the second column, "Clements" indicates that the species name matches the Clements/ebird taxonomy, "No" means it does match, and "Close" means that the names match when you disregard the last two letters.
"SibleyMonroeVsClements_Final.txt" - This file tells which species from the Sibley Monroe taxonomy match the 2019 Clements/ebird taxonomy. The first column has a species ID number from the Sibley Monroe taxonomy. The second column has the species scientific name from the Sibley Monroe taxonomy. The third column has the common name from the Sibley Monroe taxonomy. In the fourth column, "Clements" indicates that the species name matches the Clements/ebird taxonomy, "No" means it does match, and "Close" means that the names match when you disregard the last two letters.
"taxonomy_result.unix.xml" - XML file with NCBI taxonomy with the names descending from "Aves" (downloaded May 3, 2020).
"GenBank.AvesSpecies.txt" - This text file has the GenBank species and subspecies names within "Aves". The first column has the GenBank taxon ID number. The second column has the scientific name corresponding to the taxon ID number, and the third column lists whether this name corresponds to a species or a subspecies.
"extractGBnames.pl" - Perl script that reads in "taxonomy_result.unix.xml" and outputs the taxon ID numbers, their corresponding scientific names, and ther rank(e.g. "species").
"compare.pl" - Perl script that reads in the Clements Ebird 2019 taxonomy (the file "EbirdClements.txt" in D4) and the list of GenBank taxon names from Aves (the file "GenBank.AvesSpecies.txt" in S2). If the GenBank taxon name exactly matches a name in the Clements/Ebird taxonomy, it outputs the GenBank taxon ID, GenBank name, GenBank rank, and all the information associated with the name in the Clements/Ebird taxonomy. If the GenBank name does not match, it just outputs the GenBank taxon ID, GenBank name, and GenBank rank.
"EbirdClements.txt" - text file with the taxonomic names and associated metadata from the 2019 Ebird Clements dataset. The first column has the code associated with species names; the second column has the rank, the third column has the common name for the species, the fourth column has the scientific name; the fifth column has the range; the sixth column has the order name, and the last column has the family name (with the common family name next to it in parentheses).
"nucl_gb.accession2taxid.gz" - compressed file that has the GenBank accession numbers from the core nucleotide database and the taxon ID associated with that sequence. This was downloaded from NCBI on Novemeber 2, 2020.
"taxonomy_result.txt" - text file with a list of GenBank taxon ID numbers associated with species and subspecies within Aves.
"countgb.pl" - Perl script that reads in "taxonomy_result.txt" and "nucl_gb.accession2taxid" (from S5) and outputs the number of GenBank sequences associated with each avian species or subspecies ID number.
"SraResultInfo.csv" - CSV file that summarizes the data from each run in the NCBI SRA database associated with an Aves taxon. The 28th column has the GenBank taxon ID associatetd with the SRA run. This information was downloaded from NCBI on August 1, 2021.
"genome_result.txt" - text file with a summary of the genome files in NCBI associated with taxa within Aves. This file was downloaded on September 5, 2021. The taxon name is next to the number of each entry.
"getnames.pl" - Perl script that reads in "genome_result.txt" and outputs a list of avian taxa with genome files and the number of genome files associated with each taxon.
"MacaulayLibrary_MediaSummary_April_2021.csv" – CSV file summarizing Macaulay Library audio recordings and GenBank nucleotide sequences associated with eBird/Clements 2019 names (downloaded April 2021)
"Xeno-canto_MediaSummary_October2020.csv" - CSV file summarizing Xeno-canto audio recordings and GenBank nucleotide sequences associated with eBird/Clements 2019 names (downloaded October 2020)
"GenBank_eBird/Clements2019_taxonomic_reconciliation_12Nov2021.csv" - CSV file reconciling GenBank TaxIDs with eBird/Clements 2019 taxonomy
"TaxonomicReconciliation_IUCNstatus.csv" – CSV file reconciling GenBank TaxIDs with eBird/Clements 2019 taxonomy, with respect to IUCN status
"Taxonomic_reconciliation_related_to_geographic_realm.csv — CSV file with reconciliation status related to geographic realms
"TaxonomicReconciliation_Xeno-canto.csv" – CSV file reconciling eBird/Clements 2019 taxonomy, with Xeno-canto, which uses IOC taxonomy
Villum Fonden, Award: 25925
National Science Foundation, Award: DEB-1655683