NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition
Data files
Jul 09, 2021 version files 1.07 MB
-
NLM-Gene-Annotation-Guidelines.docx
117.68 KB
-
NLM-Gene-Corpus.zip
952.18 KB
-
Pmidlist.Test.txt
900 B
-
Pmidlist.Train.txt
4.05 KB
Abstract
The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. The NLM-Gene corpus is a high-quality manually annotated corpus for genes, covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per article, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed articles from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each article to control for bias. The annotators worked in three annotation rounds until they reached a complete agreement. Using the new resource, we developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at Dryad and at https://www.ncbi.nlm.nih.gov/research/bionlp/. The gene finding results of applying this tool to the entire PubMed/PMC are freely accessible through our web-based tool PubTator.
Data Selection: Our goal was to identify articles where manual curation is useful for tool improvement, otherwise known as difficult articles, where exciting automated tools do not produce accurate results. These articles have these characteristics: they contain more gene mentions than average, they mention genes from a variety of organisms, and often more than one organism, they contain ambiguous gene mentions, they discuss genes in relation with other biomedical topics such as diseases, chemicals, mutations, etc.
Data has been doubly annotated in three rounds until annotators achieved 100% agreement.
Annotation load was distributed so that all annotators annotated a similar number of documents, and a similar number of entities. Annotators did not know the identity of their partners until the very end. All pairings were made at the document level, so each annotator was paired with every other annotator. There were six annotators who were attached to the project from the beginning to end.
Inter-annotator agreement (IAA) was measured for Gene ID annotations, since annotators had almost perfect agreement for mention recognition.
IAA was 74% for the first round of annotations, 86% for the second round, and 100% after collaborative discussions.
NLM-Gene is available in BioC XML and has been partitioned into training and testing set. The training set consists of 450 articles, and the testing set consists of 100 articles.
For annotation details, please refer to the annotation guidelines. For methodology, gene recognition results, and corpus characteristics and further details, please refer to the manuscript.
We believe this resource can be of significant value to researchers in both life sciences and informatics communities. Specifically, people involved in data curation, and biomedical tool development will find the availability of this corpus very useful.
The corpus can be used in combination with GNorm+ corpus, and BioCreative Gene annotated corpora, to create a richer dataset. NLM-Gene, being richer in the number of species, and more complex in terms of bio-entities, should provide an invaluable resource to test hard to predict cases, and build algorithms that can address harder named entity recognition issues.
NLM-Gene consists of 550 PubMed articles, from 156 journals, and contains more than 15 thousand unique gene names, corresponding to more than five thousand gene identifiers (NCBI Gene taxonomy). This corpus contains gene annotation data from 28 organisms. The annotated articles contain on average 29 gene names, and 10 gene identifiers per article. These characteristics demonstrate that this article set is an important benchmark dataset to test the accuracy of gene recognition algorithms both on multi-species and ambiguous data. The NLM-Gene corpus will be invaluable for advancing text-mining techniques for gene identification tasks in biomedical text.
In order to achieve a robust result of gene entity recognition that could translate to real life applications, we upgraded the GNormPlus system with a deep learning component for the name entity recognition component and several features that ensured better accuracy for species recognition, and false positive prediction detection. The new results are superior and are able to identify genes in the NLM-Gene test dataset close to the performance of human inter-annotator agreement. These results have been streamlined to process all PubMed articles in daily updates: https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/.