Skip to main content
Dryad

Data from: Automatic recognition of self-acknowledged limitations in clinical research literature

Cite this dataset

Kilicoglu, Halil; Rosemblat, Graciela; Malički, Mario; ter Riet, Gerben (2019). Data from: Automatic recognition of self-acknowledged limitations in clinical research literature [Dataset]. Dryad. https://doi.org/10.5061/dryad.06ds7

Abstract

Objective: To automatically recognize self-acknowledged limitations in clinical research publications to support efforts in improving research transparency. Materials and Methods: To develop our recognition methods, we used a set of 8,431 sentences from 1,197 PubMed Central articles. A subset of these sentences was manually annotated for training/testing and inter-annotator agreement was calculated. We cast the recognition problem as a binary classification task, in which we determine whether a given sentence from a publication discusses self-acknowledged limitations or not. We experimented with three methods: a rule-based approach based on document structure, supervised machine learning, and a semi-supervised method that uses self-training to expand the training set in order to improve classification performance. The machine learning algorithms used were logistic regression (LR) and support vector machines (SVM). Results: Annotators had good agreement in labeling limitation sentences (Krippendorff’s α=0.781). Of the three methods used, the rule-based method yielded the best performance with 91.5% accuracy (95% CI [90.1-92.9]), while self-training with SVM led to a small improvement over fully supervised learning (89.9%, 95% CI [88.4-91.4] vs. 89.6%, 95% CI [88.1-91.1]). Discussion: We attribute the effectiveness of the rule-based method to the highly localized and formulaic language used in reporting of limitations in clinical research publications. Experiments with training size and composition show that more data does not necessarily lead to higher accuracy in the machine learning-based approaches. Conclusion: The approach presented can be incorporated into the workflows of stakeholders focusing on research transparency to improve reporting of limitations in clinical studies.

Usage notes