Background: Attribution to the original contributor upon reuse of published data is important both as a reward for data creators and to document the provenance of research findings. Previous studies have found that papers with publicly available datasets receive a higher number of citations than similar studies without available data. However, few previous analyses have had the statistical power to control for the many variables known to predict citation rate, which has led to uncertain estimates of the "citation benefit". Furthermore, little is known about patterns in data reuse over time and across datasets. Method and Results: Here, we look at citation rates while controlling for many known citation predictors, and investigate the variability of data reuse. In a multivariate regression on 10,555 studies that created gene expression microarray data, we found that studies that made data available in a public repository received 9% (95% confidence interval: 5% to 13%) more citations than similar studies for which the data was not made available. Date of publication, journal impact factor, open access status, number of authors, first and last author publication history, corresponding author country, institution citation history, and study topic were included as covariates. The citation benefit varied with date of dataset deposition: a citation benefit was most clear for papers published in 2004 and 2005, at about 30%. Authors published most papers using their own datasets within two years of their first publication on the dataset, whereas data reuse papers published by third-party investigators continued to accumulate for at least six years. To study patterns of data reuse directly, we compiled 9,724 instances of third party data reuse via mention of GEO or ArrayExpress accession numbers in the full text of papers. The level of third-party data use was high: for 100 datasets deposited in year 0, we estimated that 40 papers in PubMed reused a dataset by year 2, 100 by year 4, and more than 150 data reuse papers had been published by year 5. Data reuse was distributed across a broad base of datasets: a very conservative estimate found that 20% of the datasets deposited between 2003 and 2007 had been reused at least once by third parties. Conclusion: After accounting for other factors affecting citation rate, we find a robust citation benefit from open data, although a smaller one than previously reported. We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data. Other factors that may also contribute to the citation benefit are considered.We further conclude that, at least for gene expression microarray data, a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.

manuscript source file (knitr)

The source file for running the statistics and inserting the results into the manuscript text. The manuscript does not include final wording revisions made in the peer review process. See the "about this doc" section for instructions on running this R knitr markdown file to produce stats.md and stats.R files.

stats_knit_.md

manuscript compiled file, with stats

This file is the result of running stats_knitr_.md through knitr. It does not include changes made during the peer review process.

stats.md

citations

Citations used by stats_knitr_.md to build the references list.

citation11k.bib

helpers.r

Helper R functions used by stats_knitr_.md to run the statistics.

helpers.R

preprocess_raw_data.r

Helper R functions used by stats_knitr_.md to preprocess data before running the statistics.

preprocess_raw_data.R

pubmed_gse_count.csv

The number of GSE data sets added to the NCBI's GEO repository each year, 2000-2011.

pubmed_pmc_ratios.csv

The fraction of PubMed in PMC that are indexed with the MeSH term “gene expression profiling”, by year of publication, 2000-2011, as measured in 2012.

PLoSONE2011_rawdata.txt

Data from Piwowar HA (2011) Data from: Who shares? Who doesn’t? Factors associated with openly archiving raw research data. Dryad Digital Repository. doi:10.5061/dryad.mf1sd. Reproduced here to make it easy to rerun scripts.

scopus_all.csv

Scopus citation data for publications in the cohort.

GEO_dataset_attributes.csv

One row for every GEO dataset reuse detected by searching PMC for GEO accession numbers. Columns list GEO accession, gse number, gds number, related submit_pmids, identified reuse_pmcid, the reuse_pmids_for_pmc, the submission authors, the reuse authors, whether the sumission authors and the reuse authors overlap, the submission affiliation, the release date, columns to detect reuse and data creation keywords, excerpts around the accession numbers when available (the reuse paper was OA), reuse journal, year, and date_published, the medline_status, whether it was listed on the NCBI GEO reuse webpage, whether the reuse is OA, and whether it is listed as a metaanalysis by MEDLINE.

Mendeley_annotated_250_of_11k.csv

Manual annotation of a random 250 papers from the 10,555 papers in the study. Manual examination was to determine whether the study did indeed generate gene expression microarray data. Rows with "created-microarray-data" were identified as generating microarray data in the manual review; "created-microarray-data-not" were identified as not actually generating gene expression microarray data despite being identified as such by our automated filter.

tracking1k_20111008.csv

Manually annotated instances of citation context to papers that created publicly available datasets. This study explores the subset of the dataset related to GEO data: "dataset reused" means the citation context was determined to be in the context of data reuse.

Data from: Data reuse and the open data citation advantage

Data files

Abstract

manuscript source file (knitr)

manuscript compiled file, with stats

citations

helpers.r

preprocess_raw_data.r

pubmed_gse_count.csv

pubmed_pmc_ratios.csv

PLoSONE2011_rawdata.txt

scopus_all.csv

GEO_dataset_attributes.csv

Mendeley_annotated_250_of_11k.csv

tracking1k_20111008.csv

Data from: Data reuse and the open data citation advantage

Data files

Abstract

Usage notes

manuscript source file (knitr)

manuscript compiled file, with stats

citations

helpers.r

preprocess_raw_data.r

pubmed_gse_count.csv

pubmed_pmc_ratios.csv

PLoSONE2011_rawdata.txt

scopus_all.csv

GEO_dataset_attributes.csv

Mendeley_annotated_250_of_11k.csv

tracking1k_20111008.csv

Works referencing this dataset