Leveraging metrics to drive data sharing at the Science journals
Data files
Dec 24, 2025 version files 863.58 KB
-
AAAS_Open_Science_Metrics_data_2021_to_2024.csv
850.01 KB
-
README.md
12.66 KB
-
Summary_data_Science_and_comparators_.csv
909 B
Abstract
For the scientific research published in Science to be accessible, it is important that the data, methods, results, and code are transparently reported and openly shared. Science has policies on material, data, and code sharing that support our goals of transparency and openness. These include that all papers must have a data availability statement and that all data and code must be available in the paper or deposited in a permanent public repository. Exceptions are made—for example, if there are security concerns or to protect personal privacy. In 2024, Science partnered with the company DataSeer to determine the extent to which our research papers share data and code. Dataseer uses natural language Processing (NLP) to measure a number of Open Science Indicators in published articles. The dataset “AAAS Open Science Metrics data 2021 to 2024” provides article-level data for 2680 Science papers published between 2021 and 2024. This includes article metadata, such as the doi and publication date, as well as the first listed country for the first author (obtained from OpenAlex), data on whether data and code were generated and whether and how they were shared, and data on whether the paper was preprinted (based on fuzzy matching of the article title and authors against articles from major preprint servers). We used this dataset to calculate aggregate data for data and code sharing, as well as whether data was shared in a repository. All papers had a data availability statement; 69% of papers shared data in a repository, online, or in a supplementary data table (6% of papers did not generate or share data); and 23% of papers shared code (46% of papers did not generate or share code). We compared the Science aggregate data with publicly available data from the Public Library of Science (PLOS) and the academic publisher Taylor and Francis. Summary data are provided in the file “Summary data Science and comparators.” This file gives the total number of publications for each source, the number sharing data overall (in a repository, online, or in the supplementary material), the number sharing data in a repository, and the number generating and sharing code. Overall data sharing was 69% for Science, 74% for PLOS, and 24% for Taylor and Francis, whereas data shared in a repository was 56% for Science, 26% for PLOS, and 11% for Taylor and Francis. For the papers that generated code, code sharing was at 41% for Science, 29% for PLOS, and 8% for Taylor and Francis. This provides a baseline as we implement policies and processes to further improve data and code sharing.
Dataset DOI: 10.5061/dryad.zkh1893qt
Description of the data and file structure
Science has policies on material, data, and code sharing that support our goals of transparency and openness., but we lack data on compliance.In 2024, Science partnered with the company DataSeer to determine the extent to which our research papers share data and code. Dataseer uses natural language Processing (NLP) to measure a number of Open Science Indicators in published articles. We compared data for Science with publicly available data for the Public Library of Science (PLOS) and Taylor and Francis.
Files and variables
File: AAAS_Open_Science_Metrics_data_2021_to_2024.csv
Description:
The dataset AAAS_Open_Science_Metrics_data_2021_to_2024.csv was obtained by the company Dataseer. Science provided Dataseer with the XML for 2680 research papers which were all Research articles and Reports published in Science between 2021 and 2024. Dataseer used Natural Language Processing to quantify Open Science Metrics across these articles. Information on DOI, publication date, and data and code generation and sharing were obtained from the XML.
Data Sharing: Articles sharing data as supplemental material were identified by locating the list of supplemental files in the full text and finding terms like ‘dataset’ or ‘data file’ in either the supplemental file name or description. Articles sharing data online contained a) links to data repositories (e.g. Zenodo or Dryad) or other online locations (e.g. Github), b) DOIs or other Permanent Identifiers (PIDs) for data repositories, or c) accession numbers associated with data repositories. When an article was found to share data both in the supplemental material and online, these were classified as “online”. Online repositories were identified through their DOIs or other PIDs, the site linked to by a dataset URL, or through the characteristic format of an accession number. To identify articles producing shareable code objects the full text is searched for the names of command line software or programming environments. Articles mentioning one or more software or environments are classified as ‘generating code’.
Code Sharing: To detect code sharing, Dataseer searched the full text for a) links or b) DOIs (or other PIDs) for repositories where code objects could be shared. They also checked the supplemental file titles and descriptions for terms like ‘script’ or ‘code’ for code sharing in the supplemental material. Articles found to share code both in the supplemental material and online are characterized as “online”.
Dataseer also provided information on the first listed country for the first listed author from OpenAlex. They also assessed whether articles were posted as a preprint by performing a fuzzy match for the article title and authors against a listing of articles from all of the major preprint servers. Preprints with a posting date that is on or after the article publication date (also known as postprints) were excluded.
Variables
DOI: Digital Object Identifier assigned to a Science article, obtained from article XML
Field: Classification of the field of study, obtained from OpenAlex
Subfield: Classification of the subfield of study obtained from OpenAlex
First_Author_Country: The first listed country of the first listed author, obtained from OpenAlex
Publication_Day: The day of the month on which the article was published in Science, obtained from article XML
Publication_Month: The month in which the article was published in Science, obtained from article XML.
Publication_Year: The year in which the article was published in Science, obtained from article XML.
Data_Generated: States whether data was generated as assessed by the Dataseer algorithm scanning the article XML. Labeled “Yes” or “No”.
Data_Section_Text_Generated: Which sections of the research article indicated data was generated as assessed by the DataSeer algorithm scanning the article XML. The label DA is the Data Availability Statement. The label NA indicates that the Dataseer algorithm did not find evidence that new data was generated.
Data_Shared: Was data shared as assessed by the Dataseer algorithm scanning the XML. Labeled “Yes” or “No”.
Data_Section_Text_Shared: Which sections of the research article indicated data was shared as assessed by the DataSeer algorithm scanning the article XML. The label DA is the Data Availability Statement. The label NA indicates that the Dataseer algorithm did not find evidence that data was shared.
Data_Location: Indicates whether the data was shared online or in the supplementary material as determined by the Dataseer algorithm scanning the XML. Online data includes data in repositories and articles sharing data both online and in the supplementary material were classified as “Online”.
Data_Sharing_Accessions: Accession numbers associated with repositories found in the XML.
Data_Sharing_URLs: URLs indicating where data is shared found in the XML. This includes URLs for repositories, but also other online locations.
Data_Sharing_DOIs: DOIs associated with shared datasets found in the XML.
Data_Sharing_Repositories: Data Repositories identified by Dataseer in the XML. Only repositories from a standard list are considered repositories.
DAS (Data Availability Statement): Dataseer extracted the text of the DAS, but in this file only the header Data and Material Availability was retained to avoid including personal emails in the dataset.
Code_Generated: Was code generated as assessed by the DataSeer algorithm applied to the XML. Labelled “Yes” or “No”
Code_Section_Text_Generated: Which sections of the research article indicated code was generated as assessed by the DataSeer algorithm scanning the article XML. The label DA is the Data Availability Statement. The label NA indicates that the Dataseer algorithm did not find evidence that new code was generated.
Code_Shared: Was code shared as assessed by the Dataseer algorithm scanning the XML. Labeled “Yes” or “No”.
Code_Section_Text_Shared: Which sections of the research article indicated code was shared as assessed by the DataSeer algorithm scanning the article XML. The label DA is the Data Availability Statement. The label NA indicates that the Dataseer algorithm did not find evidence that code was shared.
Code_Location: Indicates whether the data was shared online or in the supplementary material as determined by the Dataseer algorithm scanning the XML. The label NA indicates that the Dataseer algorithm did not find evidence that code was shared.
Code_Sharing_URLs: URLs indicating where data is shared found in the XML.
Code_Sharing_Repositories: Repositories identified by Dataseer in the XML. Only repositories from a standard list are considered repositories.
Preprint_Match: Was a preprint for the article identified by Dataseer by performed a fuzzy match for the article title and authors against a listing of articles from all of the major preprint servers. Labeled “Yes” or “No”
Preprint_DOI: DOI of identified preprint match obtained from the preprint metadata.
Preprint_Day: Day of preprint posting.
Preprint_Month: Month of preprint posting.
Preprint_Year: Year of preprint posting.
Preprint_URL: URL to access the preprint.
Preprint_Server: The server where the preprint is hosted.
File: Summary_data_Science_and_comparators_.csv
Description:
The file Summary_data_Science_and_comparators.csv provides aggregate data on data and code sharing calculated from the Dataseer article level data. Similar aggregate data was calculated from publicly available datasets from the Public Library of Science (PLoS) and Taylor and Francis.
Variables
Summary data: each row indicates whicj value is calculated (details below)
Total: Total gives the aggregate value for each row (details below)
Science No. of publications: Number of articles provided to Dataseer by Science.
Science No. sharing data overall: Articles are counted as sharing data if the “Data_Shared” column in “AAAS Open Science Metrics data 2021 to 2024” contains “Yes”.
Science % sharing data overall= (Science No. sharing data overall/ Science No. of publications) x 100
Science No. sharing data in a repository: Articles are counted as having data shared in a repository if the “Data_Sharing_Repositories” column in “ AAAS Open Science Metrics data 2021 to 2024” does not contain “NA”.
Science %. sharing data in a repository = (Science No. sharing data in a repository/ Science No. of publications) x 100
PLOS No. of publications: Number of DOI entries in the file “PLOS-Dataset_v10_Jul25.csv” available at https://plos.figshare.com/articles/dataset/PLOS_Open_Science_Indicators/21687686
PLOS No. sharing data overall: Articles are counted as sharing data if the “Data_Shared” column in “PLOS-Dataset_v10_Jul25.csv” contains “TRUE”.
PLOS % sharing data overall = (PLOS No. sharing data overall/ PLOS No. of publications) x 100
PLOS No. sharing data in a repository: Articles are counted as having data shared in a repository if the “Repositories Data” column in “PLOS-Dataset_v10_Jul25.csv” does not contain “NA”.
PLOS %. sharing data in a repository = (PLOS No. sharing data in a repository/ PLOS No. of publications) x 100
T&F No. of publications: Number of DOI entries in the dataset made available by Taylor and Francis available at dataset https://doi.org/10.6084/m9.figshare.30316342
T&F No. sharing data overall: Articles are counted as having data shared in a repository if the “Data_Repositories” column in the T&F dataset does not contain “NA”.
T&F % sharing data overall = (T&F No. sharing data overall/ T&F No. of publications) x 100
T&F No. sharing data in a repository = Articles are counted as having data shared in a repository if the “Data_Repositories” column in the T&F dataset does not contain “NA”.
T&F %. sharing data in a repository = (T&F No. sharing data in a repository/ T&F No. of publications) x 100
Science No. generating code: Articles are counted if the “Code_Generated” column in “AAAS Open Science Metrics data 2021 to 2024” contains “Yes”.
Science No. sharing generated code: Articles are counted if the “Code_Generated” column contains “Yes” and the “Code_Shared” column contains “Yes” in in “AAAS Open Science Metrics data 2021 to 2024”.
Science % sharing generated code = Science No. sharing generated code/ Science No. generating code x 100
PLOS No. generating code: Articles are counted if the “Code_Generated” column in “PLOS-Dataset_v10_Jul25.csv” contains “TRUE”
PLOS No. sharing generated code: Articles are counted if Articles are counted if the “Code_Generated” column contains “TRUE” and the “Code_Shared” column contains “TRUE” in “PLOS-Dataset_v10_Jul25.csv”.
PLOS %. sharing generated code = (PLOS No. sharing generated code/ PLOS No. generating code) x 100
T&F No. generating code: Articles are counted if the “Code_Generated” column in the T&F dataset contains “Yes”.
T&F No. sharing generated code: Articles are counted if the “Code_Generated” column contains “Yes” and the “Code_Shared” column contains “Yes” in the T&F dataset
T&F % sharing generated code = (T&F No. sharing generated code/ T&F No. generating code) x 100
Code/software
Microsoft® Excel® for Microsoft 365 MSO (Version 2510 Build 16.0.19328.20244) 64-bit
Access information
Data was derived from the following sources:
