Data for: Sustainable connectivity in a community repository
Data files
Dec 07, 2023 version files 169.52 MB
169.51 MB
11.95 KB
Identifiers of many kinds are the key to creating unambiguous and persistent connections between research objects and other items in the global research infrastructure (GRI). Many repositories are implementing mechanisms to collect and integrate these identifiers into their submission and record curation processes. This bodes well for a well-connected future, but many existing resources submitted in the past are missing these identifiers, thus missing the connections required for inclusion in the connected infrastructure. Re-curation of these metadata is required to make these connections.
The Dryad Data Repository has existed since 2008 and has successfully re-curated the repository metadata several times, adding identifiers for research organizations, funders, and researchers. Understanding and quantifying these successes depends on measuring repository and identifier connectivity. Metrics are described and applied to the entire repository here.
Identifiers for papers (DOIs) connected to datasets in Dryad have long been a critical part of the Dryad metadata creation and curation processes. Since 2019, the % of datasets with connected papers has decreased from 100% to less than 40%. This decrease has significant ramifications for the re-curation efforts described above as connected papers are an important source of metadata. In addition, missing connections to papers make understanding and re-using datasets more difficult.
Connections between datasets and papers are many times difficult to make because of time lags between submission and publication, lack of clear mechanisms for citing datasets and other research objects from papers, changing focus of researchers, and other obstacles. The Dryad community of members, i.e. users, research institutions, publishers, and funders have vested interests in identifying these connections and critical roles in the curation and re-curation efforts. Their engagement will be critical in building on the successes Dryad has already achieved and ensuring sustainable connectivity in the future.
README: Data For: Sustainable Connectivity in a Community Repository
This readme.txt file was generated on 30231110 by Ted Habermann
Title of Dataset
Data For: Sustainable Connectivity in a Community Repository
Author Information
Principal Investigator Contact Information
Name: Ted Habermann (0000-0003-3585-6733)
Institution: Metadata Game Changers (
ORCID: 0000-0003-3585-6733
Date published or finalized for release:
November 10, 2023
Date of data collection (single date, range, approximate date)
May and June 2023
Information about funding sources that supported the collection of the data:
National Science Foundation (Crossref Funder ID: 100000001) Award 2134956.
Overview of the data (abstract):
These data are Dryad metadata retrieved from and translated into csv files. There are two datasets:
- DryadJournalDataset was retrieved from Dryad using the ISSNs in the file DryadJournalDataset_ISSNs.txt, although some had no data.
- DryadOrganizationDataset was retrieved from Dryad using the RORs in the file DryadOrganizationDataset_RORs.txt, although some had no data.
Each dataset includes four types of metadata: identifiers, funders, keywords, and related works, each in a separate comma (.csv) or tab (.tsv) delimited files. There are also Microsoft Excel files (.xlsx) for the identifier metadata and connectivity summaries for each dataset (*.html). The connectivity summaries include summaries of each parameter in all four data files with definitions, counts, unique counts, most frequent values, and completeness.
These data formed the basis for an analysis of the connectivity of the Dryad repository for organizations, funders, and people.
Size | FileName |
90541505 | DryadJournalDataset_Identifiers__20230520_12.csv |
9017051 | DryadJournalDataset_funders__20230520_12.tsv |
29108477 | DryadJournalDataset_keywords__20230520_12.tsv |
8833842 | DryadJournalDataset_relatedWorks__20230520_12.tsv |
18260935 | DryadOrganizationDataset_funders__20230601_12.tsv |
240128730 | DryadOrganizationDataset_identifiers__20230601_12.tsv |
39600659 | DryadOrganizationDataset_keywords__20230601_12.tsv |
11520475 | DryadOrganizationDataset_relatedWorks__20230601_12.tsv |
40726143 | DryadJournalDataset_identifiers__20230520_12.xlsx |
81894301 | DryadOrganizationDataset_identifiers__20230601_12.xlsx |
842827 | DryadJournalDataset_ConnectivitySummary.html |
387551 | DryadOrganizationDataset_ConnectivitySummary.html |
Field Definitions
Licenses/restrictions placed on the data:
Creative Commons Public Domain License (CC0)
Links to publications that cite or use the data:
Was data derived from another source?
File List
A. *Dataset_identifiers__YYYYMMDD_HH.*sv:
Short description: Identifier metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API.
B. *Dataset_funders__YYYYMMDD_HH.*sv:
Short description: Funder metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API.
C. *Dataset_keywords__YYYYMMDD_HH.*sv:
Short description: Keyword metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API.
D. *Dataset_relatedWorks__YYYYMMDD_HH.*sv:
Short description: Related work metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API.
E. *Dataset_identifiers__YYYYMMDD_HH.xlsx:
Short description: Excel spreadsheet with identifier metadata from Dryad for Dataset collected at YYYYMMDD_HH using the Dryad API.
F. *Dataset_ConnectivitySummary.html:
Short description: Connectivity summary for Dataset.
G. summarizeConnectivity.ipynb
Short description: Python notebook with code for creating connectivity summaries and plots.
Relationship between files:
All files with the same dataset name make up a dataset. The .*sv are original metadata extracted from Dryad.
Description of methods used for collection/generation of data:
Most of the analysis is simply extracting and comparing counts of various metadata elements.
See connectivity summaries (*ConnectivitySummary.html) for a list of parameters in each file and summaries of their values.
Identifier Metadata
The identifier metadata datasets include the following fields:
Field | Definition |
DOI | Digital object identifier for the dataset |
title | Title for the dataset |
datePublished | Date dataset published |
relatedPublicationISSN | International Standard Serial Number for journal with related publication |
primary_article | Digital object identifier for primary article |
primary_article_type | relation type for primary article [primary_article, article] |
primary_article_source | Source of primary article (metadata for primary articles from Dryad metadata) |
primary_article_timestamp | Date and hour of primary article identification (typically metadata retrieval) format:YYYYMMDD_HH |
partyName | Author name |
contributorType | Author role from DataCite (creator or contributor) |
partyIdentifier | Author identifier |
partyIdentifierType | Type of author identifier (typically ORCID) |
partyIdentifier_source | Source of author identifier (metadata) |
partyIdentifier_timestamp | Date and hour of primary article identification (typically metadata retrieval) format:YYYYMMDD_HH |
organization | Author affiliation |
affiliation_source | Source of author affiliation |
affiliation_timestamp | Date and hour of affiliation identification (typically metadata retrieval) format:YYYYMMDD_HH |
affiliationIdentifier | Identifier for affiliation |
affiliationIdentifierType | Type of organizational identifier (typically ROR) |
affiliationIdentifier_source | Source of organizational identifier (typically metadata) |
affiliationIdentifier_timestamp | Date and hour of affiliation identifier retrieval (typically metadata retrieval) format:YYYYMMDD_HH |
Funder Metadata
The funder metadata datasets include the following fields:
Field | Definition |
DOI | Digital object identifier for the dataset |
datePublished | Date dataset published |
relatedPublicationISSN | International Standard Serial Number for journal with related publication |
organization | Funder name |
identifier | Funder identifier |
identifierType | Funder identifier type (typically Crossref Funder ID) |
awardNumber | Funder award number |
funders_source | Source of funder metadata (metadata if from Dryad) |
funders_timestamp | Date and hour of funder identification (typically metadata retrieval) format:YYYYMMDD_HH |
Keyword Metadata
The keyword metadata datasets include the following fields:
Field | Definition |
DOI | Digital object identifier for the dataset |
datePublished | Date dataset published |
relatedPublicationISSN | International Standard Serial Number for journal with related publication |
keyword | Subject keyword |
keywords_source | Source of funder metadata (metadata if from Dryad) |
keywords_timestamp | Date and hour of keyword identification (typically metadata retrieval) format:YYYYMMDD_HH |
Related Work Metadata
The related metadata datasets include the following fields:
Field | Definition |
DOI | Digital object identifier for the dataset |
datePublished | Date dataset published |
relatedPublicationISSN | International Standard Serial Number for journal with related publication |
relationship | The type of the relation between the dataset and the related work. |
identifier | The identifier of the related work. |
identifierType | Related identifier type (typically DOI) |
relatedWorks_source | Source of related work metadata (metadata if from Dryad) |
relatedWorks_timestamp | Date and hour of relatedWork identification (typically metadata retrieval) format:YYYYMMDD_HH |
These data are Dryad metadata retrieved from and translated into csv files. There are two datasets:
1. DryadJournalDataset was retrieved from Dryad using the ISSNs in the file DryadJournalDataset\_ISSNs.txt, although some had no data.
2. DryadOrganizationDataset was retrieved from Dryad using the RORs in the file DryadOrganizationDataset\_RORs.txt, although some had no data.
Each dataset includes four types of metadata: identifiers, funders, keywords, and related works, each in a separate comma (.csv) or tab (.tsv) delimited files. There are also Microsoft Excel files (.xlsx) for the identifier metadata and connectivity summaries for each dataset (*.html). The connectivity summaries include summaries of each parameter in all four data files with definitions, counts, unique counts, most frequent values, and completeness.
These data formed the basis for an analysis of the connectivity of the Dryad repository for organizations, funders, and people.