Human subjects data

Human subjects data

Data archived in Dryad are publicly available, and any human subjects data must be properly anonymized and prepared under applicable legal and ethical guidelines. When de-identifying your data, both direct and indirect identifiers need to be considered.

Dryad does not allow any direct identifiers, but a dataset may contain up to 3 indirect identifiers. Direct identifiers include variables such as participant’s name, initials, email, and postal code; indirect identifiers are data that if combined might lead to identification (see below).

Note: While every submission is evaluated by our curation team and we provide guidance for researchers to follow, the researchers themselves are responsible for ensuring that data do not contain information which identifies, or which can be used in conjunction with other publicly available information to personally identify, any individual (see Dryad’s Terms of Service).

Tips for preparing human subjects data for Dryad

  • Ensure that there are no direct identifiers.
  • Limit indirect identifiers. (Dryad allows a maximum of 3.)
  • Remove any nonessential identifying details.
  • Aggregate data – variables which may be potentially revealing, e.g., age, can be grouped.
  • Reduce the precision of a variable – e.g., remove day and month from year of birth; use county instead of city; add or subtract a small, randomly chosen number.
  • Restrict the upper or lower ranges of a continuous variable to hide outliers by collapsing them into a single code.
  • Provide good documentation of your data in a README file.

NIH states that "researchers should consider removing indirect identifiers and other information that could lead to 'deductive disclosure' of participants identities. Deductive disclosure of individual subjects becomes more likely when there are unusual characteristics or the joint occurrence of several unusual variables. Samples drawn from small geographic areas, rare populations, and linked datasets can present particular challenges to the protection of subjects' identities."

README File

All data in Dryad should be well documented. Variables and their allowable values should be defined. If you have masked your data, removed variables, or otherwise altered your dataset to de-identify it, it is best practice to include this information in your README file. See Dryad’s guidance on README files for more information.

Direct and Indirect Identifiers*

Direct Identifiers (none allowed)

  • Name
  • Initials
  • Address, including full or partial postal code
  • Telephone or fax numbers or contact information
  • Electronic mail addresses
  • Unique identifying numbers (e.g., social security number, student ID)
  • Vehicle identifiers
  • Medical device identifiers
  • Web or internet protocol addresses
  • Biometric data
  • Facial photograph or comparable image (fMRI data must have facial structures removed through tools such as pydeface or mri_deface)
  • Audiotapes of participants’ voices
  • Names of relatives
  • Dates related to an individual (e.g., date of birth, date of doctor visit, date of interview)

Indirect Identifiers (may present a risk if combined with other data – Dryad allows up to 3)

  • Sex
  • Rare disease or treatment
  • Place of treatment
  • Name of health professional responsible for care
  • Sensitive data such as illicit drug use or risky behavior**
  • Criminal record
  • Place of birth
  • Socioeconomic data, such as occupation or place of work, income, or education
  • Household and family composition
  • Organizations that participant belongs to (religious, political, trade, etc.)
  • Anthropometric measures (e.g., height, weight)
  • Multiple pregnancies
  • Ethnicity, race, indigenous status
  • Small denominators—population size of less than 100
  • Very small numerators—event counts of less than 3
  • Year of birth or age
  • Verbatim responses or transcripts

* Sources used:

** "The sensitivity of personal data is related to the potential for harm or stigma that might attach to the identification of an individual because of the nature of the information... Researchers should also be aware of information that communities may consider sensitive because, for example, of its potential to stigmatize a community" (CIHR, p. 30).

Considerations regarding qualitative data

Guidelines gathered from the Qualitative Data Repository (QDR) at Syracuse University’s Center for Qualitative and Multi-Method Inquiry:

Tips for anonymizing qualitative data:

  • Removing major (direct) identifying details (e.g., real names, locations); replacing them with pseudonyms, replacement terms (e.g., “paternal grandfather”), vaguer descriptors or coding system; and using a cross-referencing system for pseudonyms that will not be made available to users
  • Removing information in a transcript or notes from a human encounter that may reveal the identity of project participants
  • Aggregating or reducing the precision of information or a variable, e.g., replacing date of birth by age groups or city names by county names
  • Generalizing the meaning of detailed text, e.g., replacing a doctor’s detailed area of medical expertise with an area of medical specialty
  • Noting the replacement of identifying details in text and the removal or modification of information in a meaningful way (for instance, in transcribed interviews, indicating replaced text with [brackets])

See also: UK Data Archive: Anonymisation

Search for data

Be part of Dryad

We encourage organizations to: