The extent of gender and race/ethnicity imbalance in infectious disease dynamics research
Data files
May 30, 2025 version files 23.18 MB
-
df9_articledata_0.7_public.csv
8.33 MB
-
df9_consensus_authorship_public.csv
108.41 KB
-
df9_consensus_citation_public.csv
14.08 MB
-
input_jif_regression_nosf_20241212_public.csv
655.24 KB
-
README.md
6.84 KB
Abstract
https://doi.org/10.5061/dryad.djh9w0w98
Description of the data and file structure
This repository provides the source code for the following study: "The extent of gender and race/ethnicity imbalance in infectious disease dynamics research." We also provide anonymized data to reproduce the figures in the text in compliance with our data-sharing agreement with Clarivate.
The data files are already embedded with the appropriate directory structure in the zip file. These data comprise articles in the field of infectious disease dynamics, the inferred gender and race of their authors, the author's country of affiliation, and the Journal Impact Factor where the article was published. We analyze patterns in these data to understand how authorship and citation practices vary across the field.
Data are first pulled from Web of Science using the workflow and files outlined in the pull_data folder. Data are processed using the files in process_data folder; this requires some files published with Dworkin et al.'s paper on citation practices in neuroscience which are copied into this repository. Finally, figures in the main text can be reproduced with the code in make_figures and the data available in the coreidd folder.
The IDD_bib_analysis directory contains the code for our tool to analyze one's bibliography relative to the makeup of the field. Directions are provided in the cleanBib Jupyter Notebook.
Files and variables
Files are labeled numerically according to the order in which they were run within each folder. Across folders the order is 1) pull data, 2) process data, 3) make figures.
Data dictionaries for the data files are provided below.
coreidd/df9_articledata_0.7_public.csv
article_id: a unique identifier for each articlePY: year of publicationPD: month of publication, contains NAsAG: gender pairing of first and last authors of the article usinggenderize.io,Mstands for man,Wfor woman, andUfor unknown; first letter is first author, second letter is last authorARbin_ethni: race pairing of first and last authors of the article usingethnicolrand binned into White (W) and non-White (N); first letter is first author, second letter is last authorFA_genderize: first author gender fromgenderize.io,Mstands for man,Wfor woman, andUfor unknownLA_genderize: last author gender fromgenderize.io,Mstands for man,Wfor woman, andUfor unknownFA_ethni_race: first author race fromethnicolr,HLstands for Hispanic,W_NLfor non-Hispanic White,B_NLfor non-Hispanic Black,Afor Asian, andUfor unknownLA_ethni_race: last author race fromethnicolr,HLstands for Hispanic,W_NLfor non-Hispanic White,B_NLfor non-Hispanic Black,Afor Asian, andUfor unknownFA_binrace_ethni: first author binary race fromethnicolr,Wstands for White,Nfor non-White (including Hispanic, Asian, and Black non-Hispanic),Uis unknownLA_binrace_ethni: last author binary race fromethnicolr,Wstands for White,Nfor non-White (including Hispanic, Asian, and Black non-Hispanic),Uis unknownCountry: country of affiliation for last author of article (string)global_north: indicator variable for whether the last author's country of affiliation falls in the Global North (0 or 1)has_single_author: boolean variable for whether the article has only a single author
coreidd/df9_consensus_authorship_public.csv
This file contains the article ids and publication years (PY) for articles in the authorship dataset.
coreidd/df9_consensus_citation_public.csv
This file contains the articles in the citation dataset including which article they were cited by, and whether they should be included in the analysis depending on the threshold of inclusion.
article_id: the article's unique identifier so it can be linked to the full datasetin_bib_of_article_id: article id of the citing articlePY: publication year of cited articleinclude_95: boolean for whether article should be included with 95th percentile consensus citation thresholdinclude_90: boolean for whether article should be included with 90th percentile consensus citation threshold. The main analysis uses this threshold; see the paper for a more detailed description of the construction and interpretation of these thresholdsinclude_75: boolean for whether article should be included with 75th percentile consensus citation thresholdinclude_50: boolean for whether article should be included with 50th percentile consensus citation thresholdis_self_cite: indicator for whether the citation is a self citation by the first or last author (0 or 1)
input_jif_regression_nosf_20241212_public.csv,
This file contains data necessary for the impact factor regression in the main text.
pair_gender: gender pairing of first and last authors of the article usinggenderize.io,Mstands for man,Wfor woman; first letter is first author, second letter is last authorpair_race: race pairing of first and last authors of the article usingethnicolr,Wstands for White,Nfor non-White; first letter is first author, second letter is last authorFA_gender: first author gender fromgenderize.io,Mstands for man,Wfor womanLA_gender: last author gender fromgenderize.io,Mstands for man,Wfor womanFA_race: first author race fromethincolr,Wstands for White,Nfor non-WhiteLA_race: last author race fromethincolr,Wstands for White,Nfor non-WhiteJIF: 2023 Journal Impact Factor of journal where article was publishedyear: year of publicationnum_citations: number of times article was cited by articles in authorship datasetlog_num_citations: natural log ofnum_citationscolumncitation_rate: number of citations divided by years since publicationlog_citation_rate: natural log ofcitation_ratecolumnWC: Web of Science categorynum_cited_refs: number of references in the article's bibliographyWC_1st: first entry inWCCountry: country of affiliation for last author of article (string)global_north: indicator variable for whether the last author's country of affiliation falls in the Global North (0 or 1)article_id: article unique identifier
Code/software
This code uses both R and Python, depending on the file.
Access information
Other publicly accessible locations of the data:
Data was derived from the following sources:
- Clarivate Web of Science
