Data and code from: A system-wide snapshot: A multi-campus survey of open source contributors at the University of California
Data files
May 13, 2026 version files 470.80 KB
-
dryad_data_final.tar.gz
458.53 KB
-
README.md
12.27 KB
May 14, 2026 version files 470.86 KB
-
dryad_data_final.tar.gz
458.53 KB
-
README.md
12.33 KB
Abstract
The University of California (UC) OSPO Network is working to develop infrastructure for open source education, discovery, and sustainability at UC. To develop our strategic priorities and assess the state of UC open source, we conducted a survey in April 2025 of 294 UC-affiliated open source contributors. This deposit contains materials from the study, as a companion to our PLOS One paper "A system-wide snapshot: A multi-campus survey of open source contributors at the University of California".
Data:
The Dryad deposit contains the quantitative (.tsv format) and qualitative responses (.docx format) from the survey, as well as some intermediate data files generated during the analysis. Some minimal pre-processing has already been applied to the data, including removal of personally identifiable information. These data are suitable for replication and/or reuse. The compressed data file was produced with tar and gzip on an Ubuntu Linux operating system, version 24.04.3 LTS. See the README file for more information.
Code:
The Zenodo deposit, which is linked to the Dryad deposit, contains the code from this study. It is a static snapshot of this repository at the time of publication: https://github.com/UC-OSPO-Network/ospo-survey-analysis. See the README within the zip file for more information. The zip file was downloaded from GitHub using GitHub's built-in 'download .zip' feature.
This README file was generated on November 21, 2025 by Virginia Scarlett.
Updated March 26, 2026.
It applies to the unpacked tarball, dryad_data.tar.gz. To unpack the data on a Unix system, use the tar command:
tar -xvzf dryad_data_final.tar.gz
To unpack in Windows, use a file archiver like 7-Zip or WinRAR or use the built-in tar command in the Command Prompt or PowerShell.
GENERAL INFORMATION
- Title of Dataset: Survey responses from the 2025 UC OSPO Network open source survey
- Author Information: * Corresponding Author Contact Information Name: Virginia Scarlett Institution: UC Santa Barbara Library Email: virginiascarlett@ucsb.edu Alternate email: virginia.t.scarlett@gmail.com * Principal Investigator Contact Information Name: Amber Budden Institution: UC Santa Barbara Library Email: aebudden@ucsb.edu * Alternate Contact Information Name: Renata Curty Institution: UC Santa Barbara Library Email: rcurty@ucsb.edu
- Date of data collection: March 31, 2025 - May 9, 2025
- Funding information: Solicited grant from the Alfred P. Sloan Foundation Title: "Building an OSPO Network at the University of California" Grant number: G-2024-22424
SHARING/ACCESS INFORMATION
- Licenses: Data are CC-0. Code is BSD-3, copyright Regents of the University of California.
- Links to publications that cite or use the data:
SocArXiv Preprint: https://doi.org/10.31235/osf.io/p8bx6_v1
Paper submitted and currently in review at PLOS One.
- Was data derived from another source? no
- Recommended citation for this dataset:
Scarlett, Curty, Gomez et al. (2026). Survey responses from the 2025 UC OSPO Network open source survey [Data set]. Dryad. https://doi.org/10.5061/dryad.2280gb662
If you are citing our study, ideas, or findings, please cite the PLOS One paper, which is currently in review at the time of writing this README.
METHODOLOGICAL INFORMATION
- Data collection methods: See the paper for details. Briefly, we used a snowball sampling approach to survey UC-affiliated open source contributors across UC campuses. Data were collected anonymously using Qualtrics. The survey instrument and protocol were approved by the UC Santa Barbara (UCSB) Office of Research IRB Human Subjects Committee (HSC) (protocol #1-25-0264), and deemed exempt from human subjects review, i.e. Federal Regulations 45 CFR 46.104(d), under category 2. The survey consisted of mostly quantitative questions (e.g., multiple choice or rating scales) with a few qualitative questions (i.e. free-text boxes). Survey flow depended on previous selections, so no participant saw all possible questions. See paper for details. While optional contact information was requested for follow-up, the final dataset has been de-identified to ensure participant confidentiality.
- Methods for processing the data: I downloaded the raw data from Qualtrics and immediately separated them into quantitative data, qualitative data, and contact information/PII. See individual file descriptions for info on how they were processed.
- Methods for de-identification of data: The quantitative data contain only indirect identifiers. The most informative identifiers in the quantitative data are campus plus either field of study for academics (e.g. Life Sciences) or "work area" for staff (e.g. Information Technology (IT)). Since it would be extremely difficult to identify a participant based on these fields alone, no efforts have been made to de-identify the quantitative data.
Several steps were taken to de-identify the qualitative data:
- Order of responses was shuffled, so quantitative and qualitative data cannot be linked. * Possible identifiers were replaced with a bracketed word, e.g. "College of Engineering" might become [my college]. Possible identifiers include specific open source projects the participant contributed to and clubs, departments, or colleges they belong to. * Some words were removed, where the participant added additional, unnecessary identifying information. For example, if somebody entered their name, details about the kind of work they do, or details about their workplace.
Software needed to interpret the data: The .qsf version of the survey instrument should be opened with Qualtrics. The .docx version should be opened with Microsoft word. To view the tsv (tab-delimited text) files, I recommend either using Microsoft Excel (or similar), or VSCode with the RainbowCSV extension.
DATA & FILE OVERVIEW
- File List:
SURVEY INSTRUMENT
- survey_instrument.qsf: Survey in Qualtrics format. Does not contain any responses. * survey_instrument.docx: Same as above, but in Microsoft Word format. * (PDF is in the paper supplement and provided here as well)
"RAW" DATA
- all_quant.tsv: All of the quantitative survey responses. * qual.docx: All of the anonymized qualitative survey responses.
PROCESSED DATA
- clean_data/: This folder contains the same data as in all_quant.tsv, but reformatted for easier analysis.
ANALYSIS OUTPUTS
- data_for_plots/: This folder contains the data frames that were used to generate the figures in the paper. * curation_of_disciplines/: The results from matching write-in responses to Q18 (participant's field of study) with the Digital Commons taxonomy of disciplines.
- Additional data that were not included in the current data package: The raw Qualtrics export has not been included because it contains personally identifiable information (PII), including email addresses and GitHub usernames. It also contains a lot of irrelevant and/or confusing information that is automatically generated by Qualtrics (e.g., the entire consent form; internal "response id" used by Qualtrics; how long it took the participant to take the survey; participant language (English for all)).
- Are there multiple versions of the dataset? no
DATA-SPECIFIC INFORMATION FOR INDIVIDUAL FILES
DATA-SPECIFIC INFORMATION FOR clean_data/:
The question numbers below (Q1, Q2, etc.) come from the survey instrument.
- Files included: importance_Q2.tsv contributor_status_Q3.tsv contributor_roles_Q4.tsv project_size_Q5.tsv motivations_Q6.tsv project_types_Q7.tsv hosting_services_Q8.tsv challenges_Q9.tsv solutions_Q10.tsv future_contributors_Q15.tsv other_quant.tsv
- Number of variables/columns: Varies
- Number of cases/rows: 332
- Variable information: Here, the columns for each particular survey question were taken from all_quant.tsv and placed in separate files. Each row corresponds to a survey participant. The data have been tidied a bit, including renaming the columns and simplifying the entries. For example, "select all that apply" questions have been modified so that 1s indicate that the option was selected and 0s indicate that the option was not selected. The column names should be fairly self-explanatory, but in case they are not, they are also in the same order that they appeared on the survey. The more complex questions each have their own file. All the simple multiple choice questions have been put in one file (other_quant.tsv).
- Missing data codes: All of these questions were mandatory. Where the respondent did not answer because they never saw the question, their lack of response appears as an empty string "".
DATA-SPECIFIC INFORMATION FOR all_quant.tsv:
- Number of variables/columns: 96
- Number of cases/rows: 332
- Variable information: Rows correspond to individual survey respondents. Columns correspond to survey options. The question numbers below (Q1, Q2, etc.) come from the survey instrument.
campus: Q1
challenges: Q9
consent_form: Q0
ontributor_role: Q4
contributor_status: Q3
favorite_solution: Q11
field_of_study: Q17
future_contributors: Q15
hosting_services: Q8
importance_src: Q2
job_category: Q16
motivations: Q6
project_size: Q5
project_types: Q7
solution_offerings: Q10
staff_categories: Q19
I have no idea why Qualtrics sometimes numbers columns non-sequentially, but the order corresponds to the order on the survey. So if the column names are contributor_role_1, contributor_role_4, contributor_role_6, and contributor_role_7, those columns correspond to options 1, 2, 3, and 4 from the contributor roles question. I have kept the original column names that Qualtrics generated in case they have some hidden significance.
For rating (matrix) questions, the entry indicates which rating was selected. For check-all-that-apply questions, the presence of a non-empty string indicates that that option was selected.
Missing data codes: All of these survey questions were mandatory. Where the respondent did not answer because they never saw the question, their lack of response appears as an empty string "".
DATA-SPECIFIC INFORMATION FOR data_for_plots:
You don't really need to know what the rows/columns in these tables mean. They are only useful for reproducing my figures, and if you are reproducing my figures, then presumably you are reusing my code, which will give you a better explanation of how each table was produced than I can provide here.
DATA-SPECIFIC INFORMATION FOR curation_of_disciplines/:
Word of warning: this part was very messy and involved manual curation.
- Files:
- Disciplines_taxonomy_2025-03.pdf: The Digital Commons taxonomy of disciplines I downloaded from the internet (see notes file for details).
- digital_commons_disciplines_curation_notes.txt: Notes on the choices I made when I transferred the PDF taxonomy of disciplines to a txt file.
- digital_commons_disciplines.txt: The taxonomy of disciplines converted to a txt.
- qual_disciplines_shuffled.tsv: see below
- Number of variables/columns: 5
- Number of cases/rows: 196 (174 participants, some appear multiple times)
- Variable information: "NEWparticipantID" is a participant identifier. Participant IDs have been randomly shuffled, and row order has also been randomly shuffled (while keeping responses from the same participant contiguous). So the "NEW" part indicates that the participant IDs here do not correspond to the participant order in any other file in this data set. "response" indicates what the participant wrote in for Q18. "Level" columns are the classification results.
If the same participant has multiple "response" values, that means they entered multiple disciplines. Multiple disciplines separated with a slash or comma were parsed automatically. Multiple disciplines separated with something else were reviewed manually, in which case I did not bother to parse the response. So if someone wrote "Math, Stats", they will have one row where response==Math and another row where response==Stats. But if they wrote "Math & Stats", then they will have two rows where response is the same, i.e. response==Math & Stats.
Missing data codes: N/A
DATA-SPECIFIC INFORMATION FOR qual_data.docx:
See above for de-identification methods. In addition to de-identification, uninformative responses such as "N/A", "blank", or "?" were removed. Many comments contain typos or grammatical errors. These are original to the respondent and we have deliberately left them as-is.
Participant responses are numbered for visual clarity, but these numbers are arbitrary and do not correspond to the participant order in any other file in this data set.
Human subjects data
The survey instrument and protocol was approved by the UC Santa Barbara (UCSB) Office of Research IRB Human Subjects Committee (HSC) (protocol #1-25-0264), and was deemed exempt from human subjects review, i.e. Federal Regulations 45 CFR 46.104(d), under category 2. The survey contained a consent form that participants were required to agree to to proceed. The direct PII, i.e. names, email addresses, and GitHub usernames were removed. Remediation of indirect PII is described in detail in the data readme. Briefly, references to specific projects, clubs, departments, or colleges in the qualitative data were replaced with e.g. [my department], or were simply removed if they added no useful information or context.
