Skip to main content

Data from: Designing data science workshops for data-intensive environmental science research

Cite this dataset

Theobold, Allison; Hancock, Stacey; Mannheimer, Sara (2020). Data from: Designing data science workshops for data-intensive environmental science research [Dataset]. Dryad.


Over the last 20 years, statistics preparation has become vital for a broad range of scientific fields, and statistics coursework has been readily incorporated into undergraduate and graduate programs. However, a gap remains between the computational skills taught in statistics service courses and those required for the use of statistics in scientific research. Ten years after the publication of "Computing in the Statistics Curriculum,'' the nature of statistics continues to change, and computing skills are more necessary than ever for modern scientific researchers. In this paper, we describe research on the design and implementation of a suite of data science workshops for environmental science graduate students, providing students with the skills necessary to retrieve, view, wrangle, visualize, and analyze their data using reproducible tools. These workshops help to bridge the gap between the computing skills necessary for scientific research and the computing skills with which students leave their statistics service courses. Moreover, though targeted to environmental science graduate students, these workshops are open to the larger academic community. As such, they promote the continued learning of the computational tools necessary for working with data, and provide resources for incorporating data science into the classroom.


Surveys from Carpentries style workshops the results of which are presented in the accompanying manuscript. 

Pre- and post-workshop surveys for each workshop (Introduction to R, Intermediate R, Data Wrangling in R, Data Visualization in R) were collected via Google Form. 

  • The surveys administered for the fall 2018, spring 2019 academic year are included as pre_workshop_survey and post_workshop_assessment PDF files. 
  • The raw versions of these data are included in the Excel files ending in survey_raw or assessment_raw.
    • The data files whose name includes survey contain raw data from pre-workshop surveys and the data files whose name includes assessment contain raw data from the post-workshop assessment survey.
  • The annotated RMarkdown files used to clean the pre-workshop surveys and post-workshop assessments are included as workshop_survey_cleaning and workshop_assessment_cleaning, respectively. 
  • The cleaned pre- and post-workshop survey data are included in the Excel files ending in clean
  • The summaries and visualizations presented in the manuscript are included in the analysis annotated RMarkdown file. 

Usage notes

To clean the data:

  1. Open the workshop_survey_cleaning and workshop_assessment_cleaning annotated RMarkdown files. 
  2. Load in the data for each respective RMarkdown file. 
  3. Execute the data cleaning R code implemented in each RMarkdown file. 
  4. The resulting cleaned datasets are the data included whose suffix is clean.  

To recreate the visualizations and summaries included in the manuscript:

  1. Open the analysis annotated RMarkdown file.
  2. Read in the cleaned pre- and post-workshop surveys -- merged to one Excel file each (survey & assessment). 
  3. Execute the R code implemented in the RMarkdown file, where each section corresponds to a summary or visualization presented in the manuscript.   


National Network of Libraries of Medicine, Award: PND Data Engagement Grant