Refining impact assessment in undergraduate STEM education: Differential item functioning analysis of field-based learning interventions

Bhatti, Haider Ali 1 ; Arcila Hernández, Lina1; Balachandran, Lalitha1; Kouba, Paige1; Croll, Donald1; Dayton, Gage1; Marnocha, Erin2; Beltran, Roxanne 1 ; Zavaleta, Erika1

Published May 16, 2025 on Dryad. https://doi.org/10.5061/dryad.zcrjdfnqs

Data files

May 16, 2025 version files 94.41 KB

README.md

8.57 KB
refining_impact_assessment_code.txt

50.11 KB
refining_impact_assessment_data.csv

35.72 KB

Abstract

This dataset contains self-efficacy survey responses from undergraduate biology students enrolled in three different course formats (lecture, introductory field, and intensive field) at the University of California, Santa Cruz from 2016-2019. The data structure includes pre/post survey responses measuring students' self-efficacy across four skill areas (species identification, experimental design, oral presentation, and field research), demographic information (URM status, first-generation status, gender, and Educational Opportunity Program status), and calculated change scores for 564 students. The dataset demonstrates how Differential Item Functioning (DIF) analysis can quantify both the magnitude and demographic patterns of educational interventions with greater precision than traditional assessment methods. Analysis revealed field course students were significantly more likely to report higher self-efficacy ratings compared to lecture course students (odds ratios ranging from 2-167 times higher), with historically minoritized students showing greater gains in field settings. This dataset has significant reuse potential for researchers studying educational interventions, assessment methodology, field-based learning, and equity in STEM education. All data were collected under IRB approval (UCSC #HS3230) with student identifiers anonymized to ensure ethical compliance and privacy protection.RetryClaude can make mistakes. Please double-check responses.

Dataset DOI: 10.5061/dryad.zcrjdfnqs

Description of the data and file structure

This dataset contains self-efficacy survey responses from undergraduate biology students enrolled in three different course formats at the University of California, Santa Cruz from Fall 2016 through Spring 2019. The data was collected to analyze how field-based learning experiences impact students' self-efficacy development across demographic groups. The dataset includes pre/post survey responses measuring students' self-efficacy in several areas, demographic information, and course enrollment data.

This data is an extension of research originally published by Beltran et al. (2020) and archived at Dryad (https://doi.org/10.7291/D1DM3P). The current dataset includes additional analyses examining differential item functioning (DIF) to quantify how field-based learning impacts self-efficacy development across different student populations.

Files and variables

refining_impact_assessment_data.csv

This file contains the complete dataset with pre/post survey responses, demographic information, and calculated change scores for all participants across the three course formats.

Variable Definitions

Course and Demographic Variables

Course: The course format in which the student was enrolled. Values include "BIOE20C" (lecture course), "BIOE82" (introductory field course), or "CEC" (California Ecology and Conservation, intensive field course).

StudentID: Anonymized randomly generated identifier for each student participant. Not linked to any identifiable student information.

first4: First four characters of StudentID.

URM: Underrepresented minority status. "Yes" indicates student identifies as African American/Black, American Indian/Alaskan Native, or Hispanic/Latino. "No" indicates student does not identify as URM.

EOP: Educational Opportunity Program status. "Yes" indicates student is in EOP program, "No" indicates student is not in EOP program.

FIF: First in family to attend college status. "Yes" indicates student is first in family to attend college, "No" indicates student is not first in family.

Gender: Self-reported gender identity. Values include "Female" and "Male".

Self-Efficacy Pre-Survey Variables

All pre-survey items were measured on a 5-point Likert scale: 1 = Strongly Disagree, 2 = Disagree, 3 = Neither Agree Nor Disagree, 4 = Agree, 5 = Strongly Agree

FloraPre: Pre-survey response to "I am familiar with the flora, fauna, and ecosystems of California" (1-5 scale).

ExpDesignPre: Pre-survey response to "I have strong experimental design skills" (1-5 scale).

OralPresPre: Pre-survey response to "I have strong oral presentation skills" (1-5 scale).

FieldResearchPre: Pre-survey response to "I know how to conduct field research projects from start to finish" (1-5 scale).

CareerSciencePre: Pre-survey response to "I am interested in pursuing a career in science" (1-5 scale).

GradDegreePre: Pre-survey response to "I plan to pursue a graduate degree" (1-5 scale).

Self-Efficacy Post-Survey Variables

All post-survey items were measured on the same 5-point Likert scale as the pre-survey items.

FloraPost: Post-survey response to "I am familiar with the flora, fauna, and ecosystems of California" (1-5 scale).

ExpDesignPost: Post-survey response to "I have strong experimental design skills" (1-5 scale).

OralPresPost: Post-survey response to "I have strong oral presentation skills" (1-5 scale).

FieldResearchPost: Post-survey response to "I know how to conduct field research projects from start to finish" (1-5 scale).

CareerSciencePost: Post-survey response to "I am interested in pursuing a career in science" (1-5 scale).

GradDegreePost: Post-survey response to "I plan to pursue a graduate degree" (1-5 scale).

Calculated Change Score Variables

These variables represent the difference between post and pre survey responses (Post minus Pre)

FloraDiff: Change in self-efficacy for flora/fauna/ecosystem familiarity (FloraPost - FloraPre).

ExpDesignDiff: Change in self-efficacy for experimental design skills (ExpDesignPost - ExpDesignPre).

OralPresDiff: Change in self-efficacy for oral presentation skills (OralPresPost - OralPresPre).

FieldResearchDiff: Change in self-efficacy for field research skills (FieldResearchPost - FieldResearchPre).

CareerScienceDiff: Change in interest in pursuing a career in science (CareerSciencePost - CareerSciencePre).

GradDegreeDiff: Change in plans to pursue a graduate degree (GradDegreePost - GradDegreePre).

Missing Data

Some cells may contain missing values and are left blank. These blank cells indicate that data were not available for that particular student and variable. Missing values may occur for various reasons, including students not answering specific questions on either the pre- or post-survey.

Course Descriptions

BIOE 20C Ecology and Evolution (labeled as "BIOE20C" in the dataset)
- Referred to as "lecture course" in the manuscript
- 5 units
- Introduction to ecology and evolution covering principles of evolution at the molecular, organismal, and population levels, along with behavioral, population, and community ecology.
BIOE 82 Introduction to Field Research and Conservation (labeled as "BIOE82" in the dataset)
- Referred to as "intro field course" in the manuscript
- 2 units
- Introductory class exposing freshmen, sophomore & transfer students to local habitats throughout the Central Coast of California and how to conduct field-based studies.
California Ecology and Conservation (labeled as "CEC" in the dataset)
- Referred to as "intensive field course" in the manuscript
- 19 units
- Intensive science instruction and research training undertaken entirely within the UC Natural Reserve System, where students spend seven weeks living and studying at different reserves.

Code/software

refining_impact_assessment_code.txt

This file contains the R code scripts used to conduct Differential Item Functioning (DIF) analysis of the survey data. The scripts analyze pre-survey, post-survey, and change scores across course formats and demographic groups, generating statistical outputs and visualizations that quantify the magnitude of differences in self-efficacy development.

The dataset can be viewed with any spreadsheet software or text editor that handles CSV files. The analysis was conducted using R version 4.2.2 with the following packages:

tidyverse (for data manipulation and visualization)
ordinal (for ordinal logistic regression analysis)
ggplot2 (for creating visualizations)
dplyr (for data transformation)
readr (for reading CSV files)

The R scripts included in the submission perform Differential Item Functioning (DIF) analysis to evaluate the impact of field-based learning on students' self-efficacy across three undergraduate biology courses. The workflow consists of several analysis components:

Course Pre-Survey DIF Analysis (examining pre-instructional differences)
Course Post-Survey DIF Analysis (examining post-instructional differences)
Course Change Score DIF Analysis (examining developmental trajectories)
Demographic DIF Analysis (comparing responses across demographic groups)

Each analysis produces both statistical outputs and visualizations that quantify the magnitude of differences between course formats and demographic groups.

Access information

The dataset presented in this submission is derived from an expanded analysis of data originally published in Beltran et al. (2020). The original dataset is publicly available at Dryad: https://doi.org/10.7291/D1DM3P

This submission extends the analysis of the original dataset using Differential Item Functioning (DIF) methodology to quantify intervention effectiveness and equity implications of field-based learning in undergraduate biology education.

The original data was collected under UCSC IRB protocol #HS3230. All data collection and analysis comply with institutional ethics guidelines.

Human subjects data

We received explicit consent from participants to publish the de-identified data in the public domain and de-identified data through our approved UCSC IRB protocol #HS3230.

StudentIDs were anonymized and do not represent a means to identify participants.

Study Context and Participants

We conducted several sets of new analyses on student response data from a pre/post survey previously administered by Beltran et al. (2020) at the University of California, Santa Cruz. The original survey response data ranges from Fall 2016 through Spring 2019 and assessed three undergraduate biology courses with different levels of the intervention (field-based learning): (1) BIOE 20C Ecology and Evolution (lecture course; no field-based learning; n = 81); (2) BIOE 82 Introduction to Field Research and Conservation (introductory field-based learning; n = 190); and (3) California Ecology and Conservation (intensive field-based learning; n = 293). Complete course descriptions are provided in Supplement 1. For demographic analyses, we compared five dichotomous groups based on self-reported student identities: (1) URM versus non-URM status, (2) FIF versus non-FIF status, (3) Female versus male gender identity, (4) Educational Opportunity Program (EOP) versus non-EOP status, and (5) Nondominant versus dominant identity status (not previously aggregated or analyzed by Beltran et al., [2020]). Students were categorized as “nondominant” if they identified with at least one historically minoritized population (URM, FIF, or EOP). Our choice of the term “nondominant” follows Gutiérrez et al. (2009), who explain that this terminology acknowledges power relations rather than numerical representation or characteristic descriptions, unlike terms such as “minority” or “students of color.” For the intensive field course, only URM and FIF data were available for classification. All data were collected under UCSC IRB protocol #HS3230 and are archived in the publicly accessible repository Dryad at https://doi.org/10.7291/D1DM3P.

Survey and Study Design

We analyzed pre/post student response data from a previously validated self-efficacy survey instrument with sufficient internal consistency (see Beltran et al. [2020] for details). Students rated themselves on four self-efficacy items using a 5-point Likert scale from Strongly Disagree to Strongly Agree (1 = Strongly Disagree, 2 = Disagree, 3 = Neither Agree Nor Disagree, 4 = Agree, and 5 = Strongly Agree). The four items (and the name we use hereafter, in italics) were:

I am familiar with the flora, fauna, and ecosystems of California (Species Identification)
I have strong experimental design skills (Experimental Design)
I have strong oral presentation skills (Oral Presentation)
I know how to conduct field research projects from start to finish (Field Research)

For each item, we calculated initial ratings from the first week of instruction (pre-survey), final ratings from the last week of instruction (post-survey), and differences between post and pre ratings (change scores). This matched pre/post design allowed us to compare self-efficacy ratings before and after the instructional intervention as well as developmental trajectories across course formats and student populations.

Statistical Analyses

Baseline (non-DIF) Analysis. We first conducted baseline analyses to establish initial points of comparison between course types and demographic groups, which we later compared with the DIF analysis. This primary analysis did not use DIF and instead served as a “traditional” approach based on raw Likert score comparisons (e.g., comparing pre/post means of an item). For each item, we compared pre, post, and change scores in the field courses versus the lecture course. To examine potential differences between course types, we conducted pairwise comparisons using independent samples t-tests with the lecture course as the reference group, comparing it separately with the intro and intensive field courses. To examine potential differences based on student identities, we also conducted pairwise comparisons between the dichotomous demographic groupings in each course. For example, we compared dominant versus nondominant students’ ratings on each of the four items in the lecture course student population, then compared those same groups in the intro field course, and finally in the intensive field course. Each of the four other demographic groupings were also compared in each of the three courses.

We applied the Benjamini-Hochberg (BH) correction (Benjamini & Hochberg, 1995) to maintain a false discovery rate of 5%, with our four study items representing a family of tests with critical values of p < 0.0125, p < 0.025, p < 0.0375, and p < 0.05 for first through fourth ranked p-values respectively. This correction minimized false discoveries while maximizing statistical power, which was necessary given our multiple comparisons across items and groups. We also calculated effect sizes (Cohen’s d) and interpreted these values based on thresholds proposed by Kraft (2020) in the context of educational interventions: d < 0.05 indicating small effect, d = 0.05 to < 0.20 indicating medium effect, and d ≥ 0.20 indicating large effect. We then compared these baseline analyses to the DIF analyses described below to see how the results aligned or differed. Additionally, we assessed whether the DIF procedures added any further insights on intervention impacts that were potentially missed or understated by the baseline analysis.

DIF Analysis. Building on the baseline comparisons, we conducted DIF analyses to more deeply examine potential group differences in item responses. We used ordinal logistic regression models in R (R Core Team, 2024) to analyze the ordered Likert data. We utilized the ordinal package (Christensen, 2023) for model fitting and the tidyverse suite (Wickham et al., 2019) for data manipulation and visualization. This specific DIF approach suited our context and response data because it appropriately handled the ordered response categories of the Likert scale, accommodated unequal sample sizes, and provided interpretable measures of group differences. For each item, the ordinal logistic regression models treated responses as ordered factors (1-5) and used centered total scores (combining all four items) as a control for overall self-efficacy levels. This analysis compared two nested models: a base model using only total scores to predict responses, and a DIF model that added group membership as a predictor. These DIF methods extended beyond the traditional mean comparisons that were done in the baseline analysis by allowing us to examine how response patterns differed between groups while controlling for overall ability levels.

We performed two sets of DIF analyses: a Course DIF Analysis to broadly assess for intervention impacts and a Demographic DIF Analysis to assess for equitable impacts of the intervention. Our hypotheses for these analyses across the three courses are summarized in Tables 1 (Course DIF hypotheses) and 2 (Demographic DIF hypotheses) .

[Table 1: Hypotheses for course DIF analysis (Pre/Post/Change)] - see manuscript

[Table 2: Hypotheses for demographic DIF analysis (Post/Change)] - see manuscript

For the Course DIF Analysis (Table 1), the two field-based courses (focal groups) were compared to the lecture-based course (reference group). Each pre-survey item was analyzed for DIF to test for any initial differences in the student populations before instruction. We expected no DIF on any of the pre-survey items in each course comparison (i.e., intro field course vs. lecture course, intensive field course vs. lecture course). We then performed a DIF analysis on the post-survey items to evaluate how the different levels of field-based learning in each course (i.e., different levels of the intervention) influenced self-efficacy development. We also tested for DIF on the change scores for each item. Since this analysis was done after instruction, we expected DIF on each of the post-survey items in each field-versus-lecture course comparison. We also expected the intensive field course versus lecture course to show greater DIF than the intro field course versus lecture course considering the major differences in course load (19 units versus 2 units, respectively) and levels of field-based immersion.

For the Demographic DIF Analysis (Table 2), we compared each of the five dichotomous demographic groups with the historically dominant identity as the reference group. These demographic DIF comparisons were done separately for each course (i.e., lecture course demographic groups compared to each other, intro field course compared to each other, and intensive field course compared to each other). For each comparison, we conducted separate DIF analyses for the pre- and post-survey response data, as well as on change scores, to investigate how these student populations may have differentially experienced self-efficacy development within each unique course context. In the lecture course, we expected the known disproportionately negative impacts of direct instruction on students with nondominant identities (Eddy & Hogan, 2014; Harris et al., 2020) to result in more instances of DIF between dominant and nondominant groups than in the field courses. We expected field courses to facilitate more equitable self-efficacy gains regardless of student identity in comparison to the lecture course. We predicted no demographic DIF on the pre-survey items in each course (since this was measured prior to instruction) and less demographic DIF on the post-survey items (and change score results) in the field courses compared to the lecture course. Comparatively less DIF in the field courses would indicate that after instruction, the demographic groups were similarly responding to the self-efficacy items (i.e., greater measurement invariance across demographic groups in the field courses versus the lecture course).

For all DIF analyses, we evaluated the presence of DIF based on multiple criteria. We assessed statistical significance using likelihood ratio tests comparing the nested models, applying the aforementioned BH correction as recommended by Kim and Oshima (2013). We quantified DIF magnitude using McFadden’s R² values interpreted based on thresholds established specifically for education research by Jodoin and Gierl (2001): negligible DIF (R² < 0.035), moderate DIF (0.035 ≤ R² ≤ 0.070), and large DIF (R² > 0.070). We also calculated odds ratios (ORs) as additional measures of practical significance with thresholds adapted from Bjorner et al. (1998): no DIF for ORs between 0.65-1.53, slight to moderate DIF for ORs outside 0.65-1.53, and moderate to large DIF for ORs outside 0.53-1.89. We then translated these ORs into statements about how many times more or less likely one group was to give higher ratings compared to the reference group (e.g., OR of 1.00 indicates exactly equal likelihood).