Survey data on student expectations for faculty, teaching assistant, and peer support in engineering education
Data files
Mar 20, 2026 version files 384.12 KB
-
README.md
16.01 KB
-
TopicModellingStudentFeedbackData.xlsx
368.11 KB
Abstract
This study compares five short text topic modeling (STTM) techniques for analyzing qualitative student feedback on instructional support in engineering education. Student feedback was collected using short answer questions that resulted in 1,667, 1,592, and 1,376 expectations for faculty support, teaching assistant (TA) support, and peer support respectively as part of a larger survey conducted via convenience sampling in over 40 engineering courses offered at single large university between 2016 and 2023. After cleaning and preprocessing the data, short text responses were analyzed using five unsupervised topic models implemented in Python: traditional models, Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Non-Negative Matrix Factorization (NMF), and k-means and one deep learning model (BERTopic). Model performance was evaluated using topic coherence and external performance metrics. Two approaches to establishing ground truth were evaluated: (a) keywords from each topic model guided manual (human) coding of the data (a machine-led approach); and (b) themes in the data were extracted and coded independently by a domain expert (a human-led approach). NMF achieved the highest average performance in two of the three datasets, reaching 75.6% accuracy, 75.7% F1-Score, and 0.63 interrater reliability for the peer support dataset and 72.6% accuracy, 72.0% F1-Score, and 0.57 interrater reliability for the TA support dataset. The human-led approach yielded higher accuracy and F1-scores for faculty and peer support but failed for TA support when the topics extracted by topic models did not align with themes identified by a domain expert. These findings highlight the need for humans to be involved in the analysis of short text data in contexts like education research where high performance is necessary to achieve appropriate rigor. Domain expert intervention also enables strategic use of topic models to optimize their use in qualitative data analysis.
Dataset DOI: 10.5061/dryad.x3ffbg81n
Description of the data and file structure
Information regarding TopicModellingStudentFeedbackData.xlsx
The above noted datafile was used in conjunction with the following Jupyter Notebook files to generate the results in the PLOS ONE article entitled "A Comparative Analysis of Topic Modelling Techniques for the Thematic Analysis of Student Feedback" by authors Neha Kardam and Denise Wilson:
- 01_preprocessing.ipynb
- 02_topic_models.ipynb
- 03_evaluation.ipynb
This README document describes the data contained in the MS Excel file in the context of the comparative analysis of topic models described in the corresponding article:
Column A (Subject Code): A de-identified number assigned to a unique student response (no students responded twice to the survey).
Column B (Gender): Male, Female, Other
Column C (US_Status): US Citizen, Permanent US Resident, International (foreign) student, Other. The "Other" category reflects respondents who did not identify with any of the listed US residency statuses and could not be meaningfully merged into another category without misrepresenting survey responses.
Column D (Race_Ethnicity): Asian, Black, Latino/a, Native American, Pacific Islander, Non-Hispanic White, Mixed Race, Other. The "Mixed Race" category groups all respondents who identified with two or more racial/ethnic backgrounds.
Faculty Support
Column E (FacultySupportResponse): Student Response to the following prompt: "What one action can your professors at take to best support you in your classes (please be as specific as possible)?"
Column F (FacultySupport_Approach1_Code): A code was assigned by a domain expert/researcher based on the keywords extracted from the top words corresponding to each topic from each of the five topic models considered in the comparative analysis. The three possible codes that could be assigned to each document were labelled examples and experience, interactions, and teaching practice.
- Examples & Experience Code Keywords: k-means (questions, material); LDA (examples, students, class); LSA (resources, exam, homework, tests, extra, solutions, problems, practice); NMF (examples, extra, exam, tests, homework, solutions, problems, practice); BERTopic (problems, practice, examples, material, tasks, support, notes, content, concepts)
- Interactions Code Keywords: k-means (hours, office, help, class, questions, helpful, available); LDA (ask, helpful, lecture, time, questions, hours, office); LSA (frequently, times, week, hour, available, hold, hours, office); NMF (extra, hold, available, help, offer, hours, office); BERTopic (office, hours, help, available, session, meet, support)
- Teaching Practice Code Keywords: k-means (lecture, problems, class, practice, examples, make, exams); LDA (time, material, homework, exams, provide, lecture, practice, problems); LSA (questions, practice, lectures, provide, problems, students, lecture, class, hours, office); NMF (professors, make, notes, examples, questions, time, students, lectures, class); BERTopic (lectures, class, slides, topics, discussion, instruction, plan)
Column G (FacultySupport_Approach2_Code): Based on the optimal number of topics recommended by the topic models (three), a second domain expert developed descriptions for three themes and coded the data according to that description, without reviewing or access to the topic assignments from the topic models. This approach assessed how well the topic models (natural language processing) were able to replicate the human/traditional code assignment process.
- Examples and experience code: emphasizes experiential learning where students are active participants in their educational activities ranging from in-class active learning opportunities to scaffolded examples and laboratory exercises
- Interactions code: emphasizes the importance that students place on interactions with others (faculty, TA, peers) to enhance learning through a wide range of formats (e.g., office hours, discussion forums, feedback, email, etc.)
- Teaching practice code: refers to all actions that faculty take to prepare, carry out, and follow-up on instruction ranging from providing supplemental resources for learning to recording and posting lectures and lecture notes to implementing an engaging teaching or lecture style to effective assessment practices
Column H (FacultySupport_MultipleTopics): Indicates whether multiple topics were present in the student response (Yes) or only a single topic was present (No) as assessed by the same domain expert involved in coding the data according to Approach 2 (A2).
Column I (FacultySupport_Ambiguous): Indicates ambiguity/uncertainty (Yes) in domain expert's assessment of student's response compared to a clear code assignment (No) as assessed by the same domain expert involved in coding the data according to Approach 2 (A2).
TA Support
Column J (TASupportResponse): Student Response to the following prompt: "What one action can your TAs at take to best support you in your classes (please be as specific as possible)?"
Column K (TASupport_Approach1_Code): A code was assigned by a domain expert/researcher based on the keywords extracted from the top words corresponding to each topic from each of the five topic models considered in the comparative analysis. The three possible codes that could be assigned to each document were labelled availability, Q&A, and teaching practice.
- Availability Code Keywords: k-means (office, hour, hold, offer, available, email); LDA (hour, office, session, offer, email, hold); LSA (office, hour, available, hold); NMF (office, hour, hold, offer, available, flexible); BERTopic (TAs, TA, helpful, hours, office)
- Q&A Code Keywords: k-means (question, answer, ask, email, homework, quickly); LDA (question, answer, ask, available, helpful); LSA (question, answer, section, problem, example, quiz); NMF (question, answer, ask, email, homework, quickly); BERTopic (questions, available, zoom, answer, email)
- Teaching Practice Code Keywords: k-means (lab, problem, section, example, material, quiz, work); LDA (lab, problem, section, example, material, homework, quiz); LSA (section, lab, problem, example, quiz, material); NMF (lab, section, problem, example, quiz, material, practice); BERTopic (lab, problems, section, material, quiz, homework, sections)
Column L (TASupport_Approach2_Code): Based on the optimal number of topics recommended by the topic models (three), a second domain expert developed descriptions for three themes and coded the data according to that description, without reviewing or access to the topic assignments from the topic models. This approach assessed how well the topic models (natural language processing) were able to replicate the human/traditional code assignment process.
- Examples and experience code: emphasizes experiential learning where students are active participants in their educational activities ranging from in-class active learning opportunities to scaffolded examples and laboratory exercises
- Interactions code: emphasizes the importance that students place on interactions with others (faculty, TA, peers) to enhance learning through a wide range of formats (e.g., office hours, discussion forums, feedback, email, etc.)
- Teaching Practice code: refers to all actions that TAs take to prepare, carry out, and follow-up on instruction ranging from providing supplemental resources for learning to recording and posting lectures and lecture notes to implementing an engaging teaching or lecture style to effective assessment practices
Column M (TASupport_MultipleTopics): Indicates whether multiple topics were present in the student response (Yes) or only a single topic was present (No) as assessed by the same domain expert involved in coding the data according to Approach 2 (A2).
Column N (TASupport_Ambiguous): Indicates ambiguity/uncertainty (Yes) in domain expert's assessment of student's response compared to a clear code assignment (No) as assessed by the same domain expert involved in coding the data according to Approach 2 (A2).
Peer Support
Column O (PeerSupportResponse): Student Response to the following prompt: "What one action can students in your class take to improve your educational experience (please be as specific as possible)?"
Column P (PeerSupport_Approach1_Code): A code was assigned by a domain expert/researcher based on the keywords extracted from the top words corresponding to each topic from each of the five topic models considered in the comparative analysis. The three possible codes that could be assigned to each document were labelled engagement, questioning practice, and civility.
- Engagement Code Keywords: k-means (help, group, participate, study, discussion, work); LDA (participate, discussion, engage, breakout, room); LSA (study, group, form, work, willing, homework); NMF (study, group, form, work, willing, homework, open, discussion); BERTopic (help, study, groups, work, discord, group, participate, breakout, homework)
- Questioning Code Keywords: k-means (question, ask, answer, good, think, afraid); LDA (question, ask, answer, think, respectful); LSA (question, ask, answer, good, participate, think); NMF (question, ask, answer, good, think, afraid, relevant); BERTopic (questions, ask, asking, confused, confusing, good, think)
- Civility Code Keywords: k-means (talk, stop, distract, people, learn, friend); LDA (talk, distract, chat, group, study, work); LSA (talk, distract, respectful, participate); NMF (talk, distract, respectful, stop, mute, quiet); BERTopic (lecture, professor, dont, talking, lectures, time)
Column Q (PeerSupport_Approach2_Code): Based on the optimal number of topics recommended by the topic models (three), a second domain expert developed descriptions for three themes and coded the data according to that description, without reviewing or access to the topic assignments from the topic models. This approach assessed how well the topic models (natural language processing) were able to replicate the human/traditional code assignment process.
- Interactions code: emphasizes the importance that students place on interacting with their peers to enhance learning through collaborative activities including study groups, group projects, homework collaboration, and participation in class discussions and breakout rooms
- Questioning code: emphasizes the importance of peer questioning in the classroom, where students ask thoughtful, relevant questions during lectures to support a more effective and interactive learning environment for all students
- Civility code: includes a broad range of civil behaviors expected of peers in the classroom including but not limited to refraining from disruptive talking, avoiding distracting activities, being punctual, muting microphones when not speaking, and maintaining a respectful learning environment
Column R (PeerSupport_MultipleTopics): Indicates whether multiple topics were present in the student response (Yes) or only a single topic was present (No) as assessed by the same domain expert involved in coding the data according to Approach 2 (A2).
Column S (PeerSupport_Ambiguous): Indicates ambiguity/uncertainty (Yes) in domain expert's assessment of student's response compared to a clear code assignment (No) as assessed by the same domain expert involved in coding the data according to Approach 2 (A2).
COVID Timing
Column T (COVID_Code): PreCOVID (class offered before spring 2020); DuringCOVID (class offered during remote learning between spring 2020 and spring 2021 inclusive); PostCOVID (class offered after spring 2021)
Missing values
Empty cells in this dataset fall into two categories:
1. Demographic columns (Gender, US_Status, Race_Ethnicity, COVID_Code)
A small number of rows have blank demographic values (Gender: 13, US_Status: 15, Race_Ethnicity: 41, COVID_Code: 1), reflecting individual students who chose not to disclose that information. All demographic blanks indicate not available - the information was not provided by the student.
2. Survey response and coding columns
Some cells in all three survey response columns and their associated coding columns are blank. These indicate not applicable - the student did not respond to that particular prompt and therefore no coding was performed. Blank counts by dataset: Faculty Support (FacultySupportResponse): 31; TA Support (TASupportResponse): 115; Peer Support (PeerSupportResponse): 332.
These cells have been intentionally left blank (rather than filled with "n/a") to preserve one-to-one row correspondence across all three survey questions and to ensure compatibility with the analysis scripts, which detect blank cells programmatically to exclude them from topic modeling while retaining all rows in the output files.
Code/software
The analysis was conducted using Python 3.12.7 with the following Jupyter notebooks:
- 01_preprocessing.ipynb: Text preprocessing (lowercase, contraction expansion, punctuation/special character/number removal, POS-aware lemmatization via spaCy, stopword removal)
- 02_topic_models.ipynb: Five topic models (K-means, LDA, LSA, NMF, BERTopic) with elbow method for optimal topic selection
- 03_evaluation.ipynb: Evaluation metrics (accuracy, precision, recall, F1, Cohen's kappa, UMass coherence)
Required Python packages: pandas, numpy, openpyxl, spacy (with en_core_web_sm model), contractions, matplotlib, seaborn, wordcloud, scikit-learn, bertopic, sentence-transformers, umap-learn, gensim, python-docx
The complete code, installation instructions, and requirements.txt are available at github.com/nehakardam/topic-model-student-feedback. A citable, DOI-versioned archive of the code repository is also available at Zenodo: 10.5281/zenodo.19099505
Workflow: Run notebooks sequentially (01 → 02 → 03). Each notebook has a DATASET configuration variable that can be set to 'Faculty', 'TA', or 'Peer' to reproduce results for each dataset. All models use random_state=42 for reproducibility.
Access information
Code repository: github.com/nehakardam/topic-model-student-feedback (MIT License)
The data was collected via anonymous student surveys administered at the University of Washington as part of instructional feedback research. The survey data is original and was not derived from other sources.
Human subjects data
Data were collected via anonymous student surveys under IRB approval (STUDY00000378) at the University of Washington, where completion of the survey constituted consent to participate in the research and acknowledgment that anonymized responses may be used for research purposes. Data were de-identified prior to archiving by: (1) removing all direct identifiers including names, student IDs, and course section codes; (2) replacing individual identifiers with sequential numeric codes (1, 2, 3...) with no link to any enrollment record; and (3) retaining only three demographic indirect identifiers (gender, domestic/international status, and race/ethnicity). An additional variable, COVID_Code, indicates study period (Pre/During/Post COVID-19) and is a data collection classification, not a personal attribute. No combination of retained variables uniquely identifies any individual participant given the large sample sizes (1,376–1,667 respondents per dataset).
