[Stimulus Set] Evoking the N400 event-related potential (ERP) component using a publicly available novel set of sentences with semantically incongruent or congruent eggplants (endings)

Toffolo, Kathryn 1 ; Freedman, Edward 1 ; Foxe, John 1

Research facility: University of Rochester

Published May 16, 2022; Updated Sep 21, 2024 on Dryad. https://doi.org/10.5061/dryad.9ghx3ffkg

Data files

May 16, 2022 version files 95.05 MB

README.txt

14.68 KB
stimuli.zip

95.04 MB

Dec 02, 2022 version files 95.05 MB

README.txt

14.68 KB
stimuli.zip

95.04 MB

Sep 21, 2024 version files 95.27 MB

N400Stimset_cloze-probability-survey_results.json

431 B
N400Stimset_cloze-probability-survey_results.tsv

10.69 KB
N400Stimset_cloze-probability-survey.json

449 B
N400Stimset_cloze-probability-survey.tsv

175.12 KB
N400Stimset_stimuli_parameters.json

3.15 KB
N400Stimset_stimuli_parameters.tsv

75.69 KB
README

17.54 KB
README.md

15.72 KB
stimuli.zip

94.97 MB

Sep 21, 2024 version files 95.27 MB

N400Stimset_cloze-probability-survey_results.json

431 B
N400Stimset_cloze-probability-survey_results.tsv

10.69 KB
N400Stimset_cloze-probability-survey.json

449 B
N400Stimset_cloze-probability-survey.tsv

175.12 KB
N400Stimset_stimuli_parameters.json

3.15 KB
N400Stimset_stimuli_parameters.tsv

75.69 KB
README

17.54 KB
README.md

15.72 KB
stimuli.zip

94.97 MB

Abstract

During speech comprehension, the ongoing context of a sentence is used to predict sentence outcome by limiting subsequent word likelihood. Neurophysiologically, violations of context-dependent predictions result in amplitude modulations of the N400 event-related potential (ERP) component. While N400 is widely used to measure semantic processing and integration, no publicly available auditory stimulus set is available to standardize approaches across the field. Here, we developed an auditory stimulus set of 442 sentences that utilized the semantic anomaly paradigm, provided cloze probability for all stimuli, and was developed for both children and adults. With 20 neurotypical adults, we validated that this set elicits robust N400’s, as well as two additional semantically related ERP components: the recognition potential (~250 ms) and the late positivity component (~600 ms). This stimulus set (https://doi.org/10.5061/dryad.9ghx3ffkg) and the 20 high-density (128-channel) electrophysiological datasets (https://doi.org/10.5061/dryad.6wwpzgmx4) are made publicly available to promote data sharing and reuse. Future studies that use this stimulus set to investigate sentential semantic comprehension in both control and clinical populations may benefit from the increased comparability and reproducibility within this field of research.

Task Description:

The task was first explained to the participant during the consent process and then again before the experimental session. Individuals were asked to refrain from excessive movement and to focus on a fixation cross throughout the task in order to reduce movement artifacts. The experimental session began by explaining the task for a third time. All instructions were presented both visually on the screen and auditorily through the headphones. Instructions were followed by two practice trials which were the same for every participant. Feedback was given about a participant’s response only during practice trials (2 example stimuli) and not during experimental trials. Trials were presented as follows: 1. A fixation cross was on the screen while an auditory sentence stimulus was presented through headphones; 2. A 2-second pause; and 3. A question (presented both visually and auditorily) asked the participant if the sentence ended as expected, where subjects responded with a right or left arrow key when sentences ended as expected (congruent) or unexpected (incongruent) respectively, to end the trial. A two-second delay was inserted between a subject’s response and the start of the next sentence. During the experiment, a total of 440 stimuli were presented to participants in the same order. This was done to ensure that every participant had the same experience throughout the task for every sentence. Stimuli were separated into 11 blocks with optional breaks between each block. Participants could continue onto the next block by pressing the spacebar.

Contact information regarding analyses

First Author: Kathryn Toffolo
University Email: kathryn_toffolo@urmc.rochester.edu
Unaffiliated Email: kattoffolo@gmail.com
ORCID iD: orcid.org/0000-0002-5728-3174
Linkedin: Kathryn Toffolo

Software used:

Audacity® software --> Used for sentence compilation and pitch/pace adjustments
Praat (PRAAT v. 6.1, University of Amsterdam, the Netherlands) --> Used to measure the timing of the sentences and target words
JASP (JASP Team [2020], Version 0.12.2) --> Statistical analysis
MATLAB (MathWorks Inc., Natick, MA) --> EEG preprocessing and analysis
EEGLAB (Delorme & Makeig, 2004) --> EEG preprocessing and analysis
FieldTrip toolbox (Oostenveld et al. 2010) --> Topography statistical plots

Descriptions of Updates:

9/21/24- Although onsets were previously individually determined, in response to a reviewer’s inquiry, we performed an additional inter-rater reliability assessment to ensure that onsets were indeed accurately registered. Three additional listeners were enlisted to independently record final word onsets. Listeners were trained to use the program Praat and assigned ~100 random stimuli (not including the "LD5_Eliminated" stimuli) to record the timing. Afterward, we compared the previously recorded onset times with the times recorded by the three listeners. Out of 302 stimuli, we noticed only 6 outlier mistakes. The first author then reviewed the remaining stimuli of this set, including the "LD5_Eliminated" stimuli (118 stimuli), and found 4 outlier mistakes. These 10 outlier mistakes are now corrected on the “stimuli_parameter” sheet provided with the original stimulus set. Additionally, minor changes (i.e., 8ms< onset discrepancy <20ms) were made to the “stimuli_parameter” sheet for 9 stimuli. Excluding outliers but including minor changes, the average difference between the previously recorded onsets and the three independent reviewers was 1.1±1.8 ms. As such, we are highly confident in the onset determinations.

These outlier onset differences do not affect collected data because they are parameters only needed for analysis. The one sentence that was incorrectly presented to subjects has been corrected and uploaded, but will not affect the grand average ERP given the number of stimuli presented (>400). These outlier onset differences also do not affect the analysis of the data for the same reason (i.e., the number of stimuli presented is very large (>400) relative to the outliers). Minor changes (i.e., 8ms< onset discrepancy <20ms) will be even less influential on analysis.

Below are the specific changes that were made:

Outlier mistakes:
(LD5_Eliminated)NPC_chain - ~805.1ms change
(LD5_Eliminated)NPC_warm - ~73.1ms change
(LD5_Eliminated)NPI_eggs(brake) - ~638.7ms change
NPC_blue - ~90.0ms change
NPC_cream - ~73.5ms change
NPC_gloves - ~76.1ms change
NPC_spider - ~105.1ms change
NPI_clown(dice) - ~87.9ms change (Incorrect Sentence. Correct stimulus has been uploaded.)
NPI_brakes(cob) - ~38.6ms change
NPI_out(ant) - ~22.3ms change

Minor Changes:
(LD5_Eliminated)NPI_coal(chain) - ~16.6ms change
(LD5_Eliminated)NPI_fried(mom) - ~14.7ms change
NPC_bath - ~17.0ms change
NPC_bright - ~14.5ms change
NPC_gum - ~12.6ms change
NPI_bath(roar) - ~14.9ms change
NPI_cough(friend) - ~13.9ms change
NPI_dancers(pirate) - ~8.6ms change
NPI_gum(coat) - ~11.9ms change

Description of file(s):

Stimulus Set ("stimuli"):

This stimulus set includes 221 sentence pairs for a total of 442 auditory sentence stimuli. These sentences are mono, not stereo, and are in a .wav format. The congruent sentence pair is denoted by the prefix "NPC_" for "non-prosodic congruent" and the incongruent sentence pair is denoted by the prefix "NPI_" for "non-prosodic incongruent". Notice that the incongruent stimuli first show the congruent ending followed by the incongruent ending in parentheses (ex. NPC_bake vs. NPI_bake(milk)). Additionally, there are 20 sentence pairs with the prefix "(LD5_Eliminated)". These stimuli were removed from the final analysis because: 1. The endings could make sense to children (Ex. In a universe where animals are personified such as in cartoons, the incongruent ending of “My dog digs holes/spring” could make sense in the context where the verb “dig” is synonymous with “liked”); 2. Did not match in syllable number; 3. Were hyphenated phrases; or 4. Contains cultural references. The specific linguistic division information for all stimuli is provided in the TSV file "N400Stimset_stimuli_parameters". Although eliminated from our final analysis, all corresponding timing and divisions for these stimuli are provided in this TSV file if future studies would like to use them in their research. Furthermore, if researchers would like to analyse the responses to these stimuli, the responses to all stimuli are in the raw EEG files for each participant. For the sentence stimuli, individual words from the word list (""N400Stimset_stimuli_parameters"" TSV file) were recorded from a female speaker, who was instructed to voice words with minimal inflection, stress or intonation (i.e., in a monotonous non-prosodic manner). Words were then compiled into complete sentences using the Audacity® software. These artificially compiled sentences were adjusted to have similar pitch frequency and pacing across each word within a sentence and between all sentences. The provided sentences are mono files, not stereo.

In addition to the stimulus set, 15 task-related sentences are included. These audio files were recorded by a female speaker, who was instructed to voice the sentences as she would talking to a young adult. "(Audio_01)DidThisSentenceEndCorrectly" follows each sentence stimulus and is played so the participant knows it is time to respond. "(Audio_02)DoYouWantToTakeABreak") is played at the onset of each break period. "(Audio_03)Congratulations" is played at the end of the task. Audio files with the prefix "(Intro_01)"-"(Intro_05 )" are descriptions of the task, and should be played in order. Audio files with the prefix "(Intro_06)" -"(Intro_11)" are for the practice example section. This includes audio introducing the practice session "(Intro_06)Practice_Intro", audio for between examples "(Intro_07)Practice_LetsTryAnotherOne", and 4 possible
responses depending on how the subject answers: 1. If the subject got the answer correct for the congruent example "(Intro_08)PracticeFeedback_Correct4Congruent"; 2. Correct for the incongruent example "(Intro_09)PracticeFeedback_Correct4Inongruent"; 3. Incorrect for incongruent example "(Intro_10)PracticeFeedback_Incorrect4Incongruent"; and 4. Incorrect for the congruent example "(Intro_11)PracticeFeedback_Incorrect4Congruent". Lastly, is the file that lets the participant know that the task is starting "(Intro_12) Practice_LetsStartTheTask".

"N400Stimset_stimuli_parameters":

The use of this .tsv file was imperative to this study, and as such is described below. The triggers in the raw EEG data were recorded when each stimulus began, not the onset of the target word. The "target_onset" column in this file was added to the onset of each trigger in the EEG data. Before this value was added to the stimulus onset times, the "target_onset(s)" was converted to datapoints by multiplying by the sample rate (512). This file can also be used to see what is said in each sentence. The top half of the file is the sentences with congruent endings while the bottom half is the sentences with incongruent sentence endings. The first 3 columns are the order in which a stimulus was played, the stimulus key, and the stimulus file name so that each sentence can be matched to an audio file. Following this are the sentences separated by each word. This file may be useful to N400 investigations that want to have a visual presentation of the stimuli in addition to or instead of an auditory presentation.

"stim_key"- is the order number of the stimulus representing when in the dataset the stimulus was presented (1st sound, 2nd sound, ... etc.).
"stim_file"- is the audio file name for the presented stimulus.
"1"- is the first word in the stimulus (sentence).
"2"- is the second word in the stimulus (sentence).
"3"- is the third word in the stimulus (sentence).
"4"- is the fourth word in the stimulus (sentence).
"5"- is the fifth word in the stimulus (sentence).
"6"- is the sixth word in the stimulus (sentence).
"7-" is the seventh word in the stimulus (sentence).
"8"- is the eighth word in the stimulus (sentence).
"stim_dur(s)"- is the entire length of each stimulus in seconds.
"target_onset(s)"- is the time from the beginning of the sentence (the raw trigger time in the raw EEG data) to the START of the target/ending word in seconds.
"target_end(s)"- is the time from the beginning of the sentence (where the trigger was placed) to the END of the target/ending word in seconds.
"target_dur(s)"- is the time between the START and the END of the target/ending word in seconds.
"time-quarter_div"- is the division to investigate for an effect of time. The stimuli are broken into 4 groups (1-4) by quarter. Because the stimuli in this file are in the order that they were presented to each participant, this column will be in order from 1-4.
"order-group_div"- is the division to investigate for an effect of order. The stimuli are broken into 4 groups: 1. Is congruent stimuli for the scenario in which the congruent stimulus pair was presented before the incongruent stimulus pair; 2. Are incongruent stimuli for the scenario in which the congruent stimulus pair is presented before the incongruent stimulus pair; 3. Are congruent stimuli for the scenario in which the incongruent stimulus pair was presented before the congruent stimulus pair; 4. Are incongruent stimuli for the scenario in which the incongruent stimulus pair was presented before the congruent stimulus pair.
"cloze-probability%_div"- are the cloze probability (CP) scores for each sentence. To investigate the effect of CP, these scores were broken into 4 different groups: 1. Sentence pairs with CP greater or equal to 96%; 2. Sentence pairs with greater than or equal to 90% CP and less than 96% CP; 3. Sentence pairs with greater than or equal to 80% CP and less than 90% CP; 4. Sentence pairs with less than 80% CP.
"linguistic-group_div"- is the division to investigate for an effect of linguistic error. The stimuli are broken into 5 groups: 1. Incongruent sentences that contain only semantic ending errors, along with their congruent pair; 2. Incongruent sentences with both semantic errors and a syntactic number error, along with their congruent pair; 3. Incongruent sentences with both semantic errors and syntactic adjective/noun errors, along with their congruent pair; 4. Incongruent sentence stimuli with both semantic errors and syntactic verb/noun ending errors, along with their congruent pair; and 5. The eliminated linguistic division group contained 19 sentence pairs of which the endings could make sense to children, did not match in syllable numbers, were hyphenated phrases, or contained cultural references. The final analysis combined groups 2-4 into one larger group of semantic and syntactic error sentence pairs in order to contrast with sentence pairs containing just semantic errors.
"linguistic-group_reasoning"- are quick explanations/descriptions for why a stimulus was placed in a particular linguistic group.

"N400Stimset_cloze-probability-survey"

This file includes the RedCap survey answers to establish the cloze probability for each sentence in this experiment. In this survey, each sentence in this stimulus set was presented with the final word missing (blank) and participants were required to fill in this blank with the first singular word that came to mind. If participants could not think of an answer, they were encouraged to guess rather than leave a blank. Non-answers were not counted toward CP scores; participants were removed from the survey data if they answered fewer than 10 questions out of the 221; and participants were removed if they scored 3 standard deviations from the mean. Given this elimination criteria, ID numbers 54, 119, and 135 were eliminated, and the survey used the responses from 134 individuals to assess the CP of each sentence. The rows represent each participant and the columns represent their answers. The first columns include the participant #, their 1st language, other languages they know, their age, their gender, and their percent correct out of all the questions. Following this were their individual answers to each fill-in-the-blank. Below the answers to this survey was the answer key to each fill-in-the-blank question (i.e., the intended congruent ending for each sentence). The JSON file "N400Stimset_cloze-probability-survey" provides more information about this dataset including the average percent correct, the standard deviation of the percent correct, and the mean age of participants.

"N400Stimset_cloze-probability-survey_results"

This file is the final assessment of each fill-in-the-blank question after participant elimination. "#" is the order number of each fill-in-the-blank question. This corresponds to the number at the top of the "N400Stimset_cloze-probability-survey_answers". "total_answers" is the number of participants that answered the particular question while "correct_answers" are the number of participants that answered correctly. "cloze_probability" is the cloze probability score for each sentence (i.e., correct answers/total answers). This is followed by the sentential context given to the participants for each fill-in-the-blank ("sentential_context"), and the intended correct answer for the congruent sentence ("key").

2.1 Participants

Twenty-four neurotypical adults were recruited and provided written informed consent to participate in this study. Four subjects were excluded from data analysis due to failure to remain alert or to sit still during data collection (n=2), or due to noisy EEG data (n=2) (Figure S1). The remaining participants make up the fully-analyzed dataset. These twenty subjects ranged in age from 18 to 35 (mean age = 25.5 +/- 4.36), nine were female, and three were left-handed. Every participant spoke English as their first language, and twelve participants were mono-lingual while eight participants reported being bi- or multi-lingual. Demographic information for all participants, including those removed from further analysis are reported in Table 1.

2.2 Stimuli

The semantic anomaly paradigm consisted of 221 sentence pairs with incongruent and congruent endings, for a total of 442 stimuli in this stimulus set. However, twenty sentence pairs were eliminated before analysis because their endings did not match in syllable number, contained hyphenated phrases, or cultural references, or upon closer examination, the supposed incongruent endings were in fact congruent. The remaining 402 stimuli (201 congruent and 201 incongruent sentence pairs) ranged from four to eight words in length. Sentences were created using simple words, derived from a set of age-appropriate words contained in the Medical Research Council Psycholinguistic (MRCP) database (https://websites.psychology.uwa.edu.au/school/mrcdatabase/uwa_mrc.htm). For example, from the word “cake”, the congruent sentence “He baked a birthday cake” was created. The words selected from the MRCP database took into consideration the age of acquisition (from a database provided by (Gilhooly and Logie, 1980)) and/or written word frequency (from a normed written word frequency set (Francis and Kucera, 1967)). This was to ensure that each sentence could be readily comprehended by NT individuals aged 5 years or older. The majority of these age-appropriate words were monosyllabic, but a few basic words with 2-syllables (e.g. present) and 3-syllables (e.g. animal) were included, due to their high-written-word frequency or early age of acquisition. Similar to previous studies using the MRCP database (Ross et al., 2011, 2007), we were sensitive to the fact that everyday language use has changed since the creation of these sets, so we carefully selected words that are still in common use.

This stimulus set included sentence pairs where the incongruent endings were just semantic errors. After elimination, this amounts to 132 / 201 stimulus pairs including example stimuli. These incongruent endings were matched to their congruent pair in word type (e.g. noun or verb) and number (e.g. plural or singular). There were also sentence pairs where incongruent endings contained both semantic and syntactic errors. After elimination, this amounts to 69 / 201 stimulus pairs. Semantically incongruent endings were also classified as syntactic errors if the ending deviated from the syntax provided by the sentential context. The type of syntactic errors included: 1. Endings where a plural noun expectation deviated with a singular noun or vice versa (27 stimulus pairs); 2. Endings where an adjective expectation was deviated with a noun or vice versa (20 stimulus pairs); and 3. Endings where a verb expectation was deviated with a noun or vice versa (22 stimulus pairs). During the final analysis, these 3 types of syntactic errors were combined into a single linguistic division (LD) in order to compare the overall response to sentences with both semantic and syntactic errors, and the response to sentences with only semantic errors.

Language comprehension in NT individuals is highly influenced by communicative cues such as prosody, especially when sentence meaning relies on syntactic prosodic cues (Thorson 2018, (Cutler et al., 1997; Dahan, 2015; Frazier et al., 2006, Thorson 2018). The intention of this manuscript was to create a stimulus set that could be used for all populations equally. Therefore, this stimulus set was constructed without prosody to ensure that NT individuals would not have an advantage in language comprehension over other populations known to have difficulty with communication or prosody, such as those with ASD (DePape et al., 2012; Eigsti et al., 2012; Martzoukou et al., 2017; McCann et al., 2007; O’Connor, 2012; Wang and Tsao, 2015) or schizophrenia (Leitman et al., 2007, 2005). To do so, individual words from the word list were recorded by a female speaker, instructed to voice the words with minimal inflection, stress, and intonation (i.e. in a monotonous non-prosodic manner). Words were then compiled into complete sentences using the Audacity Software (Version 3.0.0. Audacity® software is copyright © 1999 - 2021 Audacity Team. https://audacity team.org/ ). These artificially compiled sentences were manually adjusted to have similar pitch frequency between each word within a sentence and between all sentences. Concurrently, time gaps were added between words so that all sentences would have similar pacing and that researchers would be able to trigger discretely on each word. Both artificial timing and frequency add to the robotic nature of the stimuli. A future initiative will add the prosodic versions of these sentences to this public stimulus set so that researchers can explore more communicative aspects of language.

2.3 Procedure

Participants were fit with a 128-electrode cap (Bio Semi B.V. Amsterdam, the Netherlands) and seated in a sound-attenuating, electrically shielded booth (Industrial Acoustics Company, The Bronx, NY) with a computer monitor (Acer Predator Z35 Curved HD, Acer Inc.) and a standard keyboard (Dell Inc.). The task was created with Presentation® Software (Version 18.0, Neurobehavioral Systems, Inc. Berkeley, CA). The task was first explained to the participant during the consent process and then again before the experimental session. Individuals were asked to refrain from excessive movement and to focus on a fixation cross throughout the task in order to reduce movement artifacts. The experimental session began by explaining the task for a third time. All instructions were presented both visually on the screen and auditorily through the headphones (Sennheiser Electronic GmbH & Co. KG, USA). Instructions were followed by two practice trials which were the same for every participant. Feedback was given about a participant’s response only during practice trials and not during experimental trials. Trials were presented as follows: 1. A fixation cross was on the screen while an auditory sentence stimulus was presented through headphones; 2. A two-second pause; and 3. A question (presented both visually and auditorily) asked the participant if the sentence ended as expected, where subjects responded with a right or left arrow key when sentences ended as expected (congruent) or unexpected (incongruent) respectively, to end the trial. A two-second delay was inserted between a subject’s response and the start of the next sentence. A total of 442 stimuli were presented to participants in the same order. This was done to ensure that every participant had the same experience throughout the task for every sentence. Two of these stimuli (one congruent and one incongruent) were used for example trials. The remaining 440 were used for the experiment. Stimuli were separated into 11 blocks with optional breaks between each block. Participants could continue onto the next block by pressing the spacebar. After elimination, the responses to 400 out of the 440 stimuli contributed to the analysis of this experiment.

2.4 Data preprocessing

Data were digitized online at a rate of 512Hz, DC to 150 Hz pass-band, and referenced to the common mode sense (CMS) active electrode. EEG data were preprocessed and analyzed offline using in-house scripts leveraging EEGLAB functions (Delorme & Makeig, 2004). Data were filtered using a Chebyshev II spectral filter with a bandpass of 0.1 - 45 Hz. Channels were rejected automatically if recorded data from an electrode exceeded more than 3 standard deviations from the mean-variance and amplitude from all electrodes, the channel would be rejected and interpolated using EEGLAB spherical interpolation. Data were then re-referenced to the common average. Prior to analysis, the time from the beginning of a sentence to the onset of the last word was measured for each stimulus using Praat (PRAAT v. 6.1, University of Amsterdam, the Netherlands). These measures were used to adjust the time stamp of each stimulus, so that the data could be aligned to the onset of the last word (i.e. 0 ms), rather than the beginning of the sentence. For all participants, epochs from -200 to 1000 ms were created using a baseline of the 200 ms interval before the onset of the last word. Trials were rejected automatically based on an artifact rejection threshold of 250 mV and if a trial contained amplitudes greater than two standard deviations from the mean amplitude across all channels. Grand average ERP waveforms were generated by first averaging the trials per condition per electrode, and then averaging by participant.

2.5. Statistical analysis

JASP (Jeffrey’s Amazing Statistic Program Team [2020], Version 0.12.2) was used for statistical analyses. Three midline electrodes (Fz, Cz, and Pz) were chosen a priori for investigation (Lau et al., 2008; Luck, 2005; Osterhout and Holcomb, 1992). Other electrodes (F7, T7, and P7) were investigated post hoc. For every participant, these selected electrodes were assessed for an effect of the condition using a repeated measure ANOVA (rmANOVA) at four time points of interest (250 ms, 400 ms, 600 ms, and 700 ms). Amplitude values for these electrodes were acquired by averaging the amplitudes across a 10 ms time window, centered at the time point of interest. Additional rmANOVA’s were conducted to assess for a main effect of CP, order, linguistic division, and time. F-scores and p-values for the main effect of condition at the midline electrodes are shown in Table 2. Other main effects of midline electrodes are shown in Table 3. All main effects for electrodes F7, T7, and P7 are in Table 4.

Topography plot statistics were generated using the FieldTrip toolbox (Oostenveld et al., 2011) for MATLAB and displayed using the EEGLAB toolbox. A group-level cluster-based permutation test was conducted using two-tailed, independent sample t-statistics with a critical alpha level of 0.05. This test applied the Monte Carlo method to estimate significance probability, the triangulation method of the neighbours function for spatial clustering, and a multiple-comparison correction. Single sample clusters were combined using “maxsum” and a 5% two-sided cutoff criterion was applied to both positive and negative clusters. Topography statistics are presented as the average significance over a 10 ms time window centered at the time point of interest (Figure S2).

2.6 Cloze probability

To further characterize the stimulus set, a RedCap survey was employed to test the CP of all sentence endings. Each sentence in the set was presented with the final word missing (blank) and participants were required to fill in this blank with the first singular word that came to mind. If participants could not think of an answer, they were encouraged to guess rather than leave a blank. Non-answers were not counted towards CP scores; participants were removed from the survey data if they answered fewer than 10 questions out of the 221; and participants were removed if their percent correct was three standard deviations from the mean. After elimination, the survey used the responses from 134 individuals to assess the CP of each sentence. The majority of these stimuli had greater than 80% CP. The CP distribution of sentences is shown in Figure S3.

2.7 Data Availability

This stimulus set (https://doi.org/10.5061/dryad.9ghx3ffkg) and the supporting datasets (https://doi.org/10.5061/dryad.6wwpzgmx4) are available through Dryad for the scientific community to use freely in their experiments. The stimulus set provides the auditory files for all 442 stimuli and a stimulus parameter file that includes stimulus information such as duration, target word onset, derivative divisions (i.e. CP, order, linguistic error, and time), and most importantly, the written form of each stimulus so that semantic comprehension via reading can be investigated. Cloze probability survey answers and result files are also within the stimulus set download.

The dataset download provides the 24 datasets in BIDS format via guidelines provided by (Pernet et al., 2019) and all the aforementioned stimulus set files. Participant information is also detailed in the dataset (.tsv and .json). The full dataset includes unfiltered EEG data (.bdf), corresponding event files, and channel rejection files for each participant (.tsv), as well as recording information, electrode positioning, and event file information (.tsv and/or .json). We additionally provide preprocessed ERP derivatives for this study (.mat), the corresponding trial rejection information per derivative for each participant (.tsv), and filtering parameters (.json). Refer to the README.txt files in both the dataset and stimulus set in order to use them appropriately. Use of this dataset, stimulus set, or presenting examples from this stimulus set should include a citation to this paper.

2.8. Code Availability

The code generated for the analysis of these datasets as well as the Presentation® code is available through Zenodo via Dryad (https://doi.org/10.5061/dryad.6wwpzgmx4). The provided code was utilized to create the preprocessed ERP derivatives as well as figure components.

Usage Notes

If using this stimulus set for your research or in examples, please cite the root paper as a reference (Toffolo, Kathryn; Freedman, Edward; Foxe, John (2022), Evoking the N400 Event-Related Potential (ERP) Component Using a Publicly Available Novel Set of Sentences with Semantically Incongruent or Congruent Eggplants (Endings)). If you are looking for the full dataset of this paper, you can find it here: https://doi.org/10.5061/dryad.6wwpzgmx4.