Evoking the N400 Event-Related Potential (ERP) component using a publicly available novel set of sentences with semantically incongruent or congruent eggplants (endings)

Toffolo, Kathryn 1 ; Freedman, Edward1 ; Foxe, John1

Research facility: University of Rochester

Published Oct 13, 2022; Updated Oct 01, 2024 on Dryad. https://doi.org/10.5061/dryad.6wwpzgmx4

Data files

Oct 13, 2022 version files 24.37 GB

N400-EEG.zip
24.37 GB
README.txt

29.42 KB

Dec 14, 2022 version files 24.37 GB

N400Stimset-EEG.zip
24.37 GB
README.txt

29.42 KB

Jan 13, 2023 version files 24.37 GB

N400Stimset-EEG.zip
24.37 GB
README.txt

29.42 KB

Jun 13, 2024 version files 24.37 GB

N400Stimset_manuscriptdata.zip
24.37 GB
README.txt

29.42 KB

Oct 01, 2024 version files 24.37 GB

N400Stimset_manuscriptdata.zip
24.37 GB
README
32.34 KB
README.md
30.09 KB

Abstract

During speech comprehension, the ongoing context of a sentence is used to predict sentence outcome by limiting subsequent word likelihood. Neurophysiologically, violations of context-dependent predictions result in amplitude modulations of the N400 event-related potential (ERP) component. While N400 is widely used to measure semantic processing and integration, no publicly-available auditory stimulus set is available to standardize approaches across the field. Here, we developed an auditory stimulus set of 442 sentences that utilized the semantic anomaly paradigm, provided cloze probability for all stimuli, and was developed for both children and adults. With 20 neurotypical adults, we validated that this set elicits robust N400’s, as well as two additional semantically-related ERP components: the recognition potential (~250 ms) and the late positivity component (~600 ms). This stimulus set (https://doi.org/10.5061/dryad.9ghx3ffkg) and the 20 high-density (128-channel) electrophysiological datasets (https://doi.org/10.5061/dryad.6wwpzgmx4) are made publicly available to promote data sharing and reuse. Future studies that use this stimulus set to investigate sentential semantic comprehension in both control and clinical populations may benefit from the increased comparability and reproducibility within this field of research.

https://doi.org/10.5061/dryad.6wwpzgmx4

Description of the data and file structure

Citation

Toffolo, K.K., Freedman, E.G., Foxe, J.J. Evoking the N400 Event-related Potential (ERP) Component Using a Publicly Available Novel Set of Sentences with Semantically Incongruent or 

Congruent Eggplants (Endings). Neuroscience, 501 (2022): 143-158. doi.org/10.1016/j.neuroscience.2022.07.030.

Project name and executive summary

Title:

Evoking the N400 Event-Related Potential (ERP) Component Using a Publicly Available Novel Set of Sentences with Semantically Incongruent or Congruent Eggplants (Endings)

Task Description:

The task was first explained to the participant during the consent process and then again before the experimental session. Individuals were asked to refrain from excessive movement and to focus on a fixation cross throughout the task in order to reduce movement artifacts. The experimental session began by explaining the task for a third time. All instructions were presented both visually on the screen and auditorily through the headphones. Instructions were followed by two practice trials which were the same for every participant. Feedback was given about a participant’s response only during practice trials (2 example stimuli) and not during experimental trials. Trials were presented as follows: 1. A fixation cross was on the screen while an auditory sentence stimulus was presented through headphones; 2. A 2 second pause; and 3. A question (presented both visually and auditorily) asked the participant if the sentence ended as expected, where subjects responded with a right or left arrow key when sentences ended as expected (congruent) or unexpected (incongruent) respectively, to end the trial. A two-second delay was inserted between a subject’s response and the start of the next sentence. During the experiment, a total of 440 stimuli were presented to participants in the same order. This was done to ensure that every participant had the same experience throughout the task for every sentence. Stimuli were separated into 11 blocks with optional breaks between each block. Participants could continue onto the next block by pressing the spacebar.

Contact information regarding analyses

First Author: Kathryn Toffolo

University Email: kathryn_toffolo@urmc.rochester.edu
Unaffiliated Email: kattoffolo@gmail.com 
ORCID iD: orcid.org/0000-0002-5728-3174
Linkedin: Kathryn Toffolo

Sharing/Access Information

Raw file access:

There are many ways to open .bdf files (ex. BESA, fieldtrip via MATLAB, converting .bdf to .edf to use brain vision etc.), but the way our lab accesses/analyzes our data is with EEGLab via MATLAB. First, download Matlab and then EEGLab. Put the EEGLab folder within the "code" folder for easy access, and then put the "code" folder within the entirety of the dataset folder (i.e. "N400-EEG"). Within the "code" folder are the preprocessing code used to make ERPs for each derivative, and the presentation code used for task administration. We reccomend researchers using their own code to processes the data. However, to open the .bdf files in matlab, you can use lines 1-16 in the "EEGMaster" file. When using this code, first add the EEGLab folder to your Matlab path and replace the file location at line 10 with the file location of this dataset at line 10. You may now run lines 4-16. A window will pop up, and you will select the folder containing the EEG data (i.e. "N400-EEG"). The bdf2mat script will run through all the subjects and output a .mat file within each subject folder. Each subjects .mat file can be uploaded into Matlab by dragging and dropping, clicking on the file within matlab, or using the function "load". The variable "EEG" will appear in the workspace as a struct. EEG.data contains the time series data per channel. 

All analyze code uses MATLAB language, thus is restricted to MATLAB. These files can be opened and edited with a standard text editor, but cannot be run without MATLAB. Stimulus presentation code is restricted to "Presentation" by Neurobehavioral Systems, Inc. All .tsv and .json files can be read with a standard text editor. 

Software used:

Audacity® software --> Used for sentence compilation and pitch/pace adjustments
Praat (PRAAT v. 6.1, University of Amsterdam, the Netherlands) --> Used to measure the timing of the sentences and target words
JASP (JASP Team [2020], Version 0.12.2) --> Statistical analysis
MATLAB (MathWorks Inc., Natick, MA) --> EEG preprocessing and analysis
EEGLAB (Delorme & Makeig, 2004) --> EEG preprocessing and analysis
FieldTrip toolbox (Oostenveld et al. 2010) --> Topography statistical plots
Presentation® Software (Version 18.0, Neurobehavioral Systems, Inc. Berkeley, CA) --> Presenting stimuli to participants

Descriptions of Updates:

6/13/24- Although onsets were previously individually determined, in response to a reviewer’s inquiry, we performed an additional inter-rater reliability assessment to ensure that onsets were indeed accurately registered. Three additional listeners were enlisted to independently record final word onsets. Listeners were trained to use the program Praat and assigned ~100 random stimuli (not including the "LD5_Eliminated" stimuli) to record the timing. Afterwards, we compared the previously recorded onset times with the times recorded by the three listeners. Out of 302 stimuli, we noticed only 6 outlier mistakes. The first author then reviewed the remaining stimuli of this set, including the "LD5_Eliminated" stimuli (118 stimuli), and found 4 outlier mistakes. These 10 outlier mistakes are now corrected on the “stimuli_parameter” sheet provided with the original stimulus set. Additionally, minor changes (i.e. 8ms< onset discrepancy <20ms) were made to the “stimuli_parameter” sheet for 9 stimuli. Excluding outliers but including minor changes, the average difference between the previously recorded onsets and the three independent reviewers was 1.1±1.8 ms. As such, we are highly confident in the onset determinations. 

These outlier onset differences do not affect collected data because they are parameters only needed for analysis. The one sentence that was incorrectly presented to subjects has been corrected and uploaded, but will not affect the grand average ERP given the number of stimuli presented (>400). These outlier onset differences also do not affect the analysis of the data for the same reason (i.e. the number of stimuli presented is very large (>400) relative to the outliers). Minor changes (i.e. 8ms< onset discrepancy <20ms) will be even less influential on analysis. Therefore, the analysis of this manuscript was not adjusted. 

Below are the specific changes that were made: 
	Outlier mistakes:
		(LD5_Eliminated)NPC_chain- ~805.1ms change
		(LD5_Eliminated)NPC_warm- ~73.1ms change
		(LD5_Eliminated)NPI_eggs(brake)- ~638.7ms change
		NPC_blue- ~90.0ms change
		NPC_cream- ~73.5ms change
		NPC_gloves- ~76.1ms change
		NPC_spider- ~105.1ms change
		NPI_clown(dice)- ~87.9ms change (Incorrect Sentence. Correct stimulus has been uploaded.)
		NPI_brakes(cob)- ~38.6ms change
		NPI_out(ant)- ~22.3ms change
	Minor Changes:
		(LD5_Eliminated)NPI_coal(chain)- ~16.6ms change
		(LD5_Eliminated)NPI_fried(mom)- ~14.7ms change
		NPC_bath- ~17.0ms change
		NPC_bright- ~14.5ms change
		NPC_gum- ~12.6ms change
		NPI_bath(roar)- ~14.9ms change
		NPI_cough(friend)- ~13.9ms change
		NPI_dancers(pirate)- ~8.6ms change
		NPI_gum(coat)- ~11.9ms change

Description of file(s):

Stimulus Set 
"stimuli":
This stimulus set includes 221 sentences pairs for a total of 442 auditory sentence stimuli. These sentences are mono, not stereo, and are in a .wav format. The congruent sentence pair is denoted by the prefix "NPC_" for "non-prosodic congruent" and the incongruent sentence pair is denoted by the prefix "NPI_" for "non-prosodic incongruent". Notice that the incongruent stimuli first shows the congruent ending followed by the incongruent ending in parentheses (ex. NPC_bake vs. NPI_bake(milk)). Additionally, there are 20 sentence pairs with the prefix "(LD5_Eliminated)". These stimuli were removed from final analysis because: 1. The endings could make sense to children (Ex. In a universe where animals are personified such as in cartoons, the incongruent ending of “My dog digs holes/spring” could make sense in the context where the verb “dig” is synonymous with “liked”); 2. Did not match in syllable number; 3. Were hyphenated phrases; or 4. Contained cultural references. The specific  linguistic division information for all stimuli are provided in the TSV file "N400Stimset_stimuli_parameters". Although eliminated from our final analysis,  all corresponding timing and divisions for these stimuli are provided in this TSV file if future studies would like to use them in their research. Furthermore,  if researchers would like to analyze the responses to these stimuli, the responses to all stimuli are in the raw EEG files for each participant.

For the sentence stimuli, individual words from the word list (""N400Stimset_stimuli_parameters"" TSV file) were recorded from a female speaker, who was instructed to voice words with minimal inflection, stress or intonation (i.e. in a monotonous non-prosodic manner). Words were then compiled into complete sentences using the Audacity® software. These artificially compiled sentences were adjusted to have similar pitch frequency and pacing across each word within a sentence, and between all sentences. The provided sentences are mono files, not stereo.

In addition to the stimulus set, 15 task related sentences are included. These audio files were recorded from a female speaker, who was instructed to voice the sentences as she would talking to a young adult. "(Audio_01)DidThisSentenceEndCorrectly" follows each sentence stimulus and is played so the participant knows it is time to respond. "(Audio_02)DoYouWantToTakeABreak") is played at the onset of each break period. "(Audio_03)Congratulations" is played at the end of the task. Audio files with the prefix "(Intro_01)"-"(Intro_05 )" are descriptions of the task, and should be played in order. Audio files with the prefix "(Intro_06)" -"(Intro_11)" are for the practice example section. This includes audio introducing the practice session "(Intro_06)Practice_Intro", audio for between examples "(Intro_07)Practice_LetsTryAnotherOne", and 4 possible responses depending on how the subject answers: 1. If the subject got the answer correct for the congruent example "(Intro_08)PracticeFeedback_Correct4Congruent"; 2. Correct for the incongruent example "(Intro_09)PracticeFeedback_Correct4Inongruent"; 3. Incorrect for incongruent example "(Intro_10)PracticeFeedback_Incorrect4Incongruent"; and 4.Incorrect for the congruent example "(Intro_11)PracticeFeedback_Incorrect4Congruent". Lastly, is the file that lets the participant know that the task is starting "(Intro_12)Practice_LetsStartTheTask".

"N400Stimset_stimuli_parameters":		
The use of this .tsv file was imperative to this study, and as such is described below. The triggers in the raw EEG data recorded when each stimulus began, not the onset of the target word. The "target_onset" column in this file was added to the onset of each trigger in the EEG data. Before this value was added to the stimulus onset times, the "target_onset(s)" was converted to datapoints by multiplying by the sample rate (512). This file can also be used to see what is said in each sentence. The top half of the file are the sentences with congruent endings while the bottom half are the sentences with incongruent sentence endings. The first 3 columns are the order in which a stimulus was played, the stimulus key, and the stimulus file name so that each sentence can be matched to an audio file. Following this are the sentences separated by each word. This file may be useful to N400 investigations that want to have visual presentation of the stimuli in addition to or instead of auditory presentation. 

	"stim_key"- is the order number of the stimulus representing when in the dataset the stimulus was presented (1st sound, 2nd sound, ... etc.).
	"stim_file"- is the audio file name for the presented stimulus.
	"1"- is the first word in the stimulus (sentence).
	"2"- is the second word in the stimulus (sentence).
	"3"- is the third word in the stimulus (sentence).
	"4"- is the fourth word in the stimulus (sentence).
	"5"- is the fifth word in the stimulus (sentence).
	"6"- is the sixth word in the stimulus (sentence).
	"7-" is the seventh word in the stimulus (sentence).
	"8"- is the eighth word in the stimulus (sentence).
	"stim_dur(s)"- is the entire length of each stimulus in seconds.
	"target_onset(s)"- is the time from the beginning of the sentence (the raw trigger time in the raw eeg data) to the START of the target/ending word in seconds.
	"target_end(s)"- is the time from the beginning of the sentence (where the trigger was placed) to the END of the target/ending word in seconds.
	"target_dur(s)"- is the time between the START and the END of the target/ending word in seconds.
	"time-quarter_div"- is the division to investigate for an effect of time. The stimuli are broken into 4 groups (1-4) by quarter. Because the stimuli in this file are in the order that they were presented to each participant, this column will be in order from 1-4.
	"order-group_div"- is the division to investigate for an effect of order. The stimuli are broken into 4 groups: 1. Is congruent stimuli for the scenario in which the congruent stimulus pair was presented before the incongruent stimulus pair; 2. Is incongruent stimuli for the scenario in which the congruent stimulus pair is presented before the incongruent stimulus pair; 3. Is congruent stimuli for the scenario in which the incongruent stimulus pair was presented before the congruent stimulus pair; 4. Is incongruent stimuli for the scenerio in which the incongruent stimulus pair was presented before the congruent stimulus pair.
	"cloze-probability%_div"- are the cloze probability (CP) scores for each sentence. To investigate for an effect of CP, these scores were broken into 4 different groups: 1. Sentence pairs with CP greater of equal to 96%; 2. Sentence pairs with greater than or equal to 90% CP and less than 96% CP; 3. Sentence pairs with greater than or equal to 80% CP and less than 90% CP; 4. Sentence pairs with less than 80% CP.
	"linguistic-group_div"- is the division to investigate for an effect of linguistic error. The stimuli are broken into 5 groups: 1. Incongruent sentence that contain only semantic ending errors, along with their congruent pair; 2. Incongruent sentences with both semantic errors and a syntactic number error, along with their congruent pair; 3. Incongruent sentences with both semantic errors and syntactic adjective/noun errors, along with their congruent pair; 4. Incongruent sentence stimuli with both semantic errors and syntactic verb/noun ending errors, along with their congruent pair; and 5. Eliminated linguistic division group contained 19 sentence pairs of which the endings could make sense to children, did not match in  syllable number, were hyphenated phrases, or contained cultural references. Final analysis combined groups 2-4 into one larger group of semantic and syntactic error sentence pairs in order to contrast with sentence pairs containing just semantic errors.
	"linguistic-group_reasoning"- are quick explanations/descriptions for why a stimulus was placed in a particular linguistic group.  

"N400Stimset_cloze-probability-survey"	
This file includes the RedCap survey answers to establish the cloze probability for each sentence in this experiment. In this survey, each sentence in this stimulus set was presented with the final word missing (blank) and participants were required to fill in this blank with the first singular word that came to mind. If participants could not think of an answer, they were encouraged to guess rather than leave a blank. Non-answers were not counted towards CP scores; participants were removed from the survey data if they answered fewer than 10 questions out of the 221; and participants were removed if they scored 3 standard deviations from the mean. Given this elimination criteria, ID numbers 54, 119, and 135 were eliminated, and the survey used the responses from 134 individuals to assess the CP of each sentence. 

The rows represent each participant and the columns represent their answers. The first columns include the participant #, their 1st language, other languages they know, their age, their gender, and their percent correct out of all the questions. Following this was their individual answers to each fill in the blank. Below the answers to this survey was the answer key to each fill in the blank question (i.e. the intended congruent ending for each sentence). The JSON file "N400Stimset_cloze-probability-survey" provides more information about this dataset including average percent correct, the standard deviation of the percent correct, and the mean age of participants.
		
"N400Stimset_cloze-probability-survey_results"	
This file is the final assessment of each fill in the blank question after participant elimination. "#" is the order number of each fill in the blank question. This corresponds to the number at the top of the "N400Stimset_cloze-probability-survey_answers". "total_answers" is the number of participants that answered the particular question while "correct_answers" are the number of participants that answered correctly. "cloze_probability" is the cloze probability score for each sentence (i.e. correctanswers/total answers). This is followed by the sentential context given to the participants for each fill in the blank ("sentential_context"), and the intended correct answer for the congruent sentence ("key").

Complete DataSet:
Data for the 24 subjects are provided in EEG BIDS format. Raw unfiltered data are in 24 subject folders (sub-##). Filtered erp data for each division (CP, LD, etc.) of each subject (sub-##) can be found in the erps folder of "derivatives". The code used to create these erp data are provided in the "code" folder.

"dataset_description"- (JSON) Description of the dataset.
"License"- License for the datset
"Participants"- (JSON and TSV) Describes the 24 subjects: ID, gender, age, dominant hand, first language, number of other known languages, whether they were included in the final dataset, and reason for exclusion if excluded.
"task-N400Stimset_eeg"- (JSON) Information about aquiring the raw eeg data.
"task-N400Stimset_electrodes"- (TSV) Information about electrode location.
"task-N400Stimset_events"- (JSON) Information about the event file that is within each subjects raw data folder.

"sub-##":
The "eeg" folder of each participant contains the raw EEG (BDF) data ("sub-##_task-N400Stimset_eeg"), info (TSV) about channel rejection ("sub-##_task-N400Stimset_channels"), as well as an event file (TSV) for each participant ("sub-##_task-N400Stimset_events").

Event files start with the "onset" and "duration" of the target word within a stimulus (the final word at the end of a sentence). For example, the sentence "I baked a birthday cake/clue", the target congruent words would be cake and the target incongruent word would be clue. The onset and duration of these target words are provided in the first 2 columns respectively. There is also "stim_onset" and "stim_dur" in seconds. "stim_onset" is the onset of the entire stimulus, essentially the trigger time at the beginning of the sentence. "stim_dur" is the duration of the entire stimulus (sentence). "type" includes information about how the stimulus was presented, whether just auditory, both auditory and written on the screen, or if the onset of a trigger was simply the participant responding to the question. "trial_type" is the primary experimental classification. This column tells you whether a stimulus was an introduction, feedback, example stimuli, a break, a right/left arrow key press by the participant, and whether or not the experimental stimulus was congruent (NPC) or incongruent (NPI). Lastly, "stim_file" is the column that provides the name of the stimulus file that was played. 

Due to a user error of the presentation software while recording, the event files for sub-01 and sub-02 are slightly incomplete relative to other participants (i.e. missing intro file ("intro#") onsets). That being said, the event times of the task stimuli ("sound##") and the participants responses ("left" or "right" arrow key press) are intact and accurate, and thus their data can be analyzed the same as other participants.  
	
"derivatives":	
If the user decides to use the preprocessed ERP data that utilized functions in EEGLAB (Delorme & Makeig, 2004) for MATLAB (MathWorks Inc., Natick, MA), read the following: Within the "erps" folder, there are the 24 subjects (sub-##), filtering parameters (JSON) applied during preprocessing ("task-N400Stimset_erp-filters"), and 5 files (JSON) describing trial rejection per eachdivision (GA, CP, LD, order, and time) which correspond with the 5 trial rejection (TSV) files in each subject folder. Within each included subject folder (sub-##), there are 5 preprocessed MAT files (sub-##_task-N400Stimset_erp-X) containing epoched ERP data as well as their corresponding trial rejection files (TSV). These trial rejection files contain information about how many trials were accepted into final analysis of that division relative to the total trials in each condition. Note: Sub-05, sub-10, sub-15, and sub-18 only have the grand average division because they were eliminated from final analysis. Each preprocessed MAT file per division contains 3 variables: 1. "fs", which is the sampling rate (i.e. 512); 2. "t" which is the total time of the epoch in datapoints (i.e 615). Our epoch spans between -200 ms before the stimulus to 1000 ms after the stimulus (to convert ms to data points is ms*fs/1000); and 3. "ERPs", which is the filtered and epoched ERP data. The "ERPs" file is made up of cells containing structs for each condition (Ex. for "GA" there will be 2 cells, one which contains ERP data in response to congruent stimuli and the other which contains ERP data in response to incongruent stimuli). The corresponding trial rejection files for each division will tell you what condition is in each cell struct. The conditional structs have several fields constructed by the pop_epoch function of EEGLAB. The most useful fields will be: 1. "data", which is organized by 128 electrodes * 615 datapoints * # of trials accepted for that condition for that subject. To look at a condition, the user should select an electrode (1-128), and average per trial before they average across subjects; 2. "chanlocs" which contains the location information and labels for each of the 128 electrodes; and 3. "event" which contains the accepted trials used for this condition. The "type" field within "event" will have a # in the thousands corresponding to the condition. This number depends on the divisions within each perprocessed ERP group described below, but one could also just refer to the corresponding trial rejection file.	
	
	Preprocessed ERP Divisions-
		"sub-##_task-N400Stimset_erp-GA": This is the grand average of the 2 main conditions "congruent" vs. "Incongruent". Within the ERPs file, there will be 2 cells, one for each condition. Cell 1 is for the congruent condition and cell 2 is for the incongruent condition. To ensure the proper comparison is being made, look at the field "event" within the ERPs struct. The "type" column of "event" will correspond to the condition, i.e. if "type" contains the number "1000", the struct is for the congruent condition, whereas "2000" is for the incongruent condition. For the grand average compare 1000 vs 2000. (Used for Figure 1-5). To see the trial rejection per condition, refer to the group's TSV file ("sub-##_task-N400Stimset_erp-GA_trialrej").
		"sub-##_task-N400Stimset_erp-CP": This is the grand average of the 2 main conditions separated by 4 cloze probability groups: 1. 96%-100% CP (54 sentence pairs); 2. 90%-<96% CP (56 sentences pairs); 3. 80%-<90% CP (46 sentence pairs); 4. <80% CP (43 sentence pairs). The congruent event types are "1000", "2000", "3000", and "4000". Incongruent event types are "5000", "6000", "7000", and "8000". To ensure proper comparisons of conditions based on CP group, compare: 1. "1000" vs. "5000"; 2. "2000" vs. "6000"; 3. "3000" vs. "7000"; and 4. "4000" vs. "8000". (Used for Figure 6 and S.4) To see the trial rejection per condition, refer to the group's TSV file ("sub-##_task-N400Stimset_erp-CP_trialrej").
		"sub-##_task-N400Stimset_erp-LD": This is the grand average of the 2 main conditions separated into 2 linguistic groups: 1. Incongruent sentence stimuli with only semantic ending errors, along with their congruent pair (264 Stimuli); and 2. Incongruent sentence stimuli with both semantic and syntactic ending errors, along with their congruent pair (138 Stimuli). It is important to note that these sentences are broken down into 5 linguistic groups noted in the TSV file "N400Stimset_stim_timing-divisions": 1. Incongruent sentences that contain only semantic ending errors, along with their congruent pair (264 Stimuli); 2. Incongruent sentences with both semantic errors and a syntactic number error, along with their congruent pair (54 Stimuli); 3. Incongruent sentences with both semantic errors and syntactic adjective/noun errors, along with their congruent pair (40 Stimuli); 4. Incongruent sentence stimuli with both semantic errors and syntactic verb/noun ending errors, along with their congruent pair (44 Stimuli); and 5. The 20 sentence pairs that were eliminated from the study and not counted in any analysis. Because the sample size per condition was much smaller than the 1st linguistic group, linguistic groups 2-4 were combined for a better comparison between sentence pairs that only contained semantic only errors, and sentence pairs with both semantic and syntactic errors. The congruent event types are "1000" and "2000" for semantic and semantic/syntactic errors respectively, and the incongruent event types are "3000", and "4000". To ensure proper comparisons of conditions based on LD group, compare: 1. "1000" vs. "3000"; and 2. "2000" vs. "4000". (Used for S.7 and S.8) To see the trial rejection per condition, refer to the group's TSV file ("sub-##_task-N400Stimset_erp-LD_trialrej").
		"sub-##_task-N400Stimset_erp-Order": This is the grand average of the 2 main conditions separated by the order in which they were received, i.e. we compared the difference between when the congruent stimulus was heard before the incongruent pair, and when the incongruent stimulus was heard before the congruent pair. The congruent event types are "1000" and "3000" for congruent before incongruent and vice versa, and the incongruent event types are "2000", and "4000". To ensure proper comparisons of conditions based on order compare "1000" vs. "2000" for congruent before incongruent, and "3000" vs. "4000" for incongruent before congruent. (Used for S.5 and S.6) To see the trial rejection per condition, refer to the group's TSV file ("sub-##_task-N400Stimset_erp-Order\_trialrej").
		"sub-##_task-N400Stimset_erp-Time": This is the grand average of the 2 main conditions separated by quarters of time in the experiment: 1. First quarter of the experiment (51 congruent and 48 incongruent); 2. Second quarter of the experiment (46 congruent and 53 incongruent); 3. Third quarter of the experiment (52 congruent and 50 incongruent); and 4. Last quarter of the experiment (52 congruent and 50 incongruent). The congruent event types are "1000", "2000", "3000", and "4000" for each quarter of the experiment, and the incongruent event types are "5000", "6000", "7000", and "8000". To ensure proper comparisons of conditions based on LD group, compare: 1. "1000" vs. "5000"; 2. "2000" vs. "6000"; 3. "3000" vs. "7000"; and 4. "4000"  vs. "8000". (Used for S.8 and S.9) To see the trial rejection per condition, refer the group's TSV file ("sub-##_task-N400Stimset_erp-Time_trialrej").

Note: The stimuli labeled as linguistic group 5 (LD5; refer to the "N400Stimset_stim_timing-divisions" CSV file for groups) were eliminated from all  analyses. These were 20 sentence pairs where the endings could make sense to children (for example, in a universe where animals are personified such as in cartoons, the incongruent ending of “My dog digs holes/spring” could make sense in the context where the verb “dig” is synonymous with “liked”), did not match in syllable number, were hyphenated phrases, or contained cultural references. 

Subjects 05, 10, 15, and 18 were deleted from final analysis for the reasons indicated on their subject folder. The preprocessed data of these subjects include the linguistic group 5 (LD5) stimuli as the decision to eliminate these individuals was determined before these stimuli were deleted in final analysis. 

"Code": 
Provided in this folder are the preprocessing code used to make ERPs for each derivative and the presentation code used for task administration. Although we recommend researchers using their own code to processes the data, the code was provided for transparency. The presentation code does not include the 20 stimuli removed from final analysis. Although these stimuli were presented to participants, we recommend that researchers do not use these for their study, and thus removed them from the code. They may be added back in on the researchers discretion. Refer to the "requirements" for basic usage notes and necessary program downloads. Otherwise, the code has a fair amount of annotation.

2.1 Participants

Twenty-four neurotypical adults were recruited and provided written informed consent to participate in this study. Four subjects were excluded from data analysis due to failure to remain alert or to sit still during data collection (n=2), or due to noisy EEG data (n=2) (Figure S1). The remaining participants make up the fully-analyzed dataset. These twenty subjects ranged in age from 18 to 35 (mean age = 25.5 +/- 4.36), nine were female, and three were left-handed. Every participant spoke English as their first language, and twelve participants were mono-lingual while eight participants reported being bi- or multi- lingual. Demographic information for all participants, including those removed from further analysis are reported in Table 1.

2.2 Stimuli

The semantic anomaly paradigm consisted of 221 sentence pairs with incongruent and congruent endings, for a total of 442 stimuli in this stimulus set. However, twenty sentence pairs were eliminated before analysis because their endings did not match in syllable number, contained hyphenated phrases, cultural references, or upon closer examination, the supposed incongruent endings were in fact congruent. The remaining 402 stimuli (201 congruent and 201 incongruent sentence pairs) ranged from four to eight words in length. Sentences were created using simple words, derived from a set of age-appropriate words contained in the Medical Research Council Psycholinguistic (MRCP) database (https://websites.psychology.uwa.edu.au/school/mrcdatabase/uwa_mrc.htm). For example, from the word “cake”, the congruent sentence “He baked a birthday cake” was created. The words selected from the MRCP database took into consideration the age of acquisition (from a database provided by (Gilhooly and Logie, 1980)) and/or written word frequency (from a normed written word frequency set (Francis and Kucera, 1967)). This was to ensure that each sentence could be readily comprehended by NT individuals aged 5 years or older. The majority of these age-appropriate words were monosyllabic, but a few basic words with 2-syllables (e.g. present) and 3-syllables (e.g. animal) were included, due to their high-written-word frequency or early age of acquisition. Similar to previous studies using the MRCP database (Ross et al., 2011, 2007), we were sensitive to the fact that everyday language use has changed since the creation of these sets, so we carefully selected words that are still in common use.

This stimulus set included sentence pairs where the incongruent endings were just semantic errors. After elimination, this amounts to 132 / 201 stimulus pairs including example stimuli. These incongruent endings were matched to their congruent pair in word type (e.g. noun or verb) and number (e.g. plural or singular). There were also sentence pairs where incongruent endings contained both semantic and syntactic errors. After elimination, this amounts to 69 / 201 stimulus pairs. Semantically incongruent endings were also classified as syntactic errors if the ending deviated from the syntax provided by the sentential context. The type of syntactic errors included: 1. Endings where a plural noun expectation was deviated with a singular noun or vice versa (27 stimulus pairs); 2. Endings where an adjective expectation was deviated with a noun or vice versa (20 stimulus pairs); and 3. Endings where a verb expectation was deviated with a noun or vice versa (22 stimulus pairs). During final analysis, these 3 types of syntactic errors were combined into a single linguistic division (LD) in order to compare the overall response to sentences with both semantic and syntactic errors, and the response to sentences with only semantic errors.

Language comprehension in NT individuals is highly influenced by communicative cues such as prosody, especially when sentence meaning relies on syntactic prosodic cues (Thorson 2018, (Cutler et al., 1997; Dahan, 2015; Frazier et al., 2006, Thorson 2018). The intention of this manuscript was to create a stimulus set that could be used for all populations equally. Therefore, this stimulus set was constructed without prosody to ensure that NT individuals would not have an advantage in language comprehension over other populations known to have difficulty with communication or prosody, such as those with ASD (DePape et al., 2012; Eigsti et al., 2012; Martzoukou et al., 2017; McCann et al., 2007; O’Connor, 2012; Wang and Tsao, 2015) or schizophrenia (Leitman et al., 2007, 2005). To do so, individual words from the word list were recorded from a female speaker, instructed to voice the words with minimal inflection, stress, and intonation (i.e. in a monotonous non-prosodic manner). Words were then compiled into complete sentences using the Audacity Software (Version 3.0.0. Audacity® software is copyright © 1999 - 2021 Audacity Team. https://audacity team.org/ ). These artificially-compiled sentences were manually adjusted to have similar pitch-frequency between each word within a sentence and between all sentences. Concurrently, time gaps were added between words so that all sentences would have similar pacing and that researchers would be able to trigger discretely on each word. Both artificial timing and frequency add to the robotic nature of the stimuli. A future initiative will add the prosodic versions of these sentences to this public stimulus set so that researchers can explore more communicative aspects to language.

2.3 Procedure

Participants were fit with a 128-electrode cap (Bio Semi B.V. Amsterdam, the Netherlands) and seated in a sound attenuating, electrically shielded booth (Industrial Acoustics Company, The Bronx, NY) with a computer monitor (Acer Predator Z35 Curved HD, Acer Inc.) and a standard keyboard (Dell Inc.). The task was created with Presentation® Software (Version 18.0, Neurobehavioral Systems, Inc. Berkeley, CA). The task was first explained to the participant during the consent process and then again before the experimental session. Individuals were asked to refrain from excessive movement and to focus on a fixation cross throughout the task in order to reduce movement artifacts. The experimental session began by explaining the task for a third time. All instructions were presented both visually on the screen and auditorily through the headphones (Sennheiser electronic GmbH & Co. KG, USA). Instructions were followed by two practice trials which were the same for every participant. Feedback was given about a participant’s response only during practice trials and not during experimental trials. Trials were presented as follows: 1. A fixation cross was on the screen while an auditory sentence stimulus was presented through headphones; 2. A two second pause; and 3. A question (presented both visually and auditorily) asked the participant if the sentence ended as expected, where subjects responded with a right or left arrow key when sentences ended as expected (congruent) or unexpected (incongruent) respectively, to end the trial. A two-second delay was inserted between a subject’s response and the start of the next sentence. A total of 442 stimuli were presented to participants in the same order. This was done to ensure that every participant had the same experience throughout the task for every sentence. Two of these stimuli (one congruent and one incongruent) were used for example trials. The remaining 440 were used for the experiment. Stimuli were separated into 11 blocks with optional breaks between each block. Participants could continue onto the next block by pressing the spacebar. After elimination, the responses to 400 out of the 440 stimuli contributed to the analysis of this experiment.

2.4 Data preprocessing

Data were digitized online at a rate of 512Hz, DC to 150 Hz pass-band, and referenced to the common mode sense (CMS) active electrode. EEG data were preprocessed and analyzed offline using in-house scripts leveraging EEGLAB functions (Delorme & Makeig, 2004). Data were filtered using a Chebyshev II spectral filter with a band pass of 0.1 - 45 Hz. Channels were rejected automatically if recorded data from an electrode exceeded more than 3 standard deviations from the mean variance and amplitude from all electrodes, the channel would be rejected and interpolated using EEGLAB spherical interpolation. Data were then re-referenced to the common average. Prior to analysis, the time from the beginning of a sentence to the onset of the last word was measured for each stimulus using Praat (PRAAT v. 6.1, University of Amsterdam, the Netherlands). These measures were used to adjust the time stamp of each stimulus, so that the data could be aligned to the onset of the last word (i.e. 0 ms), rather than the beginning of the sentence. For all participants, epochs from -200 to 1000 ms were created using a baseline of the 200 ms interval before the onset of the last word. Trials were rejected automatically based on an artifact rejection threshold of 250 mV and if a trial contained amplitudes greater than two standard deviations from the mean amplitude across all channels. Grand average ERP waveforms were generated by first averaging the trials per condition per electrode, and then averaging by participant.

2.5. Statistical analysis

JASP (Jeffrey’s Amazing Statistic Program Team [2020], Version 0.12.2) was used for statistical analyses. Three midline electrodes (Fz, Cz, and Pz) were chosen a priori for investigation (Lau et al., 2008; Luck, 2005; Osterhout and Holcomb, 1992). Other electrodes (F7, T7, and P7) were investigated post hoc. For every participant, these selected electrodes were assessed for an effect of condition using a repeated measure ANOVA (rmANOVA) at four time points of interest (250 ms, 400 ms, 600 ms, and 700 ms). Amplitude values for these electrodes were acquired by averaging the amplitudes across a 10 ms time window, centered at the time point of interest. Additional rmANOVA’s were conducted to assess for a main effect of CP, order, linguistic division, and time. F-scores and p-values for a main effect of condition at the midline electrodes are shown in Table 2. Other main effects for midline electrodes are shown in Table 3. All main effects for electrodes F7, T7, and P7 are in Table 4.

Topography plot statistics were generated using the FieldTrip toolbox (Oostenveld et al., 2011) for MATLAB and displayed using the EEGLAB toolbox. A group level cluster-based permutation test was conducted using two-tailed, independent sample t-statistics with a critical alpha-level of 0.05. This test applied the Monte-Carlo method to estimate significance probability, the triangulation method of the neighbours function for spatial clustering, and a multiple-comparison correction. Single sample clusters were combined using “maxsum” and a 5% two-sided cutoff criterion was applied to both positive and negative clusters. Topography statistics are presented as the average significance over a 10 ms time window centered at the time point of interest (Figure S2).

2.6 Cloze probability

To further characterize the stimulus set, a RedCap survey was employed to test the CP of all sentence endings. Each sentence in the set was presented with the final word missing (blank) and participants were required to fill in this blank with the first singular word that came to mind. If participants could not think of an answer, they were encouraged to guess rather than leave a blank. Non-answers were not counted towards CP scores; participants were removed from the survey data if they answered fewer than 10 questions out of the 221; and participants were removed if their percent correct was three standard deviations from the mean. After elimination, the survey used the responses from 134 individuals to assess the CP of each sentence. The majority of these stimuli had greater than 80% CP. The CP distribution of sentences were shown in Figure S3.

2.7 Data Availability

This stimulus set (https://doi.org/10.5061/dryad.9ghx3ffkg) and the supporting datasets (https://doi.org/10.5061/dryad.6wwpzgmx4) are available through Dryad for the scientific community to use freely in their experiments. The stimulus set provides the auditory files for all 442 stimuli and a stimulus parameter file that includes stimulus information such as duration, target word onset, derivative divisions (i.e. CP, order, linguistic error, and time), and most importantly, the written form of each stimulus so that semantic comprehension via reading can be investigated. Cloze probability survey answer and result files are also within the stimulus set download.

The dataset download provides the 24 datasets in BIDS format via guidelines provided by (Pernet et al., 2019) and all the aforementioned stimulus set files. Participant information is also detailed in the dataset (.tsv and .json). The full dataset includes unfiltered EEG data (.bdf), corresponding event files, and channel rejection files for each participant (.tsv), as well as recording information, electrode positioning, and event file information (.tsv and/or .json). We additionally provide preprocessed ERP derivatives for this study (.mat), the corresponding trial rejection information per derivative for each participant (.tsv), and filtering parameters (.json). Refer to the README.txt files in both the dataset and stimulus set in order to use them appropriately. Use of this dataset, stimulus set, or presenting examples from this stimulus set should include a citation to this paper.

2.8. Code Availability

The code generated for the analysis of these datasets as well as the Presentation® code is available through Zenodo via Dryad (https://doi.org/10.5061/dryad.6wwpzgmx4). The provided code was utilized to create the preprocessed ERP derivatives as well as figure components.

Evoking the N400 Event-Related Potential (ERP) component using a publicly available novel set of sentences with semantically incongruent or congruent eggplants (endings)

Data files

Abstract

README: Evoking the N400 Event-Related Potential (ERP) component using a publicly available novel set of sentences with semantically incongruent or congruent eggplants (endings)

Description of the data and file structure

Methods

Usage notes

Works referencing this dataset