Data from: Neural tracking of syllabic and phonemic time scale
Abstract
Dynamical theories of speech processing propose that the auditory cortex parses acoustic information in parallel at the syllabic and phonemic time scales. A paradigm was developed to independently manipulate both linguistic time scales, and intracranial recordings were acquired from eleven epileptic patients listening to French sentences. Our results indicate that (i) syllabic and phonemic time scales are both reflected in the acoustic spectral flux; (ii) during comprehension, the auditory cortex tracks the syllabic time scale in the theta range, while neural activity in the alpha-beta range phase locks to the phonemic time scale; (iii) these neural dynamics occur simultaneously and share a joint spatial location; (iv) the spectral flux embeds two time scales —in the theta and low-beta ranges— across 17 natural languages. These findings help understand how the human brain extracts acoustic information from the continuous speech signal at multiple time scales simultaneously, a prerequisite for subsequent linguistic processing.
README: The human auditory cortex concurrently tracks syllabic and phonemic time scales via acoustic spectral flux.
Jérémy Giroud, Agnès Trébuchon, Manuel Mercier , Matthew H. Davis , Benjamin Morillon
MRC Cognition and Brain Sciences Unit, University of Cambridge, UK
Aix Marseille Université, INSERM, INS, Institut de Neurosciences des Systèmes, Marseille, France
APHM, Clinical Neurophysiology, Timone Hospital, Marseille, France
corresponding author: jeremy.giroud@gmail.com ; bnmorillon@gmail.com
https://doi.org/10.5061/dryad.t4b8gtj9r
Description of the data and file structure
This dataset contains data to reproduce the figures in the future manuscript untitled "The human auditory cortex concurrently tracks syllabic and phonemic time scales via acoustic spectral flux."
The dataset contains the following folders that relate to the different analyses performed in the manuscript:
- acoustics: which contains the acoustic representations of the stimuli used in the experiment
- anatomy: which is the fsaverage subject used to plot sEEG data into fsaverage space.
- behaviour: repetition accuracy scores from the participants.
- coherence: it contains the coherence values obtain between different acoustic representations of the speech signal and both the broadband and High Frequency Activity (HFa) extracted from the sEEG bipolar signal.
- coherence classification: stores data to reproduce the SVM classification
- corpus: contains frequency representations of both envelope and spectral flux extracted from the different speech and music corpora
- HFA_broadband: data necessary to reproduce the difference between the two different types of sEEG signal.
- RSR: contains data related to the rate specific response
- TRF: contains the data from the linear encoding models for each patient
The data is organised according to the different analyses performed and it is paralleled by the python code from the github repository at https://github.com/jeremygrd/Neural-tracking-of-syllabic-and-phonemic-time-scale
Methods
The dataset contains intracranial data stereotaxic electro-encephalography (sEEG) from 11 epileptic patients. Data was acquired at La Timone Hospital in Marseille.
Participants
Stereotactic Electroencephalography (sEEG) recordings were obtained from 11 patients (5 women, mean age 32,8 years, range [18-53]) with refractory epilepsy using intracerebral electrodes implanted as part of the standard presurgical evaluation process (Table S1). Electrode placements were based solely on clinical requirements. All patients were French native speakers. All patients gave informed consent, and the experiment reported here was approved by the Institutional Review board of the French Institute of Health (IRB00003888).
Audio features extraction
We extracted from the auditory stimuli 10 unidimensional auditory features using the open-source Python library Surfboard.
Spectral flux is a measure of how quickly the power spectrum(opens in new window) of a signal(opens in new window) is changing, calculated by comparing the power spectrum for one frame against the power spectrum from the previous frame. It is computed as the L2-norm (the Euclidean distance(opens in new window)) between the two normalized(opens in new window) spectra. Thus, the spectral flux is not dependent upon overall power (since the spectra are normalized), nor on phase considerations (since only the magnitudes are compared). The spectral flux can be used to determine the timbre(opens in new window) of an audio signal.
Spectral slope describes how quickly the amplitude spectrum of a signal tails off (negative slope) towards the high frequencies, calculated using a linear regression. It estimates the exponent of the 1/f-like distribution of the amplitude spectrum and reflects the pattern of aperiodic power across frequencies.
Spectral centroid is the center of gravity (the barycentre) of the amplitude spectrum. It is computed considering the spectrum as a distribution whose values are the frequencies and the probabilities to observe these are the normalized amplitude. It indicates where the center of mass(opens in new window) of the spectrum is located.
Spectral spread describes the average deviation of the normalized amplitude spectrum around its centroid. It is commonly associated with the bandwidth of the signal.
Spectral skewness represents the asymmetry of the spectrum around its centroid. It is computed from the third order moment.
RMS describes the average signal amplitude, computed as the square root of the arithmetic mean of the signal squared.
F0 contour represents the changes of the fundamental frequency over time in the course of an utterance.
Mel Frequency Cepstral Coefficients (MFFCC) is a representation of the short-term power spectrum(opens in new window) of a sound, based on a linear cosine transform(opens in new window) of the log-Mel power spectrum(opens in new window) Frequencies are represented in Mel scale, which approximates the human auditory system's response more closely than the linearly-spaced frequency bands used in the normal spectrum.
Amplitude envelope corresponds to the slow overall amplitude fluctuations of the signal over time. This last feature was computed with custom MATLAB scripts developed for speech signals analyses (107)(opens in new window). The sound signal was decomposed into 32 narrow frequency bands using a cochlear model, and the absolute value of the Hilbert transform was computed for each of these narrowband signals. The broadband temporal amplitude envelope resulted from the summation of these bandpass filtered signals.
Envelope derivative envelope corresponds to the positive part of the first derivative of the amplitude envelope.
All 10 features were computed from the audio waveforms sampled at 44.1 KHz. This resulted in a time series of equal duration to the original audio waveforms. They were down-sampled at 512 Hz to reduce computational load. Then, a spectral decomposition was performed by applying a Morlet wavelet transform to the acoustic time series using the MNE-python function time_frequency.tfr_morlet (n_cycles=5.5). This approach was used to ensure that the spectral decompositions were similar between the acoustic and neural time series (see below). Finally, the resulting time frequency representations were averaged over time to obtain the power spectrum of the signal, between 1 and 50 Hz (linear scale, 1 Hz resolution).
Stimulus-brain coherence
To quantify the relation between acoustic and neural time courses, we computed the stimulus-brain phase coherence. Coherence was computed for each 33 condition between each neural frequency of each speech responsive sEEG channel (from each electrode of each patient) and each acoustic feature, according to the following steps: The acoustic features time series and sEEG signals were resampled at 256 Hz. For sEEG signals, a time-frequency decomposition was performed by estimating the complex-valued wavelet transform of neural signals in the 1-120 Hz frequency range (linear scale, 0.5 Hz resolution) using a family of Morlet wavelets (n_cycles=5.5), with a zero-padding approach to minimize boundary effects. The phase information was extracted from these complex-valued neural signals. Finally, the phase coherence was computed over time and trials of the same condition (i.e., concatenated time courses), between the raw acoustic and the spectrally decomposed neural signals.
Note that the acoustic signal has not been spectrally decomposed in order to estimate how the main acoustic dynamics are tracked by each neural frequency, thus capturing non-linear relationships.
To estimate statistical significance, we used a surrogate approach with random pairing between audio and brain signals. That is, we computed the coherence for 500 random pairings of acoustic and neural data (within the same experimental condition), making sure that none of the acoustic features were paired with their original neural signal pair. These permutations were then averaged for each channel and provided an estimate of the baseline coherence values that can be expected by chance.
Classification of linguistic time scales A nested 4-fold cross-validation procedure was used to train/test/validate a support vector machine (SVM) with a Radial Basis Function (RBF) kernel to predict the labeled sentences from the power spectrum of the acoustic signal or the stimulus-brain coherence spectrum. The 315 sentences were labeled according to their syllabic (3 classes) or phonemic (8 classes) time scales (Fig. 1a; see above). As one of the phonemic classes (18 Hz) contains more examples than the others, we applied a stratification procedure. That is the class_weight parameter from the *SVC *function was set to “balanced” so as to correct for an unbalanced number of examples per class: each condition was equally represented by randomly picking for each a number of examples equal to the maximal available all conditions. Within the inner loop, a grid search using a cross-validation (StratifiedKFold=4) was performed to find an optimal hyperparameter combination that allows the classifier to correctly predict test data. The values of the penalty parameter C ranged from 0.5 to 1000 (5 values, log-scale), and γ ranged from 0.01 to 1000 (6 values, log-scale). In the outer loop (StratifiedKFold=4), the model was re-trained using the training set with the optimized hyperparameters (C, ƴ) and then the trained model was tested on the test dataset, which was separated. This procedure was repeated 10 times by shifting every time the block used for testing. Such a procedure effectively uses a series of train/validation/test set splits to avoid overfitting. The entire procedure was implemented using the open-source Python library Scikit-learn.
For the classification based on acoustic features, the multivariate factor in the classification analyzes corresponded to the power spectrum of each sentence, estimated between 1-5O Hz (linear scale, 1 Hz resolution; see above) for each of the 10 acoustic features (spectral flux, spectral slope, spectral centroid, spectral spread, spectral skewness, RMS, F0 contour, MFCC and amplitude envelope).
For the classification based on stimulus-brain coherence results, we applied the same approach, except for the multivariate factor in the classification analyzes which corresponded to the coherence spectrum of all patients (n=11), estimated between 1-120 Hz (linear scale, 0.5 Hz resolution; see above), averaged across all speech responsive sEEG channels per patient (see above).
Rate-specific response (RSR)
To estimate the relation between (syllabic or phonemic) linguistic time scales and stimulus-brain coherence dynamics (estimated with either the envelope or spectral flux), we calculated an index that quantifies the rate-specificity of the coherence responses (RSR).
For instance, the RSR at 3 Hz is estimated by comparing the 3 Hz coherence value in the 3 Hz syllabic level and contrasting it with the 3 Hz coherence value obtained in the other syllabic levels (6 and 9 Hz), as follow:
RSR = Coh(f = 3, r = 3) - ( ( Coh(f = 3, r = 6) + Coh(f = 3, r = 9)) /2) )
where f is the frequency for which coherence value was determined and r is the syllabic time scale (both in Hz). An RSR larger than 0 reflects a response which captures the 3 Hz syllabic rate. For the syllabic time scale, rate specific responses at 3, 6 and 9 Hz were averaged. For the phonemic time scale, rate-specific responses at 6, 7.5, 9, 12, 15, 18, 18, 22.5 and 27 Hz were averaged.
A surrogate distribution of RSR values (10,000) was constructed to provide an estimate of RSR values that can be expected by chance. This was done by selecting random coherence values (from 1-120 Hz, excluding the frequency of stimulation). Experimentally observed RSR values were deemed significant if higher than the 95% percentile of the surrogate distribution.
Linear encoding models
We estimated temporal response functions (TRF) using the spyEEG toolbox which allows the use of multiple predictors to explain the time-course of neural activity. We built three different linear encoding models, with the following acoustic predictors: amplitude envelope alone, spectral flux alone, and both combined. TRF were estimated per model (n=3) for each experimental condition (n=9), sEEG speech-responsive channel (n=347; across n=11 patients) and neural signal type (broadband and HFa). While estimating the TRFs, a fivefold cross-validation procedure was used. It helps avoid overfitting and find the optimal lambda by exploring a parameter space ranging from 10^-6 to 10^6. The optimal lambda was determined by the regularization value that leads to the highest Pearson's correlation between predicted and observed sEEG data. TRF were estimated with a lag window from -2 seconds to 2 seconds. Both neural data and acoustic features were sampled at 100Hz, z-scored, and filtered between 0.1 and 10 Hz. To determine if the resulting models could predict neural responses better than chance and evaluate how well each feature of interest (envelope, spectral flux or both features) captured neural activity, we measured for each sEEG channel the model's predictive power for neural activity against null feature models. Null models were created for envelope, spectral flux, and both features combined, by randomly shuffling the respective feature vectors. During cross-validation, the models attempted to predict neural responses using these permuted features. This step was repeated 10 times per feature, and the resulting null model predictive values were averaged. To assess whether prediction performances were higher for models trained on real features compared to null models trained on shuffled features, we statistically compared their prediction performance using a linear mixed model with random intercept for patients and channels.
Sounds corpora analyses
We analyzed audio recordings from three corpora of naturalistic sounds: speech, music and environmental sounds.
The speech corpus contains 17 different languages retrieved from ‘www.faithcomesbyhearing.com’(opens in new window). They consist of various verses of the New Testament from the Bible in Arabic, English, Spanish, Basque, Finnish, French, Hindi, Armenian, Japanese, Korean, Dutch, Russian, Swedish, Tamil, Thai, Vietnamese and Chinese mandarin. We selected the specific versions of the recordings which did not contain any sound effects but only plain speech. We then selected the 50 % of audio recordings with the highest signal-to-noise ratio, per language.
The music corpus comes from the “GTZAN Dataset - Music Genre Classification”; “the MNIST of sounds”, a collection of 10 musical genres comprising blues, classical, country, disco, hiphop, jazz, metal, pop, reggae, and rock. Each genre contains 100 audio files, each with a length of 30 seconds, sampled at 22.05 kHz. To obtain datasets by genre comparable in size to those used for speech, we retained the 95% of stimuli with the highest signal-to-noise ratio in each genre.
The corpus of environmental sounds comes from the “Making Sense of Sound” classification challenge, containing a broad range of sounds from everyday life. This corpus is originally composed of 5 categories: Nature, Human, Music, Effects and Urban, with 300 sounds in each category. We removed the Music and Human groups to focus only on environmental sounds, and analyzed the 900 remaining sounds. All samples were 5 second long, sampled at 44.1 kHz, and peak-normalized. To obtain datasets by category comparable in size to those used for speech and music, we retained the 95% of stimuli with the highest signal-to-noise ratio in each category.
The audios of each corpora were converted from multi to single channel and resampled to 16 kHz. Each file was then cut into a 3-second long segment and the power spectrum of both the acoustic amplitude envelope and the spectral flux were extracted (similarly to above, see ‘Audio features extraction’; 1-50 Hz, linear scale, 1 Hz resolution). To alleviate the effect of the 1/f trend present in the power spectrum and ease the peak detection procedure, each modulation value was multiplied by 0.85 times the square root of the frequency at which it was estimated.