## Dataset from "Neural event segmentation of continuous experience in human infants" Yates, T. S., Skalaban, L. J., Ellis, C. T., Bracher, A. J., Baldassano, C., & Turk-Browne, N. B. (2022). Neural event segmentation of continuous experience in human infants. *Proceedings of the National Academy of Sciences*. This directory contains raw and pre-processed movie watching data and event segmentation files used for the analyses in the manuscript. To analyze this data, you should refer to the [https://github.com/ntblab/infant\_neuropipe/tree/EventSeg/](https://github.com/ntblab/infant\_neuropipe/tree/EventSeg/). In particular, [https://github.com/ntblab/infant\_neuropipe/tree/EventSeg/scripts/EventSeg/Event_Segmentation.ipynb](https://github.com/ntblab/infant\_neuropipe/tree/EventSeg/scripts/EventSeg/Event_Segmentation.ipynb) utilizes these files to recreate the figures reported in the paper. For this purpose, the contents of this directory are expected to be in a folder called `data/EventSeg`. This notebook can be adapted to explore other analyses, which in some cases will create new files. Aeronaut is a 3-minute long segment of a short film entitled ''Soar'' created by Alyce Tzue ([https://vimeo.com/148198462](https://vimeo.com/148198462)). The film was downloaded from YouTube in Fall 2017 and iMovie was used to trim the length. In the video, a girl is looking at airplane blueprints when a miniature boy crashes his flying machine onto her workbench. The pilot appears frightened at first, but the girl helps him fix the machine. After a few failed attempts, a blueprint flies into the girl's shoes, which they use to finally launch the flying machine into the air to join a flotilla of other ships drifting away. In the night sky, the pilot opens his suitcase, revealing a diamond star, and tosses it into the sky. The pilot then looks down at Earth and signals to the girl, who looks up as the night sky fills with stars. (Segments: 0:40 -- 2:30, 3:16 -- 4:26) Mickey is a 2-minute and 20-second long segment of a popular cartoon ([https://www.youtube.com/watch?v=hHCt2c_H0Ic](https://www.youtube.com/watch?v=hHCt2c_H0Ic)). The video was downloaded from YouTube in Spring 2016. In this video, a surprise party is thrown where characters dance and play the piano while one character makes an exploding cake in the kitchen. (Segments: 0:06 -- 2:26) In the functional imaging data provided, infants and adults watched these clips without audio presentation. The scan sequences are as follows: >MPRAGE: TR= 2300 ms, TE = 900 ms, flip angle = 9 degrees, iPAT = 2, matrix = 256 x 256, slices = 176, resolution = 1.00 mm isotropic >PETRA: TR1 = 3.32 ms, TR2 = 2250 ms, TE = 0.07 ms, flip angle = 6 degrees, matrix = 320 x 320, slices = 320, resolution = 0.94 mm isotropic, radial slices = 30,000 >SPACE: TR = 3200 ms, TE = 563 ms, flip angle = 120 degrees, matrix = 192 x 192, slices = 176, resolution = 1 mm isotropic >T2* gradient-echo EPI: TR = 2 s, TE = 30 ms, flip angle = 71 degrees, matrix = 64 x 64, slices = 34, resolution = 3 mm isotropic, interleaved slice acquisition All files with the suffix '.nii.gz' or '.nii' can be opened using the freely-available fMRIB Software Library ([https://fsl.fmrib.ox.ac.uk/fsl/fslwiki](FSL)) or open-source software FreeSurfer ([https://surfer.nmr.mgh.harvard.edu/](FreeSurfer)). ### File/directory descriptions: **Aeronaut** Files in this folder contain raw and preprocessed infant and adult movie-watching data which can be used for a variety of analyses. Participant-specific files are named with the participant name at the start. The infant participant names are comprised of three parts: sXXXX describes the unique family ID, the \_X that follows is the sibling ID (counting up from the first child to participate in the family) and the final \_X is the session number. Hence: s0001\_2\_4 would be the 4th session for the 2nd sibling in family s0001. These infant participant IDs are consistent across datasets from the NTB Lab. Adult participant files are simply named based on their subject number and the study of acquisition. The prefix "mov" is used for the Aeronaut dataset, and the prefix "mickey" is used for the Mickey dataset. Run numbers indicate the nth run that was retained in that participant's session. If the number has a letter after it (e.g. functional03a) then that indicates it is a pseudorun and there is other data from this run that has been removed (because it pertained to another task, not reported here). All data within a run is continuous, no interleaved time points were removed. > **adult\_participants.csv**: Summary table of the participant information, including: the participant ID, participant age (in years), participant sex (male or female), the location of the scan, the session number, the total number of TRs collected (num\_TR), the proportion of TRs that were usable after motion exclusion (prop\_TR\_motion), the proportion of TRs that were usable after eye-tracking exclusion (prop\_TR\_eye), the proportion of frames that were coded the same between gaze coders (eye\_reliability; left blank if there was only one gaze coder), and the number of gaze coders (coder\_num). > **anatomicals**: Anatomical images used for alignment. Facial information has been stripped for anonymity. These were collected using the PETRA sequence (for infants) or the MPRAGE sequence (for adults) defined above. In some cases, more than one scan has been averaged to improve quality. > **eye\_confounds**: Text files with 1s for TRs that should be excluded for eye closure > **infant\_participants.csv**: Summary table of the participant information, including: the participant ID, participant age (in months), participant sex (male or female), the location of the scan, the session number, the total number of TRs collected (num\_TR), the proportion of TRs that were usable after motion exclusion (prop\_TR\_motion), the proportion of TRs that were usable after eye-tracking exclusion (prop\_TR\_eye), the proportion of frames that were coded the same between gaze coders (eye\_reliability), and the number of gaze coders (coder\_num). Eye-tracking data is missing from one participant. > **motion\_confounds**: Text files with 1s for TRs that should be excluded for excessive motion (>3mm translational motion) > **preprocessed\_native**: Contains a folder called linear\_alignment that has nifti files for preprocessed functional data during movie watching. All functional images were linearly aligned to native anatomical space > **preprocessed\_standard**: Contains nifti files for preprocessed functional data during movie watching that have been aligned to standard MNI space either linearly with manual adjustment (subfolder linear\_alignment) or nonlinearly with ANTs (subfolder nonlinear\_alignment) > **raw\_nifti:** Raw functional data for each run where movie task data was collected in these participants. If another task, not reported here, was completed in the same run then a pseudo-run was created in which the TRs corresponding to this task were sliced and separated. > **raw\_timing**: Timing information for the start of each block and event for each participant. For each file the first column is the onset of the event or block, the second column is the duration of event or block and the third column is the weight. (Note that all participants but s5037\_1\_1 only saw the movie once in the session) >> **run\_burn\_in.txt** and **run\_burn\_in\_adults.txt**: File with subject name, functional run, and number of TRs in the burn-in for that run (by default should be 3, but may differ) > **transformation\_mats**: The 4x4 affine transformation matrix (in .mat format) to align the data. One type of file is for aligning each functional in raw\_nifti to highres (files with \_highres, with one for each run). The other type is for aligning from highres to standard (files with \_highres2standard). > **transformation\_ANTs**: Contains ANTs folders for each participant. These were created by run\_ANTs\_highres2standard.sh and were used to create the nonlinear registration to infant standard and linear registration to adult MNI standard. Note that for adults, the infant\_standard2standard.mat is an identity matrix, meaning that alignment to 'infant\_standard' vs 'standard' is identical, and all references to 'infant\_standard' actually refer to adult MNI space. > > **example\_func2highres.nii.gz**: functional image of the centroid TR that minimizes the Euclidian distance between TRs aligned to highres anatomical space > > **example\_func2infant\_standard.nii.gz**: functional image of the centroid TR that minimizes the Euclidian distance between TRs aligned to infant standard space > > **example\_func2standard.nii.gz**: functional image of the centroid TR that minimizes the Euclidian distance between TRs aligned to adult MNI standard space > > **example\_func.nii.gz**: functional image of the centroid TR in its native 3mm space > > **fs\_alignment.mat**: transformation matrix that aligns fs\_vol.nii.gz to highres\_brain.nii.gz (6 degrees of freedom) > > **fs\_brain.nii.gz**: freesurfer-outputted highres anatomical image rotated and masked to only show brain voxels > > **fs\_vol.nii.gz**: freesurfer-outputted highres anatomical image in 1mm space > > **highres2infant\_standard\_0GenericAffine.mat**: transformation matrix used to move from highres to infant standard space > > **highres2infant\_standard\_1InverseWarp.nii.gz**: warp file used by ANTs to move from infant standard space to highres > > **highres2infant\_standard\_1Warp\_3mm.nii.gz**: warp file used by ANTs to move from high resolution to infant standard space (while maintaining 3mm functional space) > > **highres2infant\_standard\_1Warp.nii.gz**: warp file used by ANTs to move from high resolution to infant standard space > > **highres2infant\_standard\_InverseWarped.nii.gz**: infant standard image aligned to highres space via ANTs > > **highres2infant\_standard\_Warped.nii.gz**: highres anatomical image aligned to infant standard space via ANTs > > **highres2standard.nii.gz**: highres anatomical image aligned to adult MNI standard space > > **infant\_standard2standard.mat**: linear transformation matrix between infant standard and adult MNI standard space > > **highres\_brain.nii.gz**: highres anatomical image masked to only show brain voxels > > **infant\_standard.nii.gz**: infant standard image, determined based on the child's age > > **mask.nii.gz**: mask to facilitate anatomical alignment to standard, manually edited from freesurfer output **Mickey** This folder contains the same contents as the Aeronaut data files, but since movies were often played twice to infant participants (and always twice to adults), there are more timing files and sometimes more nifti files for each subject. **EventSeg** Files in this folder are the outputs of running event segmentation analyses and are necessary for reproducing figures in the paper. *Note: Unless otherwise indicated, files reflect results when using the default (linear) alignment method used in the manuscript.* > **Aeronaut** > > **adult\_participants.csv** (Identical to the file in Aeronaut\_Data): Summary table of the participant information, including: the participant ID, participant age (in years), participant sex (male or female), the location of the scan, the session number, the total number of TRs collected (num\_TR), the proportion of TRs that were usable after motion exclusion (prop\_TR\_motion), the proportion of TRs that were usable after eye-tracking exclusion (prop\_TR\_eye), the proportion of frames that were coded the same between gaze coders (eye\_reliability; left blank if there was only one gaze coder), and the number of gaze coders (coder\_num). > > > **adults\_wholebrain\_data.npy and adults\_wholebrain\_data\_nonlinear.npy**: Files containing the wholebrain, preprocessed data of adults watching the movie. The data shape is: number of TRs x number of brain voxels x number of subjects > > > **behavioral\_boundary\_events.npy**: This file contains the TR timepoints that were designated as event boundaries after thresholding the percent of behavioral participants who indicated an event boundary at 36%, and retaining the boundaries that were robust to shifts due to response time. The TRs in this file have already been shifted to account for the hemodynamic response (2 TR, or 4-second shift). > > > **behavioral\_boundary\_keypress\_all\_subjects.npy**: This file contains the key press responses of the 22 behavioral participants who indicated the presence of an event boundary during the movie. Values of 1 mean that the subject indicated an event boundary within 4 seconds of that TR. > > > **eventseg\_human\_bounds**: This folder contains the whole-brain searchlight results showing the within vs across boundary correlations using behavioral boundaries. Files with the suffix "\_alt" were created using a more conservative analysis approach and files with the suffix "\_distance" were created using a more continuous approach (both detailed in the Supplementary Information section of the manuscript). > > >> Within vs across boundary correlation maps (created by HumanBounds\_Searchlight.py): These files contain with within vs across boundary correlations for each searchlight center and are named as ${age\_group}\_sub\_${sub\_number}\_humanbounds.nii.gz (e.g., infants\_sub\_0\_humanbounds.nii.gz). > > >> Outputs of bootstrap resampling of human boundary analysis (HumanBounds\_Bootstrapping.py): These files contain the z-statistics comparing the bootstrapped distribution of within vs across boundary correlations to zero, and are named as ${age\_group}\_avg\_zscores.nii.gz (e.g., infants\_avg\_zscores.nii.gz). > > >> The average within vs across boundary correlation map, unthresholded, is created by the Event\_Segmentation.ipynb notebook and saved here. (e.g., infants\_avg\_bounds.nii.gz) > > > **eventseg\_optk**: This folder contains the outputs from estimating the optimal number of events for each ROI as well as outputs from the inner and outerloop iterations of the event segmentation analysis. > > >> Log-likelihood values of event models in an ROI across all participants in a given age group (created by FindOptK\_ROI.py): These files contain the log-likelihood values for each inner loop iteration and given number of events and are named as ${age\_group}\_${roi}\_${num\_events}\_events\_cb\_method\_loglik\_${k\_split}.npy (e.g., infants\_Precuneus\_standard\_0\_events\_cb\_method\_loglik\_2.npy) > > >> Log-likelihood values of event models in an ROI across all but one participant in a given age group (created by FindOptK\_ROI.py and inputting a held-out participant number): These files contain the log-likelihood values for each inner loop iteration and given number of events and are named as ${age\_group}\_${roi}\_${num\_events}\_events\_leftout\_${leftout\_sub}\_loglik.npy (e.g., infants\_Precuneus\_standard\_2\_events\_leftout\_0\_loglik.npy) > > >> Outputs from using the optimal number of events from inner loop log-likelihood values in an ROI while holding out one participant in a given age group (created by FindOptK\_Outer.py): These files contain the most optimal K value from the inner loop, the log-likelihood of the held-out participants' actual data, and the z-statistic calculated by comparing the actual log-likelihood with a permutation distribution, and are named as ${age\_group}\_${roi}\_relsub\_${leftout\_sub}\_bestk\_loglik.npy (e.g., infants\_Precuneus\_standard\_relsub\_0\_bestk\_loglik.npy) > > > **eventseg\_searchlights** and **eventseg\_searchlights\_nonlinear**: These folders contain the outputs from estimating the optimal number of events in a searchlight across the brain (created by FindOptK\_Searchlight.py) using either the default linear alignment method or nonlinear alignment, respectively > > >> Log-likelihood values of event models across searchlights: These files contain the log-likelihood values for a given inner loop iteration and event number within an age group and are named as ${age\_group}\_all\_events\_${num\_events}\_iter\_${iter\_num}\_lls.nii.gz (e.g., infants\_all\_events\_2\_iter\_0\_lls.nii.gz) > > >> Average log-likelihood values across all iterations: These files contain the log-likelihood values averaged across inner loop iterations and are named as ${age\_group}\_all\_events\_${num\_events}\_average\_lls.nii.gz (e.g., infants\_all\_events\_2\_average\_lls.nii.gz) > > > **infant\_participants.csv** (Identical to the file in Aeronaut\_Data): Summary table of the participant information, including: the participant ID, participant age (in months), participant sex (male or female), the location of the scan, the session number, the total number of TRs collected (num\_TR), the proportion of TRs that were usable after motion exclusion (prop\_TR\_motion), the proportion of TRs that were usable after eye-tracking exclusion (prop\_TR\_eye), the proportion of frames that were coded the same between gaze coders (eye\_reliability), and the number of gaze coders (coder\_num). Eye-tracking data is missing from one participant. > > > **infants\_wholebrain\_data.npy** and **infants\_wholebrain\_data\_nonlinear.npy**: Files containing the wholebrain, preprocessed data of infants watching the movie. The data shape is: number of TRs x number of brain voxels x number of subjects > > > **infants\_gaze\_subgroup\_wholebrain\_data.npy**, **infants\_gaze\_highmedian\_wholebrain\_data.npy**, and **infants\_gaze\_lowmedian\_wholebrain\_data.npy**: Same as above, but for the 4 infants who looked at the movie the whole time, the 12 infants who had the most looking to the movie, and the 12 infants who had the least looking to the movie, respectively. > > > **intersect\_mask\_standard\_firstview\_all.nii.gz**: The intersect of the brain masks of all infant and adult subjects who watched the Aeronaut movie > > > **plots**: Where the notebook (Event\_Segmentation.ipynb) stores the plots created in the analysis (with files with the suffix '\_nonlinear' corresponding to nonlinear alignment). Specifically: > > > 'across\_groups\_ll\_all\_rois.svg' shows the z-statistic values for model log-likelihood vs. permuted chance for each ROI and all group combinations of training and testing the model (e.g., adult fit when trained on adults; infant fit when trained on adults). > > > > 'adult\_age\_hist.svg' and 'age\_hist.svg' show the age distribution of subjects in years for adults and months for infants. > > > > '${age\_group}\_across\_roi\_iscs' shows the across-movie (Aeronaut vs. Mickey) intersubject correlation values when averaged across voxels for each ROI, with a dark line showing the actual intersubject correlation values. > > > > '${age\_group}\_human\_bounds\_hmm\_rois.svg' shows the correlation between the continuous measure of model boundaries and adult behavioral boundaries vs. chance for each ROI. > > > > '${age\_group}\_human\_bounds\_rois\_alt.svg' shows the within-vs-across boundary pattern similarity correlation values when anchoring to the same timepoint, across all ROIs. > > > >'${age\_group}\_human\_bounds\_rois.svg' shows the within-vs-across boundary pattern similarity correlation values when using all possible pairs of timepoints, across all ROIs. > > > > ${age\_group}\_isc\_vs\_time\_permuted.svg' shows time-permuted intersubject correlation values vs. actual intersubject correlation values for each ROI. > > > >'${age\_group}\_nested\_results.svg' shows the z-statistic values for model log-likelihood vs. permuted chance for each ROI when models were trained and tested on the same age group. > > > >'${age\_group}\_roi\_iscs.svg' shows the actual intersubject correlation values when averaged across voxels for each ROI. > > > >'behavioral\_boundaries\_subject\_timecourse.svg' shows the timepoints each behavioral subject decided a behavioral event boundary occurred. > > > >'behavioral\_boundaries\_subject\_timecourse.svg' shows a different way of visualizing behavioral boundaries, with the proportion of participants who pressed a key on the y-axis. > > > >'${ROI}\_rsa\_${age\_group}.svg' shows the timepoint-by-timepoint pattern similarity in the given ROI and age group, with model boundaries marked in red. > > > >'example\_continuous\_behav\_corr.svg' shows an example of how timepoint pattern similarity and absolute distance to the nearest event boundary align. > > > >'example\_hmm\_behavior.svg' shows an example of how continuous behavioral boundaries and continuous model boundaries align. > > > >'infants\_gaze\_loglik\_corrs.svg' shows the log-likelihood values for different optimal number of events for infant subgroups based on looking behavior. > > > >'isc\_${age\_group}.svg' shows the whole-brain intersubject correlation values thresholded at a correlation of 0.10. > > > >'kvals\_${age\_group}.svg' shows the whole-brain optimal numbers of event boundaries thresholded at an intersubject correlation of 0.10. > > > >'peri-event\_stim\_${age\_group}.svg' shows timepoint pattern similarity approaching and receding from behavioral event boundaries for each ROI. > > > >'searchlight\_wva\_bounds\_${age\_group}.svg' and 'searchlight\_wva\_bounds\_${age\_group}\_alt.svg' show the whole-brain within-vs-across boundary pattern similarity values for different approaches to relating brain and behavior. > > > >'searchlight\_wva\_bounds\_distance\_${age\_group}.svg' show the whole-brain correlation between timepoint pattern similarity and distance to event boundaries. Files with the prefix 'searchlight\_zstat' show the z-statistics from bootstrap resampling of the corresponding files (e.g., searchlight\_zstat\_bounds\_infants draws from searchlight\_wva\_bounds\_infants). > > > > 'wva\_bounds\_distance\_${age\_group}.svg' shows the average correlation between timepoint pattern similarity and distance to the event boundary across ROIs. > > **Mickey** > > Follows the same file structure as Aeronaut, but with the absence of any files related to behavioral event boundaries, nonlinear alignment, or supplemental analyses. > > **ROIs** > > Contains the ROIs generated from the Harvard-Oxford atlas probabilistic atlas (0% probability threshold) in early visual cortex (EVC), lateral occipital cortex (LOC), angular gyrus (AG), precuneus, early auditory cortex (EAC), and the hippocampus; and functionally defined parcellations obtained in resting state (Shirer et al., 2012) for medial prefrontal cortex (mPFC) and posterior cingulate cortex (PCC). ### Replicating analyses The scripts in the infant\_neuropipe repository can be used to run the analyses reported in the paper. The Event\_Segmentation.ipynb notebook can regenerate the figures. Scripts in that directory can rerun the analyses, refer to the notebook and analysis README for more direction. Questions about the data or analyses can be directed to Tristan Yates, tristan.yates@yale.edu