Vocal communication is tied to interpersonal arousal coupling in caregiver-infant dyads
Abstract
It has been argued that a necessary condition for the emergence of speech in humans is the ability to vocalize irrespectively of underlying affective states, but when and how this happens during development remains unclear. To examine this, we used wearable microphones and autonomic sensors to collect multimodal naturalistic datasets from 12-month-olds and their caregivers. We observed that, across the day, clusters of vocalisations occur during elevated infant and caregiver arousal. This relationship is stronger in infants than caregivers: caregivers' vocalizations show greater decoupling with their own states of arousal, and their vocal production is more influenced by the infant’s arousal than their own. Different types of vocalisation elicit different patterns of change across the dyad. Cries occur following reduced infant arousal stability and lead to increased child-caregiver arousal coupling, and decreased infant arousal. Speech-like vocalisations also occur at elevated arousal, but lead to longer-lasting increases in arousal, and elicit more parental verbal responses. Our results suggest that: 12-month-old infants’ vocalisations are strongly contingent on their arousal state (for both cries and speech-like vocalisations), whereas adults’ vocalisations are more flexibly tied to their own arousal; that cries and speech-like vocalisations alter the intra-dyadic dynamics of arousal in different ways, which may be an important factor driving speech development; and that this selection mechanism which drives vocal development is anchored in our stress physiology.
Methods
Experimental participant details
The project was approved by the Research Ethics Committee at the University of East London (Approval number: EXP 1617 04). Informed consent, and intent to publish, were obtained in the usual manner. Participants were recruited from the London, Essex, Hertfordshire and Cambridge regions of the UK. In total, 91 infant-caregiver dyads were recruited to participate in the study, of whom usable autonomic data were recorded from 82. Of these, usable paired autonomic data (from both caregiver and child) were obtained from 74 participants. Further details, including exclusion criteria, and detailed demographic details on the sample, are given in Appendix 1 section 1.1. The sample size was selected following power calculations presented in the original funding application ES/N017560/1. Of note, we excluded families in which the primary day-time care was performed by the male caregiver because the numbers were insufficient to provide an adequately gender-matched sample. All participating caregivers were, therefore, female. Participants received £30 in gift vouchers as a token of gratitude for participation, split over two visits.
Experimental method details
Participating caregivers were invited to select a day during which they would be spending the entire day with their child but which was otherwise, as far as possible, typical for them and their child. The researcher visited the participants’ homes in the morning (c. 7:30–10am) to fit the equipment, and returned later (c. 4–7pm) to pick it up. The mean (std) recording time per day was 7.3 (1.4) hours.
The equipment consisted of two wearable layers, for both infant and caregiver. For the infant, a specially designed baby-grow was worn next to the skin, which contained a built-in Electrocardiogram (ECG) recording device (recording at 250Hz), accelerometer (30Hz), Global Positioning System (GPS) (1Hz), and microphone (11.6kHz). A T-shirt, worn on top of the device, contained a pocket to hold the microphone and a miniature video camera (a commercially available Narrative Clip 2 camera). For the caregiver, a specially designed chest strap was also worn next to the skin, containing the same equipment. A cardigan, worn as a top layer, contained the microphone and video camera. The clothes were comfortable when worn and, other than a request to keep the equipment dry, participants were encouraged to behave exactly as they would do on a normal day.
At the start and end of each recording session, before the devices were inserted into the clothes worn by the participants, the researchers synchronised the two devices by holding them on top of one another and moving them sharply from side to side, once per second for 10 consecutive seconds. Post hoc trained coders identified the timings of these movements in the accelerometer data from each device independently. This information was used to synchronise the two recording devices.
Quantification and statistical analysis
Autonomic data parsing and calculation of the autonomic composite measure. Further details on the parsing of the heart rate, heart rate variability, and actigraphy are given here: https://tinyurl.com/yckzfxf8. Here we present our motivation for collapsing these three measures into a single composite measure of autonomic arousal.
Home/Awake coding. Our preliminary analyses suggested that infants tended to be strapped-in to either a buggy or car seat for much of the time that they were outdoors, which strongly influenced their autonomic data. For this reason, all of the analyses presented in the paper only include data segments in which the dyad was at home and the infant was awake. A description of how these segments were identified is given in Appendix 1 (section 1.7). Following these exclusions, the mean (std) total amount of data available per dyad was 3.7 (1.7) hours, corresponding to 221.5 (102.4) 60-second epochs per dyad.
Vocal coding. The microphone recorded a 5-second snapshot of the auditory environment every 60 seconds. Post hoc, trained coders identified samples in which the infant or caregiver was vocalising, and the following codings were applied. For each coding scheme, consistency of rating between coders was achieved through discussions and joint coding sessions based on an ersatz dataset, before the actual dataset was coded. All coders were blind to study design and hypothesised study outcome.
Importantly, analyses conducted on a separate, continuous dataset (see Appendix 1, section S10) suggest that the temporal structure of our vocalisations was maintained despite this ‘sparse sampling’ approach. Furthermore, our analyses examine how arousal changes relative to observed vocalisations, and any arousal changes that we do observe time-locked to vocalisations would be weakened (not strengthened) by the fact that the vocalisation data were sparsely sampled (because power would have been reduced by missing vocalizations through the sparse sampling method, rather than increased).
Infant data. i) vocalisation type. A morphological coding scheme was applied with the following categories: cry, laugh, squeal, growl, quasi-resonant vowel, fully-resonant vowel, marginal syllable, canonical syllable. Overall, 29% of vocalisations were cries; 1% laughs; 1% squeal; 3% growl; 18% quasi-resonant vowel; 18% fully-resonant vowel; 6% marginal syllable; 23% canonical syllable. For analyses presented in the main text these were collapsed into cries and speech-like vocalisations, which included the following non-cry categories: quasi-resonant vowel; fully-resonant vowel; marginal syllable; canonical syllable. Laughs, squeals and growls were excluded due to rarity. ii) vocal affect was coded on a three-point scale for vocal affect (negative (fussy and difficult), neutral or positive (happy and engaged). In order to assess inter-rater reliability, 11% of the sample was double coded; Cohen’s kappa was 0.70, which is considered substantial agreement. iii) vocal intensity was coded on a three-point scale from low emotional intensity, neutral, or high emotional intensity.
Adult data. i) vocalisation type. A trained coder listened to vocalisations one by one and categorised them into the following categories: Imperative, Question, Praise, Singing, Imitation of Baby Vocalisation, Laughter, Reassurance, Sighing, Storytelling. These were then further collapsed into four supraordinate categories: Positive (Singing, Laughter); Stimulating (Question); Intrusive/negative affect (Imperative, Sighing); Sensitive (Praise, Imitation of Baby Vocalisation, Reassurance, Storytelling). Overall, 14% of vocalisations were Positive; 30% were Stimulating; 41% were Intrusive; 15% were Praise. In addition, ii) vocal affect and iii) vocal intensity were coded in the same way as for the infant data. In order to assess inter-rater reliability, 24% of the sample was double coded; Cohen’s kappa was 0.60, which is considered acceptable.
Physical positions while vocalising. We also ascertained the physical position of our participants while vocalising (Appendix 1 section 1.8).
Permutation-based temporal clustering analyses. To estimate the significance of time-series relationships, a permutation-based temporal clustering approach was used. This procedure, which is adapted from neuroimaging, allows us to estimate the probability of temporally contiguous relationships being observed in our results, a fact that standard approaches to correcting for multiple comparisons fail to account for. See further details in Appendix 1 section 1.9.
ROC analyses. In order to assess the selection of visual features we employed a signal detection framework based on the Receiver Operator Characteristic (ROC) . This analyses the degree to which arousal levels predict the timings of vocalisations relative to the timings of randomly sampled comparison samples, epoch by epoch. See Results section and 67 for more details.
Arousal stability. Arousal stability was measured by calculating the auto-correlation in infant and caregiver arousal, considered separately. The auto-correlation was calculated using the Matlab function nanautocorr.m, written by Fabio Oriani. Only the first lag term was reported as previous analyses have shown that autocorrelation data show a strong first order autoregressive tendency.
Arousal coupling. Arousal coupling was measured by calculating the zero-lag cross-correlation between infant and caregiver arousal. The cross-correlation was calculated by first applying a linear detrend to each measure independently and then calculating the Spearman’s correlation between the infant and caregiver arousal data within that window.
Moving window analyses. To estimate how stability and coupling changed relative to vocalisations, we used a moving window analysis (see Figure 8). Arousal data were downsampled to 1-minute epochs (0.016 Hz) (which was the sampling frequency of our microphone data). The size of the moving window was set arbitrarily at 10 epochs, with a shift of 5 epochs between windows. We excerpted the stability and coupling values around each individual vocalisation, and averaged these across all vocalisations.
Control analysis. Participant by participant, for each vocalisation that was observed, a random ‘non-vocalisation’ moment was selected as a moment during the day when the dyad was at home and the infant was awake but no vocalisation occurred. The same moving window analysis described above was then repeated to examine change relative to this ‘non-vocalisation event’. The same procedure was repeated 1000 times and the results averaged. Real and observed data were compared using the permutation-based temporal clustering analyses described above.
Appendices available here: https://doi.org/10.31234/osf.io/gmfk7
Usage notes
These data files and associated processing scripts are designed to be run in Matlab R2022a. Only the Statistics and Machine Learning Toolbox is required. Details on open-source alternatives to Matlab are given here: https://opensource.com/alternatives/matlab.