Skip to main content
Dryad

Domestic dogs (Canis familiaris) recognise meaningful content in monotonous streams of read speech

Cite this dataset

Root-Gutteridge, Holly; Korzeniowska, Anna; Ratcliffe, Victoria; Reby, David (2024). Domestic dogs (Canis familiaris) recognise meaningful content in monotonous streams of read speech [Dataset]. Dryad. https://doi.org/10.5061/dryad.stqjq2c1s

Abstract

Domestic dogs (Canis familiaris) can recognize basic phonemic information from human speech and respond to commands. Commands are typically presented in isolation with exaggerated prosody known as dog-directed speech (DDS) register. Here, we investigate whether dogs can spontaneously identify meaningful phonemic content in a stream of putatively irrelevant speech spoken in monotonous prosody, without congruent prosodic cues.

To test this ability, dogs were played recordings of their owners reading a meaningless text in which we inserted a short meaningful or meaningless phrase, either read with unchanged reading prosody or with an exaggerated DDS prosody. We measured the occurrence and duration of dogs’ gaze at their owners.

We found that, while dogs were more likely to detect and respond to inserts that contained meaningful phrases spoken with DDS prosody, they were still able to detect these meaningful inserts spoken in a neutral reading prosody. Dogs detected and responded to meaningless control phrases in DDS as frequently as to meaningful content in neutral reading prosody, but less often than to meaningful content in DDS.

This suggests that, while DDS prosody facilitates the detection of meaningful content in human speech by capturing dogs’ attention, dogs are nevertheless capable of spontaneously recognizing meaningful phonemic content within an unexaggerated stream of speech.

README: Domestic dogs (Canis familiaris) recognise meaningful content in monotonous streams of read speech

https://doi.org/10.5061/dryad.stqjq2c1s

Holly Root-Gutteridge, University of Sussex, hollyrg@googlemail.com

Data collected between 2018 and 2022 in HAVOC lab at University of Sussex Falmer campus. Data analysed using Praat speech analysis software and Sportscode Gamebreak video analysis software.

Keywords: dog-directed speech, heterospecific communication, speech recognition, human-animal communication, Canis familiaris, speech prosody

Rainbow_Passage_Texts.docx documents the excerpts from the psychology text the rainbow passage used for the recorded speeches to the dogs. From Fairbanks, G. (1960). Voice and articulation drillbook, 2nd edn. New York: Harper & Row. pp124-139.

ESM Tables 3-8.docx are the full results from the generalised linear mixed model analysis of the data performed in SPSS. Column headings for most: Fixed effect = Fixed effects of the model, F = F-values of the fixed effect, df1 = degrees of freedom 1, df2 = degrees of freedom 2, p-value = the p-value of the fixed effect.

Dog_Demographics_including_names_2024.xlsx documents the demographic details for each participant dog as stated by their owner. Study – Lists participation in the 4 studies; Age – age in years; Dog_ID – dog’s name, Breed – dog breed as stated by owner, Sex – dog’s sex, Size – size group with 1 being smallest and 3 largest; Owned by - the sex of the owner who brought the dog in, male, female, or a male-female couple.

Reactions_of_dogs_to_voices_Final.xlsx documents the duration of attention and binary attention to each trial. These metrics were calculated in Excel from the output of Sportscode Gamebreaker video analysis from the onset of attention to target phrase to end of attention.

Final_Data_-_Meaningful_Words_All_results..sav documents the same data in SPSS .sav format.

Column headings: Study – Lists participation in the 4 studies; Age – age in years; Dog_ID – dog’s name, Breed – dog breed as stated by owner, Sex – dog’s sex, Size – size group with 1 being smallest and 3 largest; Trial – number of trial, Target_Phrase – combination of voice and phrase as single code, Voice – speech register used, Phrase – meaningful or control phrase; Position – whether the dog was facing the window or door; Owner_sex – self-identified gender of owner; BlindDuration – duration of response as calculated in Sportscode Gamebreaker; ResponseBinary – binary response as 0 (no response) or 1 (response) code; ResponseBinaryS - binary response as Yes or No.

Speech registers: NRP – neutral reading prosody. DDS – dog-directed speech.

Voice_reports_control_vs_meaningful_phrases.xlsx is the measurements for each of the voices comparing control and target phrases for each trial measured in Praat. The standard human speech settings were used in Praat and metrics calculated using the Voice Report function for the selected target phrases and exported manually into Excel.

Column headings: Dog_ID – dog’s name, Breed – dog breed as stated by owner, Trial - type of voice from trial, Insert time - onset time of the phrase in the speech, Duration - duration of the target phrase (s), Median - median value of fundamental frequency (F0), Mean = mean value of F0, StDev  = standard deviation of F0, Min = minimum value of F0 (Hz), Max = maximum value of F0, Voice breaks = number of voice breaks, Range = range of F0, CoV = coefficient of variation of F0, Owner_sex = sex of owner speaking on recording, Owner_Dog = combination of the owner's sex and dog's name to identify the files where dogs heard 2 owners. 

Voice_reports_for_10_speakers.xlsx is the measurements for each of the voices comparing inserted phrases and a standard part of each for each trial for 10 randomly selected speakers, measured in Praat. This was done using the voice report function in Praat using standard settings and manually exported to Excel.

Column headings - Sex – speaker’s sex; Speaker ID – randomly assigned ID number; File – corresponding voice recording file; Meanf0 – mean fundamental frequency (f0) of spoken phrase; f0StdDev – Standard deviation of f0; minf0 – minimum f0; Maxf0 – maximum f0; f0 range – range of f0; CofV – coefficient of variation of f0; Trial – phrase presented in neutral reading prosody control or meaningful content; Phrase_type – whether value is for target or surrounding speech.

Data were generated thus:

Stimuli

70 owners were recorded reading aloud one of three short (15-20 second) passages from the standard psychology text “the rainbow passage” (Fairbanks, 1960), with the test phrases produced after 7-12 seconds as part of the text.

Voice recordings were made on a Zoom H4N-Pro handheld recorder (Zoom) in a sound-proof booth on campus at University of Sussex. To avoid habituating the dogs to the speech, owners were recorded reading the passages without the dog present in the recording booth and were asked to imagine they were speaking to the dog. Owners were asked to produce the target phrases in a) their normal reading voice prosody (NRP) and b) dog-directed speech prosody (DDS).

All the voice recordings were clipped and aligned using the sound software Audacity (Mazzoni and Dannenberg, 2015) and the amplitude normalized to -9dB.  

Participants

Fifty-three privately-owned dogs were recruited through Facebook adverts, flyers, and personal contacts, and tested in a designated testing room on campus at University of Sussex. Data from 49 dogs were retained. The dogs were presented with 2-8 read speech samples from their owners and video recorded. Their behavioural reactions were coded in Sportscode Gamebreaker and exported to Excel.

Sharing/Access information

The data was generated for this study.

Methods

Stimuli

70 owners were recorded reading aloud one of three short (15-20 second) passages from the standard psychology text “the rainbow passage” (Fairbanks, 1960), with the test phrases produced after 7-12 seconds as part of the text. The non-meaningful (control) phrases were “[Alfie / Bertie], pass me a coffee!” and the meaningful phrase was “[Dog’s name], come on then!”, chosen as these words had the highest frequency of use by English-speaking owners during interactions with their dogs and were therefore likely to be meaningful to all dogs (Mitchell and Edmonson, 1999). The duration of the target phrases was between 0.7s and 2.5s (mean = 1.4s, std. dev. = 0.2), depending on the speaker’s natural talking speed and the dog’s name (e.g., “Badger” takes longer to say than “Max”). In total, three different extracts of the same length were used and the phrases were included within the sentences, i.e., “There is, according to legend, a boiling pot of gold at one end. People look, but no one ever finds it. When a man looks for something beyond his reach, his friends say, [Bertie, pass me a coffee] / [Dog’s name, come on then], he is looking for the pot of gold at the end of the rainbow. Throughout the centuries people have explained the rainbow in various ways.” (See ESM for extracts 2 and 3.) The time it took the owners to reach the inserted phrase depended on the speed of their natural speech (mean = 8.7s, std. dev. = 1.2s) but was consistent across readings by the same individual.

The choice of abstract was randomised but if the dog had a name too similar to Alfie or Bertie, the other name was chosen as the control (e.g., the participant dogs Betty and Beans heard Alfie, not Bertie, in their control phrase). For each dog, the same extract was used for all conditions. Voice recordings were made on a Zoom H4N-Pro handheld recorder (Zoom) in a sound-proof booth on campus at University of Sussex. To avoid habituating the dogs to the speech, owners were recorded reading the passages without the dog present in the recording booth and were asked to imagine they were speaking to the dog. Owners were asked to produce the target phrases in a) their normal reading voice prosody (NRP) and b) dog-directed speech prosody (DDS). There was an expectation that the DDS speech would show increased pitch and range compared to NRP and that this would be more interesting to the dogs (Lesch et al., 2019). Thus, two recordings were made for the Pilot Study: DDS-meaningful and DDS-control; four recordings were created by each owner for the main studies: NRP-meaningful, NRP-control, DDS-meaningful, and DDS-control.

All the voice recordings were clipped and aligned using the sound software Audacity (Mazzoni and Dannenberg, 2015) and the amplitude normalized to -9dB. Mean and coefficient of variation of fundamental frequency (foCV = (fo standard deviation / fo mean ) *100) were measured in Praat (Boersma and Weenink, 2009). foCV provides a standardised measure of fo variability independent of fo height that takes perception into account (i.e., a modulation of 10Hz around 100Hz is perceptually equivalent to a modulation of 100Hz around 1,000Hz). Values are presented in Table 2. Within sexes, mean fo differed significantly between target phrases (female: F3,108 = 68.3, p < 0.001; male: F3,43 = 43.0, p < 0.001). However, mean fo did not differ significantly for control vs meaningful within DDS and NRP registers (LMM: p > 0.05 for all), but DDS and NRP phrases did differ significantly each other (p < 0.001 for all). For male owners, coefficient of variation did not differ between any comparison (p >0.25 for all comparisons, F3,43 = 0.6, p = 0.592 overall). For female owners, coefficient of variation did differ significantly overall (F3,127 = 5.1, p = 0.002) but only for DDS-Meaningful to all others (p <0.010 for all DDS-meaningful pairwise comparisons), while other pairwise comparisons were non-significant at p > 0.2 for all. Thus, DDS speech differed to NRP speech but phrases within speech registers were not significantly different except for female coefficient of variation.

Linear mixed models were applied to a subsample of a) 5 female and b) 5 male voices and were used to confirm that the mean fundamental frequency and coefficient of variation of the inserted phrases presented in NRP did not differ significantly from that of the rest of the read speech. (Mean fundamental frequency LMMs: female - F1,14 = 2.9, p = 0.108; male – F1,14 = 0.6, p = 0.449. Coefficient of variation LMMs: female - F1,14 = 2.4, p = 0.141; male – F1,14 = 2.9, p = 0.111.)

Table 1 Mean and standard deviation of mean and coefficient of variation of fundamental frequency for a) the target phrases produced by all speakers and b) the target phrase and entire speech of 10 speakers.

Participants

Fifty-three privately-owned dogs were recruited through Facebook adverts, flyers, and personal contacts, and tested in a designated testing room on campus at University of Sussex. A total of 57 owners (17 male, 40 female) participated, with a maximum of 3 dogs per owner. Trials were discarded if the dog was distracted by non-stimuli sounds or events, e.g., background noise (n = 1), the dog was barking continuously (n = 1), or if they moved out of camera shot (n = 2). We retained data from 49 dogs (24 females and 25 males), from 39 breeds and cross-breeds, aged between 9 months and 12 years old (mean = 4.1 years, SD = 2.9 years) in our analyses (see ESM Table 1 for details following Volsche et al.’s (Volsche et al., 2023) suggested format).

Protocol

Dogs were introduced to the room and given up to 20 minutes to freely explore and habituate to the space. Once they were considered to be relaxed, the trials began. No dogs appeared to be stressed either before or during the trials.

During all trials, the owners wore noise-cancelling headphones (TaoTronics) and listened to music while seated in a chair at 90 degrees to the dog (Figure 1), with their back to the dog and instructed not to turn to look at the dog. A single Behringer Europort MPA40BT-PRO speaker was set on a tripod behind the owner’s head and set to conversational volume (approx. 65dB measured at dog’s position). The experimenter stood out of the dog’s sight line and played the stimuli from an Apple MacBook Pro. The dogs were held on a loose lead by the handler and allowed some freedom of movement. While the handler was consistently one of two researchers, their familiarity to the dog could vary from “completely unfamiliar” to “person the dog met on more than one occasion but do not have a close relationship to” if the dog had participated in a previous study before or belonged to a friend of the researchers.

The dogs were positioned either to the left or the right of the speaker, and this position was cross-balanced across dogs within studies, with half to the left and half to the right. The dogs’ reactions were filmed on a Sony FDR-AX100 camcorder (Sony) on a tripod positioned approximately 1.5-2m from the dogs’ starting position. Trial interval depended on the dogs’ disposition. If the dog was calm, trial interval was less than 2 minutes, but if the dog was restless or distracted, a short break of a few minutes was provided, and the dog was sometimes taken out of the room and returned.

As some owners brought more than one dog and some dogs heard more than one owner, we considered each pairing of owner and dog to be a unique dyad, and thus the unit of comparison was dyad not owner or dog.

Figure 1 Experimental set-up in testing room at University of Sussex with the speaker positioned to the dog’s left. In half of the trials, this arrangement was reversed with the speaker positioned to the dog’s right. The owner was seated facing away from the dog wearing headphones and listening to music while the dog was positioned behind their chair and held on a loose lead by a handler. The speaker was positioned behind the owner’s head to simulate them speaking.

Whether the dogs gazed at their owner or not in the 10s period following the inserted phrase was used as the broadest metric of attention, while duration of gaze was used as the index of attention. None of the dogs were looking at or fully oriented towards their owner immediately prior to the onset of the target phrase, which would have been a criterion for dropping the trial. The trial ended 10s after the onset of the inserted phrase.

Pilot Study: The effect of meaning on dogs’ responses to content presented in dog-directed speech (DDS) prosody

The pilot study was designed to test whether dogs responded differently to inserts containing meaningful phrases vs. meaningless, control phrases, in both cases spoken with dog directed prosody (DDS). If they did not respond to the DDS presentation of speech, it was felt that it was unlikely that they would do so to NRP speech and that a new protocol would be required. Twenty-two dogs were tested, and 40 trials from 20 dogs were retained, with 2 dogs removed because they moved out of camera view during the stimulus. All owners included in this study were female. Each dog was presented with a recording of their female owner reading the text twice, once inserting the meaningful phrase and once inserting the control phrase. The order of presentation of meaningful and control phrase recordings were cross-balanced across dogs.

Study 1 prosody: Impacts of Prosody and Content on response

To better explore the effects of prosody and content, the pilot protocol was repeated with a total of 43 owner-dog dyads and all four speech conditions, adding NRP-meaningful and NRP-control to the DDS versions. The dogs heard all four speech conditions in pseudo-randomised presentation, cross-balanced across dogs. A total of 172 trials were retained (13 dogs heard 8 trials, with 4 trials from their male owner and 4 trials from their female owner, but one of these dogs moved out of shot).

Study 2 gender: The effects of gender on dogs’ responses to content and prosody

During initial data collection, it was noted that some of the dogs appeared to be more responsive to the male owner’s NRP speech than their female owner’s NRP speech. Therefore, we decided to explore the potential effects of speaker gender on their responses. Thus, we tested whether dogs hearing both their male and female owners would respond differently to them across all four conditions of meaning and prosody, with an expectation that NRP from male owners could elicit more or stronger responses than female NRP due to the smaller differences between male NRP and DDS.

Each of the 13 dogs heard a total of 8 trials, 4 from each owner. To avoid the possible effect of learning on response to the target phrases, as the same text passage was used throughout, the NRP trials were always played first for each owner, with control and meaningful phrase presentation cross-balanced within DDS conditions. Both owners were present in the room, but the non-participant (e.g., the male while the female was “talking” to the dog) was kept out of view to prevent any “clever Hans” effect influencing the results.

One dog was removed from the dataset because he moved out of camera shot while reacting to his owners’ voices. One dog (Emma, terrier) had been previously tested in the pilot study with a gap of several months between tests, but all other dogs experienced this as a novel presentation and it was expected that Emma would not retain her memories of the pilot study or be primed by them. Thus, 96 trials were retained from 12 dogs in total, with each dog hearing a total of 8 trials, including all four speech presentations from both their male and female owners.

All eight trials were performed on the same day and between trial intervals varied from a few minutes to more than 20 minutes depending on the behaviour of the dog, e.g., engagement in other activities like sniffing or investigating the area. We counterbalanced the presentation of male and female owners’ speech, but each dog heard all four trials from each owner as a block which was not divided (e.g., male owner trials x 4 then female owner trials x 4, but not male owner x 2 then female x 2 etc.). The dogs heard the same order of presentation for both male and female owners (e.g., either order 1 or order 2) to avoid order effects on their responsiveness.

Behavioural analysis

Prior to analysis, the videos of the trials were edited in iMovie (Apple Inc.) so that each file presented a single trial with a sound effect replacing the target phrase. All videos were blind coded in Sportscode Gamebreaker 11 (Sportstec, Warriewood, NSW, Australia) by HRG and 25% were second-coded by ATK. Response was defined as the dog directing its gaze towards the owner. The binary gaze response following presentation of the target phrase and duration of response were recorded for each trial. The duration of reaction was capped at 10 seconds, the maximum length of speech after the target phrase was produced. Inter-observer rating agreement was measured for binary gaze and duration using Cronbach’s alpha. 

Usage notes

There are no missing values.

Funding

Biotechnology and Biological Sciences Research Council, Award: BB/P00170X/1