Influence of behavioural contingency on developmental song learning in young zebra finches (Taeniopygia guttata) tutored by a robot-bird

Araguas, Alice1; Valencia, David1; Chopin, Adrien2; Thomas, Antoine1; Gauthier, Philippe3; Richer, Florian3; Guellaï, Bahia1; Deregnaucourt, Sebastien 1

Published Jan 15, 2026 on Dryad. https://doi.org/10.5061/dryad.sxksn03f4

Data files

Jan 15, 2026 version files 40.08 KB

masterfile_ProcRoySoc.xlsx

39.26 KB
README.md

818 B

Abstract

In humans and other animals, social robots can serve as effective tutors for learning new skills. Young oscines learn their song by imitating conspecific adults. In a previous study, we demonstrated that a robotic bird can be as effective as a live tutor in training a young zebra finch (Taeniopygia guttata) to imitate a song model. Here we take this further by investigating the role of behavioural contingency in developmental song learning and in shaping the birds’ engagement with the robot. Two groups of young male finches were exposed to a robotic tutor under contingent (CON) or non-contingent (NCON) conditions. In the CON-group, the robot produced a call in response to a call emitted by the bird. When the bird perched nearby, the robot oriented toward it and broadcast a song. While song imitation was slightly better in the CON-group, the difference was not statistically significant. However, birds in the CON group spent more time near the robot and interacted with it more frequently compared to NCON-birds. These findings highlight the importance of behavioural contingency in social robotics and offer novel insights into the use of robotic agents in studies with non-human animals.

2. Materials and methods

2.1. Subjects

We used 24 male zebra finches that hatched in the animal facilities of the University Paris Nanterre (France). After hatching, young zebra finches spent 14 days with their parents and other siblings in housing cages (LD 14 : 10, T°=20-23°C). At 14 dph (days post hatch), mothers and offspring were transferred into an isolation room to avoid any exposure to adult male song. At 35 dph, young males were isolated in individual cages (46x23x27 cm) in sound-proof chambers (85x65x60 cm). Sound proof chambers were equipped with fans providing a low airflow and OSRAM DULUX lights on an automatic 14:10 light:dark schedule. Each cage contained three perches and we placed a round mirror (diameter: 10 cm) above one of them to reduce the impact of social isolation. Birds had access to water, food, sand and cuttle bones ad libitum. Once a week this diet was supplemented with fresh vegetables. All procedures followed the European regulations on animal experimentation and were approved by the Darwin Ethic Comity of the French Ministry for National Education, Higher Education and Research (authorization #206412019051415231534).

2.2. Experimental procedure

After 2 days of habituation to the sound proof chamber, each young finch was transferred to a cage containing MANDABOT during 1h, 5 days a week during 5 consecutive weeks. After exposure, the young finch was transferred to his housing cage. At day 1 of exposure, a grid placed in the middle of the cage separated the robot from the pupil. Based on the results of other studies [i], we used this procedure in a previous experiment with a live tutor²³ to prevent eventual agonistic behaviours and we used it again with the robot to standardize our experimental procedures of song tutoring. At day 2, the grid was left only for the first 30 minutes of the session. Then we removed it to permit physical interactions between the young finch and the robot for the following 30 minutes. From day 3 to the end of the experiment, the pupil and the robot could physically interact for the whole hour. At 100 dph when song is usually crystallized, the experiment ended. All birds were then transferred to aviaries with conspecifics.

2.2.1. MANDABOT, the robotic finch

For a detailed description of MANDABOT, see ref. 23. Briefly, the robot was fixed on a grey plastic box (16x8.5x8 cm) and had the size, the shape and the plumage colours of an adult male zebra finch. It was controlled via a custom program running on a Beaglebone Black Wireless and was connected to a Dell Latitude 5300 computer running on Microsoft Windows 10. It was equipped with a speaker (AS01808MR from PUI Audio) to broadcast pre-recorded zebra finch songs and calls collected in a previous experiment during exposure of young finches to a live tutor²³ (see below). The robot was placed in a cage of a similar size as the one used for housing groups, and contains two perches, with food and water.

2.2.2. Acoustic stimuli

The vocal repertoire of a zebra finch is composed of 10 calls produced by both sexes in different contexts (contact, parent-young communication, alarm and aggression) and song produced only by males[ii]^,[iii]. The Zebra Finch song is hierarchically organized: songs are produced in bouts (5–30 s), each bout usually starting with short sounds called introductory notes followed by one or several motifs, a motif being a stereotyped sequence of song syllables separated by silent intervals (0.5–1.5 s). Each syllable is composed of one or several notes or elements; each element representing a different vocal gesture. Successive motifs in a bout can sometimes be separated by call-like syllables called connectors[iv].

Calls and songs from adult males (n= 12) exposed to a young finch were used to create sound stimuli. We decided to select vocalizations recorded in this context because it was reported in a previous study that adult finches changed some features of their song in a presence of a juvenile male²⁴. For each of the 12 adult tutors, we selected 20 songs and 50 contact calls called tet calls. Tet calls are short and soft calls used in close-range social contexts²⁶. Using Avisoft SASLab Pro, we applied a high pass filter at 420 Hz and a volume maximization of 90% to all the vocalisations.

2.2.3. Experimental groups

Contingent group: in the contingent group (CON), the robotic zebra finch produced sounds (songs and calls) in response to the pupil’s behaviour. On each side of the robot, there was a perch (5 cm) connected to load cells (CZL639HD from Phidgets). When the pupil landed on the perch, the robot turned in his direction and randomly moved its head or broadcast a song among the 20 of his specific tutor’s repertoire. When the pupil left the perch, the body of the robot turned back to the centre in a neutral position. To prevent excessive exposure to tutor song known to impair song imitation [v], we implemented a set of scenarios in which the robot turned toward the juvenile with or without song broadcast. Specifically, 10 scenarios included song and 100 did not. At each perch landing, one of these scenarios was randomly selected, so that a landing always triggered an orienting movement but not necessarily a song playback. Moreover, to avoid repeated triggering while the bird remained on the perch, the robot could initiate a new scenario only once the bird had left and returned. We also developed a closed-loop program to mimic the vocal interactions between two finches. Such bird-machine vocal interactions have already been successfully applied in oscine songbirds including zebra finches ^[vi],[vii]. Sounds from the experimental cage were continuously recorded using a Behringer C-2 microphone connected to a PreSonus AudioBox 1818VSL (16 bits, 44.1 kHz). When a vocal sound produced by the pupil (either a call or a song syllable) was detected by the system, one pre-recorded call among the 50 tet calls from the database of the specific tutor was broadcast through the speaker. Song production by the subject did not elicit song playback; however, call production by the subject did trigger call playback. Given that songs include call-like elements, it is possible that song production by the pupil occasionally activated the robot’s call playback. During each exposure, the robot was programmed to perform random movements to make its behaviour appear more natural and to prevent the young birds from being startled when it moved or sang, as observed during pre-tests.

We launched a closed-loop program for 60 minutes for each pupil. At the end of the loop, the computer generated a script containing all the sequences of movements and tutor’s specific vocalisations produced by MANDABOT. For each pupil, 25 scripts were produced corresponding to the 25 daily exposure sessions to the contingent robot. These scripts were then used in a yoked paradigm for the non-contingent group (NCON, see below).

For the first day of exposure, a 60 min closed-loop was launched. The grid prevented physical interactions but MANDABOT and the pupil could interact vocally via the vocal loop. To ensure that the pupil was exposed to a minimum quantity of robot’s songs even if he did not interact with the robot later during the experiment, the robot was programmed to broadcast 10 song bouts on day 1.

As explained previously, during the second day of exposure, the grid was removed after 30 min which allowed the pupil to interact with the robot for the last 30 min. The robot was programmed to broadcast one song bout during the first 30 min to ensure that the pupil heard the robot singing even if he did not land on perches later.

The number of songs heard by the pupil depended, except for day 1 (complete session) and day 2 (beginning of the session), on the number of times they landed on perches on either side of the robot. During day 1 and at the beginning of day 2, a grid prevented physical interactions between the bird and the robot, and the bird was passively exposed to a minimum amount of song.

Non-contingent group: following the same procedure, 12 young male finches were exposed to MANDABOT programmed with a non-contingent mode (NCON). Each pupil was exposed to the robot running the 25 daily scripts recorded in the closed-loop interaction context between the robot and a pupil in the contingent group (n= 12 sets of 25 daily scripts, each bird being exposed to one set). Therefore, each pupil was exposed to the same sequence of movements and sounds produced by the robot in the contingent group, but there will be no contingency between its own behaviour and those of the robot (yoked condition).

2.3. Behavioural analysis

2.3.1. Sound recording and analysis

During the whole experiment, sounds were recorded continuously using SAP 2011 (Sound Analysis Pro software [viii]). Each cage was equipped with a Behringer C-2 microphone connected to a PreSonus AudioBox 1818VSL (24 bits, 96 kHz) controlled by a DELL Optilex GX620 PC on Windows 7. During recording, SAP detects and saves individual songs into separate files, whilst mostly discarding isolated calls and cage noises.

Song segmentation: we selected sound files containing songs produced at 100 dph. Using Goldwave (v6.36), we applied a high pass filter at 420 Hz and a volume maximization of 90% to sound files. Then, songs were segmented into syllables using SAP 2011 (see below) until we selected at least 300 syllables for each bird. For each syllable, SAP 2011 extracted 14 different parameters: duration, amplitude, pitch, frequency modulation (FM), squared amplitude modulation (AM2), Wiener entropy, goodness of pitch, mean frequency and the variance in pitch, FM, entropy, pitch goodness, mean frequency and AM^30,[ix] (see SAP 2011 user manual: http://soundanalysispro.com/manual-1).

For each bird, the table containing the results of this analysis was loaded into the SongSeq software[x] to classify the different syllables into different clusters (syllable types). Once syllables were clustered into syllable types, data were then further processed using a custom-made program written in MATLAB (R2017a) in order to create one *.wav file per syllable and to classify each sound file created into different folders; each folder containing all sounds belonging to the same cluster. We checked that the clustering was appropriately done by visually inspecting spectrograms and eventually listening to sounds, using Sound Explorer (René Jansen, University of Amsterdam). If needed, syllables were re-assigned to the appropriate cluster.

In order to analyse sequence of syllables within a bout, we first had to set a value for the maximal gap between successive bouts. Based on previous analyses, we set this value at 200ms. Using a custom-made program written in MATLAB, we obtained for each song bout: 1) the total number of syllables; 2) the number of different syllable types; 3) the number of different transitions; 4) the bout duration and three different scores used to assess the sequential organization of the song bout namely: 5) song linearity; 6) song consistency and 7) song stereotypy [xi]. The linearity score represents how often syllables are sung in a specific order. In our study we calculated a slightly modified linearity score, as follows: linearity = (number of different syllables – 1) / number of syllable-to-syllable transitions ^[xii],[xiii]. The consistency score represents how often specific variations of the song bout occur. In other words, it does not represent how syllables are ordered but the amount of time a particular sequence of syllables is sung. It was calculated following the equation:

Consistency = ∑ [T(d)/T(a)]/N;

where T(d) is the most frequent/dominant transition for each syllable, T(a) is the number of all transitions for that particular syllable and N is the total number of syllable types in the ^34,35. The overall stereotypy in the song is represented by the stereotypy score, which is the mean of linearity and consistency scores:

Stereotypy = (linearity + consistency)/2.

Similarity score: imitation of the tutor song was quantified using an automated procedure implemented in SAP³¹ that parametrically quantifies the similarity between songs. Percentage of similarity is the percentage of similar sounds included in final sections.

First, portions of songs were selected after a visual inspection of the spectrograms, by cutting the sound at the onset and offset of the selected sequence, using Sound Explorer (René Jansen, Amsterdam). Second, regions of high similarity between the segments of the pupil and the model songs were identified, and the results were aggregated into a global measure of acoustic similarity and sequence similarity. In asymmetric comparisons, the most similar sound elements of two sequences (tutor song and pupil song) are compared, independent of their position within the sequence³¹. The smallest unit of comparison is 9.26-ms–long sound interval (FFT windows). Each interval is characterized by measures for five acoustic features: pitch, FM, amplitude modulation (AM), Wiener entropy, and pitch goodness. SAP calculates the Euclidean distance between all interval pairs from two songs, over the course of the motif, and determines a p-value for each interval pair. This p-value is based on p-value estimates derived from the cumulative distribution of Euclidean distances across 250 000 sound interval pairs, obtained from 25 random pairs of zebra finch songs. Neighboring intervals that pass the p-threshold value (p = 0.05 in this study, default value of SAP) form larger similarity segments (70 ms). In asymmetric similarity measurements, we want to judge how good the copy is in reference to the song model. The song model is loaded as “sound 1” and the copies are loaded as “sound 2” in the batch module of the SAP software³¹. Therefore, in our study, song models (tutor songs) were loaded as “sound 1” and pupil songs were loaded as “sound 2”. In summary, the amount of sound from the tutor song that is included into the similarity segments represents the similarity score; it thus reflects how much of the tutor’s song material is found in the pupil song³¹. This procedure was repeated 100 times, comparing 10 different exemplars of the tutor song with 10 different exemplars of each pupil’s song. The mean value of these 100 comparisons was used for statistical analysis.

2.3.2. Video recording and analysis

All sessions were video recorded using a Logitech C920 webcam connected to ContaCam 4.9.9 software on HP ProBook 650 G1 computer running on Windows 7.

For time reasons, on the 25 videos recorded for each pupil, 6 videos were analysed using the software BORIS v.7.9.16. Days 1 and 2 of exposure were not analysed as a grid separated the pupil and the robot, preventing physical interactions between the bird and the robot respectively for the whole session and for half of the session. We coded videos of the day 3, the first time the bird interacted for 60 min with the robot and then, first days of the each following week (days 6, 11, 16, 21) and the last day of the experiment (day 25).

We created an ethogram to describe interactions between the pupil and the robot. We focused our analysis on behaviours oriented towards the robot. We used custom-coded keystrokes in BORIS to quantitatively score these behaviours from the videos (see supplementary table 1). On the fifteen behaviours coded, five (platform, close vicinity, clumping, preening, sleeping) were analysed for their duration because they occurred continuously for long periods of time (on the order of seconds). Pecking on the different body parts of the robot was coded in occurrences as their duration was less than one second.

Video analysis was performed by several experimenters. During training with selected videos, we measured the inter-rater reliability (Cohen’s kappa coefficient [xiv]) and checked that k was superior to 0.9 before to proceed.

2.4. Statistical analysis

Statistical analyses were conducted in R Studio v. 1.4.1103 and Matlab R2017A. We checked normality of data distributions using Kolmogorov-Smirnov tests. Significance level was p < 0.05. We performed nonparametric tests, Generalized Linear Models (GLM) or Generalized Linear Mixed-Effects models (GLME) with normal or non-normal distributions. All models included intercept. We used the Akaike Information Criterion (AIC) when comparing GLMs and GLMEs and the AICc (AIC corrected for small samples) when comparing GLMs only. The whole analysis pipeline (data, codes and html readout of the results with step-by-step interpretation) is available on OSF: https://osf.io/bwqd6/overview.

2.4.1. Song similarity: importance of the behavioural contingency

First, we applied a full model with a nonlinear structure. We expected song similarity to be strongly influenced by the number of songs to which each bird was exposed, following an inverted-U relationship, as reported in a previous study²³. To account for this expected pattern, the statistical models included both linear and quadratic terms for the number of songs. Therefore, we used GLMEs with tutor as a potential random variable (groups were paired by tutors), and contingency, a linear and a quadratic factor for the number of broadcast songs, along with their interactions, as candidate fixed-effect variables. We followed the general procedure highlighted in the Chopin_toolbox version 1.3 (available at https://github.com/Stereo-Boy/chopin_toolbox) to isolate the best models. Data distribution did not differ from a normal distribution (Kolmogorov-Smirnov test, KS = 0.15, p = 0.6369), so the models were fitted assuming a normal distribution and identity link. Given that there were 24 data points, we could explore 2-3 factors conjointly. We compared various models with different combinations of factors and interactions (n = 41 models). The residuals of the best models were asymmetric despite being normal. Given the random-effect factor was not part of the best models, we ran simpler GLMs allowing us to check for outliers potentially explaining the asymmetry (n = 25 models).

We also checked whether birds that exhibited a strong interest for the robot also exhibited a higher similarity score. In particular, we looked at the time budget in the vicinity of the robot (time spent on the platform, close to the robot or clumping; physical contact with the robot, which typically indicates affiliative social behaviour²⁵, at the number of calls directed toward the robot and at the number of beak hits to different body parts of MANDABOT during the experiment. For the number of beak hits and the time spend at proximity of the robot, we averaged the available totals (measured on day 3, 6, 11, 16, 21 and 25). For the number of calls, we summed the total over each day. We started our analysis with a GLME with tutor as a random effect factor and the other mentioned factors as fixed-effect candidates, and compared the models (n = 385 models). The residuals were asymmetric again, and the tutor factor not present in the best model, so we ran GLMs instead that allow us to identify one bird as outlier (Cook’s distance >1). Removing that outlier (and its pair) resulted in symmetric distributions of residuals.

2.4.2. Other acoustic parameters

Data distribution was not significantly different from normal (bout duration: KS = 0.14, p = 0.7211; number of syllables: KS = 0.07, p = 0.9982; song consistency: KS = 0.18, p = 0.3646; song linearity: KS = 0.14, p = 0.7154; song stereotypy: KS = 0.15, p = 0.5873). We compared GLMs to predict bout duration (n = 41 models) and the number of syllables (n = 25 models), song consistency (n = 7 models), song linearity (n = 25 models) and song stereotypy (n = 25 models).

2.4.3. Time spent in the close proximity of MANDABOT

As durations on the platform encompassed durations in close vicinity that also included those in clumping, we preferred to use durations in an exclusive way. Exclusive platform durations were obtained by subtracting close vicinity durations from platform durations; exclusive close durations were obtained by subtracting clumping durations from close durations.

First, we looked at the non-parametric group difference using a Wilcoxon Mann-Whitney test before applying a full model with a nonlinear structure. Normality of data was checked using Kolmogorov Smirnov tests. Data were not differently distributed than normal for time spent on the platform (CON: KS = 0.08, p = 0.6961 and NCON: KS = 0.12, p = 0.2205) and time spent in the close vicinity of MANDABOT (CON: KS = 0.09, p = 0.6109 and NCON: KS = 0.09, p = 0.5804). Data were not normally distributed for clumping (CON: KS = 0.36, p <0.001 and NCON: KS = 0.44, p <0.001). Regarding the time spent on the platform, we ran a GLME with tutor as a random-effect factor (n= 15 models), assuming normal distributions with a log link. Regarding the time spent in close vicinity, we compared GLMEs with tutor as a random-effect factor (n= 15 models), assuming normal distributions with an identity link. Regarding the time spent clumping, we compared GLMEs with tutor as a random-effect factor (n= 63 models), assuming Poisson distributions with a log link.

2.4.4. Number of beak hits to MANDABOT

First, we looked at the non-parametric group difference in the number of beak hits using a Wilcoxon Mann-Whitney test before applying a full model with a nonlinear structure. Normality of data was checked using Kolmogorov Smirnov tests. Data were not normally distributed (Kolmogorov-Smirnov tests for contingent group: KS = 0.11, p = 0.3558 and non-contingent group: KS = 0.22, p = 0.0017). Therefore, we assumed Poisson distributions given the data represents event counts, before running GLMEs. We compared GLMEs with tutor as a potential random-effect factor, and contingency and day at which the number of beak hits were measured, along with the interactions, as candidate fixed-effect factors (n= 63 models).

2.4.5. Other changes in behavioral interactions with MANDABOT during the experiment

We also investigated whether two measures were changing over time, capitalizing on data collected on different days (days 1 to 25): the number of calls and the number of bird perch hits near the robot. We assumed Poisson distributions, given data represents event counts. For each measure, we compared GLMEs with tutor as a potential random-effect factor and a linear and quadratic component representing the day as candidate fixed-effect factors (n = 7 models).

[i] Chen Y, Matheson LE, Sakata JT. 2016. Mechanisms underlying the social enhancement of vocal learning in songbirds. PNAS 113, 6641-6646 (https://doi.org/10.1073/pnas.1522306113)

[ii] Zann, R. A. (1996). The Zebra Finch: A Synthesis of Field and Laboratory Studies. Oxford University Press.

[iii] Elie, J. E., Theunissen, F. E. (2016). The vocal repertoire of the domesticated zebra finch: a data-driven approach to decipher the information-bearing acoustic features of communication signals. Anim. Cog. 19, 285-315. (https://doi.org/10.1007/s10071-015-0933-6)

[iv] Bruno, JH, Tchernichovski, O. 2019. Regularities in zebra finch song beyond the repeated motif. Behav. Proc. 163, 53-59 (https://doi.org/10.1016/j.beproc.2017.11.001)

[v] Tchernichovski O, Lints T, Mitra PP, Nottebohm F. 1999. Vocal imitation in zebra finches is inversely related to model abundance. PNAS 96, 12901-12904 (https://doi.org/10.1073/pnas.96.22.12901)

[vi] Lerch A, Roy P, Pachet F, Nagle L. 2011. Closed-loop bird–computer interactions: a new method to study the role of bird calls. Anim. Cog. 14, 203-211. (http://dx.doi.org/10.1007/s10071-010-0353-6)

[vii] Benichov JI, Benezra SE. Vallentin D, Globerson E, Long MA, Tchernichovski, O. 2016. The forebrain song system mediates predictive call timing in female and male zebra finches. Curr. Biol. 26, 309-318 (DOI: 10.1016/j.cub.2015.12.037)

[viii] Tchernichovski O, Lints TJ, Derégnaucourt S, Cimenser A, Mitra PP. 2004. Studying the song development process: rationale and methods. Ann. NY Acad. Sci. 1016, 348-363 (https://doi.org/10.1196/annals.1298.031)

[ix] Tchernichovski O, Nottebohm F, Ho CE, Pesaran B, Mitra PP. 2000. A procedure for an automated measurement of song similarity. Anim. Behav. 59, 1167-1176 (https://doi.org/10.1006/anbe.1999.1416)

[x] Daou A, Johnson F, Wu W, Bertram R. 2012. A computational tool for automated large-scale analysis and measurement of bird-song syntax. J. Neurosci. Meth. 210, 147-160 (https://doi.org/10.1016/j.jneumeth.2012.07.020)

[xi] Scharff C, Nottebohm F. 1991. A comparative study of the behavioral deficits following lesions of various parts of the zebra finch song system: implications for vocal learning. J. Neurosci. 11, 2896-2913 (https://doi.org/10.1523/JNEUROSCI.11-09-02896.1991).

[xii] Iyengar S, Bottjer SW 2002. The role of auditory experience in the formation of neural circuits underlying vocal learning in zebra finches. J. Neurosci. 22, 946-958 (https://doi.org/10.1523/JNEUROSCI.22-03-00946.2002).

[xiii] Kao MH, Brainard MS. 2006. Lesions of an avian basal ganglia circuit prevent context-dependent changes to song variability. J. Neurophysiol. 96, 1441-1455 (https://doi.org/10.1152/jn.01138.2005).

[xiv] Cohen J. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Measure 20, 37–46 (https://doi.org/10.1177/001316446002000104)

Influence of behavioural contingency on developmental song learning in young zebra finches (Taeniopygia guttata) tutored by a robot-bird

Data files

Abstract

Description of the data and file structure

File: masterfile_ProcRoySoc.xlsx

Code/software

Access information

Influence of behavioural contingency on developmental song learning in young zebra finches (Taeniopygia guttata) tutored by a robot-bird

Data files

Abstract

README: Influence of behavioural contingency on developmental song learning in young zebra finches (Taeniopygia guttata) tutored by a robot-bird

Description of the data and file structure

File: masterfile_ProcRoySoc.xlsx

Code/software

Access information

Methods