Data from: Unsupervised discovery of family specific vocal usage in the Mongolian gerbil

Peterson, Ralph 1 ; Choudhri, Aman2 ; Mitelut, Catalin 1 ; Tanelus, Aramis1 ; Capo-Battaglia, Athena1 ; Williams, Alex1 ; Schneider, David1 ; Sanes, Dan 1

Published Oct 18, 2024; Updated Nov 20, 2024 on Dryad. https://doi.org/10.5061/dryad.m905qfv68

Abstract

In nature, animal vocalizations can provide crucial information about identity, including kinship and hierarchy. However, lab-based vocal behavior is typically studied during brief interactions between animals with no prior social relationship, and under environmental conditions with limited ethological relevance. Here, we address this gap by establishing long-term acoustic recordings from Mongolian gerbil families, a core social group that uses an array of sonic and ultrasonic vocalizations. Three separate gerbil families were transferred to an enlarged environment and continuous 20-day audio recordings were obtained. Using a variational autoencoder (VAE) to quantify 583,237 vocalizations, we show that gerbils exhibit a more elaborate vocal repertoire than has been previously reported and that vocal repertoire usage differs significantly by family. By performing gaussian mixture model clustering on the VAE latent space, we show that families preferentially use characteristic sets of vocal clusters and that these usage preferences remain stable over weeks. Furthermore, gerbils displayed family-specific transitions between vocal clusters. Since gerbils live naturally as extended families in complex underground burrows that are adjacent to other families, these results suggest the presence of a vocal dialect which could be exploited by animals to represent kinship. These findings position the Mongolian gerbil as a compelling animal model to study the neural basis of vocal communication and demonstrates the potential for using unsupervised machine learning with uninterrupted acoustic recordings to gain insights into naturalistic animal behavior.

Continuous 20-day audio recordings from three separate gerbil families were obtained. We leveraged deep-learning based unsupervised analysis of 583,237 vocalizations to show that gerbils exhibit a more elaborate vocal repertoire than has been previously reported, and that vocal repertoire usage significantly differs by family.

This dataset contains a processed data file which contains all the necessary data to replicate the main findings from the associated paper (vocalization_df). Also included is a variational autoencoder (VAE) pre-trained on all vocalizations (checkpoint_050.tar). Finally, raw audio containing vocalizations are included so that users can explore some of the raw data.

Analysis notebooks and figure generating code which use the data in this repo are on GitHub: https://github.com/ralphpeterson/gerbil-vocal-dialects/.

Description of the Data and file structure

vocalization_df

This is the main data used for analysis. I recommend using the python library pandas to load the vocalization_df.feather file.

The file has the following fields:
"timestamp" : timestamp (datetime object) of vocalization
"dt_start" : time (s) of vocalization since beginning of experiment
"onset" : onset time (s) of vocalization from the start of the associated "audio_filename" file
"offset" : offset time (s) of vocalization from the start of the associated "audio_filename" file
"hour_of_day" : hour of day that the vocalization occurred
"date" : date that the vocalization occurred
"spectral_flatness" : spectral flatness of the raw audio between onset-offset (https://librosa.org/doc/main/generated/librosa.feature.spectral_flatness.html)
"latent_mean_umap" : (x,y) coordinates of vocalization in UMAP embedding of VAE latent space
"latent_means" : 32D latent space from VAE
"audio_filename" : audio file that the vocalization came from
"cohort" : the cohort (family) that the vocalization came from ('c2', 'c4', or 'c5')
"cohort_int" : the cohort (family) that the vocalization came from (0, 1, 2)
"z_70" : cluster assignment of vocalization from Gaussian Mixture Model with k=70
"prob_z_70" : probability that the vocalization assignment for each of the k=70 states. ('z_70' = argmax('prob_z_70'))

wav files

Raw audio containing vocalizations used in figure1-vocalization-segmenting.ipynb and figure1-audio-segmenting-example.ipynb
2020_07_22_15_52_33_369348_merged.wav
2020_07_23_20_42_27_891232_merged.wav

Raw audio containing vocalizations used in figure4-bout-examples.ipynb
2020_07_20_15_29_37_785571_merged.wav
2020_07_30_00_34_28_393297_merged.wav
2020_09_23_18_02_45_235282_merged.wav

I recommend using scipy.io.wavfile() to read wav files. Alternatively, use Ocenaudio (freeware with a nice UI) or Audacity to explore wav files.

umap_outline_c245.npy

Outer contour of UMAP probability density plots used in figure3-family-usage-differences.ipynb and figuresS3-pup-removal.ipynb. Use numpy.load() to load the npy file.

checkpoint_050.tar

Pre-trained vocalization VAE. Load the model and use it to generate reconstructions from latents or generate latents from spectrograms. See figure2-reconstructions.ipynb for example usage.

extra data

The following raw and pre-processed data referenced in this code (https://github.com/ralphpeterson/gerbil-vocal-dialects) is not currently uploaded to Dryad due to storage limitations.

There is over 15 TB of raw and processed data that is available for download via Google Drive. Please email me (ralph.emilio.peterson@gmail.com) and I will be happy to share with you.

Raw
cohort2_combined_audio
cohort4_combined_audio
cohort5_combined_audio

Pre-processed (onsets/offsets of every vocalization)
cohort2_segments
cohort4_segments
cohort5_segments

Spectrograms (pre-processed for use with the VAE)
cohort2_specs
cohort4_specs
cohort5_specs

Sharing/access Information

Links to other publicly accessible locations of the data:
N/A

Was data derived from another source?
No

Four ultrasonic microphones (Avisoft CM16/CMPA48AAF-5V) were synchronously recorded using a National Instruments multifunction data acquisition device (PCI-6143) via BNC connection with a National Instruments terminal block (BNC-2110). The recording was controlled with custom python scripts using the NI-DAQmx library (https://github.com/ni/nidaqmx-python) which wrote samples to disk at a 125 kHz sampling rate. In total, 13.084 TB of raw audio data were acquired across the three families. For further analyses, the four-channel microphone signals were averaged to create a single-channel high-fidelity audio signal.

Audio was segmented by amplitude thresholding using the AVA python package (https://github.com/pearsonlab/autoencoded-vocal-analysis). First, sound amplitude traces are calculated by computing spectrograms from raw audio, then summing each column of the spectrogram. The “get_onset_offsets” function, which performs the segmenting, requires the selection of a number of parameters which affect segmenting performance. The following values were tuned via an interactive procedure which validated that the segmenting could detect low amplitude vocalizations and capture individual vocal units apparent by eye:

seg_params = {

'min_freq': 500 # minimum frequency

'max_freq': 62500, # maximum frequency

'nperseg': 512, # FFT

'noverlap': 256, # FFT

'spec_min_val': -8, # minimum STFT log-modulus

'spec_max_val': -7.25, # maximum STFT log-modulus

'fs': 125000, # audio sample rate

'th_1': 2, # segmenting threshold 1

'th_2': 5, # segmenting threshold 2

'th_3': 2, # segmenting threshold 3

'min_dur':0.03, # minimum syllable duration (s)

'max_dur':0.3, # maximum syllable duration (s)

'smoothing_timescale': 0.007, # amplitude

'softmax': False, # apply softmax to the frequency bins to calculate amplitude

'temperature':0.5, # softmax temperature parameter

'algorithm': get_onsets_offsets

}

Sound events are detected when the amplitude exceeds 'th_3', And sound offset occurs when there is a subsequent local minimum (i.e., amplitude less than 'th_2', or 'th_1'). The maximum and minimum syllable durations were selected based on published duration ranges of gerbil vocalizations (Ter-Mikaelian et al. 2012, Kobayasi & Riquimaroux, 2012).

We computed the spectral flatness of each detected sound event using the python package librosa (https://github.com/librosa). Consistent with prior literature (Castellucci et al., 2016), we used a threshold on spectral flatness to separate putative vocal and non-vocal sounds. This threshold value was determined empirically, by calculating the false positive vocalization rate of groups of randomly sampled vocalizations. For each spectral flatness value, 100 randomly sampled vocalization spectrograms less than the working threshold value were assembled into 10x10 grids and visually inspected for false positives (e.g. non-vocal sounds). This procedure was repeated 10 times for spectral flatness thresholds of 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, and 0.4. We quantified the false positive vocalization rate for each threshold value and selected 0.3, which had a 5.5 +/- 1.96% false positive rate.

Data from: Unsupervised discovery of family specific vocal usage in the Mongolian gerbil

Data files

Abstract

README: Data from: Unsupervised discovery of family specific vocal usage in the Mongolian gerbil