F0 estimation for bioacoustics: A benchmark/training dataset of non-human vocalisations with annotated frequency contours

Best, Paul 1 ; Araya-Salas, Marcelo2; Ekström, Axel G.3 4; Freitas, Bárbara5; Jensen, Frants H.6; Kershenbaum, Arik7; Lameira, Adriano R.8; Lehmann, Kenna D. S.9; Linhart, Pavel10; Liu, Robert C.11; Madhavan, Malavika10; Markham, Andrew12; Roch, Marie A.13; Root-Gutteridge, Holly14; Šálek, Martin15; Smith-Vidaurre, Grace9; Strandburg-Peshkin, Ariana16; Warren, Megan R.11; Wijers, Matthew12; Marxer, Ricard 1

Published May 08, 2025 on Dryad. https://doi.org/10.5061/dryad.prr4xgxw8

Abstract

The fundamental frequency (F0) is a key parameter for characterising structures in vertebrate vocalisations, for instance defining vocal repertoires and their variations at different biological scales (e.g., population dialects, individual signatures). However, the task is too laborious to perform manually, and its automation is complex. Despite significant advancements in the fields of speech and music for automatic F0 estimation, similar progress in bioacoustics has been limited.

To address this gap, we compile and publish a benchmark dataset of over 250,000 calls from 13 taxa, each paired with ground truth F0 values (each call are associated a series of time x frequency points delimitating its frequency contour). These vocalisations range from high to low SNR, from infra-sounds to ultra-sounds, from high to low harmonicity, and some include non-linear phenomena.

This dataset allows to train supervised and/or self-supervised models in estimating F0 values (similarly to CREPE or PESTO for instance). Also, the provided ground truth allows to evaluate the performance and compare different algorithms on these signals (see the associated manuscript for a first benchmark and baseline). Pretrained models and scripts to train or evaluate models on this dataset are available on a separate github repository.

https://doi.org/10.5061/dryad.prr4xgxw8

This dataset compiles acoustic recordings for more than 250,000 non-human vocalisations of 13 different taxa. Vocalisations are represented by a waveform file, and an associated ground-truth F0 contour. The associated journal publication characterises data distributions of each taxon, and reports a first benchmark of F0 estimation algorithms, including neural networks specifically fine-tuned for the task.

Description of the data and file structure

Each folder in this dataset corresponds to one taxon, for which vocalisations are stored as acoustic waveforms along with F0 annotations. This is to the exception of the F0_predictions folder described below. Files are usually stored into single folders, except for canids in which files are sorted by species, and disk-winged bats for which files are sorted as “inquiry” or “response” calls.

For an easy browse of all samples of the dataset, users can find a file_list.txt file at the root of each folder, listing all sound files for the given taxon. Each sound file stores the mono waveform for one vocalisation, surrounded by silent padding. In the same folder, for each sound file, there is a .csv file with the same base name containing the F0 ground truth annotations (a list of F0 values in Hz along with timestamp in seconds). Time windows with no F0 ground-truth are considered as silence or unvoiced.

With the same base name with the _preds.csv suffix, each sound file also has a corresponding table in the F0_predictions folder. These do not contain original raw data but report the different analysis presented in the associated publication. They contain a column time (resampled time position in seconds), annot (resampled F0 ground truth in Hz), harmonicity, salience and SHR (ratios without units, see the associated publication for a detailed description), along with confidence values and F0 predictions (in Hz) for each of the benchmarked algorithms: praat, p-YIN, crepe (tensorflow implementation) , torch-crepe (tcrepe), torch-crepe trained on different taxa (tcrepe_ftoth), torch-crepe trained on the same taxon (tcrepe_ftsp), torch-crepe trained on the evaluation taxon with viterbi smoothing (tcrepe_ftspV), basic-pitch (basic), PESTO not fine-tuned (pesto), pesto fine-tuned (pesto_ft). Additionally to the 13 taxa present in this dataset, F0 predictions are also available for bottlenose dolphins, for which waveforms could not be published but can be requested to original authors (see access information).

Code/software

.wav and .csv files can be opened with the software of choice. In Python, packages such as librosa and pandas are recommended.

The necessary code to reproduce the journal publication’s experiments with this data is available in this repository. This includes testing different algorithms and models on this dataset, fine-tuning models on specific data subsets, or running pretrained models for F0 inference on new data.

Access information

Data was derived from the following sources:

dolphins: Roch MA, Scott Brandes T, Patel B, Barkley Y, Baumann-Pickering S, Soldevilla MS. Automated extraction of odontocete whistle contours. The Journal of the Acoustical Society of America. 2011;130(4):2212–2223.
lions: Wijers M, Trethowan P, Du Preez B, Chamaillé-Jammes S, Loveridge AJ, Macdonald DW, et al. Vocal discrimination of African lions and its potential for collar-free tracking. Bioacoustics. 2021;30(5):575–593.
spotted hyenas: Lehmann KD, Jensen FH, Gersick AS, Strandburg-Peshkin A, Holekamp KE. Long-distance vocalizations of spotted hyenas contain individual, but not group, signatures. Proceedings of the Royal Society B. 2022;289(1979):20220548.
monk parakeets: Smith-Vidaurre G, Perez-Marrufo V, Wright TF. Simpler signatures post-invasion; 2021. Available from: https://figshare.com/articles/dataset/Simpler_signatures_post-invasion/14811636.
monk parakeets: Smith-Vidaurre G, Perez-Marrufo V, Hobson EA, Salinas-Melgoza A, Wright TF. Smith-Vidaurre et al 2023 IdentityInformationEncoding; 2023. Available from: https://figshare.com/articles/dataset/Smith-Vidaurre_et_al_2023_IdentityInformationEncoding/22582099.
rodents: Warren MR, Campbell D, Borie AM, Ford IV CL, Dharani AM, Young LJ, et al. Maturation of Social-Vocal Communication in Prairie Vole (Microtus ochrogaster ) Pups. Frontiers in Behavioral Neuroscience. 2022;15:814200.
rodents: Liu RC, Miller KD, Merzenich MM, Schreiner CE. Acoustic variability and distinguishability among mouse ultrasound vocalizations. The Journal of the Acoustical Society of America. 2003;114(6):3412–3422.
bottlenose dolphins: Sayigh, L. S., Janik, V. M., Jensen, F. H., Scott, M. D., Tyack, P. L., & Wells, R. S. (2022). The Sarasota Dolphin Whistle Database: A unique long-term resource for understanding dolphin communication. Frontiers in Marine Science, 9, 923046.

F0 estimation for bioacoustics: A benchmark/training dataset of non-human vocalisations with annotated frequency contours

Data files

Abstract

README: F0 estimation for bioacoustics: A benchmark/training dataset of non-human vocalisations with annotated frequency contours

Description of the data and file structure

Code/software

Access information

Methods

Works referencing this dataset