Animal acoustic identification, denoising, and source separation using generative adversarial networks
Data files
Aug 18, 2025 version files 425.99 MB
-
README.md
9.63 KB
-
Spectrogram_GAN.zip
425.98 MB
Abstract
Soundscapes contain rich ecological information, offering insights into both biodiversity and ecosystem dynamics. However, the sheer volume of data produced by passive acoustic monitoring presents significant challenges for scalable analysis and ecological interpretation. While convolutional neural networks (CNNs) have advanced species classification in bioacoustics, they often struggle with identifying acoustic targets in acoustic space and quantifying soundscapes’ characteristics.
In this study, we propose a novel spectrogram-to-spectrogram translation framework based on generative adversarial networks (GANs) to isolate and quantify acoustic sources within soundscape recordings. Our method is trained on paired spectrogram images: original full-spectrogram representations and target spectrogram representations containing only the vocalizations of specific sound labels. This design enables the model to learn source-specific mappings and perform both the species and community-level separation of acoustic components in soundscape recordings.
We developed and evaluated two GAN-based models: a species-level GAN targeting eight avian species, and a community-level GAN distinguishing among avian, insect, and anthropogenic sound sources. The models were trained and tested using soundscape recordings collected from the Yaoluoping National Nature Reserve, eastern China. The species-level model achieved a mean F1 score of 0.76 for pixel-wise detection, while the community-level model reached 0.79 across categories. In addition to precise temporal-spectral localization, our approach captures sources’ acoustic occupancy and frequency distribution patterns, offering deeper ecological insight. Compared to baseline CNN classifiers, our model achieved a mean F1 score of 0.97, demonstrating comparable classification performance to ResNet50 (0.95) and VGG16 (0.98) across multiple species. Our GAN approach for extracting sound sources also significantly outperformed conventional methods in denoising and source separation, as indicated by lower image-level mean squared error.
These results demonstrate the utility of GANs in advancing ecoacoustic analyses and biodiversity monitoring. By enabling robust source separation and fine-resolution signal mapping, the proposed approach contributes a scalable and transferable tool for soundscape quantification.
Dataset DOI: 10.5061/dryad.vhhmgqp6k
Description
This compressed archive (Spectrogram_GAN.zip) contains code, data, and analysis outputs related to the manuscript "Animal Acoustic Identification, Denoising, and Source Separation Using Generative Adversarial Networks". It includes scripts for transforming audio into spectrograms, preparing training pairs, and training/testing generative models for spectrogram reconstruction. The materials support both species-level and community-level acoustic analysis and quantitative evaluations.
Data and File Structure
The repository contains the compressed package Spectrogram_GAN.zip, which includes the following files and directories:
Python Scripts
job01_wav_to_spec.py: It performs Fourier transform on audio files and generates corresponding spectrograms.
Input: wav01.wav → Output: spec01.png
job02_merge_img.py: It merges the original spectrogram with its color-labeled target spectrogram to create paired images for GAN training.
Input: spec01.png + label01.png → Output: merge01.png
job03_train.py: It trains the GAN model, with options for species-level or community-level models.
Output models are saved in out_gan_mdl.
job04_test.py: It tests the trained GAN models and generates output images in out_img.
job05_specific_spec.py: It applies the color-labeled image as a mask to extract a purified spectrogram containing only the target species.
Input: spec01.png + label01.png → Output: bird-only-spec01.png
requirements.txt: It lists the required Python libraries to run the code.
Directories
sample: It includes example inputs and outputs from job01_wav_to_spec.py, job02_merge_img.py, and job05_specific_spec.py.
input_img: It contains four sub-directories with paired training and testing data (train_community_samples, train_species_samples, test_community_samples, test_species_samples). Each image pair consists of a original spectrogram and its corresponding color-labeled images, representing either species-level or community-level acoustic signals. These pairs were generated by first converting audio recordings into spectrograms using job01_wav_to_spec.py, and then merging them with annotated label images using job02_merge_img.py. The resulting images serve as input data for training and testing the GAN models.
out_gan_mdl: It is used to save the outputs about the trained GAN models from job03_train.py. The trained GAN models saved as a PyTorch .pth format. There are two models: community_net_G.pth (community-level model) and species_net_G.pth (species-level model) are saved in this folders. The .pth files can be loaded and used for model testing using job04_test.py.
out_img: It contains output images from GAN model testing (job04_test.py).
util: It contains the necessary code that is used by job03_train.py and job04_test.py.
data: This folder contains quantitative evaluation results from ecoacoustic analyses. Each file corresponds to a specific task or evaluation method:
(1) data01_ssim_lpips.xlsx: The excel file contains community-level and species-level image quality evaluation using SSIM and LPIPS.
(2) data02_frequency_quantification.xlsx: The excel file contains frequency quantification of acoustic signals at both species and community levels.
(3) data03_time_acoustic_space_quantification.xlsx: The excel file contains temporal and acoustic space occupation of species and communities.
(4) data04_denoising.xlsx: The excel file contains performance results of different audio denoising methods, including GAN, spectral subtraction, Wiener filtering, and no-denoise baseline.
(5) data05_source_separation.xlsx: The excel file contains performance results of source separation methods, including GAN and NMF.
Variable Descriptions
data01_ssim_lpips.xlsx
epoch: Training epoch number of the GAN model. (dimensionless)
community ssim: SSIM for evaluating image structural similarity quality of the generated outputs at the community level. (dimensionless)
community lpips: LPIPS score for evaluating perceptual image patch similarity of the generated outputs at the community level. (dimensionless)
species ssim: SSIM for evaluating image structural similarity quality of the generated outputs at the at the species level. (dimensionless)
species lpips: LPIPS score for evaluating perceptual image patch similarity of the generated outputs at the species level.(dimensionless)
data02_frequency_quantification.xlsx
frequency(Hz): Frequency of the sound signal, measured in Hz. (Hz)
Jungle Nightjar (Caprimulgus indicus): One of the eight bird species in the frequency quantification analysis, with values representing the species' acoustic amplitude at different frequencies. (dimensionless)
Eurasian Jay (Garrulus glandarius): One of the eight bird species in the frequency quantification analysis, with values representing the species' acoustic amplitude at different frequencies. (dimensionless)
Koklass Pheasant (Pucrasia macrolopha): One of the eight bird species in the frequency quantification analysis, with values representing the species' acoustic amplitude at different frequencies. (dimensionless)
Oriental Scops Owl (Otus sunia): One of the eight bird species in the frequency quantification analysis, with values representing the species' acoustic amplitude at different frequencies. (dimensionless)
Lesser Cuckoo (Cuculus poliocephalus): One of the eight bird species in the frequency quantification analysis, with values representing the species' acoustic amplitude at different frequencies. (dimensionless)
Brownish-flanked Bush Warbler (Horornis fortipes):One of the eight bird species in the frequency quantification analysis, with values representing the species' acoustic amplitude at different frequencies. (dimensionless)
Alström's Warbler (Phylloscopus soror): One of the eight bird species in the frequency quantification analysis, with values representing the species' acoustic amplitude at different frequencies. (dimensionless)
Hartert's Leaf Warbler (Phylloscopus goodsoni): One of the eight bird species in the frequency quantification analysis, with values representing the species' acoustic amplitude at different frequencies. (dimensionless)
bird: It is included in the community-level frequency quantification analysis, with values representing the communities' acoustic amplitude at different frequencies. (dimensionless)
insect: It is included in the community-level frequency quantification analysis, with values representing the communities' acoustic amplitude at different frequencies. (dimensionless)
human: It is included in the community-level frequency quantification analysis, , with values representing the communities' acoustic amplitude at different frequencies. (dimensionless)
data03_time_acoustic_space_quantification.xlsx
community/species: Category indicates either a community type (e.g., bird, insect, human) or a species name included in the analysis. (category)
time percentage: Proportion of total recording time occupied by the sounds of a given community or species, expressed as a percentage. (%)
acoustic space percentage: Proportion of the total acoustic space occupied by the sounds of a given community or species, expressed as a percentage. (%)
data04_denoising.xlsx
species: Species category included in the denosing analysis.(category)
GAN: MSE between the denoised spectrogram processed by the GAN algorithm and the real audio spectrogram. (dimensionless)
spectral subtraction: MSE between the denoised spectrogram processed by the spectral subtraction method and the real audio spectrogram. (dimensionless)
Wiener filtering: MSE between the denoised spectrogram processed by Wiener filtering and the real audio spectrogram. (dimensionless)
no denoise: MSE between the non-denoised spectrogram and the real audio spectrogram. (dimensionless)
data05_source_separation.xlsx
community: Community category included in the source separation analysis. (category)
GAN: MSE between the audio source separated audio processed by the GAN and the real audio source. (dimensionless)
NMF: MSE between the audio source separated audio processed by the NMF and the real audio source. (dimensionless)
Abbreviations
GAN: Generative Adversarial Network, A generative model framework that trains two neural networks, a generator and a discriminator, in opposition to each other to produce realistic synthetic data.
SSIM: Structural Similarity Index Measure, An image quality assessment metric that measures the similarity between two images based on luminance, contrast, and structural information.
LPIPS: Learned Perceptual Image Patch Similarity, A perceptual similarity metric that compares image patches using deep neural network features learned from human judgments of visual similarity.
NMF: Non-negative Matrix Factorization, A matrix decomposition technique that factorizes a non-negative data matrix into two lower-rank non-negative matrices.
MSE: Mean Squared Error, A statistical measure that calculates the average of the squared differences between predicted and actual values.
Code Citation
The core training and testing code (in "job03_train.py", "job04_test.py",and the util folder) were based on Isola, P., Zhu, J.-Y., Zhou, T., & Efros, A. A. (2017). Image-to-Image Translation with Conditional Adversarial Networks. CVPR, 2017.
Code/software
Python is required to run the scripts; the script was created using version Python 3.8.
