Data and code to run birdnet-discovery, a pipeline for signal discovery and training dataset creation using BirdNET embeddings, including example data from acoustic ARUs in Northern Alaska
Data files
Mar 24, 2025 version files 9.20 GB
-
data_code.zip
9.20 GB
-
README.md
12.79 KB
Abstract
In recent years, deep learning has become a popular solution for processing large ecological monitoring datasets. This rise in use has resulted in global classification models for a variety of data types and taxa, such as BirdNET, which classifies vocalizations of more than 6,000 avian species from acoustic data. These global models can be useful pre-trained models for transfer learning, allowing researchers to more easily develop classifiers specialized to their datasets. However, the development of such models hinges on the availability of comprehensive, high-quality training data, which can be difficult to acquire, produce, and use. We present a novel pipeline for creating training data from a large and unlabeled dataset with minimal human oversight. We used this pipeline and BirdNET as our base model to develop a transfer-learning-based model, ArcticSoundsNET, using acoustic monitoring data from 205 sites across Alaska’s Arctic Coastal Plain. We compared performance of ArcticSoundsNET with that of BirdNET to evaluate the effectiveness of our pipeline and success of the new model. We found that the ability of ArcticSoundsNET to detect and classify avian vocalizations in our data exceeded that of BirdNET by several orders of magnitude (AUC = 0.299 for ArcticSoundsNET, AUC = <0.001 for BirdNET). Importantly, our method for developing a training dataset is widely applicable for ecologists who do not have large amounts of labeled data, facilitating the creation of task-specific classification models. Developing such models is an essential step in using large acoustic datasets to support ecological conservation of critical species and habitats.
Dataset DOI: 10.5061/dryad.jh9w0vtnr
Description of the data and file structure
This dataset contains raw audio files and associated processed data to train, test, and evaluate ArcticSoundsNET, a neural-network based classifier for avian vocalizations built using BirdNET as a base model. All audio data included in this dataset was collected using autonomous recording devices (Audiomoth recorders) at plots across Alaska’s Arctic Coastal Plain in June-August of 2021-2023. Specifics of data collection (exact plot locations and dates) can be found in the manuscript associated with this dataset, which is available open-access. In addition, this dataset contains a comprehensive how-to document (HOWTO.md) which explains our pipeline for training data creation (birdnet-discovery) used for ArcticSoundsNET and in the manuscript associated with this dataset. All code associated with this process is also described and included.
Files and variables
File: data_code.zip
Description: This zip file contains all data and code included in this dataset. All folders and files are fully described below. Files are listed first, and then folders, in alphabetical order.
File: cluster_typeAssign
Description: Example CSV for use giving data clusters a final label when creating training data (for details see HOWTO document included in this dataset). This includes the following variables:
Folder: Folder path where clusters are saved
Cluster #: Number of cluster that is being given a new label
Label: Final label for all data segments in this cluster
File: HOWTO.md
Description: Markdown file explaining the process of training data creation described in the manuscript associated with this dataset. This details the practical use of all code and data files included in this dataset.
File: specCodes.csv
Description: Species code file for use in with the evaluation code included in this dataset. This includes the following variables:
tag: ArcticSoundsNET or BirdNET classification tag
vernacular name: Common name for the tag in question
scientific name: Scientific name for the tag in question
File: species_list_for_BirdNET.txt
Description: Text file of BirdNET species codes for all common bird species in our area of Alaska (determined by experts; see manuscript of details). Used to run base BirdNET on our ground truth data.
Folder: asnet_binary_labels
Description: Classification model (asnetbinary, binary birdNET in the manuscript associated with this dataset) output label files for all ground truth audio data files used to evaluate model performance in our study. CSV files in this folder are named according to the corresponding audio file (WAV file in the ground_truth folder). Each file contains the following variables:
Start (s): start in seconds for each data segment in the audio file
End (s): end in seconds for each data segment
Scientific name: scientific name (if applicable) for label for that data segment
Common name: common name for label given to each data segment
Confidence: model confidence in the label given to each data segment
File: file name of the associated .WAV file that is being classified
Folder: asnet_binary_training_data
Description: Raw audio data (.WAV files) used to train asnet binary (binary BirdNET in the associated manuscript). These are sorted into ‘bird’, ‘rain’, and ‘background’ examples.
Folder: asnet_final_labels
Description: Final ArcticSoundsNET output label files for all ground truth audio data files used to evaluate model performance in our study. CSV files in this folder are named according to the corresponding audio file (WAV file in the ground_truth folder). Each file contains the following variables:
Start (s): start in seconds for each data segment in the audio file
End (s): end in seconds for each data segment
Scientific name: scientific name (if applicable) for label for that data segment
Common name: common name for label given to each data segment
Confidence: model confidence in the label given to each data segment
File: file name of the associated .WAV file that is being classified
Folder: asnet_final_training_data
Description: Raw audio data (.WAV files) used to train the final ArcticSoundsNET model. These are sorted into examples according to the 44 ArcticSoundsNET classes fully described in the manuscript associated with this dataset.
Folder: base_bnet_labels
Description: Base BirdNET model output label files for all ground truth audio data files used to evaluate model performance in our study. CSV files in this folder are named according to the corresponding audio file (WAV file in the ground_truth folder). Each file contains the following variables:
Start (s): start in seconds for each data segment in the audio file
End (s): end in seconds for each data segment
Scientific name: scientific name (if applicable) for label for that data segment
Common name: common name for label given to each data segment
Confidence: model confidence in the label given to each data segment
File: file name of the associated .WAV file that is being classified
Folder: ground_truth
Description: This folder contains the following:
-wavs: Ground truth audio data (.WAV) files used to evaluate the classification performance of BirdNET, asnet binary (binary BirdNET), and ArcticSoundsNET.
-allManuals_combined.csv: Final csv of vocalization labels in the ground truth dataset, as determined by ornithology experts (for details, see the manuscript associated with this dataset). This file contains the following variables:
File name: Name of the ground truth file a given data segment is from
Start time: start time in seconds for the data segment
End time: end time in seconds for the data segment
annotSAS: annotation from first manual reviewer (i.e., species’ code for all species present in this data segment)
annotJP: annotation from second manual reviewer
annotMZ: annotation from third manual reviewer
annotMB: annotation from fourth manual reviewer
annotAll: final annotation for the data segment, determined by compiling annotations from all reviewers
Species codes used for annotations correspond to classes in ArcticSoundsNET. ‘nan’ is used in this file to denote data segments in which one more more reviewers determined no species’ vocalizations to be present.
-manuals_all_labels_clean.csv: Original CSV file of species’ vocalizations identified by manual reviewers in the ground truth dataset, compiled across all reviewers. This file contains the following variables:
File name: Name of the ground truth file a given data segment is from
Start time: start time in seconds for the data segment
End time: end time in seconds for the data segment
Common name: common name for label given to each data segment (species codes correspond to classes in ArcticSoundsNET)
Folder: code
Description: This folder contains all code associated with this study, which is described in the code section below.
Folder: models
Description: Folder containing TensorFlow Lite models (.tflite), parameter files (Params.CSV), and label files (Labels.txt) for the final ArcticSoundsNET (asnet final) and binary BirdNET (asnet binary) models. These files are required to run asnet final or asnet binary using BirdNET command line code (see HOW-TO document for details on running BirdNET in command line). These files should not be edited.
Folder: shortDeps
Description: This folder contains example data for use with the HOW-TO document included in this submission. This includes the following:
wavs: This folder contains .WAV files from a single plot in the Teshekpuk Lake Special Area (TLSA_2023_AL54). Files are named according to the date and time (UTC) during which they were recorded.
asnet_binary_output: Folder containing example output from asnet binary, including:
TLSA_2023_AL54: Binary BirdNET model output label files for the TLSA 2023 AL54 raw audio files. CSV files in this folder are named according to the corresponding WAV file. Each file contains the following variables:
Start (s): start in seconds for each data segment in the audio file
End (s): end in seconds for each data segment
Scientific name: scientific name (if applicable) for label for that data segment
Common name: common name for label given to each data segment
Confidence: model confidence in the label given to each data segment
File: file name of the associated .WAV file that is being classified
This folder also contains embeddings.txt files for each .WAV file. Details on the use of these embeddings files can be found in the HOWTO document included in this dataset.
clusters: This folder contains clustered data segments from the TLSA 2023 Al54 deployment. These clusters were created by using the model embeddings from asnet_binary_output/TLSA_2023_AL54 and the HDBSCAN clustering method (see the HOWTO for details). Specifically, this folder includes images of the data segments put in each cluster, and txt files for each cluster that include the file name and start time (in seconds) for every data segment put into the cluster.
Code/software
All code associated with this dataset was developed and run in Python 3.11 in the Spyder IDE (v 5.5.1).
Each file and its purpose is described briefly below. All files also internally include comprehensive comments and details on required inputs and final outputs. Files are listed in alphabetical order. For details on how to use this code to develop your own training dataset from unlabelled audio files, please see the associated manuscript and HOWTO file included in this submission.
cluster_assign_trainingLabels.py: This script uses clustering output (see manuscript and HOWTO file for details) in conjunction with cluster_typeAssign.csv to truncate raw files into audio clips that are sorted according to the labels provided and stored in the specified output folder. An example of the output from this script is provided at ~/shortDeps/asnet_binary_output/new_trainingData.
cluster_pruneFiles_to_shortDeps.py: This script subsets a larger folder of raw audio files to include only files within a specified time period (for our study, 6-9 AM local time).An example of this short dataset from one of our Arctic plots is included in this repository (~/shortDeps/TLSA_2023_AL54). See additional details in the HOWTO document and manuscript associated with this manuscript.
cluster_run_HDBSCAN.py: Script to cluster embeddings from BirdNET (or a similar classification model) using the HDBSCAN clustering method (see associated manuscript and HOWTO file for more details). This script requires as input:
- Path to your embeddings files
- Path to your audio files (.WAV or .wav, though modification for other audio input types should be relatively simple!)
and will output:
- Spectrogram images of data put into each cluster
- Text files that list the audio file, start time, and cluster label for all data segments in a given cluster, which can be used to extract audio segments for manual review and/or training data development
perf_combineManuals.py: Performance script used to convert manuals_all_labels_clean.csv to final manual labels used in model evaluations (allManuals_combined.csv)
perf_fig3.py: Code to produce figure 3 in the manuscript associated with this submission
perf_forAUC.py: Performance script used to accumulate performance data that is input into perf_table1_metrics.py.
perf_from_asnetBinary.py: Performance script used to evaluate asnet binary performance on ground truth files
perf_table1_metrics.py: Performance code used to produce Table 1 in the manuscript associated with this dataset.
perf_table2_metrics.py: Performance code used to produce all data in Table 2 in the manuscript associated with this dataset (and total number of detections in Table 1).
Access information
Other publicly accessible locations of the data:
- A portion of this dataset and all code associated with this dataset can also be found on GitHub at: https://github.com/MZiegenhorn/birdnet-discovery
Questions, concerns, or feedback about this dataset can be directed to Morgan Ziegenhorn at maziegenhorn36@gmail.com or directly via the GitHub repository (above) associated with this dataset.
Details on data collection can be found in the manuscript associated with this dataset.