Data from: Towards automated ethogramming: Cognitively-inspired event segmentation for streaming wildlife video monitoring

Mounir, Ramy 1 ; Shahabaz, Ahmed 1 ; Gula, Roman 2 ; Theuerkauf, Jörn 2 ; Sarkar, Sudeep 1

Research facility: USF Institute for Artificial Intelligence + X

Published Mar 19, 2023 on Dryad. https://doi.org/10.5061/dryad.kh18932bb

Data files

Mar 19, 2023 version files 20.24 GB

Nest_monitoring_of_the_Kagu.zip
20.24 GB
README.md
3.47 KB

Abstract

Our dataset, Nest Monitoring of the Kagu, consists of around ten days (253 hours) of continuous monitoring sampled at 25 frames per second. Our proposed dataset aims to facilitate computer vision research that relates to event detection and localization. We fully annotated the entire dataset (23M frames) with spatial localization labels in the form of a tight bounding box. Additionally, we provide temporal event segmentation labels of five unique bird activities: Feeding, Pushing leaves, Throwing leaves, Walk-In, and Walk-Out. The feeding event represents the period of time when the birds feed the chick. The nest-building events (pushing/throwing leaves) occur when the birds work on the nest during incubation. Pushing leaves is a nest-building behavior during which the birds form a crater by pushing leaves with their legs toward the edges of the nest while sitting on the nest. Throwing leaves is another nest-building behavior during which the birds throw leaves with the bill towards the nest while being, most of the time, outside the nest. Walk-in and walkout events represent the transitioning events from an empty nest to incubation or brooding, and vice versa. We also provide five additional labels that are based on time-of-day and lighting conditions: Day, Night, Sunrise, Sunset, and Shadows. In our manuscript, we provide a baseline approach that detects events and spatially localizes the bird in each frame using an attention mechanism. Our approach does not require any labels and uses a predictive deep learning architecture that is inspired by cognitive psychology studies, specifically, Event Segmentation Theory (EST). We split the dataset such that the first two days are used for validation, and performance evaluation is done on the last eight days.

The video monitoring system consisted of a commercial infrared illuminator surveillance camera (Sony 1/3′′ CCD image sensor), and an Electret mini microphone with built-in SMD amplifier (Henri Electronic, Germany), connected to a recording device via a 6.4-mm multicore cable. The transmission cable consisted of a 3-mm coaxial cable for the video signal, a 2.2-mm coaxial cable for the audio signal and two 2-mm (0.75 mm2) cables to power the camera and microphone. We powered the systems with 25-kg deep cycle, lead-acid batteries with a storage capacity of 100 Ah. We used both Archos™ 504 DVRs (with 80 GB hard drives) and Archos 700 DVRs (with 100 GB hard drives). All cameras were equipped with 12 infrared light emitting diodes (LEDs) for night vision.

We have manually annotated the dataset with temporal events, time-of-day/lighting conditions, and spatial bounding boxes without relying on any object detection/tracking algorithms. The temporal annotations were initially created by experts who study the behavior of the Kagu bird and later refined to improve the precision of the temporal boundaries. Additional labels, such as lighting conditions, were added during the refinement process. The spatial bounding box annotations of 23M frames were created manually using professional video editing software (Davinci Resolve). We attempted to use available data annotation software tools, but they did not work for the scale of our video (10 days of continuous monitoring). We resorted to video editing software, which helped us annotate and export bounding box masks as videos. The masks were then post-processed to convert annotations from binary mask frames to bounding box coordinates for storage. It is worth noting that the video editing software allowed us to linearly interpolate between keyframes of the bounding boxes annotations, which helped save time and effort when the bird’s motion is linear. Both temporal and spatial annotations were verified by two volunteer graduate students. The process of creating spatial and temporal annotations took approximately two months.

Data from: Towards automated ethogramming: Cognitively-inspired event segmentation for streaming wildlife video monitoring

Data files

Abstract

Methods

Works referencing this dataset