Skip to main content

Robot Control Gestures (RoCoG)

Cite this dataset

de Melo, Celso et al. (2020). Robot Control Gestures (RoCoG) [Dataset]. Dryad.


Building successful collaboration between humans and robots requires efficient, effective, and natural communication. This dataset supports the study of RGB-based deep learning models for controlling robots through gestures (e.g., “follow me”). To address the challenge of collecting high-quality annotated data from human subjects, synthetic data was considered for this domain. This dataset of gestures includes real videos with human subjects and synthetic videos from our custom simulator. This dataset can be used as a benchmark for studying how ML models for activity perception can be improved with synthetic data.

Reference: de Melo C, Rothrock B, Gurram P, Ulutan O, Manjunath BS (2020) Vision-based gesture recognition in human-robot teams using synthetic data. In Proc. IROS 2020.


For effective human-robot interaction, the gestures need to have clear meaning, be easy to interpret, and have intuitive shape and motion profiles. To accomplish this, we selected standard gestures from the US Army Field Manual, which describes efficient, effective, and tried-and-tested gestures that are appropriate for various types of operating environments. Specifically, we consider seven gestures: Move in reverse, instructs the robot to move back in the opposite direction; Halt, stops the robot; Attention, instructs the robot to halt its current operation and pay attention to the human; Advance, instructs the robot to move towards its target position in the context of the ongoing mission; Follow me, instructs the robot to follow the human; and, Move forward, instructs the robot to move forward.

The human dataset consists of recordings for 14 subjects (4 females, 10 males). Subjects performed each gesture twice, once for each of eight camera orientations (0º, 45º, ..., 315º). Some gestures can only be performed with one repetition (halt, advance), whereas others can have multiple repetitions (e.g., move in reverse); in the latter case, we instructed subjects to perform the gestures with as many repetitions as it felt natural to them. The videos were recorded in open environments over four different sessions. The procedure for the data collection was approved by the US Army Research Laboratory IRB, and the subjects gave informed consent to share the data. The average length of each gesture performance varied from 2 to 5 seconds and 1,574 video segments of gestures were collected. The video frames were manually annotated using custom tools we developed. The frames before and after the gesture performance were labelled 'Idle'. Notice that since the duration of the actual gesture - i.e., non-idle motion - varied per subject and gesture type, the dataset includes comparable, but not equal, number of frames for each gesture.

To synthesize the gestures, we built a virtual human simulator using a commercial game engine, namely Unity. The 3D models for the character bodies were retrieved from Mixamo, the 3D models for the face were generated on FaceGen, and the characters were assembled using 3ds Max. The character bodies were already rigged and ready for animation. We created four characters representative of the domains we were interested in: male in civilian and camouflage uniforms, and female in civilian and camouflage uniforms. Each character can be changed to reflect a Caucasian, African-American, and East Indian skin color. The simulator also supports two different body shapes: thin and thick. The seven gestures were animated using standard skeleton-animation techniques. Three animations, using the human data as reference, were created for each gesture. The simulator supports performance of the gestures with an arbitrary number of repetitions and at arbitrary speeds. The characters were also endowed with subtle random motion for the body. The background environments were retrieved from the Ultimate PBR Terrain Collection available at the Unity Asset Store. Finally, the simulator supports arbitrary camera orientations and lighting conditions.

The synthetic dataset was generated by systematically varying the aforementioned parameters. In total, 117,504 videos were synthesized. The average video duration was between 3 to 5 seconds. To generate the dataset, we ran several instances of Unity, across multiple machines, over the course of two days. The labels for these videos were automatically generated, without any need for manual annotation.

Usage notes

The following files are available with the dataset:

  •, ..., (26.2 GB): Raw videos for the human subjects performing the gestures and annotations
  •, ..., rocog_human_frames.z02 (18.7 GB): Frames for human data used for training and testing. Each folder also has annotations for gesture (label.bin), orientation (orientation.bin), and the number of times the gesture is repeated (repetitions.bin)
  •, ..., rocog_synth_frames.z09 (~85.0 GB): Frames for synthetic data used for training and testing. Each folder also has annotations for gesture (label.bin), orientation (orientation.bin), and the number of times the gesture is repeated (repetitions.bin)

The labels are saved into Python binary struct arrays. Each file contains one entry per frame in the corresponding directory. Here's Python sample code to open these files:

import glob
import os
import struct

frames_dir = 'FemaleCivilian\\10_Advance_11_1_2019_19_6_29'
frame_count = len(glob.glob(os.path.join(frames_dir, '*[0-9].png')))
fmt_str = '<{}i'.format(frame_count)
labels = list(struct.unpack(fmt_str, open(os.path.join(frames_dir, 'label.bin'), 'rb').read()))

The gesture labels are as follows: 0, advance; 1, attention; 2, rally; 3, move forward; 4, halt; 5, follow me; 6, move in reverse.