Data and trained models for: Human-robot facial co-expression
Data files
Mar 05, 2024 version files 19.12 KB
Abstract
Large language models are enabling rapid progress in robotic verbal communication, but nonverbal communication is not keeping pace. Physical humanoid robots struggle to express and communicate using facial movement, relying primarily on voice. The challenge is twofold: First, the actuation of an expressively versatile robotic face is mechanically challenging. A second challenge is knowing what expression to generate so that they appear natural, timely, and genuine. Here we propose that both barriers can be alleviated by training a robot to anticipate future facial expressions and execute them simultaneously with a human. Whereas delayed facial mimicry looks disingenuous, facial co-expression feels more genuine since it requires correctly inferring the human's emotional state for timely execution. We find that a robot can learn to predict a forthcoming smile about 839 milliseconds before the human smiles, and using a learned inverse kinematic facial self-model, co-express the smile simultaneously with the human. We demonstrate this ability using a robot face comprising 26 degrees of freedom. We believe that the ability co-express simultaneous facial expressions could improve human-robot interaction.
README: Dataset for Paper "Human-Robot Facial Co-expression"
Overview
This dataset accompanies the research on human-robot facial co-expression, aiming to enhance nonverbal interaction by training robots to anticipate and simultaneously execute human facial expressions. Our study proposes a method where robots can learn to predict forthcoming human facial expressions and execute them in real time, thereby making the interaction feel more genuine and natural.
https://doi.org/10.5061/dryad.gxd2547t7
Description of the data and file structure
The dataset is organized into several zip files, each containing different components essential for replicating our study's results or for use in related research projects:
- pred_training_data.zip: Contains the data used for training the predictive model. This dataset is crucial for developing models that predict human facial expressions based on input frames.
- pred_model.zip: Contains the trained models capable of predicting human target facial expressions. These models are trained on the data from pred_training_data.zip.
- inverse_model.zip: Includes trained models for generating motor commands based on landmarks input. These models are vital for translating predicted facial expressions into actionable motor commands for robot facial actuators.
- Coexpression.zip: Houses the software necessary for data analysis and reproducing the results presented in our paper. It includes code for both the predictive and inverse modeling processes, along with utilities for evaluating model performance.
- 0_input.npy: An example input data file showcasing the structure expected by the predictive model. This NumPy array file has a shape of (9, 2, 113), representing 9 sequential frames, each frame consisting of 2D coordinates for 113 facial landmarks. This sample data can be used as a reference for the format and scale of input data necessary for model training and inference.
The training data within pred_training_data.zip is structured to facilitate easy loading into machine learning frameworks. It consists of input facial landmarks and target facial landmarks.
The deep neural networks in pred_model.zip and inverse_model.zip is constructed using PyTorch. Details on the model architecture and parameters are included within each zip file.
The software provided in Coexpression.zip includes scripts for processing data, training models, and evaluating the system's performance.
Citation
Please cite our work if you use this dataset or the associated models/software in your research:
Hu, Yuhang, et al. "Human-Robot Facial Co-expression". Science Robotics (March 2024)
Methods
During the data collection phase, the robot generated symmetrical facial expressions, which we thought can cover most of the situation and could reduce the size of the model. We used an Intel RealSense D435i to capture RGB images and cropped them to 480 320. We logged each motor command value and robot images to form a single data pair without any human labeling.