Learning realistic lip motions for humanoid face robots
Data files
Jan 07, 2026 version files 193.99 MB
-
README.md
1.46 KB
-
real_data.zip
93.71 MB
-
synthesized_data.zip
84.26 MB
-
synthesized_videos_for_eval.zip
16.03 MB
Abstract
Lip motion represents outsized importance in human communication, capturing nearly half of our visual attention during conversation. Yet anthropomorphic robots often fail to achieve lip-audio synchronization, resulting in clumsy and lifeless lip behaviors. Two fundamental barriers underlay this challenge. First, robotic lips typically lack the mechanical complexity required to reproduce nuanced human mouth movements; second, existing synchronization methods depend on manually predefined movements and rules, restricting adaptability and realism. Here, we present a humanoid robot face designed to overcome these limitations, featuring soft silicone lips actuated by a ten-degree-of-freedom (10-DoF) mechanism. To achieve lip synchronization without predefined movements, we use a self-supervised learning pipeline based on a Variational Autoencoder (VAE) combined with a Facial Action Transformer, enabling the robot to autonomously infer more realistic lip trajectories directly from speech audio. Our experimental results suggest that this method outperforms simple heuristics like amplitude-based baselines in achieving more visually coherent lip-audio synchronization. Furthermore, the learned synchronization successfully generalizes across multiple linguistic contexts, enabling robot speech articulation in ten languages unseen during training.
https://doi.org/10.5061/dryad.j6q573nrc
Description of the data and file structure
This repository contains datasets used in the experiments and evaluations presented in the paper.
1. real_data.zip
This archive contains real robot data collected during experiments.
- Actions: Robot action commands or control signals executed during data collection.
- Images: Corresponding visual observations captured by the robot (e.g., RGB images).
These data are used to train the FAT model in our work.
2. synthesized_data.zip
This archive contains synthesized videos of the robot performing lip-sync motions.
- The videos are generated to simulate realistic robot facial movements synchronized with speech.
- These synthesized lip-sync videos are used as input to the FAT (Facial Action Transformer) model.
3. synthesized_videos_for_eval.zip
This archive contains fully synthesized robot video samples used for the evaluations reported in the paper.
- All visual and audio content in these videos is generated entirely by software.
- No real human subjects, human recordings, or human-derived audiovisual data are included.
- The videos were presented to participants during evaluation as synthetic stimuli to assess perceptual quality and motion realism, not to analyze or collect human subject data.
