Learning realistic lip motions for humanoid face robots

Hu, Yuhang 1 ; Lin, Jiong 1 ; Goldfeder, Judah 1 ; Wyder, Philippe 1 ; Cao, Yifeng1 ; Tian, Steven1 ; Wang, Yunzhe 1 ; Wang, Jingran1 ; Wang, Mengmeng1 ; Zeng, Jie1 ; Mehlman, Cameron 1 ; Wang, Yingke1 ; Zeng, Delin1 ; Chen, Boyuan 1 ; Lipson, Hod 1

Published Jan 07, 2026 on Dryad. https://doi.org/10.5061/dryad.j6q573nrc

Data files

Jan 07, 2026 version files 193.99 MB

README.md

1.46 KB
real_data.zip

93.71 MB
synthesized_data.zip

84.26 MB
synthesized_videos_for_eval.zip

16.03 MB

Abstract

Lip motion represents outsized importance in human communication, capturing nearly half of our visual attention during conversation. Yet anthropomorphic robots often fail to achieve lip-audio synchronization, resulting in clumsy and lifeless lip behaviors. Two fundamental barriers underlay this challenge. First, robotic lips typically lack the mechanical complexity required to reproduce nuanced human mouth movements; second, existing synchronization methods depend on manually predefined movements and rules, restricting adaptability and realism. Here, we present a humanoid robot face designed to overcome these limitations, featuring soft silicone lips actuated by a ten-degree-of-freedom (10-DoF) mechanism. To achieve lip synchronization without predefined movements, we use a self-supervised learning pipeline based on a Variational Autoencoder (VAE) combined with a Facial Action Transformer, enabling the robot to autonomously infer more realistic lip trajectories directly from speech audio. Our experimental results suggest that this method outperforms simple heuristics like amplitude-based baselines in achieving more visually coherent lip-audio synchronization. Furthermore, the learned synchronization successfully generalizes across multiple linguistic contexts, enabling robot speech articulation in ten languages unseen during training.

Learning realistic lip motions for humanoid face robots

Data files

Abstract

Description of the data and file structure

1. `real_data.zip`

2. `synthesized_data.zip`

3. `synthesized_videos_for_eval.zip`

Learning realistic lip motions for humanoid face robots

Data files

Abstract

README: Learning realistic lip motions for humanoid face robots

Description of the data and file structure

1. real_data.zip

2. synthesized_data.zip

3. synthesized_videos_for_eval.zip

Works referencing this dataset

1. `real_data.zip`

2. `synthesized_data.zip`

3. `synthesized_videos_for_eval.zip`