Hierarchical Cross-Modal Alignment for Controllable Human Motion Synthesis: A Geometric Deep Learning Framework

Mingyou Zeng

doi:10.53469/jrse.2025.07(07).08

Hierarchical Cross-Modal Alignment for Controllable Human Motion Synthesis: A Geometric Deep Learning Framework

Authors

Mingyou Zeng Department of Computer Sciences, Sichuan University, Chengdu, Sichuan, China

DOI:

https://doi.org/10.53469/jrse.2025.07(07).08

Keywords:

Cross-modal alignment, Motion generation, Nested modeling, Semantic representation, Skeletal sequence, Language-driven motion, Motion structure constraint

Abstract

The study addresses the problem of human motion synthesis in the absence of motion capture data. A new paradigm is introduced for motion generation based on cross-modal nested alignment. The method includes a multi-scale semantic alignment module, which models natural language prompts and skeletal motion sequences in a nested manner at both local and global levels. In addition, temporal-spatial structural priors are incorporated to improve motion continuity and semantic accuracy. On the HumanML3D and T2M-Gen datasets, the proposed method improves the motion coverage metric by 12.1%, reduces motion smoothness error by 17.3%, and decreases the average inter-frame drift error by 13.5%. Compared with current mainstream models, it shows higher robustness in handling complex semantic prompts and generating long motion sequences. This study offers a new approach to motion generation driven by cross-modal alignment.

Downloads

Published

2025-07-31

How to Cite

Zeng, M. (2025). Hierarchical Cross-Modal Alignment for Controllable Human Motion Synthesis: A Geometric Deep Learning Framework. Journal of Research in Science and Engineering, 7(7), 33–37. https://doi.org/10.53469/jrse.2025.07(07).08