Hierarchical Cross-Modal Alignment for Controllable Human Motion Synthesis: A Geometric Deep Learning Framework
DOI:
https://doi.org/10.53469/jrse.2025.07(07).08Keywords:
Cross-modal alignment, Motion generation, Nested modeling, Semantic representation, Skeletal sequence, Language-driven motion, Motion structure constraintAbstract
The study addresses the problem of human motion synthesis in the absence of motion capture data. A new paradigm is introduced for motion generation based on cross-modal nested alignment. The method includes a multi-scale semantic alignment module, which models natural language prompts and skeletal motion sequences in a nested manner at both local and global levels. In addition, temporal-spatial structural priors are incorporated to improve motion continuity and semantic accuracy. On the HumanML3D and T2M-Gen datasets, the proposed method improves the motion coverage metric by 12.1%, reduces motion smoothness error by 17.3%, and decreases the average inter-frame drift error by 13.5%. Compared with current mainstream models, it shows higher robustness in handling complex semantic prompts and generating long motion sequences. This study offers a new approach to motion generation driven by cross-modal alignment.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Mingyou Zeng

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.