A Multimodal Fusion Framework for Controllable Human Motion Synthesis: Integrating Cross-Modal Conditioning with Diffusion-Based Generative Modeling

Chunming Zhao

doi:10.53469/jrse.2025.07(07).09

A Multimodal Fusion Framework for Controllable Human Motion Synthesis: Integrating Cross-Modal Conditioning with Diffusion-Based Generative Modeling

Authors

Chunming Zhao Department of Computer Sciences, Sichuan University, Chengdu, Sichuan, China

DOI:

https://doi.org/10.53469/jrse.2025.07(07).09

Keywords:

Abstract

To facilitate the generation of generalized human motion, a unified framework named UniMotion is proposed in this paper. This framework is designed to support the handling of multimodal inputs, which encompass text, image, and audio formats. A unified prompt encoder is employed to transform diverse inputs into a common cross-modal semantic space by this approach. A two-stage motion decoder is adopted to progressively generate detailed skeleton sequences. Moreover, the incorporation of a multimodal alignment loss function is carried out to enhance the modeling of consistency across various prompts. In semantic generalization assessments and prompt consistency evaluations, margins of 7.3% and 8.9% by which UniMotion surpasses baseline methods are respectively observed. During tests involving random switching of multimodal prompts, 92.4% motion stability and logical coherence are maintained by it, highlighting its strong practicality and scalability. The application potential of multimodal generative models in the field of human motion modeling is broadened by this research.

Downloads

Published

2025-07-31

How to Cite

Zhao, C. (2025). A Multimodal Fusion Framework for Controllable Human Motion Synthesis: Integrating Cross-Modal Conditioning with Diffusion-Based Generative Modeling. Journal of Research in Science and Engineering, 7(7), 38–42. https://doi.org/10.53469/jrse.2025.07(07).09