Abstract
Text-to-motion generation has attracted increasing attention in the research community recently, with potential applications in animation, virtual reality, robotics, and human–computer interaction. Diffusion and autoregressive models are two popular and parallel research directions for text-to-motion generation. However, diffusion models often suffer from error amplification during noise prediction, while autoregressive models exhibit mode collapse due to motion discretization. To address these limitations, we propose a flexible, high-fidelity, and semantically faithful text-to-motion framework, named Coordinates-based Dual-constrained Autoregressive Motion Generation (CDAMD). With coordinates as input, the CDAMD follows the autoregressive paradigm and leverages a diffusion multi-layer perceptions to enhance the fidelity of predicted motions. The Dual-Constrained Causal Mask is designed for autoregressive generation, with motion tokens serving as priors and concatenated with textual encodings. Since there is limited work on coordinate-based motion synthesis, we establish such benchmarks for both text-to-motion generation and motion editing. Our approach achieves state-of-the-art performance in both fidelity and semantic faithfulness on these benchmarks.
Methodology
Architecture illustration of CDAMD. (a) Hybrid Motion Encoders encodes the raw motion sequence into a compact fine-grained latent space. (b) CDAMD model learns to autoregressively predict next tokens conditioned on text embedding from CLIP and compressed motion tokens from RVQ-VAE. (c) Dual-Constrained Causal Attention (DCCA) enforces both temporal and conditional causality, ensuring that motion generation proceeds autoregressively and preserving semantic conditioning.
Experiments
Quantitative Results:
Qualitative Results:
Video Presentation
Text to Motion 1: "A person is crouched down and walking around sneakily."
CoAD (ours)
MoMask
BAMM
Text to Motion 2: "A person steps back and sits down, then stands back up again and walks forward."
CoAD (ours)
MoMask
BAMM
Text to Motion 3: "A person walks forward, raises his left arm in front of him, then lowers his arm and walks backwards."
CoAD (ours)
MoMask
BAMM
Text to Motion 4: "The person was laying down and then they got up."
CoAD (ours)
MoMask
BAMM
BibTeX
@article{YourPaperKey2024,
title={Your Paper Title Here},
author={First Author and Second Author and Third Author},
journal={Conference/Journal Name},
year={2024},
url={https://your-domain.com/your-project-page}
}