PAPER_TITLE

FIRST_AUTHOR_LAST, FIRST_AUTHOR_FIRST; SECOND_AUTHOR_LAST, SECOND_AUTHOR_FIRST

Coordinates-Based Dual-Constrained Autoregressive Motion Generation

Kang Ding^1*, Hongsong Wang^2*, Jidong Kuang¹, Jie Gui^1†

¹School of Cyberspace Security, Southeast University
²School of Computer Science and Engineering, Southeast University
^* Equal Contribution ^† Corresponding Author

Paper Code

Abstract

Text-to-motion generation has attracted increasing attention in the research community recently, with potential applications in animation, virtual reality, robotics, and human–computer interaction. Diffusion and autoregressive models are two popular and parallel research directions for text-to-motion generation. However, diffusion models often suffer from error amplification during noise prediction, while autoregressive models exhibit mode collapse due to motion discretization. To address these limitations, we propose a flexible, high-fidelity, and semantically faithful text-to-motion framework, named Coordinates-based Dual-constrained Autoregressive Motion Generation (CDAMD). With coordinates as input, the CDAMD follows the autoregressive paradigm and leverages a diffusion multi-layer perceptions to enhance the fidelity of predicted motions. The Dual-Constrained Causal Mask is designed for autoregressive generation, with motion tokens serving as priors and concatenated with textual encodings. Since there is limited work on coordinate-based motion synthesis, we establish such benchmarks for both text-to-motion generation and motion editing. Our approach achieves state-of-the-art performance in both fidelity and semantic faithfulness on these benchmarks.

Methodology

Architecture illustration of CDAMD. (a) Hybrid Motion Encoders encodes the raw motion sequence into a compact fine-grained latent space. (b) CDAMD model learns to autoregressively predict next tokens conditioned on text embedding from CLIP and compressed motion tokens from RVQ-VAE. (c) Dual-Constrained Causal Attention (DCCA) enforces both temporal and conditional causality, ensuring that motion generation proceeds autoregressively and preserving semantic conditioning.

Experiments

Quantitative Results:

Qualitative Results:

Video Presentation

Text to Motion 1: "A person is crouched down and walking around sneakily."

CoAD (ours)

MoMask

BAMM

Text to Motion 2: "A person steps back and sits down, then stands back up again and walks forward."

CoAD (ours)

MoMask

BAMM

Text to Motion 3: "A person walks forward, raises his left arm in front of him, then lowers his arm and walks backwards."

CoAD (ours)

MoMask

BAMM

Text to Motion 4: "The person was laying down and then they got up."

CoAD (ours)

MoMask

BAMM

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}

More Works from Our Lab

Paper Title 1

Paper Title 2

Paper Title 3

Coordinates-Based Dual-Constrained Autoregressive Motion Generation

Abstract

Methodology

Experiments

Video Presentation

BibTeX