Sage Journals: Discover world-class research

Abstract

Human motion prediction is a classic problem in computer vision and graphics, and the prediction of human motion diversity has a wide range of practical applications. To tackle this problem, this study proposes predicting the future motion diversity of the human body based on conditional denoising diffusion probabilistic models combined with the kinematics of human joints. First, the observed and predicted sequences were integrated into the same sample space using the mask mechanism, and Gaussian noise was gradually injected into the predicted sequence leveraging the cosine noise scheduler to destroy the sequence structure. Subsequently, the spatial-temporal feature extractor and channel enhancement module were used to form a denoiser to learn the temporal dynamic evolution of the sample and the potential correlation between the nodes in the diffusion process to complete the noise prediction and restore the sample information. The proposed method was verified on the Human3.6M and HumanEva-I datasets, and the experimental results show that the proposed method is competitive with previous methods in diversity prediction.

Keywords

DDPM self-attention mechanism transformer mask matrix GCN

Get full access to this article

View all access options for this article.

References

Ahn

Mascaro

E. V.

Lee

(2023). Can we use diffusion probabilistic models for 3D motion prediction? arXiv preprint arXiv:2302.14503. https://doi.org/10.48550/arXiv.2302.14503

Aksan

Kaufmann

Cao

Hilliges

(2021). A spatio-temporal transformer for 3D human motion prediction. In 2021 International conference on 3D vision (3DV) (pp. 565–574). IEEE. https://doi.org/10.1109/3DV53792.2021.00066

Barquero

Escalera

Palmero

(2023). Belfusion: Latent diffusion for behavior-driven human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 2317–2327). https://doi.org/10.1109/ICCV51070.2023.00220

Bouazizi

Holzbock

Kressel

Dietmayer

Belagiannis

(2022). Motionmixer: Mlp-based 3D human body pose forecasting. arXiv preprint arXiv:2207.00499. https://doi.org/10.48550/arXiv.2207.00499

Butepage

Black

M. J.

Kragic

Kjellstrom

(2017). Deep representation learning for human motion prediction and classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6158–6166). https://doi.org/10.1109/CVPR.2017.173

Bütepage

Kjellström

Kragic

(2018). Anticipating many futures: Online human motion prediction and generation for human-robot interaction. In 2018 IEEE international conference on robotics and automation (ICRA) (pp. 4563–4570). IEEE. https://doi.org/10.1109/ICRA.2018.8460651

Chen

L. H.

Zhang

Pang

Xia

Liu

(2023). Humanmac: Masked motion completion for human motion prediction. arXiv preprint arXiv:2302.03665. https://doi.org/10.1109/ICCV51070.2023.00875

Chiu

H. K.

Adeli

Wang

Huang

D. A.

Niebles

J. C.

(2019). Action-agnostic human pose forecasting. In 2019 IEEE winter conference on applications of computer vision (WACV) (pp. 1423–1432). IEEE. https://doi.org/10.48550/arxiv.1810.09676

Creswell

White

Dumoulin

Arulkumaran

Sengupta

Bharath

A. A.

(2018). Generative adversarial networks: An overview. IEEE Signal Processing Magazine, 35(1), 53–65. https://doi.org/10.1109/MSP.2017.2765202

10.

Cui

Sun

Yang

(2020). Learning dynamic relationships for 3D human motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6519–6527). https://doi.org/10.1109/CVPR42600.2020.00655

11.

Curreli

Muhle

Saroha

Marin

Cremers

(2025). Nonisotropic Gaussian diffusion for realistic 3D human motion prediction. arXiv preprint arXiv:2501.06035. https://doi.org/10.1109/CVPR52734.2025.00181

12.

Dang

Nie

Long

Zhang

(2021). Msr-gcn: Multi-scale residual graph convolution networks for human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 11467–11476). https://doi.org/10.48550/arXiv.2108.07152

13.

Dang

Nie

Long

Zhang

(2022). Diverse human motion prediction via gumbel-softmax sampling from an auxiliary space. In: Proceedings of the 30th ACM international conference on multimedia (pp. 5162–5171). https://doi.org/10.1145/3503161.3547956

14.

Dhariwal

Nichol

(2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794. https://doi.org/10.48550/arXiv.2105.05233

15.

Gao

Yang

G. J.

(2024). Multi-condition latent diffusion network for scene-aware neural human motion prediction. IEEE Transactions on Image Processing, 33(2024), 3907–3920. https://doi.org/10.1109/TIP.2024.3414935

16.

Hernandez

Gall

Moreno-Noguer

(2019). Human motion prediction via spatio-temporal inpainting. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 7134–7143). https://doi.org/10.1109/ICCV.2019.00723

17.

Jain

Abbeel

(2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851. https://doi.org/10.48550/arXiv.2006.11239

18.

Ionescu

Papava

Olaru

Sminchisescu

(2013). Human3. 6m: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339. https://doi.org/10.1109/TPAMI.2013.248

19.

Kim

Yoon

(2022). Guided-tts 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data. arXiv preprint arXiv:2205.15370. https://doi.org/10.48550/arXiv.2205.15370

20.

Kundu

J. N.

Gor

Babu

R. V.

(2019). Bihmp-gan: Bidirectional 3D human motion prediction gan. In Proceedings of the AAAI conference on artificial intelligence (Vol. 33, pp. 8553–8560). https://doi.org/10.1609/aaai.v33i01.33018553

21.

Lin

Amer

M. R.

(2018). Human motion modeling using dvgans. arXiv preprint arXiv:1804.10652. https://doi.org/10.48550/arXiv.1804.10652

22.

Liu

Zhang

(2022a). Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowledge-Based Systems, 240, 108146. https://doi.org/10.1016/j.knosys.2022.108146

23.

Liu

Jin

Liu

Cheng

(2022b). Investigating pose representations and motion contexts modeling for 3D motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(1), 681–697. https://doi.org/10.1109/TPAMI.2021.3139918

24.

Lugmayr

Danelljan

Romero

Timofte

Van Gool

(2022). Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 11461–11471). https://doi.org/10.1109/CVPR52688.2022.01117

25.

Nie

Long

Zhang

(2022). Progressively generating better initial guesses towards next stages for high-quality human motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6437–6446). https://doi.org/10.1109/CVPR52688.2022.00633

26.

Mangalam

Adeli

Lee

K. H.

Gaidon

Niebles

J. C.

(2020). Disentangling human dynamics for pedestrian locomotion forecasting with noisy supervision. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2784–2793). https://doi.org/10.1109/WACV45572.2020.9093350

27.

Mao

Liu

Salzmann

(2021). Generating smooth pose sequences for diverse human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13309–13318). https://doi.org/10.1109/ICCV48922.2021.01306

28.

Mao

Liu

Salzmann

(2019). Learning trajectory dependencies for human motion prediction. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 9489–9497). https://doi.org/10.1109/ICCV.2019.00958

29.

Nichol

A. Q.

Dhariwal

(2021). Improved denoising diffusion probabilistic models. In International conference on machine learning (pp. 8162–8171). PMLR. https://doi.org/10.48550/arXiv.2102.09672

30.

Paden

čáp

Yong

S. Z.

Yershov

Frazzoli

(2016). A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Transactions on Intelligent Vehicles, 1(1), 33–55. https://doi.org/10.1109/TIV.2016.2578706

31.

Saharia

Chan

Saxena

Whang

Denton

E. L.

Ghasemipour

Gontijo Lopes

Karagol Ayan

Salimans

(2022). Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35, 36479–36494. https://doi.org/10.48550/arXiv.2205.11487

32.

Sigal

Balan

A. O.

Black

M. J.

(2010). Humaneva: Synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. International Journal of Computer Vision, 87(1-2), 4–27. https://doi.org/10.1007/s11263-009-0273-6

33.

Sun

Chowdhary

(2024). Comusion: Towards consistent stochastic human motion prediction via motion diffusion. In European conference on computer vision (pp. 18–36). Springer. https://doi.org/10.1007/978-3-031-73036-8_2

34.

Tashiro

Song

Ermon

(2021). Csdi: Conditional score-based diffusion models for probabilistic time series imputation. Advances in Neural Information Processing Systems, 34, 24804–24816. https://doi.org/10.48550/arXiv.2107.03502

35.

Tian

Zheng

Liang

(2024). Transfusion: A practical and effective transformer-based diffusion model for 3D human motion prediction. IEEE Robotics and Automation Letters, 9(7), 6232–6239. https://doi.org/10.1109/LRA.2024.3401116

36.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Ł.

Polosukhin

(2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 6000–6010. https://doi.org/10.5555/3295222.3295349

37.

Wei

Sun

(2023). Human joint kinematics diffusion-refinement for stochastic motion prediction. In Proceedings of the AAAI conference on artificial intelligence (Vol. 37, pp. 6110–6118). https://doi.org/10.1609/aaai.v37i5.25754

38.

Wen

Lin

Xia

Wan

Zimmermann

Liang

(2023). Diffstg: Probabilistic spatio-temporal graph forecasting with denoising diffusion models. arXiv preprint arXiv:2301.13629. https://doi.org/10.48550/arXiv.2301.13629

39.

Wang

Y. X.

Gui

L. Y.

(2022). Diverse human motion prediction guided by multi-level spatial-temporal anchors. In European conference on computer vision (pp. 251–269). Springer. https://doi.org/10.1007/978-3-031-20047-2_15

40.

Yan

Rastogi

Villegas

Sunkavalli

Shechtman

Hadap

Yumer

Lee

(2018). Mt-vae: Learning motion transformations to generate multimodal human dynamics. In Proceedings of the European conference on computer vision (ECCV) (pp. 276–293). https://doi.org/10.1007/978-3-030-01228-1_17

41.

Hou

Pei

Ong

Y. S.

Zhang

(2024). Divdiff: A conditional diffusion model for diverse human motion prediction. IEEE Transactions on Multimedia, vol. 27, 1848–1859. https://doi.org/10.1109/TMM.2024.3521821

42.

Yuan

Kitani

(2020). Dlow: Diversifying latent flows for diverse human motion prediction. In Computer Vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16 (pp. 346–364). Springer. https://doi.org/10.1007/978-3-030-58545-7_20

43.

Zhong

Zhang

Xia

(2022). Spatio-temporal gating-adjacency gcn for human motion prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6447–6456). https://doi.org/ 10.1109/CVPR52688.2022.00634

An Effective Conditional Transformer-Based Diffusion Model for Three-Dimensional Human Motion Prediction

Abstract

Keywords

Get full access to this article

References