Human pose estimation models struggle to grasp contextual information in images or video frames. Meanwhile, text-to-pose generation models, having limited training data, cannot effectively generate accurate poses for novel prompts. PoseGPT, a novel multimodal language model, not only comprehends 3D human pose but also processes image and text data. This innovative model excels in speculative pose generation, such as replicating a pose when a person is tired, and reasoning-based pose estimation, such as accurately estimating the pose of an individual wearing eyeglasses.
Paper link: https://arxiv.org/pdf/2311.18836.pdf
Project page: https://yfeng95.github.io/posegpt/
Table of content:
00:00 Introduction
07:23 Architecture
11:58 LoRA
19:12 Data Construction
22:41 Speculative Pose Generation (SPG)
25:50 Reasoning-based Pose Estimation (RPE)
Icon made by Freepik from flaticon.com