The Community for Technology Leaders
CVPR 2011 (2011)
Providence, RI
June 20, 2011 to June 25, 2011
ISBN: 978-1-4577-0394-2
pp: 1281-1288
B. Sapp , Univ. of Pennsylvania, Philadelphia, PA, USA
D. Weiss , Univ. of Pennsylvania, Philadelphia, PA, USA
B. Taskar , Univ. of Pennsylvania, Philadelphia, PA, USA
We address the problem of articulated human pose estimation in videos using an ensemble of tractable models with rich appearance, shape, contour and motion cues. In previous articulated pose estimation work on unconstrained videos, using temporal coupling of limb positions has made little to no difference in performance over parsing frames individually. One crucial reason for this is that joint parsing of multiple articulated parts over time involves intractable inference and learning problems, and previous work has resorted to approximate inference and simplified models. We overcome these computational and modeling limitations using an ensemble of tractable submodels which couple locations of body joints within and across frames using expressive cues. Each submodel is responsible for tracking a single joint through time (e.g., left elbow) and also models the spatial arrangement of all joints in a single frame. Because of the tree structure of each submodel, we can perform efficient exact inference and use rich temporal features that depend on image appearance, e.g., color tracking and optical flow contours. We propose and experimentally investigate a hierarchy of submodel combination methods, and we find that a highly efficient max-marginal combination method outperforms much slower (by orders of magnitude) approximate inference using dual decomposition. We apply our pose model on a new video dataset of highly varied and articulated poses from TV shows. We show significant quantitative and qualitative improvements over state-of-the-art single-frame pose estimation approaches.
single-frame pose estimation, stretchable model, human motion parsing, articulated human pose estimation problem, tractable model ensemble, motion cues, unconstrained videos, temporal coupling, joint multiple articulated parts parsing, intractable inference, learning problem, approximate inference, tree structure, temporal features, image appearance, color tracking, optical flow contours, max-marginal combination method, dual decomposition, video dataset

B. Sapp, D. Weiss and B. Taskar, "Parsing human motion with stretchable models," CVPR 2011(CVPR), Providence, RI, 2011, pp. 1281-1288.
274 ms
(Ver 3.3 (11022016))