This Article 
 Bibliographic References 
 Add to: 
Explicit Modeling of Human-Object Interactions in Realistic Videos
April 2013 (vol. 35 no. 4)
pp. 835-848
A. Prest, Comput. Vision Lab., ETH Zurich, Zurich, Switzerland
V. Ferrari, IPAB Inst., Univ. of Edinburgh, Edinburgh, UK
C. Schmid, LEAR Team, INRIA Rhone-Alpes, St. Ismier, France
We introduce an approach for learning human actions as interactions between persons and objects in realistic videos. Previous work typically represents actions with low-level features such as image gradients or optical flow. In contrast, we explicitly localize in space and track over time both the object and the person, and represent an action as the trajectory of the object w.r.t. to the person position. Our approach relies on state-of-the-art techniques for human detection [32], object detection [10], and tracking [39]. We show that this results in human and object tracks of sufficient quality to model and localize human-object interactions in realistic videos. Our human-object interaction features capture the relative trajectory of the object w.r.t. the human. Experimental results on the Coffee and Cigarettes dataset [25], the video dataset of [19], and the Rochester Daily Activities dataset [29] show that 1) our explicit human-object model is an informative cue for action recognition; 2) it is complementary to traditional low-level descriptors such as 3D--HOG [23] extracted over human tracks. We show that combining our human-object interaction features with 3D-HOG improves compared to their individual performance as well as over the state of the art [23], [29].
Index Terms:
video signal processing,gesture recognition,human computer interaction,image sequences,learning (artificial intelligence),object detection,object tracking,realistic images,solid modelling,human-object interaction features,explicit modeling,human-object interactions,realistic videos,human actions learning,low-level features,image gradients,optical flow,object trajectory,person position,state-of-the-art techniques,human detection,object detection,object tracking,relative trajectory,object w.r.t,coffee and cigarettes dataset,video dataset,Rochester daily activities dataset,explicit human-object model,informative cue,action recognition,low-level descriptors,3D-HOG,human tracks,Humans,Videos,Detectors,Training,Target tracking,Feature extraction,video analysis,Action recognition,human-object interaction
A. Prest, V. Ferrari, C. Schmid, "Explicit Modeling of Human-Object Interactions in Realistic Videos," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 4, pp. 835-848, April 2013, doi:10.1109/TPAMI.2012.175
Usage of this product signifies your acceptance of the Terms of Use.