2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2017)
Honolulu, Hawaii, USA
July 21, 2017 to July 26, 2017
In this paper, we propose a novel deep end-to-end network to automatically learn the spatial-temporal fusion features for video-based person re-identification. Specifically, the proposed network consists of CNN and RNN to jointly learn both the spatial and the temporal features of input image sequences. The network is optimized by utilizing the siamese and softmax losses simultaneously to pull the instances of the same person closer and push the instances of different persons apart. Our network is trained on full-body and part-body image sequences respectively to learn complementary representations from holistic and local perspectives. By combining them together, we obtain more discriminative features that are beneficial to person re-identification. Experiments conducted on the PRID-2011, i-LIDS-VIS and MARS datasets show that the proposed method performs favorably against existing approaches.
Feature extraction, Image sequences, Data mining, Fuses, Visualization, Lighting, Recurrent neural networks
L. Chen, H. Yang, J. Zhu, Q. Zhou, S. Wu and Z. Gao, "Deep Spatial-Temporal Fusion Network for Video-Based Person Re-identification," 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, Hawaii, USA, 2017, pp. 1478-1485.