Abstract—We consider the automated recognition of human actions in surveillance videos. Most current methods build classifiers based on complex handcrafted features computed from the raw inputs. Convolutional neural networks (CNNs) are a type of deep model that can act directly on the raw inputs. However, such models are currently limited to handling 2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features from both the spatial and the temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation combines information from all channels. To further boost the performance, we propose regularizing the outputs with high-level features and combining the predictions of a variety of different models. We apply the developed models to recognize human actions in the real-world environment of airport surveillance videos, and they achieve superior performance in comparison to baseline methods.
Recognizing human actions in the real-world environment finds applications in a variety of domains including intelligent video surveillance, customer attributes, and shopping behavior analysis. However, accurate recognition of actions is a highly challenging task due to cluttered backgrounds, occlusions, and viewpoint variations, etc. [ 1
], [ 2
], [ 3
], [ 4
], [ 5
], [ 6
], [ 7
], [ 8
], [ 9
], [ 10
], [ 11
]. Most of the current approaches [ 12
], [ 13
], [ 14
], [ 15
], [ 16
] make certain assumptions (e.g., small scale and viewpoint changes) about the circumstances under which the video was taken. However, such assumptions seldom hold in the real-world environment. In addition, most of the methods follow a two-step approach in which the first step computes features from raw video frames and the second step learns classifiers based on the obtained features. In real-world scenarios, it is rarely known what features are important for the task at hand since the choice of features is highly problem-dependent. Especially for human action recognition, different action classes may appear dramatically different in terms of their appearances and motion patterns.
Deep learning models [ 17
], [ 18
], [ 19
], [ 20
], [ 21
] are a class of machines that can learn a hierarchy of features by building high-level features from low-level ones. Such learning machines can be trained using either supervised or unsupervised approaches, and the resulting systems have been shown to yield competitive performance in visual object recognition [ 17
], [ 19
], [ 22
], [ 23
], [ 24
], human action recognition [ 25
], [ 26
], [ 27
], natural language processing [ 28
], audio classification [ 29
], brain-computer interaction [ 30
], human tracking [ 31
], image restoration [ 32
], denoising [ 33
], and segmentation tasks [ 34
]. The convolutional neural networks (CNNs) [ 17
] are a type of deep models in which trainable filters and local neighborhood pooling operations are applied alternatingly on the raw input images, resulting in a hierarchy of increasingly complex features. It has been shown that, when trained with appropriate regularization [ 35
], [ 36
], [ 37
], CNNs can achieve superior performance on visual object recognition tasks. In addition, CNNs have been shown to be invariant to certain variations such as pose, lighting, and surrounding clutter [ 38
As a class of deep models for feature construction, CNNs have been primarily applied on 2D images. In this paper, we explore the use of CNNs for human action recognition in videos. A simple approach in this direction is to treat video frames as still images and apply CNNs to recognize actions at the individual frame level. Indeed, this approach has been used to analyze the videos of developing embryos [ 39
]. However, such an approach does not consider the motion information encoded in multiple contiguous frames. To effectively incorporate the motion information in video analysis, we propose to perform 3D convolution in the convolutional layers of CNNs so that discriminative features along both the spatial and the temporal dimensions are captured. We show that, by applying multiple distinct convolutional operations at the same location on the input, multiple types of features can be extracted. Based on the proposed 3D convolution, a variety of 3D CNN architectures can be devised to analyze video data. We develop a 3D CNN architecture that generates multiple channels of information from adjacent video frames and performs convolution and subsampling separately in each channel. The final feature representation is obtained by combining information from all channels. To further boost the performance of 3D CNN models, we propose to augment the models with auxiliary outputs computed as high-level motion features and integrate the outputs of a variety of different architectures in making predictions.
We evaluated the developed 3D CNN model on the TREC Video Retrieval Evaluation (TRECVID) data, which consist of surveillance video data recorded at London Gatwick Airport. We constructed a multimodule event detection system, which includes the 3D CNN as a major module, and participated in three tasks of the TRECVID 2009 Evaluation for Surveillance Event Detection [ 25
]. Our system achieved the best performance on all three participating action categories (i.e., CellToEar, ObjectPut, and Pointing). To provide an independent evaluation of the 3D CNN model, we report its performance on the TRECVID 2008 development set in this paper. We also present results on the KTH data as published performance for this data is available. Our experiments show that the developed 3D CNN model outperforms other baseline methods on the TRECVID data, and it achieves competitive performance on the KTH data, demonstrating that the 3D CNN model is more effective for real-world environments such as those captured in the TRECVID data. The experiments also validate that the 3D CNN model significantly outperforms the frame-based 2D CNN for most tasks.
The key contributions of this work can be summarized as follows:
• We propose to apply the 3D convolution operation to extract spatial and temporal features from video data for action recognition. These 3D feature extractors operate in both the spatial and the temporal dimensions, thus capturing motion information in video streams.
• We develop a 3D convolutional neural network architecture based on the 3D convolution feature extractors. This CNN architecture generates multiple channels of information from adjacent video frames and performs convolution and subsampling separately in each channel. The final feature representation is obtained by combining information from all channels.
• We propose to regularize the 3D CNN models by augmenting the models with auxiliary outputs computed as high-level motion features. We further propose to boost the performance of 3D CNN models by combining the outputs of a variety of different architectures.
• We evaluate the 3D CNN models on the TRECVID 2008 development set in comparison with baseline methods and alternative architectures. Experimental results show that the proposed models significantly outperforms 2D CNN architecture and other baseline methods.
The rest of this paper is organized as follows: We describe the 3D convolution operation and the 3D CNN architecture employed in our TRECVID action recognition system in Section 2. Some related work for action recognition is discussed in Section 3. The experimental results on the TRECVID and KTH data are reported in Section 4. We conclude in Section 5 with discussions.
2. 3D Convolutional Neural Networks
In 2D CNNs, 2D convolution is performed at the convolutional layers to extract features from local neighborhood on feature maps in the previous layer. Then an additive bias is applied and the result is passed through a sigmoid function. Formally, the value of an unit at position
th feature map in the
th layer, denoted as
, is given by
is the hyperbolic tangent function,
is the bias for this feature map,
indexes over the set of feature maps in the
th layer connected to the current feature map,
is the value at the position
of the kernel connected to the
th feature map, and
are the height and width of the kernel, respectively. In the subsampling layers, the resolution of the feature maps is reduced by pooling over local neighborhood on the feature maps in the previous layer, thereby enhancing the invariance to distortions on the inputs. A CNN architecture can be constructed by stacking multiple layers of convolution and subsampling in an alternating fashion. The parameters of CNN, such as the bias
and the kernel weight
, are usually learned using either supervised or unsupervised approaches [ 17
], [ 22
2.1 3D Convolution
In 2D CNNs, convolutions are applied on the 2D feature maps to compute features from the spatial dimensions only. When applied to video analysis problems, it is desirable to capture the motion information encoded in multiple contiguous frames. To this end, we propose to perform 3D convolutions in the convolution stages of CNNs to compute features from both spatial and temporal dimensions. The 3D convolution is achieved by convolving a 3D kernel to the cube formed by stacking multiple contiguous frames together. By this construction, the feature maps in the convolution layer are connected to multiple contiguous frames in the previous layer, thereby capturing motion information. Formally, the value at position
th feature map in the
th layer is given by
is the size of the 3D kernel along the temporal dimension,
th value of the kernel connected to the
th feature map in the previous layer. A comparison of 2D and 3D convolutions is given in Fig. 1
Fig. 1. Comparison of (a) 2D and (b) 3D convolutions. In (b) the size of the convolution kernel in the temporal dimension is 3 and the sets of connections are color-coded so that the shared weights are in the same color. In 3D convolution, the same 3D kernel is applied to overlapping 3D cubes in the input video to extract motion features.
Note that a 3D convolutional kernel can only extract one type of features from the frame cube since the kernel weights are replicated across the entire cube. A general design principle of CNNs is that the number of feature maps should be increased in late layers by generating multiple types of features from the same set of lower level feature maps. Similarly to the case of 2D convolution, this can be achieved by applying multiple 3D convolutions with distinct kernels to the same location in the previous layer ( Fig. 2
Fig. 2. Extraction of multiple features from contiguous frames. Multiple 3D convolutions can be applied to contiguous frames to extract multiple features. As in Fig. 1 , the sets of connections are color-coded so that the shared weights are in the same color. Note that all six sets of connections do not share weights, resulting in two different feature maps on the right.
2.2 A 3D CNN Architecture
Based on the 3D convolution described above, a variety of CNN architectures can be devised. In the following, we describe a 3D CNN architecture that we have developed for human action recognition on the TRECVID data set. In this architecture, shown in Fig. 3
, we consider seven frames of size
centered on the current frame as inputs to the 3D CNN model. We first apply a set of hardwired kernels to generate multiple channels of information from the input frames. This results in 33 feature maps in the second layer in five different channels denoted by gray, gradient-x, gradient-y, optflow-x, and optflow-y. The gray channel contains the gray pixel values of the seven input frames. The feature maps in the gradient-x and gradient-y channels are obtained by computing gradients along the horizontal and vertical directions, respectively, on each of the seven input frames, and the optflow-x and optflow-y channels contain the optical flow fields along the horizontal and vertical directions, respectively, computed from adjacent input frames. This hardwired layer is employed to encode our prior knowledge on features, and this scheme usually leads to better performance as compared to the random initialization.
Fig. 3. A 3D CNN architecture for human action recognition. This architecture consists of one hardwired layer, three convolution layers, two subsampling layers, and one full connection layer. Detailed descriptions are given in the text.
We then apply 3D convolutions with a kernel size of
in the spatial dimension and 3 in the temporal dimension) on each of the five channels separately. To increase the number of feature maps, two sets of different convolutions are applied at each location, resulting in two sets of feature maps in the C2 layer each consisting of 23 feature maps. In the subsequent subsampling layer S3, we apply
subsampling on each of the feature maps in the C2 layer, which leads to the same number of feature maps with a reduced spatial resolution. The next convolution layer C4 is obtained by applying 3D convolution with a kernel size of
on each of the five channels in the two sets of feature maps separately. To increase the number of feature maps, we apply three convolutions with different kernels at each location, leading to six distinct sets of feature maps in the C4 layer, each containing 13 feature maps. The next layer S5 is obtained by applying
subsampling on each feature map in the C4 layer, which leads to the same number of feature maps with a reduced spatial resolution. At this stage, the size of the temporal dimension is already relatively small (3 for gray, gradient-x, gradient-y, and 2 for optflow-x and optflow-y), so we perform convolution only in the spatial dimension at this layer. The size of the convolution kernel used is
so that the sizes of the output feature maps are reduced to
. The C6 layer consists of 128 feature maps of size
, and each of them is connected to all 78 feature maps in the S5 layer.
After the multiple layers of convolution and subsampling, the seven input frames have been converted into a 128D feature vector capturing the motion information in the input frames. The output layer consists of the same number of units as the number of actions, and each unit is fully connected to each of the 128 units in the C6 layer. In this design, we essentially apply a linear classifier on the 128D feature vector for action classification. All the trainable parameters in this model are initialized randomly and trained by the online error back-propagation algorithm as described in [ 17
]. We have designed and evaluated other 3D CNN architectures that combine multiple channels of information at different stages, and our results show that this architecture gives the best performance.
2.3 Model Regularization
The inputs to 3D CNN models are limited to a small number of contiguous video frames due to the increased number of trainable parameters as the size of input window increases. On the other hand, many human actions span a number of frames. Hence, it is desirable to encode high-level motion information into the 3D CNN models. To this end, we propose computing motion features from a large number of frames and regularizing the 3D CNN models by using these motion features as auxiliary outputs ( Fig. 4
). Similar ideas have been used in image classification tasks [ 35
], [ 36
], [ 37
], but its performance in action recognition is not clear. In particular, for each training action we generate a feature vector encoding the long-term action information beyond the information contained in the input frame cube to the CNN. We then encourage the CNN to learn a feature vector close to this feature. This is achieved by connecting a number of auxiliary output units to the last hidden layer of CNN and clamping the computed feature vectors on the auxiliary units during training. This will encourage the hidden layer information to be close to the high-level motion feature. More details on this scheme can be found in [ 35
], [ 36
], and [ 37
]. In the experiments, we use the bag-of-words features constructed from dense SIFT descriptors [ 40
] computed on raw gray images and motion edge history images (MEHI) [ 41
] as auxiliary features. Results show that such a regularization scheme leads to consistent performance improvements.
Fig. 4. The regularized 3D CNN architecture.
2.4 Model Combination
Based on the 3D convolution operations, a variety of 3D CNN architectures can be designed. Among the architectures considered in this paper, the one introduced in Section 2.2 yields the best performance on the TRECVID data. However, this may not be the case for other data sets. The selection of optimal architecture for a problem is challenging since this depends on the specific applications. An alternative approach is to construct multiple models and combine the outputs of these models for making predictions [ 42
], [ 43
], [ 44
]. This scheme has also been used in combining traditional neural networks [ 45
]. However, the effect of model combination in the context of convolutional neural networks for action recognition has not been investigated. In this paper, we propose constructing multiple 3D CNN models with different architectures, hence capturing potentially complementary information from the inputs. In the prediction phase, the input is given to each model and the outputs of these models are then combined. Experimental results demonstrate that this model combination scheme is very effective in boosting the performance of 3D CNN models on action recognition tasks.
2.5 Model Implementation
The 3D CNN models are implemented in C++ as part of NEC's human action recognition system [ 25
]. The implementation details are based on those of the original CNN as described in [ 17
] and [ 46
]. All the subsampling layers apply max sampling as described in [ 47
]. The overall loss function used to train the regularized models is a weighted summation of the loss functions induced by the true action classes and the auxiliary outputs. The weight for the true action classes is set to 1 and that for the auxiliary outputs is set to 0.005 empirically. All the model parameters are randomly initialized as in [ 17
] and [ 46
] and are trained using the stochastic diagonal Levenberg-Marquardt method [ 17
], [ 46
]. In this method, a learning rate is computed for each parameter using the diagonal terms of an estimate of the Gauss-Newton approximation to the Hessian matrix on 1,000 randomly sampled training instances.
CNNs belong to the class of biologically inspired models for visual recognition, and some other variants have also been developed within this family. Motivated by the organization of visual cortex, a similar model, called HMAX [ 48
], has been developed for visual object recognition. In the HMAX model, a hierarchy of increasingly complex features is constructed by alternating applications of template matching and max pooling. In particular, at the S1 layer a still input image is first analyzed by an array of Gabor filters at multiple orientations and scales. The C1 layer is then obtained by pooling local neighborhoods on the S1 maps, leading to increased invariance to distortions on the input. The S2 maps are obtained by comparing C1 maps with an array of templates which were generated randomly from C1 maps in the training phase. The final feature representation in C2 is obtained by performing global max pooling over each of the S2 maps.
The original HMAX model is designed to analyze 2D images. In [ 16
], this model has been extended to recognizing actions in video data. In particular, the Gabor filters in the S1 layer of the HMAX model have been replaced with some gradient and space-time modules to capture motion information. In addition, some modifications to HMAX, proposed in [ 49
], have been incorporated into the model. A major difference between CNN and HMAX-based models is that CNNs are fully trainable systems in which all the parameters are adjusted based on training data, while all modules in HMAX consist of hard-coded parameters.
In speech and handwriting recognition, time-delay neural networks have been developed to extract temporal features [ 50
]. In [ 51
], a modified CNN architecture has been developed to extract features from video data. In addition to recognition tasks, CNNs have also been used in 3D image restoration problems [ 32
We focus on the TRECVID 2008 data to evaluate the developed 3D CNN models for action recognition in surveillance videos. Meanwhile, we also perform experiments on the KTH data [ 13
] to compare with previous methods.
4.1 Action Recognition on TRECVID Data
The TRECVID 2008 development data set consists of 49-hour videos captured at London Gatwick Airport using five different cameras with a resolution of
at 25 fps. The videos recorded by camera number 4 are excluded as few events occurred in this scene. In the current experiments, we focus on the recognition of three action classes ( CellToEar
, and Pointing
). Each action is classified in the one-against-rest manner, and a large number of negative samples were generated from actions that are not in these three classes. This data set was captured on five days (20071101, 20071106, 20071107, 20071108, and 20071112), and the statistics of the data used in our experiments are summarized in Table 1
. Multiple 3D CNN models are evaluated in this experiment, including the one described in Fig. 3
Table 1. The Number of Samples on the Five Dates Extracted from the TRECVID 2008 Development Data Set
As the videos were recorded in real-world environments, and each frame contains multiple humans, we apply a human detector and a detection-driven tracker to locate human heads. The detailed procedure for tracking is described in [ 52
], and some sample results are shown in Fig. 5
. Based on the detection and tracking results, a bounding box for each human that performs an action was computed. The procedure to crop the bounding box from the head tracking results is illustrated in Fig. 6
. The multiple frames required by the 3D CNN model are obtained by extracting bounding boxes at the same position from consecutive frames before and after the current frame, leading to a cube containing the action. The temporal dimension of the cube is set to 7 in our experiments as it has been shown that 5-7 frames are enough to achieve a performance similar to the one obtainable with the entire video sequence [ 53
]. The frames were extracted with a step size of 2. That is, suppose the current frame is numbered 0; we extract a bounding box at the same position from frames numbered
. The patch inside the bounding box on each frame is scaled to
Fig. 5. Sample human detection and tracking results from camera numbers 1, 2, 3, and 5 (left to right).
Fig. 6. Illustration of the procedure to crop the bounding box from the head tracking results.
To evaluate the effectiveness of the 3D CNN model, we report the results of the frame-based 2D CNN model. In addition, we compare the 3D CNN model with four other methods which build state-of-the-art spatial pyramid matching (SPM) features from local features computed on dense grid or spatiotemporal interest points (STIPs). For these methods, we construct SPM features based on local invariant features computed from each image cube as used in 3D CNN. Then a one-against-all linear SVM is learned for each action class. For dense features, we extract SIFT descriptors [ 40
] from raw gray images or motion edge history images [ 41
]. Local features on raw gray images preserve the appearance information, while MEHI is concerned with the shape and motion patterns. The dense SIFT descriptors are calculated every 6 pixels from
local image patches. For features based on STIPs, we employ the temporally integrated spatial response (TISR) method [ 54
], which has shown promising performance on action recognition. The local features are softly quantized (each local feature can be assigned to multiple codebook words) using a 512-word codebook. To exploit the spatial layout information, we employ the spatial pyramid matching method [ 55
] to partition the candidate region into
cells and concatenate their features. The dimensionality of the entire feature vector is
. We denote the method based on gray images as
, the one based on MEHI as
, and the one based on TISR as
. We also concatenate
feature vectors into a single vector, leading to the 16,384D feature representation denoted as
In the first set of experiments, we report the performance of the 3D CNN architecture described in Fig. 3
as this model achieved the best performance. This architecture is denoted as
since the five channels are convolved separately (the superscript
) and the first two convolutional layers use 3D convolution and the last convolutional layer use 2D convolution (the subscript 332). We also report the performance of the regularized 3D CNN model based on
. In this model, denoted as
, the auxiliary outputs are obtained by applying PCA to reduce the dimensionality of 8,192D
features to 150 dimensions and then concatenating them into a 300D feature vector.
We report the fivefold cross-validation results in which the data for a single day are used as a fold. The performance measures we used are precision, recall, and area under the ROC curve (ACU) at multiple values of false positive rates (FPR). The performance of the seven methods is summarized in Table 2
, and the average performance over all action classes is plotted in Fig. 7
. We can observe from these results that the 3D CNN models outperform the frame-based 2D CNN model,
significantly on the action classes CellToEar
in all cases. For the action class Pointing
, the 3D CNN model achieves slightly worse performance than the other three methods. Concatenation of the
features yields improved performance over individual features, but the performance is still lower than that of the 3D CNN models. We can also observe that our models also outperform the method based on the spatiotemporal feature TISR. Overall, the 3D CNN models outperform other methods consistently, as can be seen from the average performance in Fig. 7
. In addition, the regularized model yields higher performance than the one without regularization in all cases. Although the improvement by the regularized model is not significant, the following experiments show that significant performance improvements can be obtained by combining the two models.
Table 2. Performance of the Seven Methods under Multiple False Positive Rates
The AUC scores are multiplied by
for ease of presentation. The highest performance in each case is shown in bold face.
Fig. 7. Average performance comparison of the seven methods under different false positive rates. The AUC scores at and 1 percent are multiplied by and , respectively, for better visualization.
To evaluate the effectiveness of model combination in the context of CNN for action recognition, we develop the three alternative 3D CNN architectures described below:
• denotes the architecture in which the different channels are "mixed," and the first two convolutional layers use 3D convolution, and the last layer use 2D convolution. "Mixed" means that the channels of the same type (i.e., gradient-x and gradient-y, optflow-x, and optflow-y) are convolved separately, but they are connected to the same set of feature planes in the first convolutional layer. In the second convolutional layer, all five channels are connected to the same set of feature planes. In contrast, for models with superscript , all five channels are connected to separate feature planes in all layers.
• denotes a model similar to , but only the first convolutional layer uses 3D convolution and the other two layers use 2D convolution.
• denotes a model similar to , but all three convolutional layers use 2D convolution.
The average performance of these three models, along with that of
, is plotted in Fig. 8
. We can observe that the performance of these three alternative architecture is lower than that of
. However, we show in the following that combination of these models can lead to significant performance improvement.
Fig. 8. Average performance comparison of the four different 3D CNN architectures under different false positive rates. The AUC scores at FPR = 0.1 percent and 1 percent are multiplied by and , respectively, for better visualization.
To evaluate the effectiveness of model combination, we tuned each of the five models (
) individually and then combined their outputs to make prediction. We combine models incrementally in order of decreasing individual performance. That is, the models are sorted in decreasing order of individual performance and they are combined incrementally from the first to the last. The reason for doing this is that we expect individual models with high performance will lead to more significant improvements when they are combined. We report the combined performance for each combination in Table 3
and plot the average performance in Fig. 9
. We can observe that combination of models leads to significant performance improvements in almost all cases except the combination of
, which leads to slight performance degradation. This shows that the different architectures encode complementary information, and combination of these models effectively integrates this information though the performance of some of the individual models is low. Figs. 10
, and 12
show some sample actions in each of the three classes that are classified correctly and incorrectly by the combined model. It can be observed that most of the misclassified actions are hard to recognize even by human.
Table 3. Performance of Different Combinations of the 3D CNN Models
In this table, numbers 1 through 5 represent the models
, respectively. The highest performance in each case is shown in bold face.
Fig. 9. Performance of different combinations of the 3D CNN architectures. The AUC scores at FPR = 0.1 and 1 percent are multiplied by and , respectively, for better visualization. See the caption of Table 3 and the text for detailed explanations.
Fig. 10. Sample actions in the CellToEar class. The top row shows actions that are correctly recognized by the combined 3D CNN model, while the bottom row shows those that are misclassified by the model.
Fig. 11. Sample actions in the ObjectPut class. The top row shows actions that are correctly recognized by the combined 3D CNN model, while the bottom row shows those that are misclassified by the model.
Fig. 12. Sample actions in the Pointing class. The top row shows actions that are correctly recognized by the combined 3D CNN model, while the bottom row shows those that are misclassified by the model.
To highlight the performance improvements over our previous result in [ 26
], we report the best performance achieved by the methods in [ 26
] and that of the new methods proposed in this paper in Table 4
. We can observe that our new methods in this paper improve over the previous results significantly in all cases.
Table 4. Comparison of the Best Performance Achieved by Our Previous Methods in [ 26 ] (ICML Models) with the New Methods Proposed in This Paper (New Models)
4.2 Action Recognition on the KTH Data
We evaluate the 3D CNN model on the KTH data [ 13
], which consist of six action classes performed by 25 subjects. To follow the setup in the HMAX model, we use a 9-frame cube as input and extract foreground as in [ 16
]. To reduce the memory requirement, the resolutions of the input frames are reduced to
in our experiments as compared to
used in [ 16
]. We use a similar 3D CNN architecture as in Fig. 3
, with the sizes of kernels and the number of feature maps in each layer modified to consider the
inputs. In particular, the three convolutional layers use kernels of sizes
, respectively, and the two subsampling layers use kernels of size
. By using this setting, the
inputs are converted into 128D feature vectors. The final layer consists of 6 units corresponding to the six classes.
As in [ 16
], we use the data for 16 randomly selected subjects for training and the data for the other nine subjects for testing. Majority voting is used to produce labels for a video sequence based on the predictions for individual frames. The recognition performance averaged across five random trials is reported in Table 5
along with published results in the literature. The 3D CNN model achieves an overall accuracy of 90.2 percent as compared with 91.7 percent achieved by the HMAX model. Note that the HMAX model uses handcrafted features computed from raw images with fourfold higher resolution. Also, some of the methods in Table 5
used different training/test splits of the data.
Table 5. Action Recognition Accuracies in Percentage on the KTH Data
Note that we use the same training/test split as [ 16
] and other methods use different splits.
5. Conclusions and Discussions
We developed 3D CNN models for action recognition in this paper. These models construct features from both spatial and temporal dimensions by performing 3D convolutions. The developed deep architecture generates multiple channels of information from adjacent input frames and perform convolution and subsampling separately in each channel. The final feature representation is obtained by combining information from all channels. We developed model regularization and combination schemes to further boost the model performance. We evaluated the 3D CNN models on the TRECVID and the KTH data sets. Results show that the 3D CNN model outperforms compared methods on the TRECVID data, while it achieves competitive performance on the KTH data, demonstrating its superior performance in real-world environments.
In this paper, we considered the CNN model for action recognition. There are also other deep architectures, such as the deep belief networks [ 19
], [ 23
], which achieve promising performance on object recognition tasks. It would be interesting to extend such models for action recognition. The developed 3D CNN model was trained using a supervised algorithm in this paper, and it requires a large number of labeled samples. Prior studies show that the number of labeled samples can be significantly reduced when such a model is pretrained using unsupervised algorithms [ 22
]. We will explore the unsupervised training of 3D CNN models in the future.
• S. Ji is with the Department of Computer Science, Old Dominion University, Suite 3300, 4700 Elkhorn Avenue, Norfolk, VA 23529-0162. E-mail: firstname.lastname@example.org.
• W. Xu is with Facebook, Inc., 1601 Willow Road, Menlo Park, CA 94304. E-mail: email@example.com.
• M. Yang is with NEC Laboratories America, Inc., 10080 North Wolfe Road, SW3-350, Cupertino, CA 95014. E-mail: firstname.lastname@example.org.
• K. Yu is with Baidu Inc., Baidu Building, Shangdi 10th Street, Haidian District, Beijing 100085, China. E-mail: email@example.com.
Manuscript received 13 Apr. 2011; revised 28 Oct. 2011; accepted 17 Feb. 2012; published online 28 Feb. 2012.
Recommended for acceptance by C. Bregler.
For information on obtaining reprints of this article, please send e-mail to: firstname.lastname@example.org, and reference IEEECS Log Number TPAMI-2011-04-0227.
Digital Object Identifier no. 10.1109/TPAMI.2012.59.
received the PhD degree in computer science from Arizona State University, Tempe, Arizona, in 2010. Currently, he is working as an assistant professor in the Department of Computer Science at Old Dominion University (ODU), Norfolk, Virginia. His research interests include machine learning, data mining, and bioinformatics. He received the Outstanding PhD Student Award from Arizona State University in 2010 and the Early Career Distinguished Research Award from ODU's College of Sciences in 2012.
received the BS degree from Tsinghua University, Beijing, China, in 1998 and the MS degree from Carnegie Mellon University (CMU), Pittsburgh, in 2000. From 1998 to 2001, he was a research assistant in the Language Technology Institute at CMU. In 2001, he joined NEC Laboratories America working on intelligent video analysis. He has been a research scientist at Facebook since November 2009. His research interests include computer vision, image, and video understanding, machine learning, and data mining.
received the BE and ME degrees in electronic engineering from Tsinghua University, Beijing, China, in 2001 and 2004, respectively, and the PhD degree in electrical and computer engineering from Northwestern University, Evanston, Illinois, in June 2008. From 2004 to 2008, he was a research assistant in the computer vision group of Northwestern University. After his graduation, he joined NEC Laboratories America, Cupertino, California, where he is currently a research staff member. His research interests include computer vision, machine learning, video communication, large-scale image retrieval, and intelligent multimedia content analysis. He is a member of the IEEE.
received the PhD degree in computer science from the University of Munich in 2004. He is now the director of the Multimedia Department at Baidu. This work was done when he was the head of the Media Analytics Department at NEC Laboratories America, where he managed an R&D division working on image recognition, multimedia search, video surveillance, sensor mining, and human-computer interaction. He has published more than 70 papers in top-tier conferences and journals in the area of machine learning, data mining, and computer vision. He has served as area chair for top-tier machine learning conferences, e.g., ICML and NIPS, and taught an AI class at Stanford University as a visiting faculty member. Before joining NEC, he was a senior research scientist at Siemens. He is a member of the IEEE.