A Fair Comparison Should Be Based on the Same Protocol--Comments on "Trainable Convolution Filters and Their Application to Face Recognition"
MARCH 2014 (Vol. 36, No. 3) pp. 622-623
0162-8828/14/$31.00 © 2014 IEEE

Published by the IEEE Computer Society
A Fair Comparison Should Be Based on the Same Protocol--Comments on "Trainable Convolution Filters and Their Application to Face Recognition"
Liang Chen, IEEE Senior Member

Abstract—We comment on a paper describing an image classification approach called Volterra kernel classifier, which was called Volterrafaces when applied to face recognition. The performances were evaluated by the experiments on face recognition databases. We find that their comparisons with the state of the art of three databases were indeed based on unfair settings. The results with the settings of the standard protocol on three data sets are generated, which show that Volterrafaces achieves the state-of-the-art performance only in one database.

Introduction
In this critique, we comment on “Trainable Convolution Filters and Their Application to Face Recognition” (Kumar et al. [ 1 ]), presenting an image classification approach called Volterra kernel classifier, which was called Volterrafaces when applied to face recognition.
The advantage of Volterra kernel classifier was demonstrated solely by the experiments of Volterrafaces on face recognition. Kumar et al. [ 1 ] reported that Volterrafaces “consistently outperforms various state-of-the-art methods in the same category” (the last sentence in the Abstract of [ 1 ]). A total of five data sets were used in the experiments: Yale A, CMU PIE, and Extended Yale B, MERL Dome, and CMU Multi-PIE. While the setting of experiments on CMU Multi-PIE of [ 1 ] had never been used in other published papers, the MERL Dome had never been used in any papers other than the papers by the Kumar et al; we only comment on the experiments on Yale A, CMU PIE, and Extended Yale B, in which [ 1 ] used the cropped versions by Deng Cai et al. [ 2 ], available via Cai's website, originally at UIUC and now at http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html.
The results of Volterrafaces with Linear and Quadratic kernels, and the Boosted Volterrafaces, are reported in [ 1 ]. The paper by Kumar et al. [ 1 ] is an expanded version of [ 3 ]. The source codes for Volterrafaces with Linear and Quadratic kernels are available via Kumar's website, http://people.seas.harvard.edu/~rkkumar/#code_volterra, and also available via the Matlab website, http://www.mathworks.com/matlabcentral/fileexchange/28027-volterra faces-face-recognition-system. Indeed, all the experimental results of [ 3 ] are included in [ 1 ]. Boosted Volterrafaces is much more complicated than Linear or Quadratic Volterrafaces with no program code available; it involves 500 randomly chosen binary masks that make the reimplementation and evaluation more difficult than that of Linear and Quadratic Volterrafaces. Considering that the performances of Boosted Volterrafaces depend on the performances of Linear and Quadratic Volterrafaces, we in this paper only comment on the results of [ 1 ] with Linear and Quadratic kernels.
2. Protocols: Standard or Specific
The protocols of experiments of Cai et al.'s versions of these three databases are shown clearly in the papers by Cai et al. (e.g., [ 4 ], [ 5 ]): Assuming a data set contains $n$ subjects, each subject $S_i$ ( $i=1, 2, \cdots, n$ ) has $k_i$ images, a random split of $t$ -train experiments constitutes a random selection of $t$ images of each subject for training, the rest for test. Note that the selections of $t$ images from different subjects are independent. As indicated in [ 4 ], such 50 random splits, which can be used to ensure the experiments are perfectly repeatable, were made downloadable from Cai's website.
This protocol has become the standard for all the published experimental results, as far as we know, on these data sets, including all those cited and compared in [ 1 ], which are also shown in Tables 1 , 2 , and 3 in this comments paper: S-LDA [ 4 ], UVF [ 6 ], TANMM [ 7 ], OLAP [ 5 ], Eigenfaces [ 5 ], Fisherface [ 5 ], Laplacianfaces [ 5 ], KLPPSI [ 8 ], KRR [ 9 ], ORO [ 10 ], SR [ 11 ], MLASSO [ 12 ], RAD [ 11 ], CTA [ 13 ], PCA [ 13 ] (listed as Eigenfaces in [ 1 , Table 5]), LDA [ 13 ] (listed as Fisherfaces in [ 1 ]), and LPP [ 13 ] (listed as Laplacianfaces in [ 1 , Table 5]).

Table 1. Yale A: Recognition Error Rates



Table 2. CMU PIE: Recognition Error Rates



Table 3. Extended Yale B: Recognition Error Rates



It was reported in [ 1 ], that for the experiments on Yale A, CMU PIE and Extended Yale B, for each $t$ ( $t=2,3,\cdots,9$ for Yale A, $t=2,3,4,5,10,20,30,40$ for PIE and Yale B), representing the training set size of each subject, “experiment was repeated 10 times for 10 random choices of the traning set” ([ 1 , Section 6.3.2]). It was not clearly specified whether the choice of $t$ images from one subject is independent of that of another subject. We have examined the demo code, volterrafacesDemo.m, from Kumar's website mentioned above, which was for the experiments on Extended Yale B, and YaleB32n.mat, which was the data matrix for the experiment. The demo code and the data matrix clearly indicate that the random selection of $t$ images form different subjects that are not independent: The Extended Yale B has 64 subjects, most of which have 38 images, but 7 subjects have a few images missing; consistent with [ 1 , Table 1], YaleB22n adds enough copies of the last images of these five subjects to the database so that every subject has 38 images (therefore, there are a total of 2,432 images for all subjects). Then, in volterrafacesDemo.m, when $t$  images of each subject are chosen for training, they share the same indexes. That is, the program code first randomly generates $t$  image indexes, then the images with these $t$ indexes of each subject $S_i$ ( $i=1,2,\cdots, n$ ) are chosen for training. We also notice that with only a few exceptions, the images of different subjects with same index always visually have similar poses and illumination conditions. ( Three figures, listing all the images of all subjects in these three databases respectively, are included as a supplementary file, which can be found in the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2013.187) Therefore, we can easily understand that an experiment under this specific protocol will likely have a better recognition rate than under standard protocol.
3. The Experimental Results under Standardl Protocol
We have carried out the experiments of Volterrafaces with Linear and Quadratic kernels for the Yale A, CMU PIE, and Extended Yale B data sets using the standard protocol. To make sure that all the experimental results are strictly repeatable, we use the 50 random splits available at Cai's site for all of the experiments except for the following: for Yale A, when the training set includes 9 images per subject, and for CMU PIE and Extended Yale B, when the training sets includes 2, 3, and 4 images per subject. For these exceptional situations, since the random splits are not available at Cai's site, we generate 50 random splits using the standard protocol. We should note, as can be seen in Tables 1 , 2 , and 3 , very few works have been reported by other researches in these exceptional cases.
The results of our experiments, using Linear and Quadratic Volterrafaces on Yale A, CMU PIE, and Extended Yale B, under the standard protocol with the parameters specified in [ 1 , Tables 3, 4, and 5] are now shown in Tables 1 , 2 , and 3 . The data in the first row of each approach (Linear or Quadratic Volterrafaces) are the error rates, those in the second are the standard deviations. All the results of other approaches that were cited in [ 1 ] are also shown in these tables. These tables also include the results of Volterrafaces, under the specific Protocol, which were reported in [ 1 ].
We shall note that all the results on Yale A using other approaches cited in [ 1 ] are indeed the results on Yale A images of size $32 \times 32$ pixels, while the Yale A images used in the experiments of [ 1 ] are of size $64 \times 64$ . It is generally understood that lower-resolution images normally result in lower recognition rates. From this perspective, the comparison of the results of Volterrafaces on $64 \times 64$ images with other published results on $32 \times 32$ images is indeed not fair 1 —although it does not really affect the main conclution in this comments paper, as shown in Section 4.
4. Conclusion
The protocols of the experiments in [ 1 ] on Yale A, CMU PIE, and Extended Yale B are not standard; therefore their results should not be compared to other published resutls with the standard protocol.
We can easily see that, using the standard protocol, among these three data sets, Linear and Quadratic Volterrafaces perform better than the “state-of-the-art” approaches over Extended Yale B, but they cannot be claimed to outperform “the state-of-the-art” methods for CMU PIE and Yale A. Especially for Yale A, even though the comparison in Table 1 is in favor of Linear and Quadratic Volterrafaces, they are still way below the “state-of-the-art.”

ACKNOWLEDGMENTS

The author receives an innovation grant from Wenzhou University for being an adjunct professor; the research is also supported by an NSERC Discovery Grant of Canada.

    The author is with the College of Mathematics & Information Science, Wenzhou University (adjunct), Zhejiang, China, and with the Computer Science Department, University of Northern British Columbia, BC, Canada. E-mail: lchen@ieee.org.

Manuscript received 28 Nov. 2012; revised 19 Aug. 2013; accepted 15 Sept. 2013; published online 30 Sept. 2013.

Recommended for acceptance by M. Pantic.

For information on obtaining reprints of this article, please send e-mail to: tpami@computer.org, and reference IEEECS Log Number TPAMI-2012-11-0941.

Digital Object Identifier no. 10.1109/TPAMI.2013.187.

1. We understand the difficulties in finding published results on $64 \times 64$ sized Yale A. Here is an example: In [ 14 ], we can find that, the S-LDA approach on $64 \times 64$ version Yale A, and 2, 5, and 8 images per subject, have an averaged error rate 28.43 percent, 10.56 percent, and 6.84 percent, respectively, which are much smaller than S-LDA on $32 \times 32$ version Yale A, cited in [ 1 ] and also included in Table 1 here.

REFERENCES

1. R. Kumar, A. Banerjee, B. Vemuri and H. Pfister, “Trainable Convolution Filters and Their Application to Face Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 34, no. 7, July 2012.
2. D. Cai, X. He, and J. Han, “SRDA: An Efficient Algorithm for Large Scale Discriminant Analysis,” IEEE Trans. Knowledge and Data Eng., vol. 20, no. 1, Jan. 2008.
3. R. Kumar, A. Banerjee, and B. Vemuri, “Volterrafaces: Discriminant Analysis Using Volterra Kernels,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2009.
4. D. Cai, X. He, Y. Hu, J. Han and T. Huang, “Learning a Spatially Smooth Subspace for Face Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition,June 2007.
5. D. Cai, X. He, J. Han and H.-J. Zhang, “Orthogonal Laplacianfaces for Face Recognition,” IEEE Trans. Image Processing, vol. 15, no. 11, Nov. 2006.
6. H. Shan and G.W. Cottrell, “Looking Sround the Backyard Helps to Recognize Faces and Digits,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2008.
7. F. Wang and C. Zhang, “Feature Extraction by Maximizing the Average Neighborhood Margin,” Proc. IEEE Conf. Computer Vision and Pattern Recognition,June 2007.
8. S. An, W. Liu, and S. Venkatesh, “Exploiting Side Information in Locality Preserving Projection,” Proc. IEEE Conf. Computer Vision and Pattern Recognition,June 2008.
9. S. An, W. Liu, and S. Venkatesh, “Face Recognition Using Kernel Ridge Regression,” Proc. IEEE Conf. Computer Vision and Pattern Recognition,June 2007.
10. G. Hua, P. Viola, and S. Drucker, “Face Recognition Using Discriminatively Trained Orthogonal Rank One Tensor Projections,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2007.
11. D. Cai, X. He, and J. Han, “Spectral Regression for Efficient Regularized Subspace Learning,” Proc. IEEE Int'l Conf. Computer Vision,Oct. 2007.
12. D.-S. Pham and S. Venkatesh, “Robust :earning of Discriminative Projection for Multicategory Classification on the Stiefel Manifold,” Proc. IEEE Conf. Computer Vision and Pattern Recognition,June 2008.
13. Y. Fu and T. Huang, “Image Classification Using Correlation Tensor Analysis,” IEEE Trans. Image Processing, vol. 17, no. 2, Feb. 2008.
14. L. Chen and N. Tokuda, “A Unified Framework for Improving the Accuracy of All Holistic Face Identification Algorithms,” Artificial Intelligence Rev., vol. 33, 2010.