This Article 
 Bibliographic References 
 Add to: 
Semisupervised Learning of Classifiers: Theory, Algorithms, and Their Application to Human-Computer Interaction
December 2004 (vol. 26 no. 12)
pp. 1553-1567
Automatic classification is one of the basic tasks required in any pattern recognition and human computer interaction application. In this paper, we discuss training probabilistic classifiers with labeled and unlabeled data. We provide a new analysis that shows under what conditions unlabeled data can be used in learning to improve classification performance. We also show that, if the conditions are violated, using unlabeled data can be detrimental to classification performance. We discuss the implications of this analysis to a specific type of probabilistic classifiers, Bayesian networks, and propose a new structure learning algorithm that can utilize unlabeled data to improve classification. Finally, we show how the resulting algorithms are successfully employed in two applications related to human-computer interaction and pattern recognition: facial expression recognition and face detection.

[1] B. Shahshahani and D. Landgrebe, “Effect of Unlabeled Samples in Reducing the Small Sample Size Problem and Mitigating the Hughes Phenomenon,” IEEE Trans. Geoscience and Remote Sensing, vol. 32, no. 5, pp. 1087-1095, 1994.
[2] T. Zhang and F. Oles, “A Probability Analysis on the Value of Unlabeled Data for Classification Problems,” Proc. Int'l Conf. Machine Learning (ICML), pp. 1191-1198, 2000.
[3] K. Nigam, A. McCallum, S. Thrun, and T. Mitchell, “Text Classification from Labeled and Unlabeled Documents Using EM,” Machine Learning, vol. 39, no. 2, pp. 103-134, 2000.
[4] R. Bruce, “Semi-Supervised Learning Using Prior Probabilities and EM,” Proc. Int'l Joint Conf. AI Workshop Text Learning: Beyond Supervision, 2001.
[5] S. Baluja, “Probabilistic Modelling for Face Orientation Discrimination: Learning from Labeled and Unlabeled Data,” Proc. Neural Information and Processing Systems (NIPS), pp. 854-860, 1998.
[6] R. Kohavi, “Scaling Up the Accuracy of Naive Bayes Classifiers: A Decision-Tree Hybrid,” Proc. Second Int't Conf. Knowledge Discovery and Data Mining, pp. 202-207, 1996.
[7] I. Cohen, F.G. Cozman, and A. Bronstein, “On the Value of Unlabeled Data in Semi-Supervised Learning Based on Maximum-Likelihood Estimation,” Technical Report HPL-2002-140, Hewlett-Packard Labs, 2002.
[8] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, Calif.: Morgan Kaufmann, 1988.
[9] A. Garg, V. Pavlovic, and J. Rehg, “Boosted Learning in Dynamic Bayesian Networks for Multimodal Speaker Detection,” Proc. IEEE, vol. 91, pp. 1355-1369, Sept. 2003.
[10] N. Oliver, E. Horvitz, and A. Garg, “Hierarchical Representations for Learning and Inferring Office Activity from Multimodal Information,” Proc. Int'l Conf. Multimodal Interfaces, (ICMI), 2002.
[11] N. Friedman, D. Geiger, and M. Goldszmidt, “Bayesian Network Classifiers,” Machine Learning, vol. 29, no. 2, pp. 131-163, 1997.
[12] R. Greiner and W. Zhou, “Structural Extension to Logistic Regression: Discriminative Parameter Learning of Belief Net Classifiers,” Proc. Ann. Nat'l Conf. Artificial Intelligence, pp. 167-173, 2002.
[13] P. Ekman and W. Friesen, Facial Action Coding System: Investigator's Guide. Palo Alto, Calif.: Consulting Psychologists Press, 1978.
[14] C.L. Blake and C.J. Merz, “UCI Repository of Machine Learning Databases,” Dept. of Information and Computer Sciences, Univ. of California, Irvine, 1998.
[15] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. New York: Springer Verlag, 1996.
[16] A. Corduneanu and T. Jaakkola, “Continuations Methods for Mixing Heterogeneous Sources,” Proc. Uncertainty in Artificial Intelligence (UAI), pp. 111-118, 2002.
[17] R. Chhikara and J. McKeon, “Linear Discriminant Analysis with Misallocation in Training Samples,” J. Am. Statistical Assoc., vol. 79, pp. 899-906, 1984.
[18] C. Chittineni, “Learning with Imperfectly Labeled Examples,” Pattern Recognition, vol. 12, pp. 271-281, 1981.
[19] T. Krishnan and S. Nandy, “Efficiency of Discriminant Analysis when Initial Samples Are Classified Stochastically,” Pattern Recognition, vol. 23, pp. 529-537, 1990.
[20] T. Krishnan and S. Nandy, “Efficiency of Logistic-Normal supervision,” Pattern Recognition, vol. 23, pp. 1275-1279, 1990.
[21] S. Pal and E.A. Pal, Pattern Recognition from Classical to Modern Approaches. World Scientific, 2002.
[22] D.B. Cooper and J.H. Freeman, “On the Asymptotic Improvement in the Outcome of Supervised Learning Provided by Additional Nonsupervised Learning,” IEEE Trans. Computers, vol. 19, no. 11, pp. 1055-1063, Nov. 1970.
[23] D.W. Hosmer, “A Comparison of Iterative Maximum Likelihood Estimates of the Parameters of a Mixture of Two Normal Distributions under Three Different Types of Sample,” Biometrics, vol. 29, pp. 761-770, Dec. 1973.
[24] T.J. O'Neill, “Normal Discrimination with Unclassified Observations,” J. Am. Statistical Assoc., vol. 73, no. 364, pp. 821-826, 1978.
[25] S. Ganesalingam and G.J. McLachlan, “The Efficiency of a Linear Discriminant Function Based on Unclassified Initial Samples,” Biometrika, vol. 65, pp. 658-662, Dec. 1978.
[26] V. Castelli, “The Relative Value of Labeled and Unlabeled Samples in Pattern Recognition,” PhD thesis, Stanford Univ., Palo Alto, Calif., 1994.
[27] J. Ratsaby and S.S. Venkatesh, “Learning from a Mixture of Labeled and Unlabeled Examples with Parametric Side Information,” Proc. Eighth Ann. Conf. Computational Learning Theory, pp. 412-417, 1995.
[28] T. Mitchell, “The Role of Unlabeled Data in Supervised Learning,” Proc. Sixth Int'l Colloquium Cognitive Science, 1999.
[29] D.J. Miller and H.S. Uyar, “A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data,” Neural Information and Processing Systems (NIPS), pp. 571-577, 1996.
[30] M. Collins and Y. Singer, “Unupervised Models for Named Entity Classification,” Proc. Int'l Conf. Machine Learning, pp. 327-334, 2000.
[31] F. DeComite, F. Denis, R. Gilleron, and F. Letouzey, “Positive and Unlabeled Examples Help Learning,” Proc. 10th Int'l Conf. Algorithmic Learning Theory, O. Watanabe and T. Yokomori, eds., pp. 219-230, 1999.
[32] S. Goldman and Y. Zhou, “Enhancing Supervised Learning with Unlabeled Data,” Proc. Int'l Conf. Machine Learning, pp. 327-334, 2000.
[33] F.G. Cozman and I. Cohen, “Unlabeled Data Can Degrade Classification Performance of Generative Classifiers,” Proc. 15th Int'l Florida Artificial Intelligence Soc. Conf., pp. 327-331, 2002.
[34] I. Cohen, “Semisupervised Learning of Classifiers with Application to Human-Computer Interaction,” PhD thesis, Univ. of Illinois at Urbana-Champaign, 2003.
[35] F.G. Cozman, I. Cohen, and M. Cirelo, “Semi-Supervised Learning of Mixture Models,” Proc. Int'l Conf. Machine Learning (ICML), pp. 99-106, 2003.
[36] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., Series B, vol. 39, no. 1, pp. 1-38, 1977.
[37] H. White, “Maximum Likelihood Estimation of Misspecified Models,” Econometrica, vol. 50, pp. 1-25, Jan. 1982.
[38] F.G. Cozman and I. Cohen, “The Effect of Modeling Errors in Semi-Supervised Learning of Mixture Models: How Unlabeled Data Can Degrade Performance of Generative Classifiers,” technical report, Univ. of Sao Paulo, http://www.poli.usp. br/p/fabio.cozman/Publications, 2003.
[39] S.W. Ahmed and P.A. Lachenbruch, “Discriminant Analysis when Scale Contamination Is Present in the Initial Sample,” Classification and Clustering, pp. 331-353, New York: Academic Press, 1977.
[40] G.J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition. New York: John Wiley and Sons, 1992
[41] J.H. Friedman, “On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality,” Data Mining and Knowledge Discovery, vol. 1, no. 1, pp. 55-77, 1997.
[42] M. Meila, “Learning with Mixture of Trees,” PhD thesis, Massachusetts Inst. of Technology, Boston, 1999.
[43] P. Spirtes, C. Glymour, and R. Scheines, Causation, Prediction, and Search, second ed. Cambridge, Mass.: MIT Press, 2000.
[44] J. Pearl, Causality: Models, Reasoning, and Inference. Cambridge, Mass.: Cambridge Univ. Press, 2000.
[45] J. Cheng, R. Greiner, J. Kelly, D.A. Bell, and W. Liu, “Learning Bayesian Networks from Data: An Information-Theory Based Approach,” Artificial Intelligence J., vol. 137, pp. 43-90, May 2002.
[46] J. Cheng and R. Greiner, “Comparing Bayesian Network Classifiers,” Proc. Uncertainty in Artificial Intelligence (UAI), pp. 101-108, 1999.
[47] T.V. Allen and R. Greiner, “A Model Selection Criteria for Learning Belief Nets: An Empirical Comparison,” Proc. Int'l Conf. Machine Learning (ICML), pp. 1047-1054, 2000.
[48] N. Friedman, “The Bayesian Structural EM Algorithm,” Proc. Uncertainty in Artificial Intelligence (UAI), pp. 129-138, 1998.
[49] N. Metropolis, A.W. Rosenbluth, M.N. Rosenbluth, A.H. Teller, and E. Teller, “Equation of State Calculation by Fast Computing Machines,” J. Chemical Physics, vol. 21, pp. 1087-1092, 1953.
[50] D. Madigan and J. York, “Bayesian Graphical Models for Discrete Data,” Int'l Statistical Rev., vol. 63, no. 2, pp. 215-232, 1995.
[51] B. Hajek, “Cooling Schedules for Optimal Annealing,” Math. Operational Research, vol. 13, pp. 311-329, May 1988.
[52] D. Roth, “Learning in Natural Language,” Proc. Int'l Joint Conf. Artificial Intelligence, pp. 898-904, 1999.
[53] P. Ekman, “Strong Evidence for Universals in Facial Expressions: A Reply to Russell's Mistaken Critique,” Psychological Bulletin, vol. 115, no. 2, pp. 268-287, 1994.
[54] M. Pantic and L.J. M. Rothkrantz, “Automatic Analysis of Facial Expressions: The State of the Art,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 12, pp. 1424-1445, Dec. 2000.
[55] T. Kanade, J. Cohn, and Y. Tian, “Comprehensive Database for Facial Expression Analysis,” Proc. Automatic Face and Gesture Recognition (FG '00), pp. 46-53, 2000.
[56] I. Cohen, N. Sebe, A. Garg, and T. S. Huang, “Facial Expression Recognition from Video Sequences,” Proc. Int'l Conf. Multimedia and Expo (ICME), pp. 121-124, 2002.
[57] H. Tao and T.S. Huang, “Connected Vibrations: A Modal Analysis Approach to Non-Rigid Motion Tracking,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 735-740, 1998.
[58] L.S. Chen, “Joint Processing of Audio-Visual Information for the Recognition of Emotional Expressions in Human-Computer Interaction,” PhD thesis, Univ. of Illinois at Urbana-Champaign, 2000.
[59] M.H. Yang, D. Kriegman, and N. Ahuja, “Detecting Faces in Images: A Survey,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 1, pp. 34-58, Jan. 2002.
[60] “MIT CBCL Face Database #1,” MIT Center for Biological and Computation Learning:, 2002.
[61] K. Bennett and A. Demiriz, “Semi-Supervised Support Vector Machines,” Proc. Neural Information and Processing Systems (NIPS), pp. 368-374, 1998.
[62] A. Blum and T. Mitchell, “Combining Labeled and Unlabeled Data with Co-Training,” Proc. 11th Ann. Conf. Computational Learning Theory, pp. 92-100, 1998.
[63] R. Ghani, “Combining Labeled and Unlabeled Data for Multiclass Text Categorization,” Proc. Int'l Conf. Machine Learning (ICML), pp. 187-194, 2002.

Index Terms:
Semisupervised learning, generative models, facial expression recognition, face detection, unlabeled data, Bayesian network classifiers.
Ira Cohen, Fabio G. Cozman, Nicu Sebe, Marcelo C. Cirelo, Thomas S. Huang, "Semisupervised Learning of Classifiers: Theory, Algorithms, and Their Application to Human-Computer Interaction," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 12, pp. 1553-1567, Dec. 2004, doi:10.1109/TPAMI.2004.127
Usage of this product signifies your acceptance of the Terms of Use.