This Article 
 Bibliographic References 
 Add to: 
Simultaneous Feature Selection and Clustering Using Mixture Models
September 2004 (vol. 26 no. 9)
pp. 1154-1166
Clustering is a common unsupervised learning technique used to discover group structure in a set of data. While there exist many algorithms for clustering, the important issue of feature selection, that is, what attributes of the data should be used by the clustering algorithms, is rarely touched upon. Feature selection for clustering is difficult because, unlike in supervised learning, there are no class labels for the data and, thus, no obvious criteria to guide the search. Another important problem in clustering is the determination of the number of clusters, which clearly impacts and is influenced by the feature selection issue. In this paper, we propose the concept of feature saliency and introduce an expectation-maximization (EM) algorithm to estimate it, in the context of mixture-based clustering. Due to the introduction of a minimum message length model selection criterion, the saliency of irrelevant features is driven toward zero, which corresponds to performing feature selection. The criterion and algorithm are then extended to simultaneously estimate the feature saliencies and the number of clusters.

[1] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications Proc. 1998 ACM SIGMOD Int'l Conf. Management of Data, pp. 94-105, 1998.
[2] P. Arabie and L. Hubert, Cluster Analysis in Marketing Research Advanced Methods of Marketing Research, R.P. Bagozzi, ed., pp. 160-189, 1994.
[3] P. Baldi and G.W. Hatfield, DNA Microarrays and Gene Expression. Cambridge Univ. Press, 2002.
[4] R. Battiti, “Using Mutual Information for Selecting Features in Supervised Neural Net Learning,” IEEE Trans. Neural Networks, vol. 5, pp. 537-550, July 1994.
[5] S.K. Bhatia and J.S. Deogun, “Conceptual Clustering in Information Retrieval,” IEEE Trans. Systems, Man, and Cybernetics, vol. 28, no. 3, pp. 427-436, 1998.
[6] J. Bins and B. Draper, Feature Selection from Huge Feature Sets Proc. Eighth Int'l Conf. Computer Vision, pp. 159-165, 2001.
[7] A. Blum and P. Langley, Selection of Relevant Features and Examples in Machine Learning Artificial Intelligence, vol. 97, nos. 1-2, pp. 245-271, 1997.
[8] P. Bradley, U. Fayyad, and C. Reina, Clustering Very Large Database Using EM Mixture Models Proc. 15th Int'l Conf. Pattern Recognition (ICPR-2000), pp. 76-80, 2000.
[9] R. Caruana and D. Freitag, Greedy Attribute Selection Proc. 11th Int'l Conf. Machine Learning, pp. 28-36, 1994.
[10] G. Celeux, S. Chrétien, F. Forbes, and A. Mkhadri, A Component-Wise EM Algorithm for Mixtures J. Computational and Graphical Statistics, vol. 10, pp. 699-712, 2001.
[11] A. Chaturvedi and J.D. Carroll, A Feature-Based Approach to Market Segmentation via Overlapping k-Centroids Clustering J. Marketing Research, vol. 34, no. 3, pp. 370-377, 1997.
[12] A. Corduneanu and C.M. Bishop, Variational Bayesian Model Selection for Mixture Distributions Proc. Eighth Int'l Conf. Artificial Intelligence and Statistics, pp. 27-34, 2001.
[13] M. Dash and H. Liu, Feature Selection for Clustering Proc. Pacific-Asia Conf. Knowledge Discovery and Data Mining, 2000.
[14] M. Devaney and A. Ram, Efficient Feature Selection in Conceptual Clustering Proc. 14th Int'l Conf. Machine Learning, pp. 92-97, 1997.
[15] R. Duda, P. Hart, and D. Stork, Pattern Classification. New York: John Wiley and Sons, 2001.
[16] J.G. Dy and C.E. Brodley, Feature Subset Selection and Order Identification for Unsupervised Learning Proc. 17th Int'l Conf. Machine Learning, pp. 247-254, 2000.
[17] J.G. Dy, C.E. Brodley, A. Kak, L.S. Broderick, and A.M. Aisen, Unsupervised Feature Selection Applied to Content-Based Retrieval of Lung Images IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 3, pp. 373-378, Mar. 2003.
[18] M.A.T. Figueiredo and A.K. Jain, Unsupervised Learning of Finite Mixture Models IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, pp. 381-396, 2002.
[19] M.A.T. Figueiredo, A.K. Jain, and M.H. Law, A Feature Selection Wrapper for Mixtures Proc. First Iberian Conf. Pattern Recognition and Image Analysis, pp. 229-237, 2003.
[20] Y. Freund and R.E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting J. Computational and Graphical Statistics, vol. 55, no. 1, pp. 119-139, 1997.
[21] H. Frigui and R. Krishnapuram, “A Robust Competitive Clustering Algorithm with Applications in Computer Visions,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 21, no. 5, pp. 450- 465, May 1999.
[22] P. Gustafson, P. Carbonetto, N. Thompson, and N. de Freitas, Bayesian Feature Weighting for Unsupervised Learning, with Application to Object Recognition Proc. Ninth Int'l Workshop Artificial Intelligence and Statistics (AISTAT03), 2003.
[23] M. Iwayama and T. Tokunaga, Cluster-Based Text Categorization: A Comparison of Category Search Strategies Proc. 18th ACM Int'l Conf. Research and Development in Information Retrieval, pp. 273-281, 1995.
[24] A. Jain and D. Zongker, Feature Selection: Evaluation, Application, and Small Sample Performance IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 2, pp. 153-158, Feb. 1997.
[25] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988.
[26] A.K. Jain, R.P.W. Duin, and J. Mao, Statistical Pattern Recognition: A Review IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan. 2000.
[27] A.K. Jain and F. Farrokhnia, Unsupervised Texture Segmentation Using Gabor Filters Pattern Recognition, vol. 24, pp. 1167-1186, 1991.
[28] A.K. Jain and P. Flynn, Image Segmentation Using Clustering Advances in Image Understanding, pp. 65-83, 1996.
[29] A.K. Jain, M.N. Murty, and P.J. Flynn, Data Clustering: A Review ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, Sept. 1999.
[30] J.N. Kapur, Measures of Information and Their Applications. New Delhi, India: Wiley, 1994.
[31] Y. Kim, W. Street, and F. Menczer, Feature Selection in Unsupervised Learning via Evolutionary Search Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 365-369, 2000.
[32] K. Kira and L. Rendell, The Feature Selection Problem: Traditional Methods and a New Algorithm Proc. 10th Nat'l Conf. Artificial Intelligence (AAAI-92), pp. 129-134, 1992.
[33] R. Kohavi and G. John, Wrappers for Feature Subset Selection Artificial Intelligence, vol. 97, nos. 1-2, pp. 273-324, 1997.
[34] D. Koller and M. Sahami, Toward Optimal Feature Selection Proc. 13th Int'l Conf. Machine Learning, pp. 284-292, 1996.
[35] J. Komorowski, L. Polkowski, and A. Skowron, Rough Sets: A Tutorial Rough-Fuzzy Hybridization: A New Method for Decision Making, Singapore: Springer-Verlag, 1998.
[36] I. Kononenko, Estimating Attributes: Analysis and Extensions of RELIEF Proc. Seventh European Conf. Machine Learning, pp. 171-182, 1994.
[37] N. Kwak and C.-H. Choi, Input Feature Selection by Mutual Information Based on Parzen Window IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 12, pp. 1667-1671, Dec. 2002.
[38] M.H. Law, A.K. Jain, and M.A.T. Figueiredo, Feature Selection in Mixture-Based Clustering Advances in Neural Information Processing Systems 15, pp. 625-632, Cambridge, Mass.: MIT Press, 2003.
[39] G. McLachlan and K. Basford, Mixture Models: Inference and Application to Clustering. New York: Marcel Dekker, 1988.
[40] A.J. Miller, Subset Selection in Regression. London: Chapman&Hall, 2002.
[41] B. Mirkin, Concept Learning and Feature Selection Based on Square-Error Clustering Machine Learning, vol. 35, pp. 25-39, 1999.
[42] P. Mitra and C.A. Murthy, Unsupervised Feature Selection Using Feature Similarity IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24, no. 3, pp. 301-312, Mar. 2002.
[43] D. Modha and W. Scott-Spangler, Feature Weighting in k-Means Clustering Machine Learning, vol. 52, no. 3, pp. 217-237, 2003.
[44] J. ${\rm Novovi\check c ov \acute a}$, P. Pudil, and J. Kittler, "Divergence Based Feature Selection for Multimodal Class Densities," IEEE Trans. PAMI, Vol. 18, No. 2, Feb. 1996, pp. 218-223.
[45] D. Pelleg and A.W. Moore, X-Means: Extending k-Means with Efficient Estimation of the Number of Clusters Proc. 17th Int'l Conf. Machine Learning, pp. 727-734, 2000.
[46] J. Pena, J. Lozano, P. Larranaga, and I. Inza, Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 23, no. 6, pp. 590-603, June 2001.
[47] P. Pudil, J. Novovicovà, and J. Kittler, Floating Search Methods in Feature Selection Pattern Recognition Letters, vol. 15, pp. 1119-1125, 1994.
[48] P. Pudil, J. Novovicová, and J. Kittler, Feature Selection Based on the Approximation of Class Densities by Finite Mixtures of the Special Type Pattern Recognition, vol. 28, no. 9, pp. 1389-1398, 1995.
[49] S.J. Raudys and A.K. Jain, "Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 13, pp. 252-264, 1991.
[50] J. Rissanen, Stochastic Complexity in Stastistical Inquiry. Singapore: World Scientific, 1989.
[51] S.J. Roberts, R.M. Everson, and I. Rezek, Maximum Certainty Data Partitioning Pattern Recognition, vol. 33, no. 5, pp. 833-839, 1999.
[52] V. Roth and T. Lange, Feature Selection in Clustering Problems Advances in Neural Information Processing Systems 16, Cambridge, Mass.: MIT Press, 2004.
[53] M. Sahami, Using Machine Learning to Improve Information Access PhD thesis, Computer Science Dept., Stanford Univ., 1998.
[54] P. Sand and A.W. Moore, Repairing Faulty Mixture Models Using Density Estimation Proc. 18th Int'l Conf. Machine Learning, pp. 457-464, 2001.
[55] J. Shi and J. Malik, Normalized Cuts and Image Segmentation IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000.
[56] L. Talavera, Dependency-Based Feature Selection for Clustering Symbolic Data Intelligent Data Analysis, vol. 4, pp. 19-28, 2000.
[57] G. Trunk, A Problem of Dimensionality: A Simple Example IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 1, no. 3, pp. 306-307, 1979.
[58] S. Vaithyanathan and B. Dom, Generalized Model Selection for Unsupervised Learning in High Dimensions Advances in Neural Information Processing Systems 12, pp. 970-976, Cambridge, Mass.: MIT Press, 1999.
[59] P. Viola and M. Jones, Rapid Object Detection Using a Boosted Cascade of Simple Features Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2001.
[60] C.S. Wallace and D.L. Dowe, MML Clustering of Multi-State, Poisson, von Mises Circular and Gaussian Distributions Statistics and Computing, vol. 10, pp. 73-83, 2000.
[61] C.S. Wallace and P. Freeman, Estimation and Inference via Compact Coding J. Royal Statistical Soc. (B), vol. 49, no. 3, pp. 241-252, 1987.
[62] E. Xing, M. Jordan, and R. Karp, Feature Selection for High-Dimensional Genomic Microarray Data Proc. 18th Int'l Conf. Machine Learning, pp. 601-608, 2001.
[63] J.H. Yang and V. Honavar, Feature Subset Selection Using a Genetic Algorithm IEEE Intelligent Systems, vol. 13, no. 2, pp. 44-49, 1998.

Index Terms:
Feature selection, clustering, unsupervised learning, mixture models, minimum message length, EM algorithm.
Martin H.C. Law, M?rio A.T. Figueiredo, Anil K. Jain, "Simultaneous Feature Selection and Clustering Using Mixture Models," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 9, pp. 1154-1166, Sept. 2004, doi:10.1109/TPAMI.2004.71
Usage of this product signifies your acceptance of the Terms of Use.