The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.12 - December (2009 vol.21)
pp: 1649-1664
Nizar Bouguila , Concordia University, Montreal
ABSTRACT
In this paper, we consider the problem of unsupervised discrete feature selection/weighting. Indeed, discrete data are an important component in many data mining, machine learning, image processing, and computer vision applications. However, much of the published work on unsupervised feature selection has concentrated on continuous data. We propose a probabilistic approach that assigns relevance weights to discrete features that are considered as random variables modeled by finite discrete mixtures. The choice of finite mixture models is justified by its flexibility which has led to its widespread application in different domains. For the learning of the model, we consider both Bayesian and information-theoretic approaches through stochastic complexity. Experimental results are presented to illustrate the feasibility and merits of our approach on a difficult problem which is clustering and recognizing visual concepts in different image data. The proposed approach is successfully applied also for text clustering.
INDEX TERMS
Discrete data, finite mixture models, multinomial, Dirichlet prior, feature weighting/selection, MAP, stochastic complexity, Fisher kernel, image databases, text clustering.
CITATION
Nizar Bouguila, "A Model-Based Approach for Discrete Data Clustering and Feature Weighting Using MAP and Stochastic Complexity", IEEE Transactions on Knowledge & Data Engineering, vol.21, no. 12, pp. 1649-1664, December 2009, doi:10.1109/TKDE.2009.42
REFERENCES
[1] G.J. McLachlan and D. Peel, Finite Mixture Models. Wiley, 2000.
[2] N. Littlestone, “Learning Quickly when Irrelevant Attributes Abound: A New Linear-Threshold Algorithm,” Machine Learning, vol. 2, pp. 285-318, 1988.
[3] D. Angluin and P. Laird, “Learning from Noisy Examples,” Machine Learning, vol. 2, pp. 343-370, 1988.
[4] H. Almuallim and T.G. Dietterich, “Learning with Many Irrelevant Features,” Proc. Ninth Nat'l Conf. Artificial Intelligence (AAAI '91), pp. 547-552, 1991.
[5] S.J. Raudys and A.K. Jain, “Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 13, no. 3, pp.252-264, Mar. 1991.
[6] C. Schaffer, “Selecting a Classification Method by Cross-Validation,” Machine Learning, vol. 13, no. 1, pp. 135-143, 1993.
[7] R. Kohavi and D. Sommerfield, “Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology,” Proc. First Int'l Conf. Knowledge Discovery and Data Mining (KDD '95), pp. 192-197, 1995.
[8] A.L. Blum and P. Langley, “Selection of Relevant Features and Examples in Machine Learning,” Artificial Intelligence, vol. 97, pp.245-271, 1997.
[9] S. Cost and S. Salzberg, “A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features,” Machine Learning, vol. 10, no. 1, pp. 57-78, 1993.
[10] D. Wettschereck, D.W. Aha, and T. Mohri, “A Review and Empirical Evaluation of Feature Weighting Methods for a Class of Lazy Learning Algorithms,” Artificial Intelligence Rev., vol. 11, nos.1-5, pp. 273-314, 1997.
[11] D.S. Modha and W.S. Spangler, “Feature Weighting in K-Means Clustering,” Machine Learning, vol. 52, pp. 217-237, 2003.
[12] C.H. John, R. Kohavi, and K. Pfleger, “Irrelevant Features and the Subset Selection Problem,” Proc. 11th Int'l Conf. Machine Learning (ICML '94), pp. 121-129, 1994.
[13] M. Dash and H. Liu, “Feature Selection for Classification,” Intelligent Data Analysis, vol. 1, no. 3, pp. 131-156, 1997.
[14] R. Kohavi and G.H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, vol. 97, pp. 273-324, 1997.
[15] H. Liu and L. Yu, “Toward Integrating Feature Selection Algorithms for Classification and Selection,” IEEE Trans. Knowledge and Data Eng., vol. 17, no. 4, pp. 491-502, Apr. 2005.
[16] A.K. Jain and D. Zongker, “Feature Selection: Evaluation, Application, and Small Sample Performance,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 2, pp. 153-158, Feb. 1997.
[17] H. Liu, F. Hussain, C.L. Tan, and M. Dash, “Discretization: An Enabling Technique,” Data Mining and Knowledge Discovery, vol. 6, no. 4, pp. 393-423, 2002.
[18] S. Katz, “Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer,” IEEE Trans. Acoustics, Speech, and Signal Processing, vol. 35, no. 3, pp.400-401, Mar. 1987.
[19] N. Friedman and Y. Singer, “Efficient Bayesian Parameter Estimation in Large Discrete Domains,” Proc. Conf. Neural Information Processing Systems (NIPS '99), pp. 417-423, 1999.
[20] I.S. Dhillon and D.S. Modha, “Concept Decompositions for Large Sparse Text Data Using Clustering,” Machine Learning, vol. 42, nos.1/2, pp. 143-175, 2001.
[21] M. Dash and H. Liu, “Feature Selection for Clustering,” Proc. Fourth Pacific-Asia Conf. Knowledge Discovery and Data Mining, Current Issues and New Applications (PAKDD '00), pp. 110-121, 2000.
[22] J.R. Quinlan, “Induction of Decision Trees,” Machine Learning, vol. 1, no. 1, pp. 81-106, 1986.
[23] W.W. Church and P. Hanks, “Word Association Norms: Mutual Information and Lexicography,” Computational Linguistics, vol. 16, no. 1, pp. 22-29, 1990.
[24] Y. Li, C. Luo, and S.M. Chung, “Text Clustering with Feature Selection by Using Statistical Data,” IEEE Trans. Knowledge and Data Eng., vol. 20, no. 5, pp. 641-652, May 2008.
[25] I. Kononenko, “On Biases in Estimating Multi-Valued Attributes,” Proc. 14th Int'l Joint Conf. Artificial Intelligence (IJCAI '95), pp. 1034-1040, 1995.
[26] X. Wang and A. Kabán, “Model-Based Estimation of Word Saliency in Text,” Proc. Ninth Int'l Conf. Discovery Science, N.Lavrac, L. Todorovski, and K.P. Jantke, eds., pp. 279-290, 2006.
[27] D. Koller and M. Sahami, “Hierachically Classifying Documents Using Very Few Words,” Proc. 14th Int'l Conf. Machine Learning (ICML '97), pp. 170-178, 1997.
[28] Y. Yang and J.O. Pedersen, “A Comparative Study on Feature Selection in Text Categorization,” Proc. Int'l Conf. Machine Learning (ICML '97), pp. 412-420, 1997.
[29] D. Mladenić and M. Grobelnik, “Feature Selection for Unbalanced Class Distribution and Naive Bayes,” Proc. 16th Int'l Conf. Machine Learning (ICML '99), pp. 258-267, 1999.
[30] M.A. Hall and G. Holmes, “Benchmarking Attribute Selection Techniques for Discrete Class Data Mining,” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 6, pp. 1437-1447, Nov./Dec. 2003.
[31] A. Dasgupta, P. Drineas, B. Harb, V. Josifovski, and M.W. Mahoney, “Feature Selection Methods for Text Classification,” Proc. 13th Int'l Conf. Knowledge Discovery and Data Mining (KDD '07), pp. 230-239, 2007.
[32] S. Yu, K. Yu, V. Tresp, and H.-P. Kriegel, “A Probabilistic Clustering-Projection Model for Discrete Data,” Proc. 11th European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '05), pp. 417-428, 2005.
[33] M.H.C. Law, M.A.T. Figueiredo, and A.K. Jain, “Simultaneous Feature Selection and Clustering Using Mixture Models,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 9, pp.1154-1166, Sept. 2004.
[34] M.W. Graham and D.J. Miller, “Unsupervised Learning of Parsimonious Mixtures on Large Spaces with Integrated Feature and Component Selection,” IEEE Trans. Signal Processing, vol. 54, no. 4, pp. 1289-1303, Apr. 2006.
[35] S. Vaithyanathan and B. Dom, “Generalized Model Selection for Unsupervised Learning in High Dimensions,” Proc. Conf. Neural Information Processing Systems (NIPS '99), pp. 970-976, 1999.
[36] J. Rissanen, Stochastic Complexity in Statistical Inquiry. World Scientific, 1989.
[37] J. Dougherty, R. Kohavi, and M. Sahami, “Supervised and Unsupervised Discretization of Continuous Features,” Proc. Int'l Conf. Machine Learning (ICML '95), pp. 194-202, 1995.
[38] N. Friedman and M. Goldszmidt, “Discretizing Continuous Attributes while Learning Bayesian Networks,” Proc. Int'l Conf. Machine Learning (ICML '96), pp. 157-165, 1996.
[39] P. Domingos and M. Pazzani, “On the Optimality of the Simple Bayesian Classifier under the Zero-One Loss,” Machine Learning, vol. 29, nos. 2/3, pp. 103-130, 1997.
[40] J. Novovicová, P. Pudil, and J. Kittler, “Divergence Based Feature Selection for Multimodal Class Densities,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, no. 2, pp. 218-223, Feb. 1996.
[41] P. Pudil, J. Novovicová, N. Choakjarernwanit, and J. Kittler, “Feature Selection Based on the Approximation of Class Densities by Finite Mixtures of Special Type,” Pattern Recognition, vol. 28, no. 9, pp. 1389-1398, 1995.
[42] I.H. Witten and T.C. Bell, “The Zero-Frequency Problem: Estimating the Probabilities of Novel Events in Adaptive Text Compression,” IEEE Trans. Information Theory, vol. 37, no. 4, pp.1085-1094, July 1991.
[43] N. Bouguila and D. Ziou, “Unsupervised Learning of a Finite Discrete Mixture: Applications to Texture Modeling and Image Databases Summarization,” J. Visual Comm. and Image Representation, vol. 18, no. 4, pp. 295-309, 2007.
[44] T.L. Griffiths and J.B. Tenenbaum, “Using Vocabulary Knowledge in Bayesian Multinomial Estimation,” Proc. Conf. Neural Information Processing Systems (NIPS '01), pp. 1385-1392, 2001.
[45] R.E. Krichevsky and V.K. Trofimov, “The Performance of Universal Encoding,” IEEE Trans. Information Theory, vol. IT-27, no. 2, pp. 199-207, Mar. 1981.
[46] B.S. Clarke and A.R. Barron, “Jeffrey's Prior is Asymptotically Lease Favorable under Entropic Risk,” J. Statistical Planning and Inference, vol. 41, no. 1, pp. 37-60, 1994.
[47] Y. Freund, “Predicting a Binary Sequence Almost as Well as the Optimal Biased Coin,” Proc. Ninth Ann. Conf. Computational Learning Theory (COLT '96), pp. 89-98, 1996.
[48] Y. Freund, “Predicting a Binary Sequence Almost as Well as the Optimal Biased Coin,” Information and Computation, vol. 182, no. 1, pp. 73-94, 2003.
[49] N. Bouguila and D. Ziou, “Unsupervised Selection of a Finite Dirichlet Mixture Model: An MML-Based Approach,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 8, pp. 993-1009, Aug. 2006.
[50] D.M. Chickering and D. Heckerman, “Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables,” Machine Learning, vol. 29, pp. 181-212, 1997.
[51] P. Kontkanen, P. Myllymaki, T. Silander, H. Tirri, and P. Grunwald, “On Predictive Distributions and Bayesian Networks,” Statistics and Computing, vol. 10, pp. 39-54, 2000.
[52] J.J. Rissanen, “Fisher Information and Stochastic Complexity,” IEEE Trans. Information Theory, vol. 42, no. 1, pp. 40-47, Jan. 1996.
[53] G. Schwarz, “Estimating the Dimension of a Model,” Annals of Statistics, vol. 16, pp. 461-464, 1978.
[54] J.J. Rissanen, “Modeling by Shortest Data Description,” Automatica, vol. 14, pp. 445-471, 1978.
[55] P. Cheeseman and J. Stutz, “Bayesian Classification (AutoClass): Theory and Results,” Advances in Knowledge Discovery and Data Mining, chapter 6, pp. 153-180, AAAI Press, 1995.
[56] D.J.C. Mackay, “Choice of Basis for Laplace Approximation,” Machine Learning, vol. 33, no. 1, pp. 77-86, 1998.
[57] P.M. Lee, Bayesian Statistics: An Introduction, third ed. Ar nold, 2004.
[58] J.O. Berger, Statistical Decision Theory and Bayesian Analysis. Springer, 1985.
[59] L. Rigouste, O. Cappé, and F. Yvon, “Inference and Evaluation of the Multinomial Mixture Model for Text Clustering,” Information Processing and Management, vol. 43, no. 5, pp.1260-1280, 2007.
[60] S. Boutemedjet, D. Ziou, and N. Bouguila, “Unsupervised Feature Selection for Accurate Recommendation of High-Dimensional Image Data,” Proc. Conf. Neural Information Processing Systems (NIPS), pp. 177-184, 2007.
[61] S. Boutemedjet, D. Ziou, and N. Bouguila, “A Graphical Model for Content Based Image Suggestion and Feature Selection,” Proc. 11th European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '07), pp. 30-41, 2007.
[62] M. Szummer and R.W. Picard, “Indoor-Outdoor Image Classification,” Proc. IEEE Int'l Workshop Content-Based Access of Image and Video Databases, in Conjunction with Int'l Conf. Computer Vision (ICCV '98), pp. 42-51, 1998.
[63] A. Vailaya, A.K. Jain, and H.-J. Zhang, “On Image Classification: City Images vs. Landscapes,” Pattern Recognition, vol. 31, no. 12, pp. 1921-1935, 1998.
[64] O. Chapelle, P. Haffner, and V.N. Vapnik, “Support Vector Machines for Histogram-Based Image Classification,” IEEE Trans. Neural Networks, vol. 10, no. 5, pp. 1055-1064, Sept. 1999.
[65] N. Bouguila and D. Ziou, “A Hybrid SEM Algorithm for High-Dimensional Unsupervised Learning Using a Finite Generalized Dirichlet Mixture,” IEEE Trans. Image Processing, vol. 15, no. 9, pp.2657-2668, Sept. 2006.
[66] V.N. Vapnik, The Nature of Statistical Learning Theory, second ed. Springer-Verlag, 1999.
[67] T.S. Jaakkola and D. Haussler, “Exploiting Generative Models in Discriminative Classifiers,” Proc. Conf. Neural Information Processing Systems (NIPS '99), pp. 487-493, 1999.
[68] T. Hofmann, “Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization,” Proc. Conf. Neural Information Processing Systems (NIPS '00), pp. 914-920, 2000.
[69] C. Elkan, “Deriving TF IDF as a Fisher Kernel,” Proc. 12th Int'l Conf. String Processing and Information Retrieval (SPIRE '05), pp.295-300, 2005.
[70] G. Csurka, C.R. Dance, L. Fan, J. Willamowski, and C. Bray, “Visual Categorization with Bags of Keypoints,” Proc. Workshop Statistical Learning in Computer Vision, Eighth European Conf. Computer Vision (ECCV '04), 2004.
[71] L. Fei-Fei and P. Perona, “A Bayesian Hierarchical Model for Learning Natural Scene Categories,” Proc. IEEE Computer Soc. Conf. Computer Vision and Pattern Recognition (CVPR '05), pp. 524-531, 2005.
[72] A. Oliva and A. Torralba, “Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope,” Int'l J. Computer Vision, vol. 42, no. 3, pp. 145-175, 2001.
[73] D.G. Lowe, “Distinctive Image Features from Scale Invariant Keypoints,” Int'l J. Computer Vision, vol. 60, no. 2, pp. 91-110, 2004.
[74] C. Elkan, “Using the Triangle Inequality to Accelerate K-Means,” Proc. 20th Int'l Conf. Machine Learning (ICML '03), pp. 147-153, 2003.
[75] C.D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge Univ. Press, 2008.
[76] T. Liu, S. Liu, Z. Chen, and W.-Y. Ma, “An Evaluation on Feature Selection for Text Clustering,” Proc. Int'l Conf. Machine Learning (ICML '03), pp. 488-495, 2003.
[77] R.E. Madsen, D. Kauchak, and C. Elkan, “Modeling Word Burstiness Using the Dirichlet Distribution,” Proc. Int'l Conf. Machine Learning (ICML '05), pp. 545-552, 2005.
[78] C. Elkan, “Clustering Documents with an Exponential-Family Approximation of the Dirichlet Compound Multinomial Distribution,” Proc. Int'l Conf. Machine Learning (ICML '06), pp. 289-296, 2006.
[79] N. Bouguila, “Clustering of Count Data Using Generalized Dirichlet Multinomial Distributions,” IEEE Trans. Knowledge and Data Eng., vol. 20, no. 4, pp. 462-474, Apr. 2008.
24 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool