This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data
April-June 2005 (vol. 2 no. 2)
pp. 83-101

Abstract—This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection, and classification. The partitioning of a relational table into attribute subgroups allows a small number of attributes within or across the groups to be selected for analysis. By clustering attributes, the search dimension of a data mining algorithm is reduced. The reduction of search dimension is especially important to data mining in gene expression data because such data typically consist of a huge number of genes (attributes) and a small number of gene expression profiles (tuples). Most data mining algorithms are typically developed and optimized to scale to the number of tuples instead of the number of attributes. The situation becomes even worse when the number of attributes overwhelms the number of tuples, in which case, the likelihood of reporting patterns that are actually irrelevant due to chances becomes rather high. It is for the aforementioned reasons that gene grouping and selection are important preprocessing steps for many data mining algorithms to be effective when applied to gene expression data. This paper defines the problem of attribute clustering and introduces a methodology to solving it. Our proposed method groups interdependent attributes into clusters by optimizing a criterion function derived from an information measure that reflects the interdependence between attributes. By applying our algorithm to gene expression data, meaningful clusters of genes are discovered. The grouping of genes based on attribute interdependence within group helps to capture different aspects of gene association patterns in each group. Significant genes selected from each group then contain useful information for gene expression classification and identification. To evaluate the performance of the proposed approach, we applied it to two well-known gene expression data sets and compared our results with those obtained by other methods. Our experiments show that the proposed method is able to find the meaningful clusters of genes. By selecting a subset of genes which have high multiple-interdependence with others within clusters, significant classification information can be obtained. Thus, a small pool of selected genes can be used to build classifiers with very high classification rate. From the pool, gene expressions of different categories can be identified.

[1] R. Agrawal, S. Ghost, T. Imielinski, B. Iyer, and A. Swami, “An Interval Classifier for Database Mining Applications,” Proc. 18th Int'l Conf. Very Large Data Bases, pp. 560-573, 1992.
[2] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 207-216, 1993.
[3] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 20th Int'l Conf. Very Large Data Bases, pp. 487-499, 1994.
[4] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proc. Nat'l Academy of Sciences of the United States of Am., vol. 96, no. 12, pp. 6745-6750, 1999.
[5] W.-H. Au and K.C.C. Chan, “Classification with Degree of Membership: A Fuzzy Approach,” Proc. First IEEE Int'l Conf. Data Mining, pp. 35-42, 2001.
[6] W.-H. Au and K.C. C. Chan, “Mining Fuzzy Association Rules in a Bank-Account Database,” IEEE Trans. Fuzzy Systems, vol. 11, no. 2, pp. 238-248, 2003.
[7] W.-H. Au, K.C.C. Chan, and X. Yao, “A Novel Evolutionary Data Mining Algorithm with Applications to Churn Prediction,” IEEE Trans. Evolutionary Computation, vol. 7, no. 6, pp. 532-545, 2003.
[8] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, “Tissue Classification with Gene Expression Profiles,” Proc. Fourth Ann. Int'l Conf. Computational Molecular Biology, 2000.
[9] C. Bishop, Neural Networks for Pattern Recognition. New York: Oxford Univ. Press, 1995.
[10] K.C. C. Chan and W.-H. Au, “Mining Fuzzy Association Rules,” Proc. Sixth Int'l Conf. Information and Knowledge Management, pp. 209-215, 1997.
[11] K.C.C. Chan and W.-H. Au, “Mining Fuzzy Association Rules in a Database Containing Relational and Transactional Data,” Data Mining and Computational Intelligence, A. Kandel, M. Last, and H. Bunke, eds., pp. 95-114, New York: Physica-Verlag, 2001.
[12] K.C.C. Chan and A.K.C. Wong, “APACS: A System for the Automatic Analysis and Classification of Conceptual Patterns,” Computational Intelligence, vol. 6, no. 3, pp. 119-131, 1990.
[13] K.C.C. Chan and A.K.C. Wong, “A Statistical Technique for Extracting Classificatory Knowledge from Databases,” Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley, eds., Cambridge, Mass.: AAAI/MIT Press, pp. 107-123, 1991.
[14] Y. Cheng and G.M. Church, “Biclustering of Expression Data,” Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology, pp. 93-103, 2000.
[15] D.K.Y. Chiu and A.K.C. Wong, “Multiple Pattern Associations for Interpreting Structural and Functional Characteristics of Biomolecules,” Information Sciences, vol. 167, pp. 23-39, 2004.
[16] M. Delgado, N. Marín, D. Sánchez, and M.-A. Vila, “Fuzzy Association Rules: General Model and Applications,” IEEE Trans. Fuzzy Systems, vol. 11, no. 2, pp. 214-225, 2003.
[17] F. De Smet, J. Mathys, K. Marchal, G. Thijs, B. De Moor, and Y. Moreau, “Adaptive Quality-Based Clustering of Gene Expression Profiles,” Bioinformatics, vol. 18, no. 5, pp. 735-746, 2002.
[18] C. Ding and H. Peng, “Minimum Redundancy Feature Selection from Microarray Gene Expression Data,” Proc. IEEE Computational Systems Bioinformatics Conf., pp. 523-528, 2003.
[19] E. Domany, “Cluster Analysis of Gene Expression Data,” J. Statistical Physics, vol. 110, pp. 1117-1139, 2003.
[20] S. Dudoit, J. Fridlyand, and T.P. Speed, “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data,” J. Am. Statistical Assoc., vol. 97, no. 457, pp. 77-87, 2002.
[21] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Sciences of the United States of Am., vol. 95, no. 25, pp. 14863-14868, 1998.
[22] Advances in Knowledge Discovery and Data Mining, U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, eds., Cambridge, Mass.: AAAI/MIT Press, 1996.
[23] N. Friedman, M. Nachman, and D. Pe'er, “Using Baysian Networks to Analyze Expression Data,” Proc. Fourth Ann. Int'l Conf. Computational Molecular Biology, pp. 127-135, 2000.
[24] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, 1999.
[25] L.J. Heyer, S. Kruglyak, and S. Yooseph, “Exploring Expression Data: Identification and Analysis of Coexpressed Genes,” Genome Research, vol. 9, pp. 1106-1115, 1999.
[26] K. Hirota and W. Pedrycz, “Fuzzy Computing for Data Mining,” Proc. IEEE, vol. 87, no. 9, pp. 1575-1600, 1999.
[27] A.K. Jain, M.N. Murty, and P.J. Flynn, “Data Clustering: A Review,” ACM Computing Surveys, vol. 31, no. 3, pp. 264-323, 1999.
[28] C.Z. Janikow, “Fuzzy Decision Trees: Issues and Methods,” IEEE Trans. Systems, Man, and Cybernetics-Part B: Cybernetics, vol. 28, no. 1, pp. 1-14, 1998.
[29] D. Jiang, C. Tang, and A. Zhang, “Cluster Analysis for Gene Expression Data: A Survey,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1370-1386, Nov. 2004.
[30] J. Kacprzyk and S. Zadrozny, “On Linguistic Approaches in Flexible Querying and Mining of Association Rules,” Flexible Query Answering Systems: Recent Advances, H.L. Larsen, J. Kacprzyk, S. Zadrozny, T. Andreasen, and H. Christiansen, eds., pp. 475-484, Physica-Verlag, 2001.
[31] A.D. Keller, M. Schummer, L. Hood, and W.L. Ruzzo, “Bayesian Classification of DNA Array Expression Data,” Technical Report UW-CSE-2000-08-01, Dept. of Computer Science and Eng., Univ. of Washington, 2000.
[32] J. Khan, J.S. Wei, M. Ringner, L.H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, and P.S. Meltzer, “Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks,” Nature Medicine, vol. 7, no. 6, pp. 673-679, 2001.
[33] T. Kohonen, Self-Organizing Maps, third ed. Berlin: Springer-Verlag, 2001.
[34] J. Li and L. Wong, “Identifying Good Diagnostic Gene Groups from Gene Expression Profiles Using the Concept of Emerging Patterns,” Bioinformatics, vol. 18, no. 5, pp. 725-734, 2002.
[35] J. Li and L. Wong, “Identifying Good Diagnostic Gene Groups from Gene Expression Profiles Using the Concept of Emerging Patterns (Corrigendum),” Bioinformatics, vol. 18, no. 10, pp. 1406-1407, 2002.
[36] B. Liu, W. Hsu, and Y. Ma, “Integrating Classification and Association Rule Mining,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, pp. 80-86, 1998.
[37] L. Liu, A.K.C. Wong, and Y. Wang, “A Global Optimal Algorithm for Class-Dependent Discretization of Continuous Data,” Intelligent Data Analysis, vol. 8, no. 2, pp. 151-170, 2004.
[38] Y. Lu and J. Han, “Cancer Classification Using Gene Expression Data,” Information Systems, vol. 28, pp. 243-268, 2003.
[39] D.J.C. MacKay, Information Theory, Inference, and Learning Algorithms. Cambridge Univ. Press, 2003.
[40] O. Maimon, A. Kandel, and M. Last, “Information-Theoretic Fuzzy Approach to Knowledge Discovery in Databases,” Advances in Soft Computing-Engineering Design and Manufacturing, R. Roy, T. Furuhashi, and P.K. Chawdhry, eds., pp. 315-326, Springer-Verlag, 1999.
[41] J.B. McQueen, “Some Methods for Classification and Analysis of Multivariate Observations,” Proc. Fifth Berkeley Symp. Math. Statistics and Probability, pp. 281-297, 1967.
[42] S.C. Madeira and A.L. Oliveira, “Biclustering Algorithms for Biological Data Analysis: A Survey,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 1, pp. 24-45, Jan.-Mar. 2004.
[43] S.N. Mukherjee, P. Sykacek, S.J. Roberts, and S.J. Gurr, “Gene Ranking Using Bootstrapped P-Values,” SIGKDD Explorations, vol. 5, no. 2, pp. 16-22, 2003.
[44] W. Pan, “A Comparative Review of Statistical Methods for Discovering Differentially Expressed Genes in Replicated Microarray Experiments,” Bioinformatics, vol. 18, pp. 546-554, 2002.
[45] J.S. Park, M.-S. Chen, and P.S. Yu, “An Efficient Hash-Based Algorithm for Mining Association Rules,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 175-186, 1995.
[46] Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley, eds., Cambridge, Mass.: AAAI/MIT Press, 1991.
[47] G. Piatetsky-Shapiro, T. Khabaza, and S. Ramaswamy, “Capturing Best Practice for Microarray Gene Expression Data Analysis,” Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 407-415, 2003.
[48] J.R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, Calif.: Morgan Kaufmann, 1993.
[49] R. Herwig, A.J. Poustka, C. Müller, C. Bull, H. Lehrach, and J. O'Brien, “Large-Scale Clustering of cDNA-Fingerprinting Data,” Genome Research, vol. 9, pp. 1093-1105, 1999.
[50] A. Savasere, E. Omiecinski, and S. Navathe, “An Efficient Algorithm for Mining Association Rules in Large Databases,” Proc. 21st Int'l Conf. Very Large Data Bases, pp. 432-444, 1995.
[51] R. Simon, “Supervised Analysis when the Number of Candidate Features (p) Greatly Exceeds the Number of Cases (n),” SIGKDD Explorations, vol. 5, no. 2, pp. 31-36, 2003.
[52] P. Smyth and R.M. Goodman, “An Information Theoretic Approach to Rule Induction from Databases,” IEEE Trans. Knowledge and Data Eng., vol. 4, no. 4, pp. 301-316, 1992.
[53] R. Srikant and R. Agrawal, “Mining Quantitative Association Rules in Large Relational Tables,” Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 1-12, 1996.
[54] P. Tamayo, D. Solni, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, “Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation,” Proc. Nat'l Academy of Sciences of the United States of Am., vol. 96, no. 6, pp. 2907-2912, 1999.
[55] C.C. Wang and A.K.C. Wong, “Classification of Discrete-Valued Data with Feature Space Transformation,” IEEE Trans. Automatic Control, vol. 24, no. 3, pp. 434-437, 1979.
[56] A.K.C. Wong and T.S. Liu, “Typicality, Diversity and Feature Patterns of an Ensemble,” IEEE Trans. Computers, vol. 24, no. 2, pp. 158-181, Feb. 1975.
[57] A.K.C. Wong, T.S. Liu, and C.C. Wang, “Statistical Analysis of Residue Variability in Cytochrome C,” J. Molecular Biology, vol. 102, pp. 287-295, 1976.
[58] A.K.C. Wong and Y. Wang, “High-Order Pattern Discovery from Discrete-Valued Data,” IEEE Trans. Knowledge and Data Eng., vol. 9, no. 6, pp. 877-893, Nov./Dec. 1997.
[59] A.K.C. Wong and Y. Wang, “Pattern Discovery: A Data Driven Approach to Decision Support,” IEEE Trans. Systems, Man, and Cybernetics-Part C: Applications and Rev., vol. 33, no. 1, pp. 114-124, 2003.
[60] E.P. Xing, M.I. Jordan, and R.M. Karp, “Feature Selection for High-Dimensional Genomic Microarray Data,” Proc. 18th Int'l Conf. Machine Learning, pp. 601-608, 2001.
[61] R.R. Yager, “On Linguistic Summaries of Data,” Knowledge Discovery in Databases, G. Piatetsky-Shapiro and W.J. Frawley, eds., pp. 347-363, Cambridge, Mass.: AAAI/MIT Press, 1991.
[62] L. Yu and H. Liu, “Redundancy Based Feature Selection for Microarray Data,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 737-742, 2004.
[63] H. Zhang, C.Y. Yu, B. Singer, and M. Xiong, “Recursive Partitioning for Tumor Classification with Gene Expression Microarray Data,” Proc. Nat'l Academy of Sciences of the United States of Am., vol. 98, no. 12, pp. 6730-6735, 2001.

Index Terms:
Data mining, attribute clustering, gene selection, gene expression classification, microarray analysis.
Citation:
Wai-Ho Au, Keith C.C. Chan, Andrew K.C. Wong, Yang Wang, "Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 2, pp. 83-101, April-June 2005, doi:10.1109/TCBB.2005.17
Usage of this product signifies your acceptance of the Terms of Use.