Subscribe
Issue No.01 - January (2012 vol.24)
pp: 127-140
Pradipta Maji , Indian Statistical Institute, Kolkata
ABSTRACT
Microarray technology is one of the important biotechnological means that allows to record the expression levels of thousands of genes simultaneously within a number of different samples. An important application of microarray gene expression data in functional genomics is to classify samples according to their gene expression profiles. Among the large amount of genes presented in gene expression data, only a small fraction of them is effective for performing a certain diagnostic test. Hence, one of the major tasks with the gene expression data is to find groups of coregulated genes whose collective expression is strongly associated with the sample categories or response variables. In this regard, a new supervised attribute clustering algorithm is proposed to find such groups of genes. It directly incorporates the information of sample categories into the attribute clustering process. A new quantitative measure, based on mutual information, is introduced that incorporates the information of sample categories to measure the similarity between attributes. The proposed supervised attribute clustering algorithm is based on measuring the similarity between attributes using the new quantitative measure, whereby redundancy among the attributes is removed. The clusters are then refined incrementally based on sample categories. The performance of the proposed algorithm is compared with that of existing supervised and unsupervised gene clustering and gene selection algorithms based on the class separability index and the predictive accuracy of naive bayes classifier, K-nearest neighbor rule, and support vector machine on three cancer and two arthritis microarray data sets. The biological significance of the generated clusters is interpreted using the gene ontology. An important finding is that the proposed supervised attribute clustering algorithm is shown to be effective for identifying biologically significant gene clusters with excellent predictive capability.
INDEX TERMS
Microarray analysis, attribute clustering, gene selection, mutual information, classification.
CITATION
Pradipta Maji, "Mutual Information-Based Supervised Attribute Clustering for Microarray Sample Classification", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 1, pp. 127-140, January 2012, doi:10.1109/TKDE.2010.210
REFERENCES
[1] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, no. 5439, pp. 531-537, 1999.
[2] E. Domany, "Cluster Analysis of Gene Expression Data," J. Statistical Physics, vol. 110, nos. 3-6, pp. 1117-1139, 2003.
[3] J.G. Liao and K.-V. Chin, "Logistic Regression for Disease Classification Using Microarray Data: Model Selection in a Large $p$ and Small $n$ Case," Bioinformatics, vol. 23, no. 15, pp. 1945-1951, 2007.
[4] L. Wang, F. Chu, and W. Xie, "Accurate Cancer Classification Using Expressions of Very Few Genes," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 1, pp. 40-53, Jan.-Mar. 2007.
[5] P.A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach. Prentice Hall, 1982.
[6] D. Koller and M. Sahami, "Toward Optimal Feature Selection," Proc. Int'l Conf. Machine Learning, pp. 284-292. 1996.
[7] R. Kohavi and G.H. John, "Wrappers for Feature Subset Selection," Artificial Intelligence, vol. 97, nos. 1/2, pp. 273-324, 1997.
[8] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988.
[9] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification and Scene Analysis. John Wiley and Sons, 1999.
[10] D. Jiang, C. Tang, and A. Zhang, "Cluster Analysis for Gene Expression Data: A Survey," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1370-1386, Nov. 2004.
[11] W.-H. Au, K.C.C. Chan, A.K.C. Wong, and Y. Wang, "Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 2, no. 2, pp. 83-101, Apr.-June 2005.
[12] A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G.C. Tseng, "Evaluation and Comparison of Gene Clustering Methods in Microarray Analysis," Bioinformatics, vol. 22, no. 19, pp. 2405-2412, 2006.
[13] M. Medvedovic and S. Sivaganesan, "Bayesian Infinite Mixture Model Based Clustering of Gene Expression Profiles," Bioinformatics, vol. 18, no. 9, pp. 1194-1206, 2002.
[14] Y. Joo, J.G. Booth, Y. Namkoong, and G. Casella, "Model-Based Bayesian Clustering (MBBC)," Bioinformatics, vol. 24, no. 6, pp. 874-875, 2008.
[15] J. Herrero, A. Valencia, and J. Dopazo, "A Hierarchical Unsupervised Growing Neural Network for Clustering Gene Expression Patterns," Bioinformatics, vol. 17, pp. 126-136, 2001.
[16] W. Haiying, Z. Huiru, and A. Francisco, "Poisson-Based Self-Organizing Feature Maps and Hierarchical Clustering for Serial Analysis of Gene Expression Data," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 2, pp. 163-175, Apr.-June 2007.
[17] L.J. Heyer, S. Kruglyak, and S. Yooseph, "Exploring Expression Data: Identification and Analysis of Coexpressed Genes," Genome Research, vol. 9, no. 11, pp. 1106-1115, 1999.
[18] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E.S. Lander, and T.R. Golub, "Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation," Proc. Nat'l Academy of Science USA, vol. 96, no. 6, pp. 2907-2912, 1999.
[19] K.Y. Yeung and W.L. Ruzzo, "Principal Component Analysis for Clustering Gene Expression Data," Bioinformatics, vol. 17, no. 9, pp. 763-774, 2001.
[20] G.J. McLachlan, K.-A. Do, and C. Ambroise, Analyzing Microarray Gene Expression Data. Wiley-Interscience, 2004.
[21] M. Dettling and P. Buhlmann, "Supervised Clustering of Genes," Genome Biology, vol. 3, no. 12, pp. 0069.1-0069.15, 2002.
[22] T. Hastie, R. Tibshirani, M.B. Eisen, A. Alizadeh, R. Levy, L. Staudt, W.C. Chan, D. Botstein, and P. Brown, "'Gene Shaving' as a Method for Identifying Distinct Sets of Genes with Similar Expression Patterns," Genome Biology, vol. 1, no. 2, pp. 1-21, 2000.
[23] T. Hastie, R. Tibshirani, D. Botstein, and P. Brown, "Supervised Harvesting of Expression Trees," Genome Biology, vol. 1, pp. 1-12, 2001.
[24] D. Nguyen and D. Rocke, "Tumor Classification by Partial Least Squares Using Microarray Gene Expression Data," Bioinformatics, vol. 18, pp. 39-50, 2002.
[25] C. Ding and H. Peng, "Minimum Redundancy Feature Selection from Microarray Gene Expression Data," J. Bioinformatics and Computational Biology, vol. 3, no. 2, pp. 185-205, 2005.
[26] H. Peng, F. Long, and C. Ding, "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, Aug. 2005.
[27] R. Battiti, "Using Mutual Information for Selecting Features in Supervised Neural Net Learning," IEEE Trans. Neural Networks, vol. 5, no. 4, pp. 537-550, July 1994.
[28] D. Huang and T.W.S. Chow, "Effective Feature Selection Scheme Using Mutual Information," Neurocomputing, vol. 63, pp. 325-343, 2004.
[29] X. Liu, A. Krishnan, and A. Mondry, "An Entropy Based Gene Selection Method for Cancer Classification Using Microarray Data," BMC Bioinformatics, vol. 6, no. 76, pp. 1-14, 2005.
[30] I. Dhillon, S. Mallela, and R. Kumar, "Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification," J. Machine Learning Research, vol. 3, pp. 1265-1287, 2003.
[31] R. Jornsten and B. Yu, "Simultaneous Gene Clustering and Subset Selection for Sample Classification via MDL," Bioinformatics, vol. 19, no. 9, pp. 1100-1109, 2003.
[32] J. Li, H. Su, H. Chen, and B.W. Futscher, "Optimal Search-Based Gene Subset Selection for Gene Array Cancer Classification," IEEE Trans. Information Technology in Biomedicine, vol. 11, no. 4, pp. 398-405, July 2007.
[33] P. Maji, "$f$ -Information Measures for Efficient Selection of Discriminative Genes from Microarray Data," IEEE Trans. Biomedical Eng., vol. 56, no. 4, pp. 1063-1069, Apr. 2009.
[34] C. Shannon and W. Weaver, The Math. Theory of Communication Univ. Illinois Press, 1964.
[35] K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press, 1990.
[36] V. Vapnik, The Nature of Statistical Learning Theory. Springer, 1995.
[37] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J.A. Olson, J.R. Marks, and J.R. Nevins, "Predicting the Clinical Status of Human Breast Cancer by Using Gene Expression Profiles," Proc. Nat'l Academy of Science USA, vol. 98, no. 20, pp. 11462-11467, 2001.
[38] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays," Proc. Nat'l Academy of Science USA, vol. 96, no. 12, pp. 6745-6750, 1999.
[39] T.C.T.M. van der Pouw Kraan, F.A. van Gaalen, P.V. Kasperkovitz, N.L. Verbeet, T.J.M. Smeets, M.C. Kraan, M. Fero, P.-P. Tak, T.W.J. Huizinga, E. Pieterman, F.C. Breedveld, A.A. Alizadeh, and C.L. Verweij, "Rheumatoid Arthritis is a Heterogeneous Disease: Evidence for Differences in the Activation of the STAT-1 Pathway between Rheumatoid Tissues," Arthritis and Rheumatism, vol. 48, no. 8, pp. 2132-2145, 2003.
[40] T.C.T.M. van der Pouw Kraan, C.A. Wijbrandts, L.G.M. van Baarsen, A.E. Voskuyl, F. Rustenburg, J.M. Baggen, S.M. Ibrahim, M. Fero, B.A.C. Dijkmans, P.P. Tak, and C.L. Verweij, "Rheumatoid Arthritis Subtypes Identified by Genomic Profiling of Peripheral Blood Cells: Assignment of a Type I Interferon Signature in a Subpopulation of Pateints," Annals of the Rheumatic Diseases, vol. 66, pp. 1008-1014, 2007.
[41] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, "Cluster Analysis and Display of Genome-Wide Expression Patterns," Proc. Nat'l Academy of Sciences USA, vol. 95, no. 25, pp. 14863-14868, 1998.
[42] E.I. Boyle, S. Weng, J. Gollub, H. Jin, D. Botstein, J.M. Cherry, and G. Sherlock, "GO::Term Finder Open Source Software for Accessing Gene Ontology Information and Finding Significantly Enriched Gene Ontology Terms Associated with a List of Genes," Bioinformatics, vol. 20, pp. 3710-3715, 2004.