This Article 
 Bibliographic References 
 Add to: 
Data-Dependent Kernel Machines for Microarray Data Classification
October-December 2007 (vol. 4 no. 4)
pp. 583-595
One important application of gene expression analysis is to classify tissue samples according to their gene expression levels. Gene expression data are typically characterized by high dimensionality and small sample size, which makes the classification task quite challenging. In this paper, we present a data-dependent kernel for microarray data classification. This kernel function is engineered so that the class separability of the training data is maximized. A bootstrapping-based resampling scheme is introduced to reduce the possible training bias. The effectiveness of this adaptive kernel for microarray data classification is illustrated with a k-Nearest Neighbor (KNN) classifier. Our experimental study shows that the data-dependent kernel leads to a significant improvement in the accuracy of KNN classifiers. Furthermore, this kernel-based KNN scheme has been demonstrated to be competitive to, if not better than, more sophisticated classifiers such as Support Vector Machines (SVMs) and the Uncorrelated Linear Discriminant Analysis (ULDA) for classifying gene expression data.

[1] A. Schulze and J. Downward, “Navigating Gene Expression Using Microarrays—A Technology Review,” Natural Cell Biology, vol. 3, no. 8, pp. E190-195, 2001.
[2] E. Keedwell and A. Narayanan, “Discovering Gene Networks with a Neural-Genetic Hybrid,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 2, no. 3, pp. 231-242, July-Sept. 2005.
[3] D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D'Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, and W.R. Sellers, “Gene Expression Correlations of Clinical Prostate Cancer Behavior,” Cancer Cell, vol. 1, pp. 203-209, 2004.
[4] L.J. van't Veer et al., “Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer,” Nature, vol. 419, pp. 530-536, 2002.
[5] K.M. Borgwardt, S.V.N. Vishwanathan, and H. Kriegel, “Class Prediction from Time Series Gene Expression Profiles Using Dynamical Systems Kernels,” Proc. Pacific Symp. Biocomputing, vol. 11, pp. 547-558, 2006.
[6] M. Wilson, J. DeRisi, H.H. Kristensen, P. Imboden, S. Rane, P.O. Brown, and G.K. Schoolnik, “Exploring Drug-Induced Alterations in Gene Expression in Mycobacterium Tuberculosis by Microarray Hybridization,” Proc Nat'l Academy of Sciences USA, vol. 96, no. 22, pp. 12833-12838, 1999.
[7] W.E. Evans and R.K. Guy, “Gene Expression as a Drug Discovery Tool,” Nature Genetics, vol. 36, no. 3, pp. 214-215, 2004.
[8] R. Sharan and R. Shamir, “Algorithmic Approaches to Clustering Gene Expression Data,” Current Topics in Computational Molecular Biology, pp. 269-300, MIT Press, 2002.
[9] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Sciences USA, vol. 95, pp. 14863-14868, 1998.
[10] P. Toronen, M. Kolehmainen, G. Wong, and E. Castren, “Analysis of Gene Expression Data Using Self-Organizing Maps,” FEBS Letters, vol. 451, pp. 142-146, 1999.
[11] S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, and G.M. Church, “Systematic Determination of Genetic Network Architecture,” Nature Genetics, vol. 22, pp. 281-285, 1999.
[12] P. Langley, “Selection of Relevant Features in Machine Learning,” Proc. AAAI Fall Symp. Relevance, 1994.
[13] R. Kohavi and G. John, “Wrapper for Feature Subset Selection,” Artificial Intelligence, vol. 97, pp. 273-324, 1997.
[14] E.P. Xing, M.I. Jordan, and R.M. Karp, “Feature Selection for High-Dimensional Genomic Microarray Data,” Proc. 18th Int'l Conf. Machine Learning (ICML), 2001.
[15] Kernel Methods in Computational Biology, B. Scholkopf, K. Tsuda, and J.-P. Vert, eds. MIT Press, 2004.
[16] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, “Tissue Classification with Gene Expression Profiles,” J. Computational Biology, vol. 7, pp. 559-584, 2000.
[17] S. Dudoit, J. Fridlyand, and T.P. Speed, “Comparison of Discrimination Method for the Classification of Tumor Using Gene Expression Data,” J. Am. Statistical Assoc., vol. 97, pp. 77-87, 2002.
[18] M. Dettling and P. Bühlmann, “Boosting for Tumor Classification with Gene Expression Data,” Bioinformatics, vol. 19, pp. 1061-1069, 2003.
[19] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gassenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, 1999.
[20] B. West et al., “Predicting the Clinical Status of Human Breast Cancer by Using Gene Expression Profiles,” Proc. Nat'l Academy of Sciences USA, vol. 98, pp. 11462-11467, 2001.
[21] J. Ye, T. Li, T. Xiong, and R. Janardan, “Using Uncorrelated Discriminant Analysis for Tissue Classification with Gene Expression Data,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 4, pp. 181-190, Oct.-Dec. 2004.
[22] T. Hastie and R. Tibshirani, “Discriminant Adaptive Nearest Neighbor Classification,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, pp. 607-615, 1996.
[23] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley-Interscience, 2000.
[24] J.H. Friedman, “Flexible Metric Nearest Neighbor Classification,” technical report, Dept. of Statistics, Stanford Univ., 1994.
[25] P. Howland and H. Park, “Generalizing Discriminant Analysis Using the Generalized Singular Value Decomposition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, pp. 995-1006, 2004.
[26] T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler, “Support Vector Machine Classification and Validation of Cancer Tissue Samples Using Microarray Expression Data,” Bioinformatics, vol. 16, pp. 906-914, 2000.
[27] T. Jaakkola, M. Diekhans, and D. Haussler, “Using the Fisher Kernel Method to Detect Remote Protein Homologies,” Proc. Seventh Int'l Conf. Intelligent Systems for Molecular Biology, 1999.
[28] A. Zien, G. Rätsch, S. Mika, B. Schölkopf, C. Lemmen, A. Smola, T. Lengauer, and K. Müller, “Engineering Support Vector Machine Kernels that Recognize Translation Initiation Sites,” Bioinformatics, vol. 16, pp. 799-807, 2000.
[29] P. Pavlidis, T.S. Furey, M. Liberto, and W.N. Grundy, “Promoter Region-Based Classification of Genes,” Proc. Pacific Symp. Biocomputing, pp. 151-163, 2001.
[30] J.-P. Vert, “A Tree Kernel to Analyze Phylogenetic Profiles,” Bioinformatics, vol. 18, pp. S276-S284, 2002.
[31] S. Hua and Z. Sun, “Support Vector Machine Approach for Protein Subcellular Localization Prediction,” Bioinformatics, vol. 17, no. 8, pp. 721-728, 2001.
[32] S. Degroeve, B. De Baets, Y. Van de Peer, and P. Rouz, “Feature Subset Selection for Splice Site Prediction,” Bioinformatics, vol. 18, pp. S75-S83, 2002.
[33] J.-P. Vert, “Support Vector Machine Prediction of Signal Peptide Cleavage Site Using a New Class of Kernels for Strings,” Proc. Pacific Symp. Biocomputing, pp. 649-660, 2002.
[34] R.J. Carter, I. Dubchak, and S.R. Holbrook, “A Computational Approach to Identify Genes for Functional RNAs in Genomic Sequences,” Nucleic Acids Research, vol. 29, no. 19, pp. 3928-3938, 2001.
[35] S. Hua and Z. Sun, “A Novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: Support Vector Machine Approach,” J. Molecular Biology, vol. 308, pp. 397-407, 2001.
[36] J.R. Bock and D.A. Gough, “Predicting Protein-Protein Interactions from Primary Structure,” Bioinformatics, vol. 17, pp. 455-460, 2001.
[37] G.C. Cawley, MATLAB Support Vector Machine Toolbox, School of Information Systems, Univ. of East Anglia, http://theoval.sys. toolbox , Norwich, U.K., 2000.
[38] S. Amari and S. Wu, “Improving Support Vector Machine Classifiers by Modifying Kernel Functions,” Neural Networks, vol. 12, pp. 783-789, 1999.
[39] H. Xiong, M.N.S. Swamy, and M.O. Ahmad, “Optimizing the Data-Dependent Kernel in the Empirical Feature Space,” IEEE Trans. Neural Networks, vol. 16, pp. 460-474, 2005.
[40] Y. Raviv and N. Intrator, “Bootstrapping with Noise: An Efficient Regularization Technique,” Connection Science, vol. 8, pp. 355-372, 1996.
[41] S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo, M.E. McLaughlin, J.Y.H. Kim, L.C. Goumnerova, P.M. Black, C. Lau, J.C. Allen, D. Zagzag, J.M. Olson, T. Curran, C. Wetmore, J.A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D.N. Louis, J.P. Mesirov, E.S. Lander, and T.R. Golub, “Prediction of Central Nervous System Embryonal Tumor Outcome Based on Gene Expression,” Nature, vol. 415, pp. 436-442, 2002.
[42] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissue Probed by Oligonucleotide Arrays,” Proc. Nat'l Academy of Sciences USA, vol. 96, pp. 6745-6750, 1999.
[43] G.J. Gordon, R.V. Jenson, L.-L. Hsiao, S.R. Gullans, J.E. Blumenstock, S. Ramaswamy, W.G. Richards, D.J. Sugarbaker, and R. Bueno, “Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelima,” Cancer Research, vol. 62, pp. 4936-4967, 2002.
[44] M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C.T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, T.S. Ray, M.A. Koval, K.W. Last, A. Norton, T.A. Lister, J. Mesirov, D.S. Neuberg, E.S. Lander, J.C. Aster, and T.R. Golub, “Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene Expression Profiling and Supervised Machine Learning,” Nature Medicine, vol. 8, pp. 68-74, 2002.
[45] E.F. Petricoin, A.M. Ardekanl, B.A. Hitt, P.J. Levine, V.A. Fusaro, S.M. Steinberg, G.B. Mills, C. Simone, D.A. Fishman, E.C. Kohn, and L.A. Liotta, “Use of Proteomic Patterns in Serum to Identify Ovarian Cancer,” The Lancet, vol. 359, pp. 572-577, 2002.
[46] D.W. Wichern and R.A. Johnson, Applied Multivariate Statistical Analysis, fifth ed. Prentice-Hall, 2002.
[47] E. Pekalska, P. Paclik, and R.P.W. Duin, “A Generalized Kernel Approach to Dissimilarity-Based Classification,” J. Machine Learning Research, vol. 2, pp. 175-211, 2001.
[48] C. Leslie and R. Kuang, “Fast String Kernels Using Inexact Matching for Protein Sequences,” J. Machine Learning Research, vol. 5, pp. 1435-1455, 2004.

Index Terms:
Microarray data analysis, cancer classification, kernel machines, kernel optimization, bootstrapping resampling
Huilin Xiong, Ya Zhang, Xue-Wen Chen, "Data-Dependent Kernel Machines for Microarray Data Classification," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 4, no. 4, pp. 583-595, Oct.-Dec. 2007, doi:10.1109/tcbb.2007.1048
Usage of this product signifies your acceptance of the Terms of Use.