This Article 
 Bibliographic References 
 Add to: 
Using Uncorrelated Discriminant Analysis for Tissue Classification with Gene Expression Data
October-December 2004 (vol. 1 no. 4)
pp. 181-190

Abstract—The classification of tissue samples based on gene expression data is an important problem in medical diagnosis of diseases such as cancer. In gene expression data, the number of genes is usually very high (in the thousands) compared to the number of data samples (in the tens or low hundreds); that is, the data dimension is large compared to the number of data points (such data is said to be undersampled). To cope with performance and accuracy problems associated with high dimensionality, it is commonplace to apply a preprocessing step that transforms the data to a space of significantly lower dimension with limited loss of the information present in the original data. Linear Discriminant Analysis (LDA) is a well-known technique for dimension reduction and feature extraction, but it is not applicable for undersampled data due to singularity problems associated with the matrices in the underlying representation. This paper presents a dimension reduction and feature extraction scheme, called Uncorrelated Linear Discriminant Analysis (ULDA), for undersampled problems and illustrates its utility on gene expression data. ULDA employs the Generalized Singular Value Decomposition method to handle undersampled data and the features that it produces in the transformed space are uncorrelated, which makes it attractive for gene expression data. The properties of ULDA are established rigorously and extensive experimental results on gene expression data are presented to illustrate its effectiveness in classifying tissue samples. These results provide a comparative study of various state-of-the-art classification methods on well-known gene expression data sets.

[1] A.A. Alizadeh, M.B. Eisen, R.E. David, C. Ma, I.S. Lossos, A. Rosenwald, H.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Martu, T. Moore, J. Hudson, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, G.P. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botsten, P.O. Brown, and L.M. Staudt, “Distinct Types Of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling,” Nature, vol. 403, pp. 503-511, 2000.
[2] U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays,” Proc. Nat'l Academy of Science, vol. 96, pp. 6745-6750, 1999.
[3] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, “Tissue Classification with Gene Expression Profiles,” J. Computational Biology, vol. 7, pp. 559-584, 2000.
[4] P.N. Belhumeour, J.P. Hespanha, and D.J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 711-720, July 1997.
[5] M.P.S. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, and T.S. Furey, “Knowledge-Based Analysis of Microarray Gene Expression Data by Using Support Vector Machines,” Proc. Nat'l Academy of Science, vol. 97, pp. 262-267, 2000.
[6] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121-167, 1998.
[7] M. Chee, R. Yang, E. Hubbell, A. Berno, X. Huang, D. Stern, J. Winkler, D. Lockhart, M. Morris, and S. Fodor, “Accessing Genetic Information with High Density DNA Arrays,” Science, vol. 274, pp. 610-614, 1996.
[8] S. Dudoit, J. Fridlyand, and T.P. Speed, “Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data,” J. Am. Statistical Assoc., vol. 97, pp. 77-87, 2002.
[9] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Science, vol. 95, pp. 14863-4868, 1998.
[10] S. Fodor, J. Read, M. Pirrung, L. Stryer, A. Lu, and D. Solas, “Light-Directed, Spatially Addressable Parallel Chemical Synthesis,” Science, vol. 251, pp. 767-783, 1991.
[11] N. Friedman, M. Linial, I. Nachman, and D. Pe'er, “Using Bayesian Networks to Analyze Expression Data,” J. Computational Biology, vol. 7, pp. 601-620, 2000.
[12] K. Fukunaga, Introduction to Statistical Pattern Recognition. Academic Press, 1990.
[13] G. Getz, E. Levine, and E. Domany, “Coupled Two-Way Clustering Analysis of Gene Microarray Data,” Proc. Nat'l Academy of Science, vol. 97, pp. 12079-12084, 2000.
[14] G.H. Golub and C.F. V. Loan, Matrix Computations. The Johns Hopkins Univ. Press, 1991.
[15] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gassenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, 1999.
[16] T. Hastie, R. Tibshirani, and J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2001.
[17] P. Howland, M. Jeon, and H. Park, “Structure Preserving Dimension Reduction for Clustered Text Data Based on the Generalized Singular Value Decomposition,” SIAM J. Matrix Analysis and Applications, vol. 25, no. 1, pp. 165-179, 2003.
[18] C.W. Hsu and C.J. Lin, “A Comparison of Methods for Multi-Class Support Vector Machines,” IEEE Trans. Neural Networks, vol. 13, pp. 415-425, 2002.
[19] J. Khan, J. Wei, M. Ringner, L. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, and P. Meltzer, “Classification and Diagnostic Prediction of Cancers Using Expression Profiling and Artificial Neural Networks,” Nature Medicine, vol. 7, pp. 673-679, 2001.
[20] W.J. Krzanowski, P. Jonathan, W.V. McCarthy, and M.R. Thomas, “Discriminant Analysis with Singular Covariance Matrices: Methods and Applications to Spectroscopic Data,” Applied Statistics, vol. 44, pp. 101-115, 1995.
[21] Y. Lee and C.K. Lee, “Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data,” Bioinformatics, vol. 19, no. 9, pp. 1132-1139, 2003.
[22] T. Li, C. Zhang, and M. Ogihara, “A Comparative Study of Feature Selection and Multiclass Classification Methods for Tissue Classification Based on Gene Expression,” Bioinformatics, vol. 20, no. 15, pp. 2429-2437, 2004.
[23] C. Ooi and P. Tan, “Genetic Algorithms Applied to Multi-Class Prediction for the Analysis of Gene Expression Data,” Bioinformatics, vol. 19, pp. 37-44, 2003.
[24] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S. Lander, and R.T. Golub, “Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures,” Proc. Nat'l Academy of Science, vol. 98, pp. 15149-15154, 2001.
[25] D.T. Ross, U. Scherf, M.B. Eisen, C.M. Perou, C. Rees, P. Spellmand, V. Iyer, S.S. Jeffrey, M. Van de Rijn, M. Waltham, A. Pergamenschikov, J.C.F. Lee, D. Lashkari, D. Shalon, T.G. Myers, J.N. Weinstein, D. Botstein, and M.P.O. Brown, “Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines,” Nature Genetics, vol. 24, pp. 227-235, 2000.
[26] D. Singh et al., “Gene Expression Correlates of Clinical Prostate Cancer Behavior,” Cancer Cell, vol. 1, no. 2, pp. 203-209, 2002.
[27] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, “A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis,” Bioinformatics, 2004.
[28] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, “Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression,” Proc. Nat'l Academy of Science, vol. 99, no. 10, pp. 6567-6572, 2002.
[29] V.N. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
[30] J. Ye, “Characterization of a Family of Algorithms for Generalized Discriminant Analysis on Undersampled Problems,” pending publication.
[31] J. Ye, R. Janardan, Q. Li, and H. Park, “Feature Extraction via Generalized Uncorrelated Linear Discriminant Analysis,” Proc. 21st Int'l Conf. Machine Learning, pp. 895-902, 2004.
[32] J. Ye, R. Janardan, C.H. Park, and H. Park, “An Optimization Criterion for Generalized Discriminant Analysis on Undersampled Problems,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 8, pp. 982-994, Aug. 2004.
[33] C.H. Yeang, S. Ramaswamy, P. Tamayo, S. Mukherjee, R. Rifkin, M. Angelo, M. Reich, E.S. Lander, J.P. Mesirov, and T.R. Golub, “Molecular Classification of Multiple Tumor Types,” Bioinformatics, vol. 11, pp. 1-7, 2001.
[34] E.-J. Yeoh, M.E. Ross, S.A. Shurtleff, W.K. Williams, D. Patel, R. Mahrouz, F.G. Behm, S.C. Raimondi, M.V. Relling, A. Patel, C. Cheng, D. Campana, D. Wilkins, X. Zhou, J. Li, H. Liu, C.-H. Pui, W.E. Evans, C. Naeve, L. Wong, and J.R. Downing, “Classification, Subtype Discovery, and Prediction of Outcome in Pediatric Lymphoblastic Leukemia by Gene Expression Profiling,” Cancer Cell, vol. 1, pp. 133-143, 2002.
[35] K.Y. Yeung and W.L. Ruzzo, “Principal Component Analysis for Clustering Gene Expression Data,” Bioinformatics, vol. 17, pp. 763-774, 2001.

Index Terms:
Microarray data analysis, discriminant analysis, generalized singular value decomposition, classification.
Jieping Ye, Tao Li, Tao Xiong, Ravi Janardan, "Using Uncorrelated Discriminant Analysis for Tissue Classification with Gene Expression Data," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 1, no. 4, pp. 181-190, Oct.-Dec. 2004, doi:10.1109/TCBB.2004.45
Usage of this product signifies your acceptance of the Terms of Use.