The Community for Technology Leaders
RSS Icon
Issue No.03 - July-September (2010 vol.7)
pp: 537-549
Xiaoxu Han , Eastern Michgan University, Ypsilanti
As a well-established feature selection algorithm, principal component analysis (PCA) is often combined with the state-of-the-art classification algorithms to identify cancer molecular patterns in microarray data. However, the algorithm's global feature selection mechanism prevents it from effectively capturing the latent data structures in the high-dimensional data. In this study, we investigate the benefit of adding nonnegative constraints on PCA and develop a nonnegative principal component analysis algorithm (NPCA) to overcome the global nature of PCA. A novel classification algorithm NPCA-SVM is proposed for microarray data pattern discovery. We report strong classification results from the NPCA-SVM algorithm on five benchmark microarray data sets by direct comparison with other related algorithms. We have also proved mathematically and interpreted biologically that microarray data will inevitably encounter overfitting for an SVM/PCA-SVM learning machine under a Gaussian kernel. In addition, we demonstrate that nonnegative principal component analysis can be used to capture meaningful biomarkers effectively.
Biomarker discovery, classification, feature selection, overfitting.
Xiaoxu Han, "Nonnegative Principal Component Analysis for Cancer Molecular Pattern Discovery", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.7, no. 3, pp. 537-549, July-September 2010, doi:10.1109/TCBB.2009.36
[1] N. Pochet, F. De Smet, J.A.K. Suykens, and B.L.R. De Moor, "Systematic Benchmarking of Microarray Data Classification: Assessing the Role of Non-Linearity and Dimensionality Reduction," Bioinformatics, vol. 20, no. 17, pp. 3185-3195, 2004.
[2] I.T. Jolliffe, Principal Component Analysis, second ed. Springer, 2002.
[3] S. Bicciato et al., "PCA Disjoint Models for Multiclass Cancer Analysis Using Gene Expression Data," Bioinformatics, vol. 19, pp. 571-578, 2003.
[4] R. Lilien and H. Farid, "Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum," J. Computational Biology, vol. 10, no. 6, pp. 925-946, 2003.
[5] D.V. Nguyen and D.M. Rocke, "Multi-Class Cancer Classification via Partial Least Squares with Gene Expression Profiles," Bioinformatics, vol. 18, pp. 1216-1226, 2002.
[6] D. Huang and D. Zheng, "Independent Component Analysis-Based Penalized Discriminant Method for Tumor Classification Using Gene Expression Data," Bioinformatics, vol. 22, pp. 1855-1865, 2006.
[7] W. Kong, C. Vanderburg, G. Gunshin, J. Rogers, and X. Huang, "A Review of Independent Component Analysis Application to Microarray Gene Expression Data," Biotechniques, vol. 45, no. 5, pp. 501-520, 2008.
[8] J. Brunet, P. Tamayo, T. Golub, and J. Mesirov, "Molecular Pattern Discovery Using Matrix Factorization," Proc. Nat'l Academy of Sciences USA, vol. 101, no. 12, pp. 4164-4169, 2004.
[9] K. Devarajan, "Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational Biology," PLoS Computational Biology, vol. 4, no. 7, 2008, doi:10.1371/journal.pcbi.1000029.
[10] H. Kim and H. Park, "Sparse Non-Negative Matrix Factorizations via Alternating Non-Negativity-Constrained Least Squares for Microarray Data Analysis," Bioinformatics, vol. 23, no. 12, pp. 1495-1502, 2007.
[11] S. Li, W. Xin, J. Hong, and S. Qian, "Learning Spatially Localized, Parts-Based Representation," Proc. Conf. Computer Vision and Pattern Recognition (CVPR '01), vol. 1, pp. i207-i212, 2001.
[12] Y. Gao and G. Church, "Improving Molecular Cancer Class Discovery through Sparse Nonnegative Matrix Factorization," Bioinformatics, vol. 21, no. 21, pp. 3970-3975, 2005.
[13] A. Pascual-Montano, J.M. Carazo, K. Kochi, D. Lehmann, and R.D. Pascual-Marqui, "Nonsmooth Nonnegative Matrix Factorization," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp. 403-415, Mar. 2006.
[14] C. Ding, X. He, and H. Simon, "On the Equivalence of Nonnegative Matrix Factorization and Spectral Clustering," Proc. SIAM Int'l Conf. Data Mining, pp. 606-610, 2005.
[15] J. Weston et al., "Feature Selection for SVMs," Advances in Neural Information Processing Systems (NIPS), vol. 13, pp. 668-674, MIT Press, 2000.
[16] D. Lee and H. Seung, "Learning the Parts of Objects by Non-Negative Matrix Factorization," Nature, vol. 401, pp. 1788-1791, 1999.
[17] D. Lee and H. Seung, "Algorithms for Non-Negative Matrix Factorization," Advances in Neural Information Processing Systems (NIPS), vol. 13, pp. 556-562, MIT Press, 2000.
[18] P. Hoyer, "Non-Negative Matrix Factorization with Sparseness Constraints," J. Machine Learning Research, vol. 5, pp. 1457-1469, 2004.
[19] R. Zass and A. Shashua, "Nonnegative Sparse PCA," Advances in Neural Information and Processing Systems (NIPS), MIT Press, 2006.
[20] J. Nocedal and S. Wright, Numerical Optimization. Springer, 1999.
[21] V.N. Vapnik, Statistical Learning Theory. John Wiley & Sons, 1998.
[22] A. Alon et al., "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays," Proc. Nat'l Academy of Sciences USA, vol. 96, pp. 6745-6750, 1999.
[23] N. Iizuka et al., "Oligonucleotide Microarray for Prediction of Early Intrahepatic Recurrence of Hepatocellular Carcinoma After Curative Resection," The Lancet, vol. 361, pp. 923-929, 2003.
[24] C.L. Nutt et al., "Gene Expression-Based Classification of Malignant Gliomas Correlates Better with Survival than Histological Classification," Cancer Research, vol. 63, no. 7, pp. 1602-1607, 2003.
[25] B. Schölkopf, A.J. Smola, and K.R. Müller, "Nonlinear Component Analysis As a Kernel Eigenvalue Problem," Neural Computation, vol. 10, pp. 1299-1319, 1998.
[26] S. Roweis and L. Saul, "Nonlinear Dimensionality Reduction by Locally Linear Embedding," Science, vol. 290, no. 5500, pp. 2323-2326, 2000.
[27] A. Hyvärinen, "Fast and Robust Fixed-Point Algorithms for Independent Component Analysis," IEEE Trans. Neural Networks, vol. 10, no. 3, pp. 626-634, May 1999.
[28] A. d'Aspremont et al., "A Direct Formulation for Sparse PCA Using Semidefinite Programming," SIAM Rev., vol. 49, no. 3, pp. 434-448, 2007.
[29] C. Lin, "Projected Gradient Methods for Non-Negative Matrix Factorization," Neural Computation, vol. 19, no. 10, pp. 2756-2779, 2007.
[30] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004.
[31] C. Ding and C. Cantor, "A High-Throughput Gene Expression Analysis Technique Using Competitive PCR and Matrix-Assisted Laser Desorption Ionization Time-of-Flight MS," Proc. Nat'l Academy of Sciences USA, vol. 100, no. 6, pp. 3059-3064, 2003.
[32] R. Twyman and S. Primrose, Principles of Gene Manipulation and Genomics, seventh ed., Blackwell Publishing, 2006.
[33] T.R. Golub et al., "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, pp. 531-537, 1999.
[34] P. Moos, E. Raetz, M. Carlson, A. Szabo, F. Smith, C. Willman, Q. Wei, S. Hunger, and W. Carroll, "Identification of Gene Expression Profiles That Segregate Patients with Childhood Leukemia," Clinical Cancer Research, vol. 8, pp. 3118-3130, 2002.
[35] A. Culhane, G. Perriere, E. Considine, T. Cotter, and D. Higgins, "Between-Group Analysis of Microarray Data," Bioinformatics, vol. 18, no. 12, pp. 1600-1608, 2002.
[36] S.S. Wang and J. Zhu, "Improved Centroids Estimation for the Nearest Shrunken Centroid Classifier," Bioinformatics, vol. 23, no. 8, pp. 972-979, 2007.
[37] W. Chu, Z. Ghahramani, F. Falciani, and D. Wild, "Biomarker Discovery in Microarray Gene Expression Data with Gaussian Processes," Bioinformatics, vol. 21, no. 16, pp. 3385-3393, 2005.
[38] C. Chen, Q. Liu, C. Pui, G. Rivera, J. Sandlund, R. Ribeiro, W. Evans, and M. Relling, "Higher Frequency of Glutathione S-Transferase Deletions in Black Children with Acute Lymphoblastic Leukemia," Blood, vol. 89, no. 5, pp. 1701-1707, 1997.
[39] A. Chatterjee, E. Mambo, and D. Sidransky, "Mitochondrial DNA Mutations in Human Cancer," Oncogene, vol. 25, pp. 4663-4674, 2006.
[40] V. Liu et al., "High Incidence of Somatic Mitochondrial DNA Mutations in Human Ovarian Carcinomas," Cancer Research, vol. 61, pp. 5998-6001, 2001.
[41] S. Pomeroy et al., "Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression," Nature, vol. 415, no. 24, pp. 436-439, 2002.
[42] S. Vignot et al., "mTOR-Targeted Therapy of Cancer with Rapamycin Derivatives," Annals of Oncology, vol. 16, no. 4, pp. 525-537, 2005.
[43] R. Fox and M. Dimmic, "A Two-Sample Bayesian T-Test for Microarray Data," BMC Bioinformatics, vol. 7, no. 126, 2006, doi:10.1186/1471-2105-7-126.
[44] X. Li, Y. Tan, and S. Ng, "Systematic Gene Function Prediction from Gene Expression Data by Using a Fuzzy Nearest-Cluster Method," BMC Bioinformatics, vol. 7, suppl. 4: S23, 2006, doi:10.1186/1471-2105-7-S4-S23.
[45] C. Li et al., "Major Copy Proportion Analysis of Tumor Samples Using SNP Arrays," BMC Bioinformatics, vol. 9, no. 204, 2008, doi:10.1186/1471-2105-9-20.
[46] J. Liu, S. Ranka, and T. Kahveci, "Classification and Feature Selection Algorithms for Multi-Class CGH Data," Bioinformatics, vol. 24, pp. i86-i95, 2008.
[47] M. Plumbley and E. Oja, "A 'Nonnegative PCA' Algorithm for Independent Component Analysis," IEEE Trans. Neural Networks, vol. 15, no, 1, pp. 66-76, Jan. 2004.
[48] M. Plumbley, "Algorithms for Nonnegative Independent Component Analysis," IEEE Trans. Neural Networks, vol. 4, no. 3, pp. 534-543, May 2003.
[49] F. Bach and M. Jordan, "Kernel Independent Component Analysis," J. Machine Learning and Research, vol. 3, pp. 1-48, July 2002.
[50] C. Ding, T. Li, and M. Jordan, "Convex and Semi-Nonnegative Matrix Factorizations," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 32, no. 1, pp. 45-55, Jan. 2010.
17 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool