The Community for Technology Leaders
RSS Icon
Issue No.01 - January-February (2011 vol.8)
pp: 246-252
Joaquim F. Pinto da Costa , University of Porto, Porto
Hugo Alonso , University of Aveiro, Aveiro
Luís Roque , ISEP, Porto
In this work, we introduce in the first part new developments in Principal Component Analysis (PCA) and in the second part a new method to select variables (genes in our application). Our focus is on problems where the values taken by each variable do not all have the same importance and where the data may be contaminated with noise and contain outliers, as is the case with microarray data. The usual PCA is not appropriate to deal with this kind of problems. In this context, we propose the use of a new correlation coefficient as an alternative to Pearson's. This leads to a so-called weighted PCA (WPCA). In order to illustrate the features of our WPCA and compare it with the usual PCA, we consider the problem of analyzing gene expression data sets. In the second part of this work, we propose a new PCA-based algorithm to iteratively select the most important genes in a microarray data set. We show that this algorithm produces better results when our WPCA is used instead of the usual PCA. Furthermore, by using Support Vector Machines, we show that it can compete with the Significance Analysis of Microarrays algorithm.
Correlation, principal component analysis, support vector machines, microarray data, gene selection.
Joaquim F. Pinto da Costa, Hugo Alonso, Luís Roque, "A Weighted Principal Component Analysis and Its Application to Gene Expression Data", IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol.8, no. 1, pp. 246-252, January-February 2011, doi:10.1109/TCBB.2009.61
[1] C. Ambroise and G.J. McLachlan, "Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data," Proc. Nat'l Academy of Sciences USA, vol. 99, pp. 6562-6566, 2002.
[2] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. ACM Press, 1999.
[3] S. Bicciato, A. Luchini, and C. Di Bello, "PCA Disjoint Models for Multiclass Cancer Analysis Using Gene Expression Data," Bioinformatics, vol. 19, pp. 571-578, 2003.
[4] A. Brazma and J. Vilo, "Gene Expression Data Analysis," FEBS Letters, vol. 480, pp. 17-24, 2000.
[5] J. Breese, D. Heckerman, and C. Kadie, "Empirical Analysis of Predictive Algorithms for Collaborative Filtering," Proc. 14th Conf. Uncertainty in Artificial Intelligence, pp. 43-52, 1998.
[6] C.J.C. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition," Data Mining, vol. 2, pp. 121-167, 1998.
[7] A.R. Dabney, "Classification of Microarray to Nearest Centroids," Bioinformatics, vol. 21, no. 22, pp. 4148-4154, 2005.
[8] S. Dudoit, J. Fridlyand, and T.P. Speed, "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," J. Am. Statistical Assoc., vol. 97, pp. 77-87, 2002.
[9] W.J. Ewens and G.R. Grant, Statistical Methods in Bioinformatics—An Introduction, second ed. Springer, 2005.
[10] J. Faber, A.V. Krivtsov, M.C. Stubbs, R. Wright, T.N. Davis, M. van den Heuvel-Eibrink, C.M. Zwaan, S.A. Kung, and A.L. Armstrong, "Hoxa9 Is Required for Survival in Human MLL-Rearranged Acute Leukemias," Blood, vol. 113, no. 11, pp. 2375-2385, 2009.
[11] Bioinformatics and Computational Biology Solutions Using R and Bioconductor, R. Gentleman, V.J. Carey, W. Huber, R.A. Irizarry, and S. Dudoit, eds. Springer, 2005.
[12] M. Girolami and R. Breitling, "Biologically Valid Linear Factor Models of Gene Expression," Bioinformatics, vol. 20, pp. 3021-3033, 2004.
[13] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. Springer, 2002.
[14] M. Hubert and S. Engelen, "Robust PCA and Classification in Biosciences," Bioinformatics, vol. 20, pp. 1728-1736, 2004.
[15] R. Jagadeeswaran, H. Surawska, S. Krishnaswamy, V. Janamanchi, A.C. Mackinnon, T.Y. Seiwert, S. Loganathan, R. Kanteti, T. Reichman, V. Nallasura, S. Schwartz, L. Faoro, Y.C. Wang, L. Girard, M.S. Tretiakova, S. Ahmed, O. Zumba, L. Soulii, V.P. Bindokas, L.L. Szeto, G.J. Gordon, R. Bueno, D. Sugarbaker, M.W. Lingen, M. Sattler, T. Krausz, W. Vigneswaran, V. Natarajan, J. Minna, E.E. Vokes, M.K. Ferguson, A.N. Husain, and R. Salgia, "Paxillin Is a Target for Somatic Mutations in Lung Cancer: Implications for Cell Growth and Invasion," Cancer Research, vol. 68, no. 1, pp. 132-142, 2008.
[16] J.J. Jansen, H.C. Hoefsloot, H.F. Boelens, J. van der Greef, and A.K. Smilde, "Analysis of Longitudinal Metabolomics Data," Bioinformatics, vol. 20, pp. 2438-2446, 2004.
[17] K. Järvelin and J. Kekäläinen, "IR Evaluation Methods for Retrieving Highly Relevant Documents," Proc. ACM SIGIR '00, N. Belkin, P. Ingwersen, and M.-K. Leong, eds, pp. 41-48, 2000.
[18] I.T. Jolliffe, Principal Component Analysis, second ed. Springer, 2002.
[19] N. Katase, M. Gunduz, L. Beder, E. Gunduz, M. Lefeuvre, O.F. Hatipoglu, S.S. Borkosky, R. Tamamura, S. Tominaga, N. Yamanaka, K. Shimizu, N. Nagai, and H. Nagatsuka, "Deletion at Dickkopf (dkk)-3 Locus (11p15.2) Is Related with Lower Lymph Node Metastasis and Better Prognosis in Head and Neck Squamous Cell Carcinomas," Oncology Research, vol. 17, no. 6, pp. 273-282, 2008.
[20] H.A.L. Kiers, "Weighted Least Squares Fitting Using Ordinary Least Squares Algorithm," Psychometrika, vol. 62, pp. 251-266, 1997.
[21] K. Kondo, T. Kaneko, M. Baba, and H. Konno, "VEGF-C and VEGF-A Synergistically Enhance Lymph Node Metastasis of Gastric Cancer," Biological and Pharmaceutical Bull., vol. 30, no. 4, pp. 633-637, 2007.
[22] L. Lebart, A. Morineau, and J.P. Fénelon, Traitement des Données Statistiques—Méthodes et Programmes. 2e éd. Dunod/BORDAS, 1982.
[23] P.P. Medina, J. Carretero, M.F. Fraga, M. Esteller, D. Sidransky, and M. Sanchez-Cespedes, "Genetic and Epigenetic Screening for Gene Alterations of the Chromatin-Remodeling Factor, SMARCA4/BRG1, in Lung Tumors," Genes, Chromosomes and Cancer, vol. 41, no. 2, pp. 170-177, 2004.
[24] P.T. Nelson, D.A. Baldwin, L.M. Scearce, J.C. Oberholtzer, J.W. Tobias, and Z. Mourelatos, "Microarray-Based, High-Throughput Gene Expression Profiling of Micrornas," Nature Methods, vol. 1, pp. 155-161, 2004.
[25] D. Pestana and S. Velosa, Introdução à Probabilidade e à Estatística, vol. 1, 2a ed. Fundação Calouste Gulbenkian, 2006.
[26] J. Pinto da Costa, H. Alonso, L.A.C. Roque, and M.M. Oliveira, "Supervised and Unsupervised Selection of Genes in Microarray Data," Proc. Workshop Statistics in Genomics and Proteomics, vol. 27, pp. 65-74, 2006.
[27] J. Pinto da Costa and L. Roque, "Limit Distribution for the Weighted Rank Correlation Coefficient, $r_w$ ," REVSTAT—Statistical J., vol. 4, no. 3, pp. 189-200, Nov. 2006.
[28] J. Pinto da Costa and C. Soares, "A Weighted Rank Measure of Correlation," Australian and New Zealand J. Statistics, vol. 47, no. 4, pp. 515-529, 2005.
[29] J. Pinto da Costa and C. Soares, "Rejoinder to Letter to the Editor from C. Genest and J.-F. Plante Concerning Pinto da Costa, J. & Soares, C. (2005) A Weighted Rank Measure of Correlation," Australian and New Zealand J. Statistics, vol. 49, no. 2, pp. 205-207, 2007.
[30] S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo, M.E. McLaughlin, J. Kim, L. Goumnerovak, P.M. Black, C. Lau, J.C. Allen, D. Zagzag, J.M. Olson, T. Curran, C. Wetmore, J. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califanokk, G. Stolovitzkykk, D. Louis, J. Mesirov, E. Lander, and T. Golub, "Prediction of Central Nervous System Embryonal Tumour Outcome Based on Gene Expression," Nature Neuroscience, vol. 415, pp. 436-442, 2002.
[31] R Development Core Team, R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, 2005.
[32] G. Sanguinetti, M. Milo, M. Rattray, and N.D. Lawrence, "Accounting for Probe-Level Noise in Principal Component Analysis of Microarray Data," Bioinformatics, vol. 21, pp. 3748-3754, 2005.
[33] U.G. Sathyanarayana, A. Padar, C.X. Huang, M. Suzuki, H. Shigematsu, B.N. Bekele, and A.F. Gazdar, "Aberrant Promoter Methylation and Silencing of Laminin-5-Encoding Genes in Breast Carcinoma," Clinical Cancer Research, vol. 9, no. 17, pp. 6389-6394, 2003.
[34] M. Scholz, S. Gatzek, A. Sterling, O. Fiehn, and J. Selbig, "Metabolite Fingerprinting: Detecting Biological Features by Independent Component Analysis," Bioinformatics, vol. 20, pp. 2447-2454, 2004.
[35] D. Slonim, T. Golub, P. Tamayo, J.P. Mesirov, and E.S. Lander, "Class Prediction and Discovery Using Gene Expression Data," Proc. Int'l Conf. Research in Computational Molecular Biology (RECOMB), pp. 263-272, 2000.
[36] J. Terry, T. Saito, S. Subramanian, C. Ruttan, C.R. Antonescu, J.R. Goldblum, E. Downs-Kelly, C.L. Corless, B.P. Rubin, M. van de Rijn, M. Ladanyi, and T.O. Nielsen, "TLE1 as a Diagnostic Immunohistochemical Marker for Synovial Sarcoma Emerging from Gene Expression Profiling Studies," The Am. J. Surgical Pathology, vol. 31, no. 2, pp. 240-246, 2007.
[37] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, "Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression," Proc. Nat'l Academy of Sciences USA, vol. 99, pp. 6567-6572, 2002.
[38] V. Tusher, R. Tibshirani, and C. Chu, "Significance Analysis of Microarrays Applied to Ionizing Radiation Response," Proc. Nat'l Academy of Sciences USA, vol. 98, pp. 5116-5121, 2001.
[39] H. Wold and E. Lyttkens, "Nonlinear Iterative Partial Least Squares (NIPALS) Estimation Procedures," Proc. 37th Session Bull. Int'l Statistical Inst., pp. 1-15, 1969.
[40] X. Xu, H. Yamamoto, M. Sakon, M. Yasui, C.Y. Ngan, H. Fukunaga, T. Morita, M. Ogawa, H. Nagano, S. Nakamori, M. Sekimoto, N. Matsuura, and M. Monden, "Overexpression of cdc25a Phosphatase Is Associated with Hypergrowth Activity and Poor Prognosis of Human Hepatocellular Carcinomas," Clinical Cancer Research, vol. 9, no. 5, pp. 1764-1772, 2003.
[41] T. Yu and K.-C. Li, "Inference of Transcriptional Regulatory Network by Two-Stage Constrained Space Factor Analysis," Bioinformatics, vol. 21, pp. 4033-4038, 2005.
[42] M. Yunta and P.A. Lazo, "Apoptosis Protection and Survival Signal by the cd53 Tetraspanin Antigen," Oncogene, vol. 22, no. 8, pp. 1219-1224, 2003.
34 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool