This Article 
 Bibliographic References 
 Add to: 
Semisupervised Learning for Molecular Profiling
April-June 2005 (vol. 2 no. 2)
pp. 110-118

Abstract—Class prediction and feature selection are two learning tasks that are strictly paired in the search of molecular profiles from microarray data. Researchers have become aware how easy it is to incur a selection bias effect, and complex validation setups are required to avoid overly optimistic estimates of the predictive accuracy of the models and incorrect gene selections. This paper describes a semisupervised pattern discovery approach that uses the by-products of complete validation studies on experimental setups for gene profiling. In particular, we introduce the study of the patterns of single sample responses (sample-tracking profiles) to the gene selection process induced by typical supervised learning tasks in microarray studies. We originate sample-tracking profiles as the aggregated off-training evaluation of SVM models of increasing gene panel sizes. Genes are ranked by E-RFE, an entropy-based variant of the recursive feature elimination for support vector machines (RFE-SVM). A Dynamic Time Warping (DTW) algorithm is then applied to define a metric between sample-tracking profiles. An unsupervised clustering based on the DTW metric allows automating the discovery of outliers and of subtypes of different molecular profiles. Applications are described on synthetic data and in two gene expression studies.

[1] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J.D.M. Caligiuri, C. Bloomfield, and E. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, 1999.
[2] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene Selection for Cancer Classification Using Support Vector Machines,” Machine Learning, vol. 46, pp. 389-422, 2002.
[3] C. Ambroise and G. McLachlan, “Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data,” Proc. Nat'l Academy of Sciences US, vol. 99, no. 10, pp. 6562-6566, 2002.
[4] R. Simon, M. Radmacher, K. Dobbin, and L. McShane, “Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification,” J. Nat'l Cancer Inst., vol. 95, no. 1, pp. 14-18, 2003.
[5] W. Noble, “Support Vector Machine Applications in Computational Biology,” Kernel Methods in Computational Biology, B. Schoelkopf, K. Tsuda, and J.-P. Vert, eds., MIT Press, pp. 71-92, 2004.
[6] R. Simon, E. Korn, L. McShane, M. Radmacher, G. Wright, and Y. Zhao, Design and Analysis of DNA Microarray Investigations, series on statistics for biology and health, Springer, 2004.
[7] V. Vapnik, The Nature of Statistical Learning Algorithm. Berlin: Springer-Verlag, 2000.
[8] S. Mukherjee, “Classifying Microarray Data Using Support Vector Machines,” A Practical Approach to Microarray Data Analysis, W.D.D.P. Berrar and M. Granzow, eds., Kluwer Academic Publishers, pp. 166-185, 2003.
[9] B. Krishnapuram, L. Carin, and A. Hartemink, “Gene Expression Analysis: Joint Feature Selection and Classifier Design,” Kernel Methods in Computational Biology, B. Schoelkopf, K. Tsuda, and J.-P. Vert, eds., MIT Press, pp. 299-318, 2004.
[10] C. Furlanello, M. Serafini, S. Merler, and G. Jurman, “An Accelerated Procedure for Recursive Feature Ranking on Microarray Data,” Neural Networks, vol. 16, nos. 5-6, pp. 641-648, 2003.
[11] “Entropy-Based Gene Ranking without Selection Bias for the Predictive Classification of Microarray Data,” BMC Bioinformatics, no. 4, p. 54, 2003.
[12] L. Kari, A. Loboda, M. Nebozhyn, A. Rook, E. Vonderheid, C. Nichols, D. Virok, C. Chang, W.-H. Horng, J. Johnston, M. Wysocka, M. Showe, and L. Showe, “Classification and Prediction of Survival in Patients with the Leukemic Phase of Cutaneous T Cell Lymphoma,” J. Experimental Medicine, no. 11, pp. 1477-1488, June 2003.
[13] E. Bair and R. Tibshirani, “Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data,” PLoS Biol, vol. 2, no. 4, p. DOI: 10.1371/journal.pbio.0020108, Apr. 2004.
[14] L. Breiman, “Bagging Predictors,” Machine Learning, vol. 24, no. 2, pp. 123-140, 1996.
[15] O. Chapelle, “Support Vector Machines: Induction Principle, Adaptive Tuning and Prior Knowledge,” PhD dissertation, 2002.
[16] I. Guyon, N. Matic, and V. Vapnik, “Discovering Informative Patterns and Data Cleaning,” Advances in Knowledge Discovery and Data Mining, pp. 181-203, 1996.
[17] E.J. Moler, M.L. Chow, and I.S. Mian, “Analysis of Molecular Profile Data Using Generative and Discriminative Methods,” Physiological Genomics, vol. 4, pp. 109-126, 2000.
[18] S. Merler, B. Caprile, and C. Furlanello, “Bias-Variance Control via Hard Points Shaving,” Int'l J. Pattern Recognition and Artificial Intelligence, vol. 18, no. 5, pp. 891-903, 2004.
[19] G. Rätsch, T. Onoda, and K. Müller, “Soft Margins for Adaboost,” Machine Learning, vol. 42, pp. 287-320, 2001.
[20] C.L. Nutt, D. Mani, R.A. Betensky, P. Tamayo, J.G. Cairncross, C. Ladd, C.H. Ute Pohl, M.E. McLaughlin, T.T. Batchelor, P.M. Black, A. von Deimling, S.L. Pomeroy, T.R. Golub, and D.N. Louis, “Gene Expression-Based Classification of Malignant Gliomas Correlates Better with Survival than Histological Classification,” Cancer Research, vol. 63, pp. 1602-1607, 2003.
[21] A. Rakotomamonjy, “Variable Selection Using SVM-Based Criteria,” J. Machine Learning Research, no. 3, pp. 1357-1370, 2003.
[22] E. Keogh and M. Pazzani, “Scaling up Dynamic Time Warping for Datamining Applications,” Knowledge Discovery and Data Mining, pp. 285-289, 2000.
[23] H. Sakoe and S. Chiba, “Dynamic Programming Algorithm Optimization for Spoken Word Recognition,” IEEE Trans. Acoustics, Speech and Signal Processing, vol. 26, no. 1, Feb. 1978.
[24] J. Aach and G. Church, “Aligning Gene Expression Time Series with Time Warping Algorithms,” Bioinformatics, vol. 17, no. 6, pp. 495-508, 2001.
[25] R Development Core Team, “R: A Language and Environment for Statistical Computing,” R Foundation for Statistical Computing, Vienna, Austria, iSBN 3-900051-00-3, http:/, 2004.
[26] CardioGe nomics, “Genomics of Cardiovascular Development, Adaptation, and Remodeling,” NHLBI Program for Genomic Applications, Harvard Medical School, A mouse model of Myocardial Infarction, http:/, Feb. 2004.

Index Terms:
Machine learning, data mining, classifier design and evaluation, feature evaluation and selection, pattern analysis, clustering, similarity measures, biology and genetics, bioinformatics databases.
Cesare Furlanello, Maria Serafini, Stefano Merler, Giuseppe Jurman, "Semisupervised Learning for Molecular Profiling," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 2, no. 2, pp. 110-118, April-June 2005, doi:10.1109/TCBB.2005.28
Usage of this product signifies your acceptance of the Terms of Use.