This Article 
 Bibliographic References 
 Add to: 
Peakbin Selection in Mass Spectrometry Data Using a Consensus Approach with Estimation of Distribution Algorithms
May/June 2011 (vol. 8 no. 3)
pp. 760-774
Rubén Armañanzas, Universidad Politécnica de Madrid, Madrid
Yvan Saeys, Ghent University, Ghent
Iñaki Inza, University of the Basque Country, San Sebastián
Miguel García-Torres, Pablo de Olavide University, Seville
Concha Bielza, Universidad Politécnica de Madrid, Madrid
Yves van de Peer, Ghent University, Ghent
Pedro Larrañaga, Universidad Politécnica de Madrid, Madrid
Progress is continuously being made in the quest for stable biomarkers linked to complex diseases. Mass spectrometers are one of the devices for tackling this problem. The data profiles they produce are noisy and unstable. In these profiles, biomarkers are detected as signal regions (peaks), where control and disease samples behave differently. Mass spectrometry (MS) data generally contain a limited number of samples described by a high number of features. In this work, we present a novel class of evolutionary algorithms, estimation of distribution algorithms (EDA), as an efficient peak selector in this MS domain. There is a trade-of f between the reliability of the detected biomarkers and the low number of samples for analysis. For this reason, we introduce a consensus approach, built upon the classical EDA scheme, that improves stability and robustness of the final set of relevant peaks. An entire data workflow is designed to yield unbiased results. Four publicly available MS data sets (two MALDI-TOF and another two SELDI-TOF) are analyzed. The results are compared to the original works, and a new plot (peak frequential plot) for graphically inspecting the relevant peaks is introduced. A complete online supplementary page, which can be found at, includes extended info and results, in addition to Matlab scripts and references.

[1] Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation, P. Larrañaga and J. Lozano, eds. Kluwer Academic Publishers, 2002.
[2] Towards a New Evolutionary Computation: Advances on Estimation of Distribution Algorithms, J.A. Lozano, P. Larrañaga, I. Inza, and E. Bengoetxea, eds. Springer-Verlag, 2006.
[3] R. Armañanzas, I. Inza, R. Santana, Y. Saeys, J.L. Flores, J.A. Lozano, Y. Van de Peer, R. Blanco, V. Robles, C. Bielza, and P. Larrañaga, "A Review of Estimation of Distribution Algorithms in Bioinformatics," BioData Mining, vol. 1, no. 6, 2008.
[4] M. Karas, D. Bachmann, U. Bahr, and F. Hillenkamp, "Matrix-Assisted Ultraviolet Laser Desorption of Non-Volatile Compounds," Int'l J. Mass Spectrometry and Ion Processes, vol. 78, pp. 53-68, 1987.
[5] T.W. Hutchens and T. Yip, "New Desorption Strategies for the Mass Spectrometric Analysis of Macromolecules," Rapid Comm. Mass Spectrometry, vol. 7, no. 7, pp. 576-580, 1993.
[6] E.F. Petricoin, A.M. Ardekani, B.A. Hitt, P.J. Levine, V.A. Fusaro, S.M. Steinberg, G.B. Mills, C. Simone, D.A. Fishman, E.C. Kohn, and L.A. Liotta, "Use of Proteomic Patterns in Serum to Identify Ovarian Cancer," Lancet, vol. 359, no. 9306, pp. 572-577, 2002.
[7] M. Hilario, A. Kalousis, C. Pellegrini, and M. Müller, "Processing and Classification of Protein Mass Spectra," Mass Spectrometry Rev., vol. 25, pp. 409-449, 2006.
[8] H. Shin and M.K. Markey, "A Machine Learning Perspective on the Development of Clinical Decision Support Systems Utilizing Mass Spectra of Blood Samples," J. Biomedical Informatics, vol. 39, no. 2, pp. 227-248, 2006.
[9] H.W. Ressom, R.S. Varghese, L. Goldman, C.A. Loffredo, M. Abdel-Hamid, Z. Kyselova, Y. Mechref, M. Novotny, and R. Goldman, "Analysis of MALDI-TOF Mass Spectrometry Data for Detection of Glycan Biomarkers," Proc. Pacific Symp. Biocomputing, pp. 216-227, 2008.
[10] E. Marchiori, C.R. Jimenez, M. West-Nielsen, and N.H. Heegaard, "Robust SVM-Based Biomarker Selection with Noisy Mass Spectrometric Proteomic Data," Applications of Evolutionary Computing, pp. 79-90, Springer, 2006.
[11] P. Bougioukos, D. Glotsos, D. Cavouras, A. Daskalakis, I. Kalatzis, S. Kostopoulos, G. Nikiforidis, and A. Bezerianos, "An Intensity-Region Driven Multi-Classifier Scheme for Improving the Classification Accuracy of Proteomic MS-Spectra," Computer Methods and Programs in Biomedicine, vol. 99, no. 2, pp. 147-153, 2010.
[12] K.R. Coombes, K.A. Baggerly, and J.S. Morris, "Pre-Processing Mass Spectrometry Data," Fundamentals of Data Mining in Genomics and Proteomics, W. Dubitzky, M. Granzow, and D. Berrar, eds., pp. 79-102, Springer, 2007.
[13] S. Swift, A. Tucker, V. Vinciotti, N. Martin, C. Orengo, X. Liu, and P. Kellam, "Consensus Clustering and Functional Interpretation of Gene-Expression Data," Genome Biology, vol. 5, no. 11, pp. R94.1-R94.16, 2004.
[14] D. Valkenborg, S. Van Sanden, D. Lin, A. Kasim, Q. Zhu, P. Haldermans, I. Jansen, Z. Shkedy, and T. Burzykowski, "A Cross-Validation Study to Select a Classification Procedure for Clinical Diagnosis Based on Proteomic Mass Spectrometry," Statistical Applications in Genetics and Molecular Biology, vol. 7, no. 2, p. 12, 2008.
[15] J. Dutkowski and A. Gambin, "On Consensus Biomarker Selection," BMC Bioinformatics, vol. 8 (Suppl 5), no. S5, 2007.
[16] R. Armañanzas, B. Calvo, I. Inza, M. López-Hoyos, V. Martínez-Taboada, E. Ucar, I. Bernales, A. Fullaondo, P. Larrañaga, and A.M. Zubiaga, "Microarray Analysis of Autoimmune Diseases by Machine Learning Procedures," IEEE Trans. Information Technology in Biomedicine, vol. 13, no. 3, pp. 341-350, May 2009.
[17] A. Barla, G. Jurman, S. Riccadonna, S. Merler, M. Chierici, and C. Furlanello, "Machine Learning Methods for Predictive Proteomics," Briefings in Bioinformatics, vol. 9, no. 2, pp. 119-128, 2008.
[18] A.C. Sauve and T.P. Speed, "Normalization, Baseline Correction and Alignment of High-Throughput Mass Spectrometry Data," Proc. Workshop Genomic Signal Processing and Statistics, 2004.
[19] J.B. Breen, F.G. Hopwood, K.L. Williams, and M.R. Wilkins, "Automatic Poisson Peak Harvesting for High Throughput Protein Identification," Electrophoresis, vol. 21, pp. 2243-2251, 2000.
[20] W. Meuleman, J.Y. Engwegen, M.W. Gast, J.H. Beijnen, M.J. Reinders, and L.F. Wessels, "Comparison of Normalisation Methods for Surface-Enhanced Laser Desorption and Ionisation (SELDI) Time-of-Flight (TOF) Mass Spectrometry Data," BMC Bioinformatics, vol. 9, article no. 88, 2008.
[21] K.R. Coombes, S. Tsavachidis, J.S. Morris, K.A. Baggerly, M.-C. Hung, and H.M. Kuerer, "Improved Peak Detection and Quantification of Mass Spectrometry Data Acquired from Surface-Enhanced Laser Desorption and Ionization by Denoising Spectra with the Undecimated Discrete Wavelet Transform," Proteomics, vol. 5, no. 16, pp. 4107-4117, 2005.
[22] J. Prados, A. Kalousis, and M. Hilario, "On Preprocessing of SELDI-MS Data and Its Evaluation," Proc. 19th IEEE Symp. Computer-Based Medical Systems, pp. 953-958, 2006.
[23] Y. Chen, V. Kamat, E.R. Dougherty, M.L. Bittner, P.S. Meltzer, and J.M. Trent, "Ratio Statistics of Gene Expression Levels and Applications to Microarray Data Analysis," Bioinformatics, vol. 18, no. 9, pp. 1207-1215, 2002.
[24] H. Shin, B. Sheub, M. Joseph, and M.K. Markey, "Guilt-by-Association Feature Selection: Identifying Biomarkers from Proteomic Profiles," J. Biomedical Informatics, vol. 41, no. 1, pp. 124-136, 2008.
[25] D.J. Slotta, L.S. Heath, N. Ramakrishnan, R. Helm, and M. Potts, "Clustering Mass Spectrometry Data Using Order Statistics," Proteomics, vol. 95, pp. 1687-1691, 2003.
[26] R. Tibshirani, T. Hastie, B. Narasimhan, S. Soltys, G. Shi, A. Koong, and Q.-T. Le, "Sample Classification from Protein Mass Spectrometry, by 'Peak Probability Contrasts,'" Bioinformatics, vol. 20, no. 17, pp. 3034-3044, 2004.
[27] H.W. Ressom, R.S. Varghese, S.K. Drake, G.L. Hortin, M. Abdel-Hamid, C.A. Loffredo, and R. Goldman, "Peak Selection from MALDI-TOF Mass Spectra Using Ant Colony Optimization," Bioinformatics, vol. 23, no. 5, pp. 619-626, 2007.
[28] H.W. Ressom, R.S. Varghese, M. Abdel-Hamid, S.A. Eissa, D. Saha, L. Goldman, E.F. Petricoin, T.P. Conrads, T.D. Veenstra, C.A. Loffredo, and R. Goldman, "Analysis of Mass Spectral Serum Profiles for Biomarker Selection," Bioinformatics, vol. 21, no. 21, pp. 4039-4045, 2005.
[29] M. Sturm, A. Bertsch, C. Gröpl, A. Hildebrandt, R. Hussong, E. Lange, N. Pfeifer, O. Schulz-Trieglaff, A. Zerck, K. Reinert, and O. Kohlbacher, "OpenMS—An Open-Source Software Framework for Mass Spectrometry," BMC Bioinformatics, vol. 9, article no. 163, 2008.
[30] P.A. Bosman and D. Thierens, "Linkage Information Processing in Distribution Estimation Algorithms," Proc. Genetic and Evolutionary Computation Conf. (GECCO '99), pp. 60-67, 1999.
[31] H. Mühlenbein and G. Paaß, "From Recombination of Genes to the Estimation of Distributions. Binary Parameters." Proc. Fourth Int'l Conf. Parallel Problem Solving from Nature (PPSN), pp. 178-187, 1996.
[32] M. Pelikan, Hierarchical Bayesian Optimization Algorithm. Toward a New Generation of Evolutionary Algorithms. Springer, 2005.
[33] J.H. Holland, Adaptation in Natural and Artificial Systems. Univ. of Michigan Press, 1975.
[34] D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Kluwer Academic Publishers, 1989.
[35] J. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992.
[36] P. Larrañaga, "A Review on Estimation of Distribution Algorithms," Estimation of Distribution Algorithms. A New Tool for Evolutionary Computation, pp. 55-98, Kluwer Academic Publishers, 2002.
[37] Y. Saeys, I. Inza, and P. Larrañaga, "A Review of Feature Selection Techniques in Bioinformatics," Bioinformatics, vol. 23, no. 19, pp. 2507-2517, 2007.
[38] Y. Saeys, D. Degroeve, D. Aeyels, P. Rouzé, and Y. Van de Peer, "Feature Selection for Splice Site Prediction: A New Method Using EDA-Based Feature Ranking," BMC Bioinformatics, vol. 5, article no. 64, 2004.
[39] Y. Saeys, T. Abeel, and Y. Van de Peer, "Robust Feature Selection Using Ensemble Feature Selection Techniques," Machine Learning and Knowledge Discovery in Databases, pp. 313-325, Springer, 2008.
[40] A. Kalousis, J. Prados, and M. Hilario, "Stability of Feature Selection Algorithms," Proc. Fifth IEEE Int'l Conf. Data Mining, pp. 218-225, 2005.
[41] L.I. Kuncheva, "A Stability Index for Feature Selection," Proc. 25th IASTED Int'l Multi-Conf. Artificial Intelligence and Applications, pp. 390-395, 2007.
[42] K.A. Baggerly, J.S. Morris, and K.R. Coombes, "Reproducibility of SELDI-TOF Protein Patterns in Serum: Comparing Data Sets from Different Experiments," Bioinformatics, vol. 20, pp. 777-785, 2004.
[43] A. Statnikov, C.F. Aliferis, I. Tsamardinos, D. Hardin, and S. Levy, "A Comprehensive Evaluation of Multicategory Classification Methods for Microarray Gene Expression Cancer Diagnosis," Bioinformatics, vol. 21, no. 5, pp. 631-643, 2005.
[44] J. Reunanen, "Overfitting in Making Comparisons between Variable Selection Methods," J. Machine Learning Research, vol. 3, pp. 1371-1382, 2003.
[45] Computational Methods of Feature Selection, H. Liu and H. Motoda, eds. Chapman and Hall/CRC Press, 2008.
[46] B. Efron, "Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation," J. Am. Statistical Assoc., vol. 78, no. 382, pp. 316-331, 1983.
[47] J. Handi, D.B. Kell, and J. Knowles, "Multiobjective Optimization in Bioinformatics and Computational Biology," IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 2 pp. 279-292, Apr.-June 2007.
[48] V. Pareto, Cours D' économie Politique, Univ. of Lausanne, vols. I and II. 1896.
[49] E.F. Petricoin, V. Rajapaske, E.H. Herman, A.M. Arekani, S. Ross, D. Johann, A. Knapton, J. Zhang, B.A. Hitt, T.P. Conrads, T.D. Veenstra, L.A. Liotta, and F.D. Sistare, "Toxicoproteomics: Serum Proteomic Pattern Diagnostics for Early Detection of Drug Induced Cardiac Toxicities and Cardioprotection," Toxicologic Pathology, vol. 32, no. 1, pp. 122-130, 2004.
[50] H.W. Ressom, R.S. Varghese, E. Orvisky, S.K. Drake, G.L. Hortin, M. Abdel-Hamid, C.A. Loffredo, and R. Goldman, "Ant Colony Optimization for Biomarker Identification from MALDI-TOF Mass Spectra," Proc. 28th Int'l Conf. IEEE Eng. in Medicine and Biology Soc., pp. 4560-4563, 2006.
[51] G.H. John and P. Langley, "Estimating Continuous Distributions in Bayesian Classifiers," Proc. 11th Conf. Uncertainty in Artificial Intelligence, pp. 338-345, 1995.
[52] R. Santana, C. Bielza, P. Larrañaga, J.A. Lozano, C. Echegoyen, A. Mendiburu, R. Armañanzas, and S.K. Shakya, "MATEDA: A Matlab Package for the Implementation and Analysis of Estimation of Distribution Algorithms," J. Statistical Software, vol. 35, no. 7, pp. 1-30, 2010.
[53] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification. Wiley Interscience, 2001.
[54] P. Larrañaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J.A. Lozano, R. Armañanzas, G. Santafé, A. Pérez, and V. Robles, "Machine Learning in Bioinformatics," Briefings in Bioinformatics, vol. 7, no. 1, pp. 86-112, 2006.
[55] K.A. Baggerly, J.S. Morris, S.R. Edmonson, and K.R. Coombes, "Signal in Noise: Evaluating Reported Reproducibility of Serum Proteomic Tests for Ovarian Cancer," J. Nat'l Cancer Inst., vol. 97, no. 4, pp. 307-309, 2005.

Index Terms:
Mass spectrometry, EDA, feature selection, biomarker discovery.
Rubén Armañanzas, Yvan Saeys, Iñaki Inza, Miguel García-Torres, Concha Bielza, Yves van de Peer, Pedro Larrañaga, "Peakbin Selection in Mass Spectrometry Data Using a Consensus Approach with Estimation of Distribution Algorithms," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 8, no. 3, pp. 760-774, May-June 2011, doi:10.1109/TCBB.2010.18
Usage of this product signifies your acceptance of the Terms of Use.