Subscribe

Issue No.04 - April (2012 vol.24)

pp: 634-648

Xin Zhang , New York University, New York

Dennis Shasha , New York University, New York

Yang Song , New Jersey Institute of Technology and UMDNJ-New Jersey Medical School Cancer Center, Newark

Jason T.L. Wang , New Jersey Institute of Technology, Newark

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2010.238

ABSTRACT

We study a data mining problem concerning the elastic peak detection in 2D liquid chromatography-mass spectrometry (LC-MS) data. These data can be modeled as time series, in which the X-axis represents time points and the Y-axis represents intensity values. A peak occurs in a set of 2D LC-MS data when the sum of the intensity values in a sliding time window exceeds a user-determined threshold. The elastic peak detection problem is to locate all peaks across multiple window sizes of interest in the data set. We propose a new data structure, called a Shifted Aggregation Tree or AggTree for short, and use the data structure to find the different peaks. Our method, called PeakID, solves the elastic peak detection problem in 2D LC-MS data yielding neither false positives nor false negatives. The method works by first constructing an AggTree in a bottom-up manner from the given data set, and then searching the AggTree for the peaks in a top-down manner. We describe a state-space algorithm for finding the topology and structure of an efficient AggTree to be used by PeakID. Our experimental results demonstrate the superiority of the proposed method over other methods on both synthetic and real-world data.

INDEX TERMS

Knowledge discovery from LC-MS data, time series data mining, bioinformatics, computational proteomics, algorithms and data structures.

CITATION

Xin Zhang, Dennis Shasha, Yang Song, Jason T.L. Wang, "Fast Elastic Peak Detection for Mass Spectrometry Data Mining",

*IEEE Transactions on Knowledge & Data Engineering*, vol.24, no. 4, pp. 634-648, April 2012, doi:10.1109/TKDE.2010.238REFERENCES

- [1] R.E. Ardrey,
Liquid Chromatography—Mass Spectrometry: An Introduction. John Wiley & Sons, 2003.- [2] A.E. Ashcroft, "An Introduction to Mass Spectrometry," http://www.astbury.leeds.ac.uk/facil/MStut mstutorial.htm, 2011.
- [3] K.A. Baggerly, J.S. Morris, J. Wang, D. Gold, L.C. Xiao, and K.R. Coombes, "A Comprehensive Approach to the Analysis of Matrix-Assisted Laser Desorption/Ionization-Time of Flight Proteomics Spectra from Serum Samples,"
Proteomics, vol. 3, no. 9, pp. 1667-1672, 2003.- [4]
Analysis of Biological Data: A Soft Computing Approach, S. Bandyopadhyay, U. Maulik, and J.T.L. Wang, eds. World Scientific, 2007.- [5] F.K.-P. Chan, A.W.-C. Fu, and C. Yu, "Haar Wavelets for Efficient Similarity Search of Time-Series: With and without Time Warping,"
IEEE Trans. Knowledge and Data Eng., vol. 15, no. 3, pp. 686-705, May/June 2003.- [6] S. Chen, D. Hong, and Y. Shyr, "Wavelet-Based Procedures for Proteomic Mass Spectrometry Data Processing,"
Computational Statistics & Data Analysis, vol. 52, pp. 211-220, 2007.- [7] Y. Chen, G. Dong, J. Han, B.W. Wah, and J. Wang, "Multi-Dimensional Regression Analysis of Time-Series Data Streams,"
Proc. 28th Int'l Conf. Very Large Data Bases, pp. 323-334, 2002.- [8] C. Christin, A.K. Smilde, H.C.J. Hoefsloot, F. Suits, R. Bischoff, and P.L. Horvatovich, "Optimized Time Alignment Algorithm for LC-MS Data: Correlation Optimized Warping Using Component Detection Algorithm-Selected Mass Chromatograms,"
Analytical Chemistry, vol. 80, no. 18, pp. 7012-7021, 2008.- [9] M.C. Codrea, C.R. Jimenez, S. Piersma, J. Heringa, and E. Marchiori, "Robust Peak Detection and Alignment of nanoLC-FT Mass Spectrometry Data,"
Proc. Fifth European Conf. Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, pp. 35-46, 2007.- [10] K.R. Coombes, S. Tsavachidis, J.S. Morris, K.A. Baggerly, M.C. Hung, and H.M. Kuerer, "Improved Peak Detection and Quantification of Mass Spectrometry Data Acquired from Surface-Enhanced Laser Desorption and Ionization by Denoising Spectra with the Undecimated Discrete Wavelet Transform,"
Proteomics, vol. 5, no. 16, pp. 4107-4117, 2005.- [11] I. Eidhammer, K. Flikka, L. Martens, and S.-O. Mikalsen,
Computational Methods for Mass Spectrometry Proteomics, first ed. Wiley-Interscience, 2008.- [12] M.G. Elfeky, W.G. Aref, and A.K. Elmagarmid, "Periodicity Detection in Time Series Databases,"
IEEE Trans. Knowledge and Data Eng., vol. 17, no. 7, pp. 875-887, July 2005.- [13] A. Felinger,
Data Analysis and Signal Processing in Chromatography. Elsevier Science, 1998.- [14] T. Fushiki, H. Fujisawa, and S. Eguchi, "Identification of Biomarkers from Mass Spectrometry Data Using a Common Peak Approach,"
BMC Bioinformatics, vol. 7, article 358, 2006.- [15] T. Hankemeier, J. Rozenbrand, M. Abhadur, J.J. Vreuls, and U.A.T. Brinkman, "Data Correlation in On-Line Solid-Phase Extraction-Gas Chromatography-Atomic Emission/Mass Spectrometric Detection of Unknown Microcontaminants,"
Chromatographia, vol. 48, pp. 273-283, 1998.- [16] M. Karnstedt, D. Klan, C. Politz, K.U. Sattler, and C. Franke, "Adaptive Burst Detection in a Stream Engine,"
Proc. ACM Symp. Applied Computing, pp. 1511-1515, 2009.- [17] M. Katajamaa and M. Oresic, "Processing Methods for Differential Analysis of LC/MS Profile Data,"
BMC Bioinformatics, vol. 6, no. 1,article 179, 2005.- [18] E. Keogh, S. Lonardi, and W. Chiu, "Finding Surprising Patterns in a Time Series Database in Linear Time and Space,"
Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2002.- [19] J. Kleinberg, "Bursty and Hierarchical Structure in Streams,"
Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 91-101, 2002.- [20] J. Listgarten, R.M. Neal, S.T. Roweis, P. Wong, and A. Emili, "Difference Detection in LC-MS Data for Protein Biomarker Discovery,"
Bioinformatics, vol. 23, no. 2, pp. 198-204, 2007.- [21] Z. Michalewicz and D.B. Fogel,
How to Solve It: Modern Heuristics. Springer, 2002.- [22] B.Y. Renard, M. Kirchner, H. Steen, J.A.J. Steen, and F.A. Hamprecht, "NITPICK: Peak Identification for Mass Spectrometry Data,"
BMC Bioinformatics, vol. 9, article 355, 2008.- [23] D. Shasha and Y. Zhu,
High Performance Discovery in Time Series: Techniques and Case Studies. Springer, 2004.- [24] L. Singh and M. Sayal, "Privacy Preserving Burst Detection of Distributed Time Series Data Using Linear Transforms,"
Proc. IEEE Symp. Computational Intelligence and Data Mining, pp. 646-653, 2007.- [25] R. Stolt, R.J.O. Torgrip, J. Lindberg, L. Csenki, J. Kolmert, I. Schuppe-Koistinen, and S.P. Jacobsson, "Second-Order Peak Detection for Multicomponent High-Resolution LC/MS Data,"
Analytical Chemistry, vol. 78, pp. 975-983, 2006.- [26] M.H.M. van de Meent and G.J. de Jong, "Improvement of the Liquid-Chromatographic Analysis of Protein Tryptic Digests by the Use of Long-Capillary Monolithic Columns with UV and MS Detection,"
Analytical and Bioanalytical Chemistry, vol. 388, pp. 195-200, 2007.- [27] M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos, "Identifying Similarities, Periodicities and Bursts for Online Search Queries,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 131-142, 2004.- [28]
Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications, J.T.L. Wang, B.A. Shapiro, and D. Shasha, eds. Oxford Univ. Press, 1999.- [29]
Data Mining in Bioinformatics, J.T.L. Wang, M.J. Zaki, H.T.T. Toivonen, and D. Shasha, eds. Springer, 2005.- [30] M. Wang, T. Madhyastha, N.H. Chan, S. Papadimitriou, and C. Faloutsos, "Data Mining Meets Performance Evaluation: Fast Algorithms for Modeling Bursty Traffic,"
Proc. 18th Int'l Conf. Data Eng., pp. 507-516, 2002.- [31] J. Wong, G. Cagney, and H.M. Cartwright, "SpecAlign - Processing and Alignment of Mass Spectra Data Sets,"
Bioinformatics, vol. 21, pp. 2088-2090, 2005.- [32] C. Yang, Z. He, and W. Yu, "Comparison of Public Peak Detection Algorithms for MALDI Mass Spectrometry Data Analysis,"
BMC Bioinformatics, vol. 10, article 4, 2009.- [33] J. Yang, W. Wang, and P.S. Yu, "Mining Asynchronous Periodic Patterns in Time Series Data,"
IEEE Trans. Knowledge and Data Eng., vol. 15, no. 3, pp. 613-628, May/June 2003.- [34] W. Yu, B. Wu, N. Lin, K. Stone, K. Williams, and H. Zhao, "Detecting and Aligning Peaks in Mass Spectrometry Data with Applications to MALDI,"
Computational Biology and Chemistry, vol. 30, pp. 27-38, 2006.- [35] Z. Yuan, K. Du, Y. Jia, and J. Miao, "Stream Event Detection: A Unified Framework for Mining Outlier, Change and Burst Simultaneously over Data Stream,"
Proc. Seventh IEEE Int'l Conf. Data Mining-Workshops, pp. 575-580, 2007.- [36] X. Zhang and D. Shasha, "Better Burst Detection,"
Proc. 22 Int'l Conf. Data Eng., p. 146, 2006.- [37] A. Zhou, S. Qin, and W. Qian, "Adaptively Detecting Aggregation Bursts in Data Streams,"
Proc. 10th Int'l Conf. Database Systems for Advanced Applications, pp. 435-446, 2005.- [38] Y. Zhu and D. Shasha, "StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time,"
Proc. 28th Int'l Conf. Very Large Data Bases, pp. 358-369, 2002.- [39] Y. Zhu and D. Shasha, "Warping Indexes with Envelope Transforms for Query by Humming,"
Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 181-192, 2003.- [40] Y. Zhu and D. Shasha, "Efficient Elastic Burst Detection in Data Streams,"
Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 336-345, 2003. |