This Article 
 Bibliographic References 
 Add to: 
A Probabilistic Model for Mining Labeled Ordered Trees: Capturing Patterns in Carbohydrate Sugar Chains
August 2005 (vol. 17 no. 8)
pp. 1051-1064
Glycans, or carbohydrate sugar chains, which play a number of important roles in the development and functioning of multicellular organisms, can be regarded as labeled ordered trees. A recent increase in the documentation of glycan structures, especially in the form of database curation, has made mining glycans important for the understanding of living cells. We propose a probabilistic model for mining labeled ordered trees, and we further present an efficient learning algorithm for this model, based on an EM algorithm. The time and space complexities of this algorithm are rather favorable, falling within the practical limits set by a variety of existing probabilistic models, including stochastic context-free grammars. Experimental results have shown that, in a supervised problem setting, the proposed method outperformed five other competing methods by a statistically significant factor in all cases. We further applied the proposed method to aligning multiple glycan trees, and we detected biologically significant common subtrees in these alignments where the trees are automatically classified into subtypes already known in glycobiology. Extended abstracts of parts of the work presented in this paper have appeared in [35], [4], and [3].

[1] N. Abe and H. Mamitsuka, “Predicting Protein Secondary Structure Using Stochastic Tree Grammars,” Machine Learning, vol. 29, pp. 275-301, 1997.
[2] S. Abitebould, P. Buneman, and D. Suciu, Data on the Web: From Relations to Semistructured Data and XML. Morgan Kaufmann, 2000.
[3] K.F. Aoki, N. Ueda, A. Yamaguchi, T. Akutsu, M. Kanehisa, and H. Mamitsuka, “Managing and Analyzing Carbohydrate Data,” ACM SIGMOD Record, vol. 33, no. 2, 2004.
[4] K.F. Aoki, N. Ueda, A. Yamaguchi, M. Kanehisa, T. Akutsu, and H. Mamitsuka, “Application of a New Probabilistic Model for Recognizing Complex Patterns in Glycans,” Proc. 12th Int'l Conf. Intelligent Systems for Molecular Biology, 2004.
[5] K.F. Aoki, A. Yamaguchi, N. Ueda, T. Akutsu, H. Mamitsuka, S. Goto, and M. Kanehisa, “KCaM (KEGG Carbohydrate Matcher): A Software Tool for Analyzing the Structures of Carbohydrate Sugar Chains,” Nucleic Acids Research, vol. 32, (Web Server issue), pp. W267-W272, 2004.
[6] T. Asai, H. Arimura, K. Abe, S. Kawasoe, and S. Arikawa, “Online Algorithms for Mining Semistructured Data Streams,” Proc. Second Int'l Conf. Data Mining, pp. 158-174, 2002.
[7] J.K. Baker, “Trainable Grammars for Speech Recognition,” Speech Comm. Papers Presented at the 97th Meeting of ASA, pp. 547-550, 1979.
[8] L. Baum and T. Petrie, “Statistical Inference for Probabilistic Functions of Infinite State Markov Chains,” Annals of Math. Statistics, vol. 37, no. 6, pp. 1554-1563, 1966.
[9] C. Bertozzi and L. Keissling, “Chemical Glycobiology,” Science, vol. 291, pp. 2357-2364, 2001.
[10] S. Brooks, M. Dwek, and U. Schumacher, “Functional and Molecular Glycobiology,” BIOS Scientific, 2002.
[11] I.V. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White, “Model-Based Clustering and Visualization of Navigation Patterns on a Web Site,” Data Mining and Knowledge Discovery, vol. 7, no. 4, pp. 399-424, 2003.
[12] G. Chang, G. Patel, L. Relihan, and J.T.L. Wang, “A Graphical Environment for Change Detection in Structured Documents,” Proc. 21st Int'l Computer Software and Applications Conf., pp. 536-541, 1997.
[13] G. Cong, L. Yi, B. Liu, and K. Wang, “Discovering Frequent Substructures from Hierarchical Semistructured Data,” Proc. Second SIAM Int'l Conf. Data Mining (SDM '02), 2002.
[14] M.S. Crouse, R.D. Nowak, and R.G. Baraniuk, “Wavelet-Based Statistical Signal Processing Using Hidden Markov Models,” IEEE Trans. Signal Processing, vol. 46, no. 4, pp. 886-902, 1998.
[15] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., Series B, vol. 39, pp. 1-38, 1977.
[16] M. Diligenti, P. Frasconi, and M. Gori, “Hidden Tree Markov Models for Document Image Classification,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 4, pp. 519-523, 2003.
[17] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1998.
[18] S. Fine, Y. Singer, and N. Tishby, “Hierarchical Hidden Markov Model: Analysis and Applications,” Machine Learning, vol. 32, no. 1, pp. 41-62, 1998.
[19] T. Gärtner, P.A. Flach, and S. Wrobel, “On Graph Kernels: Hardness Results and Efficient Alternatives,” Proc. 16th Ann. Conf. Learning Theory, pp. 129-143, 2003.
[20] D.J. Hand and R.J. Till, “A Simple Generalisation of the Area under the ROC Curve for Multiple Class Classification Problems,” Machine Learning, vol. 45, pp. 171-186, 2001.
[21] T. Horvath, T. Gartner, and S. Wrobel, “Cyclic Pattern Kernels for Predictive Graph Mining,” Proc. 10th ACM SIGKDD, pp. 158-167, 2004.
[22] T.S. Jaakkola and D. Haussler, “Exploiting Generative Models in Discriminative Classifiers,” Proc. Conf. Neural Information Processing Systems (NIPS 11), pp. 487-493, 1999.
[23] M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori, “The KEGG Resource for Deciphering the Genome,” Nucleic Acids Research, vol. 32, pp. D277-D280, 2004.
[24] H. Kashima and T. Koyanagi, “Kernels for Semistructured Data,” Proc. 19th Int'l Conf. Machine Learning, pp. 411-417, 2002.
[25] H. Kashima, K. Tsuda, and A. Inokuchi, “Marginalized Kernels between Labeled Graphs,” Proc. 20th Int'l Conf. Machine Learning, pp. 321-328, 2003.
[26] K. Lari and S.J. Young, “The Estimation of Stochastic Context-Free Grammars Using the Inside-Outside Algorithm,” Computer Speech and Language, vol. 4, pp. 35-56, 1990.
[27] S.L. Lauritzen and D.J. Spiegelhalter, “Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems (with Discussion),” J. Royal Statistical Soc., Series B, vol. 50, no. 2, pp. 157-224, 1988.
[28] P. Mahe, N. Ueda, T. Akutsu, J.-L. Perret, and J.-P. Vert, “Extensions of Marginalized Graph Kernels,” Proc. 21st Int'l Conf. Machine Learning, pp. 552-559, 2004.
[29] G.J. McLachlan and T. Krishnan, The EM Algorithm and Extensions. Wiley-Interscience, 1996.
[30] G.J. McLachlan and D. Peel, Finite Mixture Models. John Wiley & Sons, 2000.
[31] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers, 1988.
[32] L. Rabiner and S. Juang, Fundamentals of Speech Recognition. Prentice Hall, 1993.
[33] A. Termier, M.C. Rousset, and M. Sebag, “Treefinder: A First Step towards XML Data Mining,” Proc. Second IEEE Int'l Conf. Data Mining, pp. 450-457, 2002.
[34] K. Tsuda, T. Kin, and K. Asai, “Marginalized Kernels for Biological Sequences,” Bioinformatics, vol. 18 (Supplement 1), pp. 268-275, 2002.
[35] N. Ueda, K.F. Aoki, and H. Mamitsuka, “A General Probabilistic Framework for Mining Labeled Ordered Trees,” Proc. SIAM Int'l Conf. Data Mining, pp. 357-368, 2004.
[36] N. Ueda and T. Sato, “Simplified Training Algorithms for Hierarchical Hidden Markov Models,” Proc. Fourth Int'l Conf. Discovery Science, pp. 401-415, 2001.
[37] A. Varki, R. Cummings, J. Esko, and J. Freeze, Essentials of Glycobiology. Cold Spring Harbor Laboratory Press, 1999.
[38] M. Zaki and C. Aggarwal, “An Effective Structural Classifier for XML Data,” Proc. Ninth ACM SIGKDD, pp. 316-325, 2003.

Index Terms:
Index Terms- Biology and genetics, machine learning, data mining, mining methods and algorithms.
Nobuhisa Ueda, Kiyoko F. Aoki-Kinoshita, Atsuko Yamaguchi, Tatsuya Akutsu, Hiroshi Mamitsuka, "A Probabilistic Model for Mining Labeled Ordered Trees: Capturing Patterns in Carbohydrate Sugar Chains," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 8, pp. 1051-1064, Aug. 2005, doi:10.1109/TKDE.2005.117
Usage of this product signifies your acceptance of the Terms of Use.