This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Learning Local Languages and Their Application to DNA Sequence Analysis
October 1998 (vol. 20 no. 10)
pp. 1067-1079

Abstract—This paper concerns an efficient algorithm for learning in the limit a special type of regular languages called strictly locally testable languages from positive data, and its application to identifying the protein α-chain region in amino acid sequences. First, we present a linear time algorithm that, given a strictly locally testable language, learns (identifies) its deterministic finite state automaton in the limit from only positive data. This provides us with a practical and efficient method for learning a specific concept domain of sequence analysis. We then describe several experimental results using the learning algorithm developed above. Following a theoretical observation which strongly suggests that a certain type of amino acid sequences can be expressed by a locally testable language, we apply the learning algorithm to identifying the protein α-chain region in amino acid sequences for hemoglobin. Experimental scores show an overall success rate of 95 percent correct identification for positive data, and 96 percent for negative data.

[1] N. Abe and H. Mamitsuka, "Prediction of Beta-Sheet Structures Using Stochastic Tree Grammars," Proc Genome Informatics Workshop 5, pp. 12-28, 1994.
[2] D. Angluin, "Inductive Inference of Formal Languages From Positive Data," Information and Control, vol. 45, pp. 117-135, 1980.
[3] D. Angluin, "Inference of Reversible Languages," J. ACM, vol. 29, pp. 741-765, 1982.
[4] D. Angluin and C.H. Smith, "Inductive Inference: Theory and Methods," ACM Computing Surveys, vol. 15, no. 3, pp. 237-269, 1983.
[5] S. Arikawa, S. Kuhara, S. Miyano, Y. Mukouchi, A. Shinohara, and T. Shinohara, "A Machine Discovery From Amino Acid Sequences by Decision Trees Over Regular Patterns," New Generation Computing, vol. 11, pp. 361-375, 1993.
[6] K. Asai, S. Hayamizu, and K. Onizuka, "Hmm With Protein Structure Grammar," Proc. 26th Hawaii Int'l Conf. System Sciences, pp. 783-791, 1993.
[7] J.A. Brzozowski and I. Simon, "Characterizations of Locally Testable Events," Discrete Mathematics, vol. 4, pp. 243-271, 1973.
[8] K. Culik II and T. Harju, "Dominoes and the Regularity of DNA Splicing Languages," K. Mehlhorn, ed., Proc. ICALP '89, pp. 222-233.New York: Springer-Verlag, 1989.
[9] H. Dayhoff and H. Calderone, "Composition of Proteins," Altas of Protein Sequence and Structure, vol. 5, no. 3, pp. 363-373, 1978.
[10] K.S. Fu and T.L. Booth, "Grammatical Inference: Introduction and Survey, Part 1 and 2," IEEE Trans. Systems, Man, and Cybernetics, vol. 5, pp. 95-111 and 409-423, 1975.
[11] P. Garcia and E. Vidal, "Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 12, no. 9, pp. 920-925, Sept. 1990.
[12] P. Garcia, E. Vidal, and J. Oncina, "Learning Locally Testable Languages in the Strict Sense," Algorithmic Learning Theory (Proc. First Int'l Workshop Algorithmic Learning Theory), pp. 325-338. Ohmsha Ltd. and Springer, 1990.
[13] R.W. Gatterdam, "Splicing Systems and Regularity," Int'l J. Computer Mathematics, vol. 31, pp. 63-67, 1989.
[14] R.W. Gatterdam, "Algorithms for Splicing Systems," SIAM J. Computing, vol. 21, pp. 507-520, 1992.
[15] E.M. Gold, "Language Identification in the Limit," Information and Control, vol. 10, pp. 447-474, 1967.
[16] M. Gribskov, A.D. McLachlan, and D. Eisenberg, "Profile Analysis: Detection of Distantly Related Proteins," Proc. Nat'l Academy Sciences USA, vol. 84, pp. 4,355-4,358, 1987.
[17] M.A. Harrison, Introduction to Formal Language Theory.Reading, Mass.: Addison-Wesley, 1978.
[18] T. Head, "Formal Language Theory and DNA: An Analysis of the Generative Capacity of Specific Recombinant Behaviors," Bull. Mathematical Biology, vol. 49, pp. 737-759, 1987.
[19] T. Head, "Splicing Schemes and DNA," Lindenmayer Systems, G. Rozenberg and A. Salomma, eds., pp. 371-383.New York: Springer-Verlag, 1992.
[20] C. Helgesen and P.R. Sibbald, "Palm—A Pattern Language for Molecular Biology," Proc. First Int'l Conf. Intelligent Systems for Molecular Biology, pp. 172-180, 1993.
[21] J.E. Hopcroft, "An n log n Algorithm for Minimizing States in a Finite Automaton," Theory of Machine and Computation, A. Kohavi and A. Paz, eds., pp. 189-196, 1971.
[22] O.H. Ibarra and T. Jiang, "Learning Regular Languages From Counterexamples," Proc. First Workshop on Computational Learning Theory, pp. 337-351, 1988.
[23] S. Kobayashi and T. Yokomori, "Modeling RNA Secondary Structures Using Tree Grammars," Proc. Fifth Genome Informatics Workshop, Universal Academy Press, pp. 29-38, 1994.
[24] R. McNaughton and S. Papert, Counter-Free Automata.Cambridge, Mass.: MIT Press, 1971.
[25] S. Miyano, A. Shinohara, S. Arikawa, S. Shimozono, T. Shinohara, and S. Kuhara, "Knowledge Acquisition From Amino Acid Sequences by Decision Trees and Indexing," Proc. Third Genome Informatics Workshop, pp. 69-72, 1992.
[26] Protein Database.Osaka, Japan: Protein Research Foundation.
[27] Y. Sakakibara, M. Brown, R. Hughey, I.S. Mian, K. Sjolander, R.C. Underwood, and D. Haussler, "Stochastic Context-Free Grammars for tRNA Modeling," Nucleic Acids Research, vol. 22, pp. 5,112-5,120, 1994.
[28] D.B. Searls, "The Computational Linguistics of Biological Sequences," L. Hunter, ed., Artificial Intelligence an Molecular Biology, Chapter 2, pp. 47-120. AAAI Press, 1993.
[29] T. Shinohara, "Polynomial Time Inference of Extended Regular Pattern Languages," Proc. RIMS Symp. Software Science and Eng., pp. 115-127.New York: Springer-Verlag, 1983.
[30] T. Shinohara, "Inductive Inference From Positive Data Is Powerful," Proc. Third Workshop on Computational Learning Theory, pp. 97-110, 1990.
[31] R. Siromoney, K.G. Subramanian, and V.R. Dare, "Circular DNA and Splicing Systems," Proc. Int'l Conf. Parallel Image Analysis, pp. 260-273.New York: Springer-Verlag, 1992.
[32] G.D. Stormo and G.W. Hartzell III, "Identifying Protein-Binding Sites From Unaligned DNA Fragments," Proc. Nat'l Academy Sciences USA, vol. 86, pp. 1,183-1,187, 1989.
[33] Y. Takada and R. Siromoney, "On Identifying DNA Splicing Systems From Examples," P.K. Jantke, ed., Proc. AII '92, pp. 305-319.New York: Springer-Verlag, 1992.
[34] N. Tanida and T. Yokomori, "Polynomial-Time Identification of Strictly Regular Languages in the Limit," IEICE Trans. Information and Systems, vol. 75-D, pp. 125-132, 1992.
[35] J.D. Watson, J. Tooze, and D.T. Kurtz, Recombinant DNA: A Short Course.New York: Freeman, 1983.
[36] T. Yokomori, "On Polynomial-Time Learnability in the Limit of Strictly Deterministic Automata," Machine Learning, vol. 19, 1995.
[37] T. Yokomori and S. Kobayashi, "DNA Evolutionary Linguistics and RNA Structure Modeling: A Computational Approach," Proc. IEEE Symp. Intelligence in Neural and Biological Systems, pp. 38-45, 1995.
[38] T. Yokomori and S. Kobayashi, "On the Power of Circular Splicing Systems and DNA Computability," Proc. IEEE Int'l Conf. Evolutionary Computation, pp. 219-224, 1997.

Index Terms:
Local languages, deterministic automata, hemoglobin α-chain, DNA sequence analysis, machine learning.
Citation:
Takashi Yokomori, Satoshi Kobayashi, "Learning Local Languages and Their Application to DNA Sequence Analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 10, pp. 1067-1079, Oct. 1998, doi:10.1109/34.722617
Usage of this product signifies your acceptance of the Terms of Use.