This Article 
 Bibliographic References 
 Add to: 
Learning Local Languages and Their Application to DNA Sequence Analysis
October 1998 (vol. 20 no. 10)
pp. 1067-1079

Abstract—This paper concerns an efficient algorithm for learning in the limit a special type of regular languages called strictly locally testable languages from positive data, and its application to identifying the protein α-chain region in amino acid sequences. First, we present a linear time algorithm that, given a strictly locally testable language, learns (identifies) its deterministic finite state automaton in the limit from only positive data. This provides us with a practical and efficient method for learning a specific concept domain of sequence analysis. We then describe several experimental results using the learning algorithm developed above. Following a theoretical observation which strongly suggests that a certain type of amino acid sequences can be expressed by a locally testable language, we apply the learning algorithm to identifying the protein α-chain region in amino acid sequences for hemoglobin. Experimental scores show an overall success rate of 95 percent correct identification for positive data, and 96 percent for negative data.

[1] N. Abe and H. Mamitsuka, "Prediction of Beta-Sheet Structures Using Stochastic Tree Grammars," Proc Genome Informatics Workshop 5, pp. 12-28, 1994.
[2] D. Angluin, "Inductive Inference of Formal Languages From Positive Data," Information and Control, vol. 45, pp. 117-135, 1980.
[3] D. Angluin, "Inference of Reversible Languages," J. ACM, vol. 29, pp. 741-765, 1982.
[4] D. Angluin and C.H. Smith, "Inductive Inference: Theory and Methods," ACM Computing Surveys, vol. 15, no. 3, pp. 237-269, 1983.
[5] S. Arikawa, S. Kuhara, S. Miyano, Y. Mukouchi, A. Shinohara, and T. Shinohara, "A Machine Discovery From Amino Acid Sequences by Decision Trees Over Regular Patterns," New Generation Computing, vol. 11, pp. 361-375, 1993.
[6] K. Asai, S. Hayamizu, and K. Onizuka, "Hmm With Protein Structure Grammar," Proc. 26th Hawaii Int'l Conf. System Sciences, pp. 783-791, 1993.
[7] J.A. Brzozowski and I. Simon, "Characterizations of Locally Testable Events," Discrete Mathematics, vol. 4, pp. 243-271, 1973.
[8] K. Culik II and T. Harju, "Dominoes and the Regularity of DNA Splicing Languages," K. Mehlhorn, ed., Proc. ICALP '89, pp. 222-233.New York: Springer-Verlag, 1989.
[9] H. Dayhoff and H. Calderone, "Composition of Proteins," Altas of Protein Sequence and Structure, vol. 5, no. 3, pp. 363-373, 1978.
[10] K.S. Fu and T.L. Booth, "Grammatical Inference: Introduction and Survey, Part 1 and 2," IEEE Trans. Systems, Man, and Cybernetics, vol. 5, pp. 95-111 and 409-423, 1975.
[11] P. Garcia and E. Vidal, "Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 12, no. 9, pp. 920-925, Sept. 1990.
[12] P. Garcia, E. Vidal, and J. Oncina, "Learning Locally Testable Languages in the Strict Sense," Algorithmic Learning Theory (Proc. First Int'l Workshop Algorithmic Learning Theory), pp. 325-338. Ohmsha Ltd. and Springer, 1990.
[13] R.W. Gatterdam, "Splicing Systems and Regularity," Int'l J. Computer Mathematics, vol. 31, pp. 63-67, 1989.
[14] R.W. Gatterdam, "Algorithms for Splicing Systems," SIAM J. Computing, vol. 21, pp. 507-520, 1992.
[15] E.M. Gold, "Language Identification in the Limit," Information and Control, vol. 10, pp. 447-474, 1967.
[16] M. Gribskov, A.D. McLachlan, and D. Eisenberg, "Profile Analysis: Detection of Distantly Related Proteins," Proc. Nat'l Academy Sciences USA, vol. 84, pp. 4,355-4,358, 1987.
[17] M.A. Harrison, Introduction to Formal Language Theory.Reading, Mass.: Addison-Wesley, 1978.
[18] T. Head, "Formal Language Theory and DNA: An Analysis of the Generative Capacity of Specific Recombinant Behaviors," Bull. Mathematical Biology, vol. 49, pp. 737-759, 1987.
[19] T. Head, "Splicing Schemes and DNA," Lindenmayer Systems, G. Rozenberg and A. Salomma, eds., pp. 371-383.New York: Springer-Verlag, 1992.
[20] C. Helgesen and P.R. Sibbald, "Palm—A Pattern Language for Molecular Biology," Proc. First Int'l Conf. Intelligent Systems for Molecular Biology, pp. 172-180, 1993.
[21] J.E. Hopcroft, "An n log n Algorithm for Minimizing States in a Finite Automaton," Theory of Machine and Computation, A. Kohavi and A. Paz, eds., pp. 189-196, 1971.
[22] O.H. Ibarra and T. Jiang, "Learning Regular Languages From Counterexamples," Proc. First Workshop on Computational Learning Theory, pp. 337-351, 1988.
[23] S. Kobayashi and T. Yokomori, "Modeling RNA Secondary Structures Using Tree Grammars," Proc. Fifth Genome Informatics Workshop, Universal Academy Press, pp. 29-38, 1994.
[24] R. McNaughton and S. Papert, Counter-Free Automata.Cambridge, Mass.: MIT Press, 1971.
[25] S. Miyano, A. Shinohara, S. Arikawa, S. Shimozono, T. Shinohara, and S. Kuhara, "Knowledge Acquisition From Amino Acid Sequences by Decision Trees and Indexing," Proc. Third Genome Informatics Workshop, pp. 69-72, 1992.
[26] Protein Database.Osaka, Japan: Protein Research Foundation.
[27] Y. Sakakibara, M. Brown, R. Hughey, I.S. Mian, K. Sjolander, R.C. Underwood, and D. Haussler, "Stochastic Context-Free Grammars for tRNA Modeling," Nucleic Acids Research, vol. 22, pp. 5,112-5,120, 1994.
[28] D.B. Searls, "The Computational Linguistics of Biological Sequences," L. Hunter, ed., Artificial Intelligence an Molecular Biology, Chapter 2, pp. 47-120. AAAI Press, 1993.
[29] T. Shinohara, "Polynomial Time Inference of Extended Regular Pattern Languages," Proc. RIMS Symp. Software Science and Eng., pp. 115-127.New York: Springer-Verlag, 1983.
[30] T. Shinohara, "Inductive Inference From Positive Data Is Powerful," Proc. Third Workshop on Computational Learning Theory, pp. 97-110, 1990.
[31] R. Siromoney, K.G. Subramanian, and V.R. Dare, "Circular DNA and Splicing Systems," Proc. Int'l Conf. Parallel Image Analysis, pp. 260-273.New York: Springer-Verlag, 1992.
[32] G.D. Stormo and G.W. Hartzell III, "Identifying Protein-Binding Sites From Unaligned DNA Fragments," Proc. Nat'l Academy Sciences USA, vol. 86, pp. 1,183-1,187, 1989.
[33] Y. Takada and R. Siromoney, "On Identifying DNA Splicing Systems From Examples," P.K. Jantke, ed., Proc. AII '92, pp. 305-319.New York: Springer-Verlag, 1992.
[34] N. Tanida and T. Yokomori, "Polynomial-Time Identification of Strictly Regular Languages in the Limit," IEICE Trans. Information and Systems, vol. 75-D, pp. 125-132, 1992.
[35] J.D. Watson, J. Tooze, and D.T. Kurtz, Recombinant DNA: A Short Course.New York: Freeman, 1983.
[36] T. Yokomori, "On Polynomial-Time Learnability in the Limit of Strictly Deterministic Automata," Machine Learning, vol. 19, 1995.
[37] T. Yokomori and S. Kobayashi, "DNA Evolutionary Linguistics and RNA Structure Modeling: A Computational Approach," Proc. IEEE Symp. Intelligence in Neural and Biological Systems, pp. 38-45, 1995.
[38] T. Yokomori and S. Kobayashi, "On the Power of Circular Splicing Systems and DNA Computability," Proc. IEEE Int'l Conf. Evolutionary Computation, pp. 219-224, 1997.

Index Terms:
Local languages, deterministic automata, hemoglobin α-chain, DNA sequence analysis, machine learning.
Takashi Yokomori, Satoshi Kobayashi, "Learning Local Languages and Their Application to DNA Sequence Analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 10, pp. 1067-1079, Oct. 1998, doi:10.1109/34.722617
Usage of this product signifies your acceptance of the Terms of Use.