
This Article  
 
Share  
Bibliographic References  
Add to:  
Digg Furl Spurl Blink Simpy Del.icio.us Y!MyWeb  
Search  
 
ASCII Text  x  
Takashi Yokomori, Satoshi Kobayashi, "Learning Local Languages and Their Application to DNA Sequence Analysis," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 10, pp. 10671079, October, 1998.  
BibTex  x  
@article{ 10.1109/34.722617, author = {Takashi Yokomori and Satoshi Kobayashi}, title = {Learning Local Languages and Their Application to DNA Sequence Analysis}, journal ={IEEE Transactions on Pattern Analysis and Machine Intelligence}, volume = {20}, number = {10}, issn = {01628828}, year = {1998}, pages = {10671079}, doi = {http://doi.ieeecomputersociety.org/10.1109/34.722617}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }  
RefWorks Procite/RefMan/Endnote  x  
TY  JOUR JO  IEEE Transactions on Pattern Analysis and Machine Intelligence TI  Learning Local Languages and Their Application to DNA Sequence Analysis IS  10 SN  01628828 SP1067 EP1079 EPD  10671079 A1  Takashi Yokomori, A1  Satoshi Kobayashi, PY  1998 KW  Local languages KW  deterministic automata KW  hemoglobin αchain KW  DNA sequence analysis KW  machine learning. VL  20 JA  IEEE Transactions on Pattern Analysis and Machine Intelligence ER   
Abstract—This paper concerns an efficient algorithm for learning in the limit a special type of regular languages called strictly locally testable languages from positive data, and its application to identifying the protein αchain region in amino acid sequences. First, we present a linear time algorithm that, given a strictly locally testable language, learns (identifies) its deterministic finite state automaton in the limit from only positive data. This provides us with a practical and efficient method for learning a specific concept domain of sequence analysis. We then describe several experimental results using the learning algorithm developed above. Following a theoretical observation which strongly suggests that a certain type of amino acid sequences can be expressed by a locally testable language, we apply the learning algorithm to identifying the protein αchain region in amino acid sequences for hemoglobin. Experimental scores show an overall success rate of 95 percent correct identification for positive data, and 96 percent for negative data.
[1] N. Abe and H. Mamitsuka, "Prediction of BetaSheet Structures Using Stochastic Tree Grammars," Proc Genome Informatics Workshop 5, pp. 1228, 1994.
[2] D. Angluin, "Inductive Inference of Formal Languages From Positive Data," Information and Control, vol. 45, pp. 117135, 1980.
[3] D. Angluin, "Inference of Reversible Languages," J. ACM, vol. 29, pp. 741765, 1982.
[4] D. Angluin and C.H. Smith, "Inductive Inference: Theory and Methods," ACM Computing Surveys, vol. 15, no. 3, pp. 237269, 1983.
[5] S. Arikawa, S. Kuhara, S. Miyano, Y. Mukouchi, A. Shinohara, and T. Shinohara, "A Machine Discovery From Amino Acid Sequences by Decision Trees Over Regular Patterns," New Generation Computing, vol. 11, pp. 361375, 1993.
[6] K. Asai, S. Hayamizu, and K. Onizuka, "Hmm With Protein Structure Grammar," Proc. 26th Hawaii Int'l Conf. System Sciences, pp. 783791, 1993.
[7] J.A. Brzozowski and I. Simon, "Characterizations of Locally Testable Events," Discrete Mathematics, vol. 4, pp. 243271, 1973.
[8] K. Culik II and T. Harju, "Dominoes and the Regularity of DNA Splicing Languages," K. Mehlhorn, ed., Proc. ICALP '89, pp. 222233.New York: SpringerVerlag, 1989.
[9] H. Dayhoff and H. Calderone, "Composition of Proteins," Altas of Protein Sequence and Structure, vol. 5, no. 3, pp. 363373, 1978.
[10] K.S. Fu and T.L. Booth, "Grammatical Inference: Introduction and Survey, Part 1 and 2," IEEE Trans. Systems, Man, and Cybernetics, vol. 5, pp. 95111 and 409423, 1975.
[11] P. Garcia and E. Vidal, "Inference of kTestable Languages in the Strict Sense and Application to Syntactic Pattern Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 12, no. 9, pp. 920925, Sept. 1990.
[12] P. Garcia, E. Vidal, and J. Oncina, "Learning Locally Testable Languages in the Strict Sense," Algorithmic Learning Theory (Proc. First Int'l Workshop Algorithmic Learning Theory), pp. 325338. Ohmsha Ltd. and Springer, 1990.
[13] R.W. Gatterdam, "Splicing Systems and Regularity," Int'l J. Computer Mathematics, vol. 31, pp. 6367, 1989.
[14] R.W. Gatterdam, "Algorithms for Splicing Systems," SIAM J. Computing, vol. 21, pp. 507520, 1992.
[15] E.M. Gold, "Language Identification in the Limit," Information and Control, vol. 10, pp. 447474, 1967.
[16] M. Gribskov, A.D. McLachlan, and D. Eisenberg, "Profile Analysis: Detection of Distantly Related Proteins," Proc. Nat'l Academy Sciences USA, vol. 84, pp. 4,3554,358, 1987.
[17] M.A. Harrison, Introduction to Formal Language Theory.Reading, Mass.: AddisonWesley, 1978.
[18] T. Head, "Formal Language Theory and DNA: An Analysis of the Generative Capacity of Specific Recombinant Behaviors," Bull. Mathematical Biology, vol. 49, pp. 737759, 1987.
[19] T. Head, "Splicing Schemes and DNA," Lindenmayer Systems, G. Rozenberg and A. Salomma, eds., pp. 371383.New York: SpringerVerlag, 1992.
[20] C. Helgesen and P.R. Sibbald, "Palm—A Pattern Language for Molecular Biology," Proc. First Int'l Conf. Intelligent Systems for Molecular Biology, pp. 172180, 1993.
[21] J.E. Hopcroft, "An n log n Algorithm for Minimizing States in a Finite Automaton," Theory of Machine and Computation, A. Kohavi and A. Paz, eds., pp. 189196, 1971.
[22] O.H. Ibarra and T. Jiang, "Learning Regular Languages From Counterexamples," Proc. First Workshop on Computational Learning Theory, pp. 337351, 1988.
[23] S. Kobayashi and T. Yokomori, "Modeling RNA Secondary Structures Using Tree Grammars," Proc. Fifth Genome Informatics Workshop, Universal Academy Press, pp. 2938, 1994.
[24] R. McNaughton and S. Papert, CounterFree Automata.Cambridge, Mass.: MIT Press, 1971.
[25] S. Miyano, A. Shinohara, S. Arikawa, S. Shimozono, T. Shinohara, and S. Kuhara, "Knowledge Acquisition From Amino Acid Sequences by Decision Trees and Indexing," Proc. Third Genome Informatics Workshop, pp. 6972, 1992.
[26] Protein Database.Osaka, Japan: Protein Research Foundation.
[27] Y. Sakakibara, M. Brown, R. Hughey, I.S. Mian, K. Sjolander, R.C. Underwood, and D. Haussler, "Stochastic ContextFree Grammars for tRNA Modeling," Nucleic Acids Research, vol. 22, pp. 5,1125,120, 1994.
[28] D.B. Searls, "The Computational Linguistics of Biological Sequences," L. Hunter, ed., Artificial Intelligence an Molecular Biology, Chapter 2, pp. 47120. AAAI Press, 1993.
[29] T. Shinohara, "Polynomial Time Inference of Extended Regular Pattern Languages," Proc. RIMS Symp. Software Science and Eng., pp. 115127.New York: SpringerVerlag, 1983.
[30] T. Shinohara, "Inductive Inference From Positive Data Is Powerful," Proc. Third Workshop on Computational Learning Theory, pp. 97110, 1990.
[31] R. Siromoney, K.G. Subramanian, and V.R. Dare, "Circular DNA and Splicing Systems," Proc. Int'l Conf. Parallel Image Analysis, pp. 260273.New York: SpringerVerlag, 1992.
[32] G.D. Stormo and G.W. Hartzell III, "Identifying ProteinBinding Sites From Unaligned DNA Fragments," Proc. Nat'l Academy Sciences USA, vol. 86, pp. 1,1831,187, 1989.
[33] Y. Takada and R. Siromoney, "On Identifying DNA Splicing Systems From Examples," P.K. Jantke, ed., Proc. AII '92, pp. 305319.New York: SpringerVerlag, 1992.
[34] N. Tanida and T. Yokomori, "PolynomialTime Identification of Strictly Regular Languages in the Limit," IEICE Trans. Information and Systems, vol. 75D, pp. 125132, 1992.
[35] J.D. Watson, J. Tooze, and D.T. Kurtz, Recombinant DNA: A Short Course.New York: Freeman, 1983.
[36] T. Yokomori, "On PolynomialTime Learnability in the Limit of Strictly Deterministic Automata," Machine Learning, vol. 19, 1995.
[37] T. Yokomori and S. Kobayashi, "DNA Evolutionary Linguistics and RNA Structure Modeling: A Computational Approach," Proc. IEEE Symp. Intelligence in Neural and Biological Systems, pp. 3845, 1995.
[38] T. Yokomori and S. Kobayashi, "On the Power of Circular Splicing Systems and DNA Computability," Proc. IEEE Int'l Conf. Evolutionary Computation, pp. 219224, 1997.