This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Natural Language Grammatical Inference with Recurrent Neural Networks
January/February 2000 (vol. 12 no. 1)
pp. 126-140

Abstract—This paper examines the inductive inference of a complex grammar with neural networks—specifically, the task considered is that of training a network to classify natural language sentences as grammatical or ungrammatical, thereby exhibiting the same kind of discriminatory power provided by the Principles and Parameters linguistic framework, or Government-and-Binding theory. Neural networks are trained, without the division into learned vs. innate components assumed by Chomsky, in an attempt to produce the same judgments as native speakers on sharply grammatical/ungrammatical data. How a recurrent neural network could possess linguistic capability and the properties of various common recurrent neural network architectures are discussed. The problem exhibits training behavior which is often not present with smaller grammars and training was initially difficult. However, after implementing several techniques aimed at improving the convergence of the gradient descent backpropagation-through-time training algorithm, significant learning was possible. It was found that certain architectures are better able to learn an appropriate grammar. The operation of the networks and their training is analyzed. Finally, the extraction of rules in the form of deterministic finite state automata is investigated.

[1] R.B. Allen, “Sequential Connectionist Networks for Answering Simple Questions about a Microworld,” Fifth Ann. Proc. Cognitive Science Soc., pp. 489-495, 1983.
[2] E. Barnard and E.C. Botha, “Back-Propagation Uses Prior Information Efficiently,” IEEE Trans. Neural Networks, vol. 4, no. 5, pp. 794-802, Sept. 1993.
[3] E. Barnard and D. Casasent, “A Comparison between Criterion Functions for Linear Classifiers, with an Application to Neural Nets,” IEEE Trans. Systems, Man, and Cybernetics, vol. 19, no. 5, pp. 1,030-1,041, 1989.
[4] E.B. Baum and F. Wilczek, “Supervised Learning of Probability Distributions by Neural Networks,” Neural Information Processing Systems, D.Z. Anderson, ed., pp. 52-61, New York: Am. Inst. of Physics, 1988.
[5] M.P. Casey, “The Dynamics of Discrete-Time Computation, with Application to Recurrent Neural Networks and Finite State Machine Extraction,” Neural Computation, vol. 8, no. 6, pp. 1,135-1,178, 1996.
[6] N.A. Chomsky, “Three Models for the Description of Language,” IRE Trans. Information Theory, vol. 2, pp. 113-124, 1956.
[7] N.A. Chomsky, Lectures on Government and Binding. Foris Publications, 1981.
[8] N.A. Chomsky, Knowledge of Language: Its Nature, Origin, and Use. Prager, 1986.
[9] A. Cleeremans, D. Servan-Schreiber, and J.L. McClelland, “Finite State Automata and Simple Recurrent Networks,” Neural Computation, vol. 1, no. 3, pp. 372-381, 1989.
[10] C. Darken and J.E. Moody, “Note on Learning Rate Schedules for Stochastic Optimization,” Advances in Neural Information Processing Systems, R.P. Lippmann, J.E. Moody, and D.S. Touretzky, eds., vol. 3, pp. 832-838, San Mateo, Calif.: Morgan Kaufmann, 1991.
[11] C. Darken and J.E. Moody, “Towards Faster Stochastic Gradient Search,” Neural Information Processing Systems 4, pp. 1,009-1,016, San Mateo, Calif.: Morgan Kaufmann, 1992.
[12] J.L. Elman, “Structured Representations and Connectionist Models,” Sixth Ann. Proc. Cognitive Science Soc., pp. 17-25, 1984.
[13] J.L. Elman, “Distributed Representations, Simple Recurrent Networks, and Grammatical Structure,” Machine Learning, vol. 7, pp. 195–225, 1991.
[14] P. Frasconi and M. Gori, “Computational Capabilities of Local-Feedback Recurrent Networks Acting as Finite-State Machines,” IEEE Trans. Neural Networks, vol. 7, no. 6, pp. 1,521-1,524, 1996.
[15] P. Frasconi, M. Gori, M. Maggini, and G. Soda, “Unified Integration of Explicit Knowledge and Learning by Example in Recurrent Networks,” IEEE Trans. Knowledge and Data Eng., vol. 7, no. 2, pp. 340–346, Apr. 1995.
[16] P. Frasconi,M. Gori,, and G. Soda,“Local feedback multi-layered networks,” Neural Computation, vol. 4, no. 1, pp. 120-130, 1992.
[17] K.S. Fu, Syntactic Pattern Recognition and Applications. Englewood Cliffs, N.J.: Prentice Hall, 1982.
[18] M. Gasser and C. Lee, “Networks That Learn Phonology,” technical report, Computer Science Dept., Indiana Univ., 1990.
[19] C.L. Giles, C.B. Miller, D. Chen, H.H. Chen, G.Z. Sun, and Y.C. Lee, “Learning and Extracted Finite State Automata with Second-Order Recurrent Neural Networks,” Neural Computation, vol. 4, no. 3, pp. 393–405, 1992.
[20] C.L. Giles, C.B. Miller, D. Chen, G.Z. Sun, H.H. Chen, and Y.C. Lee, “Extracting and Learning an Unknown Grammar with Recurrent Neural Networks,” Advances in Neural Information Processing Systems 4, J.E. Moody, S.J. Hanson, and R.P Lippmann, eds., pp. 317-324, San Mateo, Calif.: Morgan Kaufmann, 1992.
[21] C.L. Giles, G.Z. Sun, H.H. Chen, Y.C. Lee, and D. Chen, “Higher Order Recurrent Networks and Grammatical Inference,” Advances in Neural Information Processing Systems 2, D.S. Touretzky, ed., pp. 380-387, San Mateo, Calif.: Morgan Kaufmann, 1990.
[22] M. Hare, “The Role of Similarity in Hungarian Vowel Harmony: A Connectionist Account,” Technical Report CRL 9004, Center for Research in Language, Univ. of California, San Diego, 1990.
[23] M. Hare, D. Corina, and G.W. Cottrell, “Connectionist Perspective on Prosodic Structure,” Technical Report CRL Newsletter, vol. 3, no. 2, Center for Research in Language, Univ. of California, San Diego, 1989.
[24] C.L. Harris and J.L. Elman, “Representing Variable Information with Simple Recurrent Networks,” Sixth Ann. Proc. Cognitive Science Soc., pp. 635-642, 1984.
[25] M.H. Harrison, Introduction to Formal Language Theory. Reading, Mass.: Addison-Wesley, 1978.
[26] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan College Press, New York, 1994.
[27] J. Hertz, A. Krogh, and R.G. Palmer, Introduction to the Theory of Neural Computation. Addison-Wesley, 1991.
[28] J. Hopcroft and J. Ullman, Introduction to Automata Theory, Languages and Computation, pp. 22-24. Addison-Wesley, 1979.
[29] J. Hopfield, “Learning Algorithms and Probability Distributions in Feed-Forward and Feed-Back Networks,” Proc. Nat'l Academy of Science, vol. 84, pp. 8,429-8,433, 1987.
[30] B.G. Horne and C.L. Giles, “An Experimental Comparison of Recurrent Neural Networks,” Advances in Neural Information Processing Systems 7, G. Tesauro, D. Touretzky, and T. Leen, eds., pp. 697-704, MIT Press, 1995.
[31] L. Ingber, “Very Fast Simulated Re-Annealing,” Math. Computer Modelling, vol. 12, pp. 967-973, 1989.
[32] L. Ingber, “Adaptive Simulated Annealing (ASA),” technical report, Lester Ingber Research, McLean, Va., 1993.
[33] M.I. Jordan, “Attractor Dynamics and Parallelism in a Connectionist Sequential Machine,” Proc. Ninth Ann. Conf. Cognitive Science Soc., pp. 531-546, 1986.
[34] M.I. Jordan, “Serial Order: A Parallel Distributed Processing Approach,” Technical Report ICS Report 8604, Inst. for Cognitive Science, Univ. of California, San Diego, May 1986.
[35] S. Kirkpatrick and G.B. Sorkin, “Simulated Annealing,” The Handbook of Brain Theory and Neural Networks, M.A. Arbib, ed., pp. 876-878, Cambridge, Mass.: MIT Press, 1995.
[36] S. Kullback, Information Theory and Statistics. New York: Wiley, 1959.
[37] H. Lasnik and J. Uriagereka, A Course in GB Syntax: Lectures on Binding and Empty Categories. Cambridge, Mass.: MIT Press, 1988.
[38] S. Lawrence, S. Fong, and C.L. Giles, “Natural Language Grammatical Inference: A Comparison of Recurrent Neural Networks and Machine Learning Methods,” Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, S. Wermter, E. Riloff, and G. Scheler, eds., 1996.
[39] Y. Le Cun, “Efficient Learning and Second Order Methods,” Tutorial presented at Neural Information Processing Systems 5, 1993.
[40] L.R. Leerink and M. Jabri, “Learning the Past Tense of English Verbs Using Recurrent Neural Networks,” Proc. Australian Conf. Neural Networks, P. Bartlett, A. Burkitt, and R. Williamson, eds., pp. 222-226, 1996.
[41] B. MacWhinney, J. Leinbach, R. Taraban, and J. McDonald, “Language Learning: Cues or Rules?” J. Memory and Language, vol. 28, pp. 255-277, 1989.
[42] R. Miikkulainen and M. Dyer, “Encoding Input/Output Representations in Connectionist Cognitive Systems,” Proc. 1988 Connectionist Models Summer School, D.S. Touretzky, G.E. Hinton, and T.J. Sejnowski, eds., pp. 188-195, 1989.
[43] M.C. Mozer, “A Focused Backpropagation Algorithm for Temporal Pattern Recognition,” Complex Systems, vol. 3, no. 4, pp. 349-381, Aug. 1989.
[44] K.S. Narendra and K. Parthasarathy, "Identification and Control of Dynamical Systems Using Neural Networks," IEEE Trans. Neural Networks, Vol. 1, No. 1, Mar. 1990, pp. 4-27.
[45] C.W. Omlin and C.L. Giles, “Constructing Deterministic Finite-State Automata in Recurrent Neural Networks,” J. ACM, vol. 43, no. 6, pp. 937–972, 1996.
[46] C.W. Omlin and C.L. Giles, "Extraction of Rules from Discrete-Time Recurrent Neural Networks," Neural Networks, vol. 9, no. 1, pp. 41-52, 1996.
[47] C.W. Omlin and C.L. Giles, “Rule Revision with Recurrent Neural Networks,” IEEE Trans. Knowledge and Data Eng., vol. 8, no. 1, pp. 183-188, 1996.
[48] F. Pereira and Y. Schabes, “Inside-Outside Re-Estimation from Partially Bracketed Corpora,” Proc. 30th Ann. Meeting ACL, pp. 128-135, 1992.
[49] D.M. Pesetsky, “Paths and Categories,” PhD thesis, MIT, 1982.
[50] J. Pollack,"The induction of dynamical recognizers," Machine Learning, vol. 7, nos. 2/3, pp. 227-252, 1991.
[51] D.E. Rumelhart and J.L. McClelland, “On Learning the Past Tenses of English Verbs,” Parallel Distributed Processing; Volume 2: Psychological and Biological Models, J.L. McClelland, D.E. Rumelhart, and the PDP Research Group, eds., Cambridge, Mass.: MIT Press, pp. 216–271, 1986.
[52] J. W. Shavlik,"A framework of combining symbolic and neural learning," Machine Learning, vol. 14, no. 3, pp. 321-331, 1994.
[53] H.T. Siegelmann, “Computation Beyond the Turing Limit,” Science, vol. 268, pp. 545-548, 1995.
[54] H.T. Siegelmann, B.G. Horne, and C.L. Giles, “Computational Capabilities of Recurrent NARX Neural Networks,” IEEE Trans. Systems, Man and Cybernetics-Part B, vol. 27, no. 2, p. 208, 1997.
[55] H.T. Siegelmann and E.D. Sontag, “On the Computational Power of Neural Nets,” J. Computer and System Sciences, vol. 50, no. 1, pp. 132-150, 1995.
[56] P. Simard, M.B. Ottaway, and D.H. Ballard, “Analysis of Recurrent Backpropagation,” Proc. 1988 Connectionist Models Summer School, D. Touretzky, G. Hinton, and T. Sejnowski, eds., pp. 103-112, 1989.
[57] S. A. Solla,E. Levin,M. Fleisher,M:“Accelerated Learning in Layered Neural Networks,” Complex Systems, vol. 2, pp. 625-639, 1988.
[58] M.F. St. John and J.L. McClelland, “Learning and Applying Contextual Constraints in Sentence Comprehension,” Artificial Intelligence, vol. 46, pp. 217–257, 1990.
[59] A. Stolcke, “Learning Feature-Based Semantics with Simple Recurrent Networks,” Technical Report TR-90-015, Int'l Computer Science Inst., Berkeley, Calif., Apr. 1990.
[60] M. Tomita, “Dynamic Construction of Finite-State Automata from Examples Using Hill-Climbing,” Proc. Fourth Ann. Cognitive Science Conf., p. 105-108, 1982.
[61] D.S. Touretzky, “Rules and Maps in Connectionist Symbol Processing,” Technical Report CMU-CS-89-158, Dept. of Computer Science, Carnegie Mellon Univ., Pittsburgh, Pa., 1989.
[62] D.S. Touretzky, “Towards a Connectionist Phonology: The 'Many Maps' Approach to Sequence Manipulation,” Proc. 11th Annual Conf. Cognitive Science Soc., pp. 188-195, 1989.
[63] A.C. Tsoi and A.D. Back, “Locally Recurrent Globally Feedforward Networks: A Critical Review of Architectures,” IEEE Trans. Neural Networks, vol. 5, no. 2, pp. 229-239, 1994.
[64] R.L. Watrous and G.M. Kuhn, “Induction of Finite State Languages Using Second-Order Recurrent Networks,” Advances in Neural Information Processing Systems 4, J.E. Moody, S.J. Hanson, and R.P Lippmann, eds., pp. 309-316, San Mateo, Calif.: Morgan Kaufmann, 1992.
[65] R. Watrous and G. Kuhn,"Induction of finite-state languages using second-order recurrent networks," Neural Computation, vol. 4, no. 3, p. 406, 1992.
[66] R.J. Williams and J. Peng,“An efficient gradient-based algorithm for on-line training of recurrent networks trajectories,” Neural Computation vol. 2, no. 4, pp. 490-501, 1990.
[67] R.J. Williams and D. Zipser, “A Learning Algorithm for Continually Running Fully Recurrent Neural Networks,” Neural Computation, vol. 1, no. 2, pp. 270-280, 1989.
[68] Z. Zeng, R.M. Goodman, and P. Smyth, “Learning Finite State Machines with Self-Clustering Recurrent Networks,” Neural Computation, vol. 5, no. 6, pp. 976-990, 1993.
[69] S. Lawrence, I. Burns, A.D. Back, A.C. Tsoi, and C.L. Giles, “Neural Network Classification and Unequal Prior Classes,” Tricks of the Trade, G. Orr, K.-R. Müller, and R. Caruana, eds., pp. 299-314. Springer-Verlag, 1998.

Index Terms:
Recurrent neural networks, natural language processing, grammatical inference, government-and-binding theory, gradient descent, simulated annealing, principles-and-parameters framework, automata extraction.
Citation:
Steve Lawrence, C. Lee Giles, Sandiway Fong, "Natural Language Grammatical Inference with Recurrent Neural Networks," IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 1, pp. 126-140, Jan.-Feb. 2000, doi:10.1109/69.842255
Usage of this product signifies your acceptance of the Terms of Use.