This Article 
 Bibliographic References 
 Add to: 
Generalization Ability of Folding Networks
March/April 2001 (vol. 13 no. 2)
pp. 196-206

Abstract—The information theoretical learnability of folding networks, a very successful approach capable of dealing with tree structured inputs, is examined. We find bounds on the VC, pseudo-, and fat shattering dimension of folding networks with various activation functions. As a consequence, valid generalization of folding networks can be guaranteed. However, distribution independent bounds on the generalization error cannot exist in principle. We propose two approaches which take the specific distribution into account and allow us to derive explicit bounds on the deviation of the empirical error from the real error of a learning algorithm: The first approach requires the probability of large trees to be limited a priori and the second approach deals with situations where the maximum input height in a concrete learning example is restricted.

[1] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler, “Scale-Sensitive Dimensions, Uniform Convergence, and Learnability,” J. ACM, vol. 44, no. 4, 1997.
[2] M. Anthony and J. Shawe-Taylor, “A Sufficient Condition for Polynomial Distribution-Dependent Learnability,” Discrete Applied Math., vol. 77, 1997.
[3] M. Anthony, “Probabilistic Analysis of Learning in Artificial Neural Networks: The PAC Model and Its Variants,” Neural Computing Surveys, vol. 1, 1997.
[4] P.L. Bartlett, “For Valid Generalization, the Size of the Weights is More Important Than the Size of the Network,” IEEE Trans. Information Theory, vol. 44, no. 2, 1998.
[5] Y. Bengio, P. Simard, and P. Frasconi, “Learning Long-Term Dependencies with Gradient Descent is Difficult,” IEEE Trans. Neural Networks, vol. 5, no. 2, 1994.
[6] Y. Bengio and P. Frasconi, “Credit Assignment through Time: Alternatives to Backpropagation,” Advances in Neural Information Processing Systems, J. Cowan, G. Tesauro, and J. Alspector, eds., vol. 5, 1994.
[7] F. Costa, P. Frasconi, and G. Soda, “A Topological Transformation for Hidden Recursive Models,” European Symp. Artificial Neural Networks, M. Verleysen, ed., 1999.
[8] B. Dasgupta and E.D. Sontag, “Sample Complexity for Learning Recurrent Perceptron Mappings,” IEEE Trans. Information Theory, vol. 42, 1996.
[9] P. Frasconi, M. Gori, and A. Sperduti, “A General Framework for Adaptive Processing of Data Structures,” IEEE Trans. Neural Networks, vol. 9, no. 5, pp. 768–786, 1998.
[10] Adaptive Processing of Sequences and Data Structures. C.L. Giles and M. Gori, eds., Springer, 1998.
[11] C. Goller and A. Küchler, “Learning Task-Dependent Distributed Representations by Backpropagation through Structure,” Proc. IEEE Conf. Neural Networks, 1996.
[12] M. Gori, M. Mozer, A.C. Tsoi, and R.L. Watrous, Neurocomputing, special issue on recurrent neural networks for sequence processing, vol. 15, nos. 3 and 4, 1997.
[13] L. Gurvits and P. Koiran, “Approximation and Learning of Convex Superpositions,” J. Computer and System Sciences, vol. 55, no. 1, 1997.
[14] B. Hammer, “On the Learnability of Recursive Data,” Math. of Control, Signals, and Systems, vol. 12, 1999.
[15] B. Hammer, “On the Generalization of Elman Networks,” Artificial Neural Networks—ICANN '97, W. Gerstner, A. Germond, M. Hasler, and J.-D. Nicaud, eds., 1997.
[16] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, 1997.
[17] M. Karpinski and A. Macintyre, “Polynomial Bounds for the VC Dimension of Sigmoidal Neural Networks,” J. Computer and System Sciences, vol. 54, 1997.
[18] M.J. Kearns and R.E. Schapire, “Efficient Distribution-Free Learning of Probabilistic Concepts,” J. Computer and System Sciences, vol. 48, 1994.
[19] P. Koiran and E.D. Sontag, “Vapnik-Chervonenkis Dimension of Recurrent Neural Networks,” Discrete Applied Math., vol. 86, no. 1, 1998.
[20] A. Küchler, “On the Correspondence Between Neural Folding Architectures and Tree Automata,” tech. report, Univ. of Ulm, 1998.
[21] A. Küchler and C. Goller, “Inductive Learning Symbolic Domains Using Structure-Driven Neural Networks,” KI–96: Advances in Artificial Intelligence, G. Görz and S. Hölldobler, eds., 1996.
[22] M. Mozer, “Neural Net Architectures for Temporal Sequence Processing,” Predicting the Future and Understanding the Past, A. Weigend and N. Gershenfeld, eds., 1993.
[23] C.W. Omlin and C.L. Giles, “Constructing Deterministic Finite-State Automata in Recurrent Neural Networks,” J. ACM, vol. 43, no. 6, pp. 937–972, 1996.
[24] J.B. Pollack, “Recursive Distributed Representations,” Artificial Intelligence, vol. 46, nos. 1-2, pp. 77–106, 1990.
[25] T. Schmitt and C. Goller, “Relating Chemical Structure to Activity with the Structure Processing Neural Folding Architecture,” Eng. Applications of Neural Networks, 1998.
[26] S. Schulz, A. Küchler, and C. Goller, “Some Experiments on the Applicability of Folding Architectures to Guide Theorem Proving,” Proc. 10th Int'l Florida Artificial Intelligence Research Soc. (FLAIRS) Conf., 1997.
[27] J. Shawe-Taylor, P.L. Bartlett, R.C. Willianmson, and M. Anthony, Structural Risk Minimization over Data-Dependent Hierarchies IEEE Trans. Information Theory, vol. 44, no. 5, pp. 1926-1940, 1998.
[28] A. Sperduti, “Labeling RAAM,” Connection Science, vol. 6, no. 4, 1994.
[29] M. Vidyasagar, A Theory of Learning and Generalization. Springer, 1997.
[30] P. Werbos, The Roots of Backpropagation. Wiley, 1994.

Index Terms:
Recurrent neural networks, folding networks, computational learning theory, VC dimension, UCED property, luckiness function.
Barbara Hammer, "Generalization Ability of Folding Networks," IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 2, pp. 196-206, March-April 2001, doi:10.1109/69.917560
Usage of this product signifies your acceptance of the Terms of Use.