Issue No.08 - Aug. (2013 vol.35)
pp: 1902-1914
Ian J. Goodfellow , Université de Montréal, Montréal
Aaron Courville , Université de Montréal, Montréal
Yoshua Bengio , Université de Montréal, Montréal
We describe the use of two spike-and-slab models for modeling real-valued data, with an emphasis on their applications to object recognition. The first model, which we call spike-and-slab sparse coding (S3C), is a preexisting model for which we introduce a faster approximate inference algorithm. We introduce a deep variant of S3C, which we call the partially directed deep Boltzmann machine (PD-DBM) and extend our S3C inference algorithm for use on this model. We describe learning procedures for each. We demonstrate that our inference procedure for S3C enables scaling the model to unprecedented large problem sizes, and demonstrate that using S3C as a feature extractor results in very good object recognition performance, particularly when the number of labeled examples is low. We show that the PD-DBM generates better samples than its shallow counterpart, and that unlike DBMs or DBNs, the PD-DBM may be trained successfully without greedy layerwise training.
Encoding, Feature extraction, Data models, Training, Approximation methods, Vectors, Slabs, computer vision, Neural nets, pattern recognition
Ian J. Goodfellow, Aaron Courville, Yoshua Bengio, "Scaling Up Spike-and-Slab Models for Unsupervised Feature Learning", IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.35, no. 8, pp. 1902-1914, Aug. 2013, doi:10.1109/TPAMI.2012.273
[1] Y. Bengio, "Learning Deep Architectures for AI," Foundations and Trends in Machine Learning, vol. 2, no. 1, pp. 1-127, 2009.
[2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, "Greedy Layer-Wise Training of Deep Networks," Proc. Advances in Neural Information Processing Systems 19, B. Schölkopf, J. Platt, and T. Hoffman, eds., pp. 153-160, 2007.
[3] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, "Theano: A CPU and GPU Math Expression Compiler," Proc. Python for Scientific Computing Conf., 2010.
[4] A. Coates and A.Y. Ng, "The Importance of Encoding versus Training with Sparse Coding and Vector Quantization," Proc. Int'l Conf. Machine Learning, 2011.
[5] A. Coates, H. Lee, and A.Y. Ng, "An Analysis of Single-Layer Networks in Unsupervised Feature Learning," Proc. 13th Int'l Conf. Artificial Intelligence and Statistics, 2011.
[6] A. Courville, J. Bergstra, and Y. Bengio, "A Spike and Slab Restricted Boltzmann Machine," Proc. 13th Int'l Conf. Artificial Intelligence and Statistics, 2011.
[7] A. Courville, J. Bergstra, and Y. Bengio, "Unsupervised Models of Images by Spike-and-Slab RBMs," Proc. 28th Int'l Conf. Machine Learning, 2011.
[8] L. Deng, M. Seltzer, D. Yu, A. Acero, A. Mohamed, and G. Hinton, "Binary Coding of Speech Spectrograms Using a Deep Auto-Encoder," Proc. Interspeech '10, 2010.
[9] G. Desjardins, A.C. Courville, and Y. Bengio, "On Training Deep Boltzmann Machines," CoRR, abs/1203.4416, 2012.
[10] S. Douglas, S.-I. Amari, and S.-Y. Kung, "On Gradient Adaptation with Unit-Norm Constraints," IEEE Trans. Signal Processing, vol. 48, no. 6, pp. 1843-1847, June 2000.
[11] P. Garrigues and B. Olshausen, "Learning Horizontal Connections in a Sparse Coding Model of Natural Images," Proc. Neural Information Processing Systems, pp. 505-512, 2008.
[12] G.E. Hinton, "Training Products of Experts by Minimizing Contrastive Divergence," Technical Report GCNU TR 2000-004, Gatsby Unit, Univ. College London, 2000.
[13] G.E. Hinton, "A Practical Guide to Training Restricted Boltzmann Machines," Technical Report UTML TR 2010-003, Dept. of Computer Science, Univ. of Toronto, 2010.
[14] G.E. Hinton, S. Osindero, and Y. Teh, "A Fast Learning Algorithm for Deep Belief Nets," Neural Computation, vol. 18, pp. 1527-1554. 2006.
[15] A. Hyvärinen, J. Hurri, and P.O. Hoyer, Natural Image Statistics: A Probabilistic Approach to Early Computational Vision. Springer-Verlag, 2009.
[16] Y. Jia and C. Huang, "Beyond Spatial Pyramids: Receptive Field Learning for Pooled Image Features," Proc. Neural Information Processing Systems Workshop Deep Learning and Unsupervised Feature Learning, 2011.
[17] K. Kavukcuoglu, P. Sermanet, Y.-L. Boureau, K. Gregor, M. Mathieu, and Y. LeCun, "Learning Convolutional Feature Hierarchies for Visual Recognition," Proc. Neural Information Processing System, 2010.
[18] D. Koller and N. Friedman, Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
[19] A. Krizhevsky and G. Hinton, "Learning Multiple Layers of Features from Tiny Images," technical report, Univ. of Toronto, 2009.
[20] Q.V. Le, A. Karpenko, J. Ngiam, and A.Y. Ng, "ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning," Proc. Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, eds., pp. 1017-1025, 2011.
[21] Q.V. Le, M. Ranzato, R. Salakhutdinov, A. Ng, and J. Tenenbaum, Proc. Neural Information Processing Systems Workshop Challenges in Learning Hierarchical Models: Transfer Learning and Optimization, , 2011.
[22] N. Le Roux and Y. Bengio, "Representational Power of Restricted Boltzmann Machines and Deep Belief Networks," Neural Computation, vol. 20, no. 6, pp. 1631-1649, 2008.
[23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, "Gradient-Based Learning Applied to Document Recognition," Proc. IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[24] J. Lücke and A.-S. Sheikh, "A Closed-Form EM Algorithm for Sparse Coding," arXiv:1105.2493, 2011.
[25] T.J. Mitchell and J.J. Beauchamp, "Bayesian Variable Selection in Linear Regression," J. Am. Statistical Assoc., vol. 83, no. 404, pp. 1023-1032, 1988.
[26] S. Mohamed, K. Heller, and Z. Ghahramani, "Bayesian and l1 Approaches to Sparse Unsupervised Learning," Proc. Int'l Conf. Machine Learning, 2012.
[27] G. Montavon and K.-R. Müller, "Learning Feature Hierarchies with Cented Deep Boltzmann Machines," CoRR, abs/1203.4416, 2012.
[28] B.A. Olshausen and D.J. Field, "Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1?" Vision Research, vol. 37, pp. 3311-3325, 1997.
[29] B. Pearlmutter, "Fast Exact Multiplication by the Hessian," Neural Computation, vol. 6, no. 1, pp. 147-160, 1994.
[30] R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng, "Self-Taught Learning: Transfer Learning from Unlabeled Data," Proc. Int'l Conf. Machine Learning, pp. 759-766, Z. Ghahramani, ed., 2007.
[31] R. Salakhutdinov and G. Hinton, "Deep Boltzmann Machines," Proc. Int'l Conf. Artificial Intelligence and Statistics, 2009.
[32] L.K. Saul and M.I. Jordan, "Exploiting Tractable Substructures in Intractable Networks," Proc. Advances in Neural Information Processing Systems, 1996.
[33] N.N. Schraudolph, "Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent," Neural Computation, vol. 14, no. 7, pp. 1723-1738, 2002.
[34] P. Smolensky, "Information Processing in Dynamical Systems: Foundations of Harmony Theory," Parallel Distributed Processing, vol. 1, chapter 6, D.E. Rumelhart and J.L. McClelland, eds., pp. 194-281, MIT Press, 1986.
[35] T. Tieleman, "Training Restricted Boltzmann Machines Using Approximations to the Likelihood Gradient," Proc. 25th Int'l Conf. Machine Learning, W.W. Cohen, A. McCallum, and S.T. Roweis, eds., pp. 1064-1071, 2008.
[36] M.K. Titsias and M. Lázaro-Gredillad, "Spike and Slab Variational Inference for Multi-Task and Multiple Kernel Learning," Proc. Advances in Neural Information Processing System, 2011.
[37] D. Titterington, A. Smith, and U. Makov, Statistical Analysis of Finite Mixture Distributions. Wiley, 1985.
[38] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, "Extracting and Composing Robust Features with Denoising Autoencoders," Proc. Int'l Conf. Machine Learning, 2008.
[39] D. Warde-Farley, I. Goodfellow, P. Lamblin, G. Desjardins, F. Bastien, and Y. Bengio, "Pylearn2," , 2011.
[40] L. Younes, "On the Convergence of Markovian Stochastic Algorithms with Rapidly Decreasing Ergodicity Rates," Stochastics and Stochastics Models, pp. 177-228, 1998.
[41] K. Yu, Y. Lin, and J. Lafferty, "Learning Image Representations from the Pixel Level via Hierarchical Sparse Coding," Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011.
[42] M. Zeiler, G. Taylor, and R. Fergus, "Adaptive Deconvolutional Networks for mid and High Level Feature Learning," Proc. Int'l Conf. Machine Learning, 2011.
[43] M. Zhou, H. Chen, J.W. Paisley, L. Ren, G. Sapiro, and L. Carin, "Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations," Proc. Advances in Neural Information Processing Systems, pp. 2295-2303, 2009.