Subscribe

Issue No.05 - May (2012 vol.24)

pp: 952-960

Alexander Ihler , University of California, Irvine, Irvine

David Newman , University of California, Irvine, Irvine

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.29

ABSTRACT

Latent Dirichlet allocation (LDA) is a popular algorithm for discovering semantic structure in large collections of text or other data. Although its complexity is linear in the data size, its use on increasingly massive collections has created considerable interest in parallel implementations. “Approximate distributed” LDA, or AD-LDA, approximates the popular collapsed Gibbs sampling algorithm for LDA models while running on a distributed architecture. Although this algorithm often appears to perform well in practice, its quality is not well understood theoretically or easily assessed on new data. In this work, we theoretically justify the approximation, and modify AD-LDA to track an error bound on performance. Specifically, we upper bound the probability of making a sampling error at each step of the algorithm (compared to an exact, sequential Gibbs sampler), given the samples drawn thus far. We show empirically that our bound is sufficiently tight to give a meaningful and intuitive measure of approximation error in AD-LDA, allowing the user to track the tradeoff between accuracy and efficiency while executing in parallel.

INDEX TERMS

Data mining, topic model, parallel processing, error analysis.

CITATION

Alexander Ihler, David Newman, "Understanding Errors in Approximate Distributed Latent Dirichlet Allocation",

*IEEE Transactions on Knowledge & Data Engineering*, vol.24, no. 5, pp. 952-960, May 2012, doi:10.1109/TKDE.2011.29REFERENCES

- [1] D. Blei and J. Lafferty, "A Correlated Topic Model of Science,"
Annals of Applied Statistics, vol. 1, no. 1, pp. 17-35, 2007.- [2] J. Sivic, B. Russell, A. Efros, A. Zisserman, and W. Freeman, "Discovering Object Categories in Image Collections,"
Proc. IEEE Int'l Conf. Computer Vision (ICCV), 2005.- [3] T. Rubin and M. Steyvers, "A Topic Model for Movie Choices and Ratings,"
Proc. Int'l Conf. Cognitive Modeling, 2009.- [4] R. Nallapati, W. Cohen, and J. Lafferty, "Parallelized Variational em for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability,"
Proc. IEEE Int'l Conf. Data Mining Workshops (ICDMW), pp. 349-354, 2007.- [5] M. Welling, Y.W. Teh, and B. Kappen, "Hybrid Variational/Gibbs Collapsed Inference in Topic Models,"
Proc. Int'l Conf. Uncertainty in Artificial Intelligence (UAI), pp. 587-594, 2008.- [6] A. Asuncion, M. Welling, P. Smyth, and Y.W. Teh, "On Smoothing and Inference for Topic Models,"
Proc. Int'l Conf. Uncertainty in Artificial Intelligence (UAI), June 2009.- [7] D. Newman, A. Asuncion, P. Smyth, and M. Welling, "Distributed Inference for Latent Dirichlet Allocation,"
Proc. Advances in Neural Information Processing Systems (NIPS), Dec. 2007.- [8] Y. Wang, H. Bai, M. Stanton, W.-Y. Chen, and E.Y. Chang, "PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications,"
Proc. Fifth Int'l Conf. Algorithmic Aspects in Information and Management (AAIM), June 2009.- [9] A. Asuncion, P. Smyth, and M. Welling, "Asynchronous Distributed Learning of Topic Models,"
Proc. Advances in Neural Information Processing Systems (NIPS), pp. 81-88, 2009.- [10] T.L. Griffiths and M. Steyvers, "Finding Scientific Topics,"
Proc. Nat'l Academy of Sciences USA, vol. 101, Suppl 1, pp. 5228-5235, Apr. 2004.- [11] T. Minka, "Estimating a Dirichlet Distribution," http://research. microsoft.com/en-us/um/ people/minka/papersdirichlet/, 2003.
- [12] D.M. Blei, A.Y. Ng, and M.I. Jordan, "Latent Dirichlet Allocation,"
J. Machine Learning Research, vol. 3, pp. 993-1022, 2003.- [13] F. Yan, N. Xu, and Y. Qi, "Parallel Inference for Latent Dirichlet Allocation on Graphics Processing Units,"
Proc. Advances in Neural Information Processing Systems (NIPS), Dec. 2009.- [14] P.J. Bushell, "Hilbert's Metric and Positive Contraction Mappings in a Banach Space,"
Archive for Rational Mechanics and Analysis, vol. 52, no. 4, pp. 330-338, Dec. 1973.- [15] A.T. Ihler, J.W. FisherIII, and A.S. Willsky, "Loopy Belief Propagation: Convergence and Effects of Message Errors,"
J. Machine Learning Research, vol. 6, pp. 905-936, May 2005.- [16] L. Devroye and G. Lugosi,
Combinatorial Methods in Density Estimation. Springer, 2001.- [17] A. Asuncion and D. Newman, "UCI Machine Learning Repository," http://www.ics.uci.edu/~mlearnMLRepository.html , 2007.
- [18] Y.W. Teh, M.I. Jordan, M.J. Beal, and D.M. Blei, "Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes,"
Proc. Advances in Neural Information Processing Systems (NIPS), pp. 1385-1392, 2005. |