This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Exemplar-based Visualization of Large Document Corpus (InfoVis2009-1115)
November/December 2009 (vol. 15 no. 6)
pp. 1161-1168
Yanhua Chen, Wayne State University, Detroit, MI
Lijun Wang, Wayne State University, Detroit, MI
Ming Dong, Wayne State University, Detroit, MI
Jing Hua, Wayne State University, Detroit, MI
With the rapid growth of the World Wide Web and electronic information services,text corpus is becoming available on-line at an incredible rate.By displaying text data in a logical layout (e.g., color graphs),text visualization presents a direct way to observe the documentsas well as understand the relationship between them.In this paper, we propose a novel technique, Exemplar-based Visualization (EV), to visualizean extremely large text corpus. Capitalizing on recent advances in matrixapproximation and decomposition, EV presents a probabilistic multidimensional projection modelin the low-rank text subspace with a sound objective function. The probability of each document proportion to the topics is obtained through iterative optimization andembedded to a low dimensional space using parameter embedding.By selecting the representative exemplars, we obtain a compactapproximation of the data. This makes the visualization highly efficient and flexible. In addition, the selected exemplars neatly summarize the entire data set and greatly reduce the cognitiveoverload in the visualization, leading to an easier interpretation oflarge text corpus. Empirically, we demonstrate the superior performance of EVthrough extensive experiments performed on the publicly available text data sets.

[1] P. Berkhin Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.
[2] M. W. Berry, S. A. Pulatova, and G. W. Stewart Algorithm 844: Computing sparse reduced-rank approximations to sparse matrices. ACM Trans. Math. Softw., 31 (2): pp. 252–269, 2005.
[3] K. Borner, C. Chen, and K. Boyack Visualizing knowledge domains. , Ann. Rev. Information Science and Technology, 37 1–51:, 2003.
[4] T. Cox and M. Cox, Multidimensional Scaling. Chapman and Hall/CRC, 2nd. ed., 2001.
[5] M. C. F. de Oliveira and H. Levkowitz From visual data exploration to visual data mining: a survey. , IEEE Trans. on Visualiztion and Computer Graphics, 9 (3): pp. 378–394, 2003.
[6] C. Ding, T. Li, and M. I. Jordan, Convex and semi-nonnegative matrix factorizations. IEEE Trans. on Pattern Analysis and Machine Intelligence, accepted by 2008, to appear.
[7] C. Ding, T. Li, and W. Peng, On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics and Data Analysis, 52 (8): pp. 3913–3927, 2008.
[8] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. Wiley, New York, 2nd edition, 2001.
[9] Y.-H. Fua, M. Ward, and E. A. Rundensteiner Hierarchical parallel coordinates for exploration of large datasets. In proc. of VIS, pages 43–50, 1999.
[10] T. Hofmann Probablistic latent semantic analysis. In proc. of UAI, pages 289–296, 1999.
[11] T. Iwata, K. Saito, N. Ueda, S. Stromsten, T. L. Griffiths, and J. B. Tenenbaum Parametric embedding for class visualization. Neural Computation, 19 2536–2556:, 2007.
[12] T. Iwata, T. Yamada, and N. Ueda Probabilistic latent semantic visualization: Topic model for visualizing documents. In proc. of KDD, pages 363–371, 2008.
[13] I. T. Jolliffe, Principal Component Analysis. Springer-Verlag, 2nd edition, 2002.
[14] K. Lang News weeder: Learning to filter netnews. In proc. of ICML, pages 331–339, 1995.
[15] G. McLachlan, and D. Peel, Finite Mixture Models. New York: John Wiley & Sons, Inc., first edition, 2002.
[16] F. Paulovich and R. Minghim, Hipp: A novel hierarchical point placement strategy and its application to the exploration of document collections. IEEE Trans. on Visualization and Computer Graphics, 14 (6): 1229–1236, 2008.
[17] F. Paulovich, L. G. Nonato, R. Minghim, and H. Levkowitz, Least square projection: a fast high-precision multidimensional projection technique and its application to document mapping. IEEE Trans. on Visualiztion and Computer Graphics, 14 (3): pp. 564–575, 2008.
[18] D. Petros, K. Ravi, and W. M. Michael Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36 184–206:, 2006.
[19] M. Porter An algorithm for suffix stripping. Program, 14 (3): pp. 130–137, 1980.
[20] S. T. Roweis, and L. K. Saul Nonlinear dimensionality reduction by locally linear embedding. Science, 290 (5500): 2323 − 2326, 2000.
[21] T. Soukup, and I. Davidson, Visual Data Mining: Techniques and Tools for Data Visualization and Mining. John Wiley & Sons, Inc., 1st edition, 2002.
[22] J. Sun, Y. Xie, H. Zhang, and C. Faloutsos Less is more: Sparse graph mining with compact matrix decomposition. , Statistical Analysis and Data Mining, 1 (1): pp. 6–22, 2008.
[23] R. Tejada, R. Minghim, and L. Nonato On improved projection techniques to support visual exploaration of mutlidimensional data sets. Information Visualization, 2 (4): pp. 218–231, 2003.
[24] J. B. Tenenbaum, V. de Silva, and J. C. Langford A global geometric framework for nonlinear dimensionality reduction. Science, 290 (5500): 2319–2323, 2000.

Index Terms:
Exemplar, large-scale document visualization, multidimensional projection.
Citation:
Yanhua Chen, Lijun Wang, Ming Dong, Jing Hua, "Exemplar-based Visualization of Large Document Corpus (InfoVis2009-1115)," IEEE Transactions on Visualization and Computer Graphics, vol. 15, no. 6, pp. 1161-1168, Nov.-Dec. 2009, doi:10.1109/TVCG.2009.140
Usage of this product signifies your acceptance of the Terms of Use.