The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.06 - November/December (2009 vol.15)
pp: 1161-1168
Yanhua Chen , Wayne State University, Detroit, MI
Lijun Wang , Wayne State University, Detroit, MI
Ming Dong , Wayne State University, Detroit, MI
Jing Hua , Wayne State University, Detroit, MI
ABSTRACT
With the rapid growth of the World Wide Web and electronic information services,text corpus is becoming available on-line at an incredible rate.By displaying text data in a logical layout (e.g., color graphs),text visualization presents a direct way to observe the documentsas well as understand the relationship between them.In this paper, we propose a novel technique, Exemplar-based Visualization (EV), to visualizean extremely large text corpus. Capitalizing on recent advances in matrixapproximation and decomposition, EV presents a probabilistic multidimensional projection modelin the low-rank text subspace with a sound objective function. The probability of each document proportion to the topics is obtained through iterative optimization andembedded to a low dimensional space using parameter embedding.By selecting the representative exemplars, we obtain a compactapproximation of the data. This makes the visualization highly efficient and flexible. In addition, the selected exemplars neatly summarize the entire data set and greatly reduce the cognitiveoverload in the visualization, leading to an easier interpretation oflarge text corpus. Empirically, we demonstrate the superior performance of EVthrough extensive experiments performed on the publicly available text data sets.
INDEX TERMS
Exemplar, large-scale document visualization, multidimensional projection.
CITATION
Yanhua Chen, Lijun Wang, Ming Dong, Jing Hua, "Exemplar-based Visualization of Large Document Corpus (InfoVis2009-1115)", IEEE Transactions on Visualization & Computer Graphics, vol.15, no. 6, pp. 1161-1168, November/December 2009, doi:10.1109/TVCG.2009.140
REFERENCES
[1] P. Berkhin Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.
[2] M. W. Berry, S. A. Pulatova, and G. W. Stewart Algorithm 844: Computing sparse reduced-rank approximations to sparse matrices. ACM Trans. Math. Softw., 31 (2): pp. 252–269, 2005.
[3] K. Borner, C. Chen, and K. Boyack Visualizing knowledge domains. , Ann. Rev. Information Science and Technology, 37 1–51:, 2003.
[4] T. Cox and M. Cox, Multidimensional Scaling. Chapman and Hall/CRC, 2nd. ed., 2001.
[5] M. C. F. de Oliveira and H. Levkowitz From visual data exploration to visual data mining: a survey. , IEEE Trans. on Visualiztion and Computer Graphics, 9 (3): pp. 378–394, 2003.
[6] C. Ding, T. Li, and M. I. Jordan, Convex and semi-nonnegative matrix factorizations. IEEE Trans. on Pattern Analysis and Machine Intelligence, accepted by 2008, to appear.
[7] C. Ding, T. Li, and W. Peng, On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics and Data Analysis, 52 (8): pp. 3913–3927, 2008.
[8] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. Wiley, New York, 2nd edition, 2001.
[9] Y.-H. Fua, M. Ward, and E. A. Rundensteiner Hierarchical parallel coordinates for exploration of large datasets. In proc. of VIS, pages 43–50, 1999.
[10] T. Hofmann Probablistic latent semantic analysis. In proc. of UAI, pages 289–296, 1999.
[11] T. Iwata, K. Saito, N. Ueda, S. Stromsten, T. L. Griffiths, and J. B. Tenenbaum Parametric embedding for class visualization. Neural Computation, 19 2536–2556:, 2007.
[12] T. Iwata, T. Yamada, and N. Ueda Probabilistic latent semantic visualization: Topic model for visualizing documents. In proc. of KDD, pages 363–371, 2008.
[13] I. T. Jolliffe, Principal Component Analysis. Springer-Verlag, 2nd edition, 2002.
[14] K. Lang News weeder: Learning to filter netnews. In proc. of ICML, pages 331–339, 1995.
[15] G. McLachlan, and D. Peel, Finite Mixture Models. New York: John Wiley & Sons, Inc., first edition, 2002.
[16] F. Paulovich and R. Minghim, Hipp: A novel hierarchical point placement strategy and its application to the exploration of document collections. IEEE Trans. on Visualization and Computer Graphics, 14 (6): 1229–1236, 2008.
[17] F. Paulovich, L. G. Nonato, R. Minghim, and H. Levkowitz, Least square projection: a fast high-precision multidimensional projection technique and its application to document mapping. IEEE Trans. on Visualiztion and Computer Graphics, 14 (3): pp. 564–575, 2008.
[18] D. Petros, K. Ravi, and W. M. Michael Fast monte carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36 184–206:, 2006.
[19] M. Porter An algorithm for suffix stripping. Program, 14 (3): pp. 130–137, 1980.
[20] S. T. Roweis, and L. K. Saul Nonlinear dimensionality reduction by locally linear embedding. Science, 290 (5500): 2323 − 2326, 2000.
[21] T. Soukup, and I. Davidson, Visual Data Mining: Techniques and Tools for Data Visualization and Mining. John Wiley & Sons, Inc., 1st edition, 2002.
[22] J. Sun, Y. Xie, H. Zhang, and C. Faloutsos Less is more: Sparse graph mining with compact matrix decomposition. , Statistical Analysis and Data Mining, 1 (1): pp. 6–22, 2008.
[23] R. Tejada, R. Minghim, and L. Nonato On improved projection techniques to support visual exploaration of mutlidimensional data sets. Information Visualization, 2 (4): pp. 218–231, 2003.
[24] J. B. Tenenbaum, V. de Silva, and J. C. Langford A global geometric framework for nonlinear dimensionality reduction. Science, 290 (5500): 2319–2323, 2000.
6 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool