This Article 
 Bibliographic References 
 Add to: 
Reducing the Loss of Information through Annealing Text Distortion
July 2011 (vol. 23 no. 7)
pp. 1090-1102
Ana Granados, Universidad Autonoma de Madrid, Madrid
Manuel Cebrián, Massachusetts Institute of Technology, Cambridge
David Camacho, Universidad Autonoma de Madrid, Madrid
Francisco de Borja Rodríguez, Universidad Autonoma de Madrid, Madrid
Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper, we take a step toward understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the nondistorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-Ziv, Statistical and Block-Sorting.

[1] R.L. Cilibrasi and P.M. Vitanyi, "The Google Similarity Distance," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 3, pp. 370-383, Mar. 2007.
[2] X. Zhang, Y. Hao, X. Zhu, and M. Li, "Information Distance from a Question to an Answer," KDD '07: Proc. the 13th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 874-883, 2007.
[3] D. Ravichandran and E. Hovy, "Learning Surface Text Patterns for a Question Answering System," Proc. 40th Ann. Meeting on Assoc. for Computational Linguistics, pp. 41-47, 2001.
[4] X. Chen, B. Francia, M. Li, B. McKinnon, and A. Seker, "Shared Information and Program Plagiarism Detection," IEEE Trans. Information Theory, vol. 50, no. 7, pp. 1545-1551, July 2004.
[5] C. Ané and M. Sanderson, "Missing the Forest for the Trees: Phylogenetic Compression and Its Implications for Inferring Complex Evolutionary Histories," Systematic Biology, vol. 54, no. 1, pp. 146-157, 2005.
[6] H. Otu and K. Sayood, "A New Sequence Distance Measure for Phylogenetic Tree Construction," Bioinformatics, vol. 19, no. 16, pp. 2122-2130, 2003.
[7] A. Kocsor, A. Kertesz-Farkas, L. Kajan, and S. Pongor, "Application of Compression-Based Distance Measures to Protein Sequence Classification: A Methodological Study," Bioinformatics, vol. 22, no. 4, pp. 407-412, 2006.
[8] N. Krasnogor and D. Pelta, "Measuring the Similarity of Protein Structures by Means of the Universal Similarity Metric," Bioinformatics, vol. 20, no. 7, pp. 1015-1021, 2004.
[9] H. Pao and J. Case, "Computing Entropy for Ortholog Detection," ICCI 2004: Proc. Int'l Conf. Computational Intelligence, 2004.
[10] D. Benedetto, E. Caglioti, and V. Loreto, "Language Trees and Zipping," Physical Rev. Letters, vol. 88, no. 48702, 2002.
[11] M. Cuturi and J. Vert, "The Context-Tree Kernel for Strings," Neural Networks, vol. 18, no. 8, pp. 1111-1123, 2005.
[12] K. Emanuel, S. Ravela, E. Vivant, and C. Risi, "A Combined Statistical-Deterministic Approach of Hurricane Risk Assessment," Bull. of the Am. Meteorological Soc., vol. 87, no. 3, pp. 299-314, 2006.
[13] T. Arbuckle, A. Balaban, D. Peters, and M. Lawford, "Software Documents: Comparison and Measurement," SEKE '07: Proc. 18th Int'l. Conf. Software Eng. and Knowledge Eng., 2007.
[14] E.B. Allen, T.M. Khoshgoftaar, and Y. Chen, "Measuring Coupling and Cohesion of Software Modules: An Information-Theory Approach," Proc. Seventh Int'l Software Metrics Symp., 2001.
[15] W.T. Scott, "A New Approach to Data Mining for Software Design," CSITeA '04: Proc. Int'l Conf. Computer Science, Software Eng., Information Technology, E-Business, and Applications, 2004.
[16] R. Cilibrasi, P. Vitanyi, and R. de Wolf, "Algorithmic Clustering of Music," Proc. Fourth Int'l Conf. Web Delivering of Music (WEDELMUSIC '04), pp. 110-117, 2004.
[17] A. Kraskov, H. Stoegbauer, R. Andrzejak, and P. Grassberger, "Hierarchical Clustering Using Mutual Information," Europhysics Letters, vol. 70, no. 2, pp. 278-284, 2005.
[18] C. Santos, J. Bernardes, P. Vitanyi, and L. Antunes, "Clustering Fetal Heart Rate Tracings by Compression," CBMS '06: Proc. 19th IEEE Symp. Computer-Based Medical Systems, pp. 685-690, 2006.
[19] D. Parry, "Use of Kolmogorov Distance Identification of Web Page Authorship, Topic and Domain," Proc. Workshop Open Source Web Information Retrieval, 2005.
[20] T. Joachims, "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," technical report, Dortmund Univ., 1997.
[21] E. Leopold and J. Kindermann, "Text Categorization with Support Vector Machines. How to Represent Texts in Input Space?," Machine Learning, vol. 46, nos. 1-3, pp. 423-444, 2002.
[22] C. Faloutsos and V. Megalooikonomou, "On Data Mining, Compression, and Kolmogorov Complexity," Data Mining and Knowledge Discovery, vol. 15, no. 1, pp. 3-20, 2007.
[23] R. Martínez, M. Cebrián, F. de Borja Rodríguez, and D. Camacho, "Contextual Information Retrieval Based on Algorithmic Information Theory and Statistical Outlier Detection," Proc. IEEE Information Theory Workshop, 2007.
[24] D. Salomon, Data Compression: The Complete Reference. Springer, 2004.
[25] M. Li, X. Chen, X. Li, B. Ma, and P. Vitanyi, "The Similarity Metric," IEEE Trans. Information Theory, vol. 50, no. 12, pp. 3250-3264, Dec. 2004.
[26] R. Cilibrasi and P. Vitanyi, "Clustering by Compression," IEEE Trans. Information Theory, vol. 51, no. 4, pp. 1523-1545, Apr. 2005.
[27] J. Seward, BZIP2, http:/, 2011.
[28] I. Pavlov, LZMAX, http://www.7-zip.orgsdk.html, 2011.
[29] C. Bloom, PPMZ, http:/, 2011.
[30] M. Cebrián, M. Alfonseca, and A. Ortega, "Common Pitfalls Using the Normalized Compression Distance: What to Watch Out for in a Compressor," Comm. Information and Systems, vol. 5, no. 4, pp. 367-384, 2005.
[31] M. Cebrian, M. Alfonseca, and A. Ortega, "The Normalized Compression Distance is Resistant to Noise," IEEE Trans. Information Theory, vol. 53, no. 5, pp. 1895-1900, May 2007.
[32] S. Verdú and T. Weissman, "The Information Lost in Erasures," IEEE Trans. Information Theory, vol. 54, no. 11, pp. 5030-5058, Nov. 2008.
[33] S. Fong, D. Roussinov, and D. Skillicorn, "Detecting Word Substitutions in Text," IEEE Trans. Knowledge and Data Eng., vol. 20, no. 8, pp. 1067-1076, Aug. 2008.
[34] A. Turing, "On Computable Numbers, with an Application to the Entscheidungsproblem," Proc. London Math. Soc., vol. 2, no. 42, pp. 230-265, 1936.
[35] A. Kolmogorov, "Three Approaches to the Quantitative Definition of Information," Problems Information Transmission, vol. 1, no. 1, pp. 1-7, 1965.
[36] M. Li and P. Vitányi, An Introduction to Kolmogorov Complexity and Its Applications, second ed. Springer-Verlag, 1997.
[37] M. Sipser, Introduction to the Theory of Computation, second ed. PWS Publishing, 2006.
[38] R. Cilibrasi, A.L. Cruz, S. de Rooij, and M. Keijzer, CompLearn Toolkit, http:/, 2011.
[39] UCI Knowledge Discovery in Databases Archive, Information and Computer Science, Univ. of California, Irvine. http:/kdd.ics., 2011.
[40] MedlinePlus Health Information, MedlinePlus Website, US Nat'l Library of Medicine and Nat'l Inst. of Health, http:/, 2011.
[41] IMDB, Internet Movie Database, http:/, 2011.
[42] Y. Yang, "Noise Reduction in a Statistical Approach to Text Categorization," SIGIR: Proc. 18th Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 256-263, 1995.
[43] C. Van Rijsbergen, Information Retrieval. Butterworth-Heinemann Newton, 1979.
[44] G. Salton, Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing Co., 1989.
[45] W.J. Wilbur and K. Sirotkin, "The Automatic Identification of Stop Words," J. Information Science, vol. 18, no. 1, pp. 45-55, 1992.
[46] British National Corpus (BNC), http:/, University of Oxford, 2010.
[47] M. Burrows and D.J. Wheeler, "A Block-Sorting Lossless Data Compression Algorithm," Digital Systems Research Center Research Report, vol. 124, p. 24, 1994.
[48] D.A. Huffman, "A Method for the Construction of Minimum-Redundancy Codes," Proc. Inst. of Radio Engineers, vol. 40, no. 9, pp. 1098-1101, 1952.
[49] A. Granados, M. Cebrián, D. Camacho, and F.B. Rodríguez, "Evaluating the Impact of Information Distortion on Normalized Compression Distance," Proc. Second Int'l Castle Meeting on Coding Theory and Applications (ICMCTA), A. Barbero, ed., pp. 69-79, 2008.
[50] S. Consoli, K. Darby-Dowman, G. Geleijnse, J. Korst, and S. Pauws, "Heuristic Approaches for the Quartet Method of Hierarchical Clustering," IEEE Trans. Knowledge and Data Eng., vol. 22, no. 10, pp. 1428-1443, Oct. 2010.
[51] N. Tishby, F. Pereira, and W. Bialek, "The Information Bottleneck Method," Proc. 37th Ann. Allerton Conf. Comm., Control, and Computing, pp. 368-377, 1999.
[52] N. Slonim and N. Tishby, "Document Clustering Using Word Clusters via the Information Bottleneck Method," Proc. 23rd Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 208-215, 2000.
[53] N. Slonim, N. Friedman, and N. Tishby, "Unsupervised Document Classification Using Sequential Information Maximization," SIGIR '02: Proc. 25th Ann. Int'l ACM SIGIR Conf. Research and development in Information Retrieval, pp. 129-136, 2002.
[54] E. Keogh, S. Lonardi, and C.A. Ratanamahatana, "Towards Parameter-Free Data Mining," KDD '04: Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 206-215, 2004.
[55] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Addison-Wesley, 1999.
[56] S. Kullback and R. Leibler, "On Information and Sufficiency," Annals of Math. Statistics, vol. 22, pp. 79-86, 1951.
[57] S. Kullback, "The Kullback-Leibler Distance," The Am. Statistician, vol. 41, pp. 340-341, 1987.
[58] J.A. and M. Wong, "A K-Means Clustering Algorithm," Applied Statistics, vol. 28, no. 1, pp. 100-108, 1979.
[59] K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Schölkopf, "An Introduction to Kernel-Based Learning Algorithms," IEEE Trans. Neural Networks, vol. 12, no. 2, pp. 181-201, Mar. 2001.

Index Terms:
Information distortion, data compression, normalized compression distance, clustering by compression, Kolmogorov complexity.
Ana Granados, Manuel Cebrián, David Camacho, Francisco de Borja Rodríguez, "Reducing the Loss of Information through Annealing Text Distortion," IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 7, pp. 1090-1102, July 2011, doi:10.1109/TKDE.2010.173
Usage of this product signifies your acceptance of the Terms of Use.