The Community for Technology Leaders
Green Image
Issue No. 07 - July (2011 vol. 23)
ISSN: 1041-4347
pp: 1090-1102
David Camacho , Universidad Autonoma de Madrid, Madrid
Ana Granados , Universidad Autonoma de Madrid, Madrid
Francisco de Borja Rodríguez , Universidad Autonoma de Madrid, Madrid
Manuel Cebrián , Massachusetts Institute of Technology, Cambridge
ABSTRACT
Compression distances have been widely used in knowledge discovery and data mining. They are parameter-free, widely applicable, and very effective in several domains. However, little has been done to interpret their results or to explain their behavior. In this paper, we take a step toward understanding compression distances by performing an experimental evaluation of the impact of several kinds of information distortion on compression-based text clustering. We show how progressively removing words in such a way that the complexity of a document is slowly reduced helps the compression-based text clustering and improves its accuracy. In fact, we show how the nondistorted text clustering can be improved by means of annealing text distortion. The experimental results shown in this paper are consistent using different data sets, and different compression algorithms belonging to the most important compression families: Lempel-Ziv, Statistical and Block-Sorting.
INDEX TERMS
Information distortion, data compression, normalized compression distance, clustering by compression, Kolmogorov complexity.
CITATION
David Camacho, Ana Granados, Francisco de Borja Rodríguez, Manuel Cebrián, "Reducing the Loss of Information through Annealing Text Distortion", IEEE Transactions on Knowledge & Data Engineering, vol. 23, no. , pp. 1090-1102, July 2011, doi:10.1109/TKDE.2010.173
92 ms
(Ver 3.3 (11022016))