2008 10th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing Clustering Heterogeneous Web Data using Clustering by Compression. Cluster Validity Timisoara, Romania September 26-September 29 ISBN: 978-0-7695-3523-4
The expansive nature of the Internet produced a vast quantity of unstructured data, compared to our conception of a conventional data base. The application of clustering on the World Wide Web is essential to get structured information from this sea of information. In this paper, we intend to test the results of a new clustering technique – clustering by compression – when applied to heterogeneous sets of data. The clustering by compression procedure is based on a parameter-free, universal, similarity distance, the normalized compression distance or NCD, computed from the lengths of compressed data files (singly and in pair-wise concatenation). In order to validate the results, we calculate some quality indices. If the values we obtain prove a high quality of the clustering, in the near future we plan to include the clustering by compression technique into a framework for clustering heterogeneous web objects.
Index Terms:
clustering, heterogeneous data, cluster validity
Citation:
Alexandra Cernian, Dorin Carstoiu, Adriana Olteanu, "Clustering Heterogeneous Web Data using Clustering by Compression. Cluster Validity," synasc, pp.123-126, 2008 10th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing, 2008 Usage of this product signifies your acceptance of the Terms of Use. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||