The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.10 - Oct. (2012 vol.24)
pp: 1848-1861
Odysseas Papapetrou , Technical University of Crete, Chania
Wolf Siberski , L3S Research Center, Hannover
Norbert Fuhr , University of Duisburg-Essen, Duisburg
ABSTRACT
Text clustering is an established technique for improving quality in information retrieval, for both centralized and distributed environments. However, traditional text clustering algorithms fail to scale on highly distributed environments, such as peer-to-peer networks. Our algorithm for peer-to-peer clustering achieves high scalability by using a probabilistic approach for assigning documents to clusters. It enables a peer to compare each of its documents only with very few selected clusters, without significant loss of clustering quality. The algorithm offers probabilistic guarantees for the correctness of each document assignment to a cluster. Extensive experimental evaluation with up to 1 million peers and 1 million documents demonstrates the scalability and effectiveness of the algorithm.
INDEX TERMS
Clustering algorithms, Peer to peer computing, Probabilistic logic, Frequency estimation, Indexing, Computational modeling, text clustering., Distributed clustering
CITATION
Odysseas Papapetrou, Wolf Siberski, Norbert Fuhr, "Decentralized Probabilistic Text Clustering", IEEE Transactions on Knowledge & Data Engineering, vol.24, no. 10, pp. 1848-1861, Oct. 2012, doi:10.1109/TKDE.2011.120
REFERENCES
[1] Y. Ioannidis, D. Maier, S. Abiteboul, P. Buneman, S. Davidson, E. Fox, A. Halevy, C. Knoblock, F. Rabitti, H. Schek, and G. Weikum, "Digital Library Information-Technology Infrastructures," Int'l J. Digital Libraries, vol. 5, no. 4, pp. 266-274, 2005.
[2] P. Cudré-Mauroux, S. Agarwal, and K. Aberer, "GridVine: An Infrastructure For Peer Information Management," IEEE Internet Computing, vol. 11, no. 5, pp. 864-875, Sept. 2007.
[3] J. Lu and J. Callan, "Content-Based Retrieval in Hybrid Peer-To-Peer Networks," Proc. 12th Int'l Conf. Information and Knowledge Management (CIKM '03), 2003.
[4] J. Xu and W.B. Croft, "Cluster-Based Language Models for Distributed Retrieval," Proc. 22nd Ann. Int'l ACM SIGIR Conf. Research and Development in Information Retrieval (SIGIR '99), 1999.
[5] O. Papapetrou, W. Siberski, and W. Nejdl, "PCIR: Combining DHTs and Peer Clusters for Efficient Full-Text P2P Indexing," Computer Networks, vol. 54, no. 12, pp. 2019-2040, 2010.
[6] S. Datta, C.R. Giannella, and H. Kargupta, "Approximate Distributed K-Means Clustering over a Peer-to-Peer Network," IEEE Trans. Knowledge and Data Eng., vol. 21, no. 10, pp. 1372-1388, Oct. 2009.
[7] M. Eisenhardt, W. Müller, and A. Henrich, "Classifying Documents by Distributed P2P Clustering," Proc. INFORMATIK, 2003.
[8] K.M. Hammouda and M.S. Kamel, "Hierarchically Distributed Peer-to-Peer Document Clustering and Cluster Summarization," IEEE Trans. Knowledge and Data Eng., vol. 21, no. 5, pp. 681-698, May 2009.
[9] H.C. Hsiao and C.T. King, "Similarity Discovery in Structured P2P Overlays," Proc. Int'l Conf. Parallel Processing, 2003.
[10] I. Stoica, R. Morris, D. Karger, F. Kaashoek, and H. Balakrishnan, "Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications," Proc. SIGCOMM, 2001.
[11] K. Aberer, P. Cudré-Mauroux, A. Datta, Z. Despotovic, M. Hauswirth, M. Punceva, and R. Schmidt, "P-Grid: A Self-Organizing Structured P2P System," SIGMOD Record, vol. 32, no. 3, pp. 29-33, 2003.
[12] A.I.T. Rowstron and P. Druschel, "Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems," Proc. IFIP/ACM Middleware, 2001.
[13] C.D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge Univ. Press, 2008.
[14] M. Steinbach, G. Karypis, and V. Kumar, "A Comparison of Document Clustering Techniques," Proc. KDD Workshop Text Mining, 2000.
[15] G. Forman and B. Zhang, "Distributed Data Clustering Can Be Efficient and Exact," SIGKDD Explorations Newsletter, vol. 2, no. 2, pp. 34-38, 2000.
[16] S. Datta, K. Bhaduri, C. Giannella, R. Wolff, and H. Kargupta, "Distributed Data Mining in Peer-to-Peer Networks," IEEE Internet Computing, vol. 10, no. 4, pp. 18-26, July/Aug. 2006.
[17] S. Datta, C. Giannella, and H. Kargupta, "K-Means Clustering over a Large, Dynamic Network," Proc. SIAM Int'l Conf. Data Mining (SDM), 2006.
[18] G. Koloniari and E. Pitoura, "A Recall-Based Cluster Formation Game in P2P Systems," Proc. VLDB Endowment, vol. 2, no. 1, pp. 455-466, 2009.
[19] K.M. Hammouda and M.S. Kamel, "Distributed Collaborative Web Document Clustering Using Cluster Keyphrase Summaries," Information Fusion, vol. 9, no. 4, pp. 465-480, 2008.
[20] M. Bender, S. Michel, P. Triantafillou, G. Weikum, and C. Zimmer, "Minerva: Collaborative P2P Search," Proc. 31st Int'l Conf. Very Large Data Bases (VLDB '05), pp. 1263-1266, 2005.
[21] T. Luu, G. Skobeltsyn, F. Klemm, M. Puh, I.P. Zarko, M. Rajman, and K. Aberer, "AlvisP2P: Scalable Peer-to-Peer Text Retrieval in a Structured P2P Network," Proc. VLDB Endowment, vol. 1, no. 2, pp. 1424-1427, 2008.
[22] L.T. Nguyen, W.G. Yee, and O. Frieder, "Adaptive Distributed Indexing for Structured Peer-to-Peer Networks," Proc. 17th ACM Conf. Information and Knowledge Management (CIKM '08), pp. 1241-1250, 2008.
[23] C. Doulkeridis, A. Vlachou, K. Nørvåg, Y. Kotidis, and M. Vazirgiannis, "Multidimensional Routing Indices For Efficient Distributed Query Processing," Proc. 18th ACM Conf. Information and Knowledge Management (CIKM '09), pp. 1489-1492, 2009.
[24] O. Papapetrou, W. Siberski, F. Leitritz, and W. Nejdl, "Exploiting Distribution Skew for Efficient P2P Text Clustering," Proc. Information Systems and Peer-to-Peer Computing (DBISP2P), 2008.
[25] H.F. Witschel, "Global Term Weights in Distributed Environments," Information Processing and Management, vol. 44, no. 3, pp. 1049-1061, 2008.
[26] R. Neumayer, C. Doulkeridis, and K. Nørvåg, "Aggregation of Document Frequencies in Unstructured P2P Networks," Proc. 10th Int'l Conf. Web Information Systems Eng. (WISE '09), 2009.
[27] F.M. Cuenca-Acuna, C. Peery, R.P. Martin, and T.D. Nguyen, "PlanetP: Using Gossiping to Build Content Addressable Peer-to-Peer Information Sharing Communities," Proc. IEEE 12th Int'l Symp. High Performance Distributed Computing, 2003.
[28] R. Fagin, A. Lotem, and M. Naor, "Optimal Aggregation Algorithms for Middleware," Proc. 20th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS '01), 2001.
[29] M. Theobald, G. Weikum, and R. Schenkel, "Top-k Query Evaluation with Probabilistic Guarantees," Proc. 30th Int'l Conf. Very Large Data Bases (VLDB '04), 2004.
[30] C.H. Papadimitriou, H. Tamaki, P. Raghavan, and S. Vempala, "Latent Semantic Indexing: A Probabilistic Analysis," Proc. 17th ACM SIGACT-SIGMOD-SIGART Symp. Principles of Database Systems (PODS '98), 1998.
[31] M. Steyvers and T. Griffiths, "Probabilistic Topic Models," Handbook of Latent Semantic Analysis, Lawrence Erlbaum, 2007.
[32] R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge Univ. Press, 1995.
[33] C. Blake, "A Comparison of Document Sentence, and Term Eventspaces," Proc. 21st Int'l Conf. Computational Linguistics and the 44th Ann. Meeting of the Assoc. for Computational Linguistics (ACL), 2006.
[34] Y. Zhao and G. Karypis, "Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering," Machine Learning, vol. 55, no. 3, pp. 311-331, 2004.
[35] G.K. Zipf, Human Behavior and the Principle of Least-Effort. Addison-Wesley, 1949.
[36] S. Ramachandran, "Web Metrics: Size and Number of Resources," http://code.google.com/speed/articlesweb-metrics.html , May 2010.
27 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool