Subscribe

Issue No.02 - Feb. (2013 vol.25)

pp: 325-336

George Kollios , Boston University, Boston

Michalis Potamias , Smart Deals, Groupon

Evimaria Terzi , Boston University, Boston

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2011.243

ABSTRACT

We study the problem of clustering probabilistic graphs. Similar to the problem of clustering standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic protein-protein interaction (PPI) networks and discovering groups of users in affiliation networks. We extend the edit-distance-based definition of graph clustering to probabilistic graphs. We establish a connection between our objective function and correlation clustering to propose practical approximation algorithms for our problem. A benefit of our approach is that our objective function is parameter-free. Therefore, the number of clusters is part of the output. We also develop methods for testing the statistical significance of the output clustering and study the case of noisy clusterings. Using a real protein-protein interaction network and ground-truth data, we show that our methods discover the correct number of clusters and identify established protein relationships. Finally, we show the practicality of our techniques using a large social network of Yahoo! users consisting of one billion edges.

INDEX TERMS

Probabilistic logic, Clustering algorithms, Approximation algorithms, Partitioning algorithms, Proteins, Data mining, Approximation methods, probabilistic databases, Uncertain data, probabilistic graphs, clustering algorithms

CITATION

George Kollios, Michalis Potamias, Evimaria Terzi, "Clustering Large Probabilistic Graphs",

*IEEE Transactions on Knowledge & Data Engineering*, vol.25, no. 2, pp. 325-336, Feb. 2013, doi:10.1109/TKDE.2011.243REFERENCES

- [1] H. Frank, "Shortest Paths in Probabilistic Graphs,"
Operations Research, vol. 17, no. 4, pp. 583-599, 1969.- [2] L.G. Valiant, "The Complexity of Enumeration and Reliability Problems,"
SIAM J. Computing, vol. 8, no. 3, pp. 410-421, 1979.- [3] N.J. Krogan et al., "Global Landscape of Protein Complexes in the Yeast Saccharomyces Cerevisiae,"
Nature, vol. 440, no. 7084, pp. 637-643, http://dx.doi.org/10.1038nature04670, Mar. 2006.- [4] S.E. Schaeffer, "Graph Clustering,"
Computer Science Rev., vol. 1, no. 1, pp. 27-64, 2007.- [5] O. Benjelloun, A.D. Sarma, A. Halevy, and J. Widom, "ULDBs: Databases with Uncertainty and Lineage,"
Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB), pp. 953-964, 2006,- [6] N.N. Dalvi and D. Suciu, "Efficient Query Evaluation on Probabilistic Databases,"
Proc. 13th Int'l Conf. Very Large Data Bases (VLDB), 2004.- [7] M. Potamias, F. Bonchi, A. Gionis, and G. Kollios, "K-Nearest Neighbors in Uncertain Graphs,"
Proc. VLDB Endowment, vol. 3, nos. 1/2, pp. 997-1008, 2010.- [8] Z. Zou, H. Gao, and J. Li, "Discovering Frequent Subgraphs over Uncertain Graph Databases under Probabilistic Semantics,"
Proc. 16th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), pp. 633-642, 2010.- [9] R. Shamir, R. Sharan, and D. Tsur, "Cluster Graph Modification Problems,"
Discrete Applied Math., vol. 144, nos. 1/2, pp. 173-182, 2004.- [10] N. Bansal, A. Blum, and S. Chawla, "Correlation Clustering,"
Machine Learning, vol. 56, nos. 1-3, pp. 89-113, 2004.- [11] U. Brandes, M. Gaertler, and D. Wagner, "Engineering Graph Clustering: Models and Experimental Evaluation,"
ACM J. Experimental Algorithmics, vol. 12, article 1.1, pp. 1-26, 2007.- [12] G. Karypis and V. Kumar, "Parallel Multilevel K-Way Partitioning for Irregular Graphs,"
SIAM Rev., vol. 41, pp. 278-300, 1999.- [13] M. Newman, "Modularity and Community Structure in Networks,"
Proc. Nat'l Academy of Sciences USA, vol. 103, pp. 8577-8582, 2006.- [14] Y. Emek, A. Korman, and Y. Shavitt, "Approximating the Statistics of Various Properties in Randomly Weighted Graphs,"
Proc. 22nd Ann. ACM-SIAM Symp. Discrete Algorithms (SODA), pp. 1455-1467. 2011,- [15] P. Hintsanen and H. Toivonen, "Finding Reliable Subgraphs from Large Probabilistic Graphs,"
Data Mining Knowledge Discovery, vol. 17, no. 1, pp. 3-23, 2008.- [16] X. Lian and L. Chen, "Efficient Query Answering in Probabilistic Rdf Graphs,"
Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD '11), pp. 157-168, 2011.- [17] O. Papapetrou, E. Ioannou, and D. Skoutas, "Efficient Discovery of Frequent Subgraph Patterns in Uncertain Graph Databases,"
Proc. 14th Int'l Conf. Extending Database Technology (EDBT), pp. 355-366, 2011.- [18] Z. Zou, J. Li, H. Gao, and S. Zhang, "Frequent Subgraph Pattern Mining on Uncertain Graph Data,"
Proc. 18th ACM Conf. Information and Knowledge Management (CIKM), pp. 583-592, 2009.- [19] M. Hua and J. Pei, "Probabilistic Path Queries in Road Networks: Traffic Uncertainty Aware Path Selection,"
Proc. 13th Int'l Conf. Extending Database Technology (EDBT), pp. 347-358, 2010.- [20] Y. Yuan, L. Chen, and G. Wang, "Efficiently Answering Probability Threshold-Based Shortest Path Queries over Uncertain Graphs,"
Proc. 15th Int'l Conf. Database Systems for Advanced Applications (DASFAA), pp. 155-170, 2010.- [21] C.C. Aggarwal and P.S. Yu, "A Framework for Clustering Uncertain Data Streams,"
Proc. IEEE 24th Int'l Conf. Data Eng. (ICDE), pp. 150-159, 2008.- [22] G. Cormode and A. McGregor, "Approximation Algorithms for Clustering Uncertain Data,"
Proc. 27th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), pp. 191-200, 2008.- [23] S. Günnemann, H. Kremer, and T. Seidl, "Subspace Clustering for Uncertain Data,"
Proc. SIAM Int'l Conf. Data Mining (SDM), pp. 385-396, 2010.- [24] S. Guha and K. Munagala, "Exceeding Expectations and Clustering Uncertain Data,"
Proc. 28th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), pp. 269-278, 2009.- [25] B. Kao, S.D. Lee, F.K.F. Lee, D.W.-L. Cheung, and W.-S. Ho, "Clustering Uncertain Data Using Voronoi Diagrams and R-Tree Index,"
IEEE Trans. Knowledge Data Eng., vol. 22, no. 9, pp. 1219-1233, Sept. 2010.- [26] H.-P. Kriegel and M. Pfeifle, "Density-Based Clustering of Uncertain Data,"
Proc. 11th ACM SIGKDD Int'l Conf. Knowledge Discovery in Data Mining (KDD), pp. 672-677, 2005.- [27] W.K. Ngai, B. Kao, C.K. Chui, R. Cheng, M. Chau, and K.Y. Yip, "Efficient Clustering of Uncertain Data,"
Proc. Sixth Int'l Conf. Data Mining (ICDM), pp. 436-445, 2006.- [28] T. Bernecker, H.-P. Kriegel, M. Renz, F. Verhein, and A. Zuefle, "Probabilistic Frequent Itemset Mining in Uncertain Databases,"
Proc. 15th ACM SIGKDD Int'l Conf. Knowledge Discovery in Data Mining (KDD), 2009.- [29] L. Sun, R. Cheng, D.W. Cheung, and J. Cheng, "Mining Uncertain Data with Probabilistic Guarantees,"
Proc. 16th ACM SIGKDD Int'l Conf. Knowledge Discovery in Data Mining (KDD), pp. 273-282, 2010.- [30] Q. Zhang, F. Li, and K. Yi, "Finding Frequent Items in Probabilistic Data,"
Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), pp. 819-832, 2008.- [31] M.L. Yiu, N. Mamoulis, X. Dai, Y. Tao, and M. Vaitis, "Efficient Evaluation of Probabilistic Advanced Spatial Queries on Existentially Uncertain Data,"
IEEE Trans. Knowledge and Data Eng., vol. 21, no. 1, pp. 108-122, Jan. 2009.- [32] N.N. Dalvi, C. Ré, and D. Suciu, "Probabilistic Databases: Diamonds in the Dirt,"
Comm. ACM, vol. 52, no. 7, pp. 86-94, 2009.- [33] C. Koch, "Approximating Predicates and Expressive Queries on Probabilistic Databases,"
Proc. 27th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS), 2008.- [34] C. Re, N.N. Dalvi, and D. Suciu, "Efficient Top-K Query Evaluation on Probabilistic Data,"
Proc. IEEE 23rd Int'l Conf. Data Eng. (ICDE), 2007.- [35] G. Cormode, F. Li, and K. Yi, "Semantics of Ranking Queries for Probabilistic Data and Expected Ranks,"
Proc. IEEE Int'l Conf. Data Eng. (ICDE), 2009.- [36] M. Hua, J. Pei, W. Zhang, and X. Lin, "Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach,"
Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 2008.- [37] J. Li, B. Saha, and A. Deshpande, "A Unified Approach to Ranking in Probabilistic Databases,"
Proc. VLDB Endowment, vol. 2, no. 1, pp. 502-513, 2009.- [38] M. Soliman, I. Ilyas, and K.C.-C. Chang, "Top-k Query Processing in Uncertain Databases,"
Proc. IEEE 23rd Int'l Conf. Data Eng. (ICDE), 2007.- [39] K. Yi, F. Li, G. Kollios, and D. Srivastava, "Efficient Processing of Top-k Queries in Uncertain Databases with X-Relations,"
IEEE Trans. Knowledge Data Eng., vol. 20, no. 12, pp. 1669-1682, Dec. 2008.- [40] B. Bollobs,
Random Graphs. Cambridge Univ. Press, 2001.- [41] R. Motwani and P. Raghavan,
Randomized Algorithms. Cambridge Univ. Press, 2000.- [42] N. Ailon, M. Charikar, and A. Newman, "Aggregating Inconsistent Information: Ranking and Clustering,"
Proc. Annual ACM Symp. Theory of Computing (STOC), pp. 684-693, 2005.- [43] A. Gionis, H. Mannila, and P. Tsaparas, "Clustering Aggregation,"
ACM Trans. Knowledge Discovery from Data (TKDD), vol. 1, no. 1,article 4, pp. 1-30, 2007.- [44] A. Ben-Dor, R. Shamir, and Z. Yakhini, "Clustering Gene Expression Patterns,"
J. Computational Biology, vol. 6, no. 3/4, pp. 281-297, 1999.- [45] H. Mewes et al. "Mips: Analysis and Annotation of Proteins from Whole Genomes,"
Nucleic Acids Research, vol. 32, pp. D41-D44, 2004.- [46] S. Lattanzi and D. Sivakumar, "Affiliation Networks,"
Proc. 41st Annual ACM Symp. Theory of Computing (STOC), pp. 427-434, 2009.- [47] A.L. Barabási and R. Albert, "Emergence of Scaling in Random Networks,"
Science, vol. 286, no. 5439, pp. 509-512, 1999. |