Subscribe
Issue No.04 - April (2011 vol.23)
pp: 481-495
Marco Saerens , Universite catholique de Louvain, Belgium
Luh Yen , Universite catholique de Louvain, Belgium
ABSTRACT
This work introduces a link analysis procedure for discovering relationships in a relational database or a graph, generalizing both simple and multiple correspondence analysis. It is based on a random walk model through the database defining a Markov chain having as many states as elements in the database. Suppose we are interested in analyzing the relationships between some elements (or records) contained in two different tables of the relational database. To this end, in a first step, a reduced, much smaller, Markov chain containing only the elements of interest and preserving the main characteristics of the initial chain, is extracted by stochastic complementation [42]. This reduced chain is then analyzed by projecting jointly the elements of interest in the diffusion map subspace [41] and visualizing the results. This two-step procedure reduces to simple correspondence analysis when only two tables are defined, and to multiple correspondence analysis when the database takes the form of a simple star-schema. On the other hand, a kernel version of the diffusion map distance, generalizing the basic diffusion map distance to directed graphs, is also introduced and the links with spectral clustering are discussed. Several data sets are analyzed by using the proposed methodology, showing the usefulness of the technique for extracting relationships in relational databases or graphs.
INDEX TERMS
Graph mining, link analysis, kernel on a graph, diffusion map, correspondence analysis, dimensionality reduction, statistical relational learning.
CITATION
Marco Saerens, Luh Yen, "A Link Analysis Extension of Correspondence Analysis for Mining Relational Databases", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 4, pp. 481-495, April 2011, doi:10.1109/TKDE.2010.142
REFERENCES
[1] P. Baldi, P. Frasconi, and P. Smyth, Modeling the Internet and the Web: Probabilistic Methods and Algorithms. John Wiley & Sons, 2003.
[2] M. Belkin and P. Niyogi, "Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering," Advances in Neural Information Processing Systems, vol. 14, pp. 585-591, MIT Press, 2001.
[3] M. Belkin and P. Niyogi, "Laplacian Eigenmaps for Dimensionality Reduction and Data Representation," Neural Computation, vol. 15, pp. 1373-1396, 2003.
[4] C. Blake, E. Keogh, and C. Merz, "UCI Repository of Machine Learning Databases," Univ. California, Dept. of Information and Computer Science, http://www.ics.uci.edu/~mlearn MLRepository.html , 1998.
[5] J. Blasius, M. Greenacre, P. Groenen, and M. van de Velden, "Special Issue on Correspondence Analysis and Related Methods," Computational Statistics and Data Analysis, vol. 53, no. 8, pp. 3103-3106, 2009.
[6] I. Borg and P. Groenen, Modern Multidimensional Scaling: Theory and Applications. Springer, 1997.
[7] P. Bremaud, Markov Chains: Gibbs Fields, Monte Carlo Simulation, and Queues. Springer-Verlag, 1999.
[8] P. Carrington, J. Scott, and S. Wasserman, Models and Methods in Social Network Analysis. Cambridge Univ. Press, 2006.
[9] S. Chakrabarti, Mining the Web: Discovering Knowledge from Hypertext Data. Elsevier Science, 2003.
[10] F.R. Chung, Spectral Graph Theory. Am. Math. Soc., 1997.
[11] D.J. Cook and L.B. Holder, Mining Graph Data. Wiley and Sons, 2006.
[12] T. Cox and M. Cox, Multidimensional Scaling, second ed. Chapman and Hall, 2001.
[13] N. Cressie, Statistics for Spatial Data. Wiley, 1991.
[14] P. Demartines and J. Herault, "Curvilinear Component Analysis: A Self-Organizing Neural Network for Nonlinear Mapping of Data Sets," IEEE Trans. Neural Networks, vol. 8, no. 1, pp. 148-154, Jan. 1997.
[15] C. Ding, "Spectral Clustering," Tutorial presented at the 16th European Conf. Machine Learning (ECML '05), 2005.
[16] P. Domingos, "Prospects and Challenges for Multi-Relational Data Mining," ACM SIGKDD Explorations Newsletter, vol. 5, no. 1, pp. 80-83, 2003.
[17] F. Fouss, A. Pirotte, J.-M. Renders, and M. Saerens, "Random-Walk Computation of Similarities between Nodes of a Graph, with Application to Collaborative Recommendation," IEEE Trans. Knowledge and Data Eng., vol. 19, no. 3, pp. 355-369, Mar. 2007.
[18] F. Fouss, J.-M. Renders, and M. Saerens, "Links between Kleinberg's Hubs and Authorities, Correspondence Analysis and Markov Chains," Proc. Third IEEE Int'l Conf. Data Mining (ICDM), pp. 521-524, 2003.
[19] F. Fouss, L. Yen, A. Pirotte, and M. Saerens, "An Experimental Investigation of Graph Kernels on a Collaborative Recommendation Task," Proc. Sixth Int'l Conf. Data Mining (ICDM '06), pp. 863-868, 2006.
[20] F. Geerts, H. Mannila, and E. Terzi, "Relational Link-Based Ranking," Proc. 30th Very Large Data Bases Conf. (VLDB), pp. 552-563, 2004.
[21] X. Geng, D.-C. Zhan, and Z.-H. Zhou, "Supervised Nonlinear Dimensionality Reduction for Visualization and Classification," IEEE Trans. Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 35, no. 6, pp. 1098-1107, Dec. 2005.
[22] Introduction to Statistical Relational Learning, L. Getoor and B. Taskar, eds. MIT Press, 2007.
[23] J. Gower and D. Hand, Biplots. Chapman & Hall, 1995.
[24] M.J. Greenacre, Theory and Applications of Correspondence Analysis. Academic Press, 1984.
[25] A. Greenbaum, Iterative Methods for Solving Linear Systems. Soc. for Industrial and Applied Math., 1997.
[26] K.M. Hall, "An R-Dimensional Quadratic Placement Algorithm," Management Science, vol. 17, no. 8, pp. 219-229, 1970.
[27] D.A. Harville, Matrix Algebra from a Statistician's Perspective. Springer-Verlag, 1997.
[28] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second ed. Springer-Verlag, 2009.
[29] H. Hwang, W. Dhillon, and Y. Takane, "An Extension of Multiple Correspondence Analysis for Identifying Heterogeneous Subgroups of Respondents," Psychometrika, vol. 71, no. 1, pp. 161-171, 2006.
[30] A. Izenman, Modern Multivariate Statistical Techniques: Regression, Classification, and Manifold Learning. Springer, 2008.
[31] R. Johnson and D. Wichern, Applied Multivariate Statistical Analysis, sixth ed. Prentice Hall, 2007.
[32] R. Kimball and M. Ross, The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling. John Wiley & Sons, 2002.
[33] J.M. Kleinberg, "Authoritative Sources in a Hyperlinked Environment," J. ACM, vol. 46, no. 5, pp. 604-632, 1999.
[34] P. Kroonenberg and M. Greenacre, "Correspondence Analysis," Encyclopedia of Statistical Sciences, S. Kotz, ed., second ed., pp. 1394-1403, John Wiley & Sons, 2006.
[35] S. Lafon and A.B. Lee, "Diffusion Maps and Coarse-Graining: A Unified Framework for Dimensionality Reduction, Graph Partitioning, and Data Set Parameterization," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 9, pp. 1393-1403, Sept. 2006.
[36] A.N. Langville and C.D. Meyer, Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton Univ. Press, 2006.
[37] J. Lee and M. Verleysen, Nonlinear Dimensionality Reduction. Springer, 2007.
[38] D. Lusseau, K. Schneider, O. Boisseau, P. Haase, E. Slooten, and S. Dawson, "The Bottlenose Dolphin Community of Doubtful Sound Features a Large Proportion of Long-Lasting Associations. Can Geographic Isolation Explain This Unique Trait?" Behavioral Ecology and Sociobiology, vol. 54, no. 4, pp. 396-405, 2003.
[39] S.A. Macskassy and F. Provost, "Classification in Networked Data: A Toolkit and a Univariate Case Study," J. Machine Learning Research, vol. 8, pp. 935-983, 2007.
[40] K.V. Mardia, J.T. Kent, and J.M. Bibby, Multivariate Analysis. Academic Press, 1979.
[41] C.D. Meyer, "Stochastic Complementation, Uncoupling Markov Chains, and the Theory of Nearly Reducible Systems," SIAM Rev., vol. 31, no. 2, pp. 240-272, 1989.
[42] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis, "Diffusion Maps, Spectral Clustering and Eigenfunctions of Fokker-Planck Operators," Advances in Neural Information Processing Systems, vol. 18, pp. 955-962, MIT Press, 2005.
[43] B. Nadler, S. Lafon, R. Coifman, and I. Kevrekidis, "Diffusion Maps, Spectral Clustering and Reaction Coordinate of Dynamical Systems," Applied and Computational Harmonic Analysis, vol. 21, pp. 113-127, 2006.
[44] M. Newman and M. Girvan, "Finding and Evaluating Community Structure in Networks," Physical Rev. E, 69, p. 026113, 2004.
[45] A.Y. Ng, M.I. Jordan, and Y. Weiss, "On Spectral Clustering: Analysis and an Algorithm," Advances in Neural Information Processing Systems, vol. 14, pp. 849-856, MIT Press, 2001.
[46] P. Pons and M. Latapy, "Computing Communities in Large Networks Using Random Walks," Proc. Int'l Symp. Computer and Information Sciences (ISCIS '05), pp. 284-293, 2005.
[47] P. Pons and M. Latapy, "Computing Communities in Large Networks Using Random Walks," J. Graph Algorithms and Applications, vol. 10, no. 2, pp. 191-218, 2006.
[48] S. Ross, Stochastic Processes, second ed. Wiley, 1996.
[49] Y. Saad, Iterative Methods for Sparse Linear Systems, second ed. Soc. for Industrial and Applied Math., 2003.
[50] M. Saerens and F. Fouss, "HITS Is Principal Component Analysis," Proc. 2005 IEEE/WIC/ACM Int'l Joint Conf. Web Intelligence, pp. 782-785, 2005.
[51] M. Saerens, F. Fouss, L. Yen, and P. Dupont, "The Principal Components Analysis of a Graph, and Its Relationships to Spectral Clustering," Proc. 15th European Conf. Machine Learning (ECML '04), pp. 371-383, 2004.
[52] J.W. Sammon, "A Nonlinear Mapping for Data Structure Analysis," IEEE Trans. Computers, vol. C-18, no. 5, pp. 401-409, May 1969.
[53] O. Schabenberger and C. Gotway, Statistical Methods for Spatial Data Analysis. Chapman & Hall, 2005.
[54] B. Scholkopf and A. Smola, Learning with Kernels. MIT Press, 2002.
[55] B. Scholkopf, A. Smola, and K. Muller, "Nonlinear Component Analysis as a Kernel Eigenvalue Problem," Neural Computation, vol. 10, no. 5, pp. 1299-1319, 1998.
[56] R. Sedgewick, Algorithms in C. Addison-Wesley, 1990.
[57] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Analysis. Cambridge Univ. Press, 2004.
[58] J. Shi and J. Malik, "Normalized Cuts and Image Segmentation," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888-905, Aug. 2000.
[59] D. Spielman and S.-H. Teng, "Nearly-Linear Time Algorithms for Preconditioning and Solving Symmetric, Diagonally Dominant Linear Systems," arXiv, http://arxiv.org/abs/cs0607105, 2007.
[60] W.J. Stewart, Introduction to the Numerical Solution of Markov Chains. Princeton Univ. Press, 1994.
[61] J.B. Tenenbaum, V. de Silva, and J.C. Langford, "A Global Geometric Framework for Nonlinear Dimensionality Reduction," Science, vol. 290, pp. 2319-2323, 2000.
[62] M. Tenenhaus and F. Young, "An Analysis and Synthesis of Multiple Correspondence Analysis, Optimal Scaling, Dual Scaling, Homogeneity Analysis and Other Methods for Quantifying Categorical Multivariate Data," Psychometrika, vol. 50, no. 1, pp. 91-119, 1985.
[63] M. Thelwall, Link Analysis: An Information Science Approach. Elsevier, 2004.
[64] L. Trefethen and D. Bau, Numerical Linear Algebra. Soc. for Industrial and Applied Math., 1997.
[65] U. von Luxburg, "A Tutorial on Spectral Clustering," Statistics and Computing, vol. 17, no. 4, pp. 395-416, 2007.
[66] S. Wasserman and K. Faust, Social Network Analysis: Methods and Applications. Cambridge Univ. Press, 1994.
[67] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, "Graph Embedding and Extensions: A General Framework for Dimensionality Reduction," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 29, no. 1, pp. 40-51, Jan. 2007.
[68] L. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens, "Graph Nodes Clustering Based on the Commute-Time Kernel," Proc. 11th Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD '07), 2007.
[69] L. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens, "Graph Nodes Clustering with the Sigmoid Commute-Time Kernel: A Comparative Study," Data and Knowledge Eng., vol. 68, pp. 338-361, 2009.
[70] D.M. Young and R.T. Gregory, A Survey of Numerical Mathematics. Dover Publications, 1988.
[71] W.W. Zachary, "An Information Flow Model for Conflict and Fission in Small Groups," J. Anthropological Research, vol. 33, pp. 452-473, 1977.
[72] H. Zha, X. He, C.H.Q. Ding, M. Gu, and H.D. Simon, "Bipartite Graph Partitioning and Data Clustering," Proc. ACM 10th Int'l Conf. Information and Knowledge Management (CIKM '01), pp. 25-32, 2001.
[73] X. Zhu and A. Goldberg, Introduction to Semi-Supervised Learning. Morgan & Claypool Publishers, 2009.
[74] J.Y. Zien, M.D. Schlag, and P.K. Chan, "Multilevel Spectral Hypergraph Partitioning with Arbitrary Vertex Sizes," IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 18, no. 9, pp. 1389-1399, 1999.