Subscribe
Issue No.10 - October (2008 vol.20)
pp: 1348-1362
Pauli Miettinen , University of Helsinki, University of Helsinki
Taneli Mielikäinen , Nokia Research Center Palo Alto, Palo Alto
Aristides Gionis , Yahoo, Barcelona
Gautam Das , University of Texas at Arlington, Arlington
Heikki Mannila , University of Helsinki and Helsinki University of Technology, Helsinki
ABSTRACT
Matrix decomposition methods represent a data matrix as a product of two factor matrices: one containing basis vectors that represent meaningful concepts in the data, and another describing how the observed data can be expressed as combinations of the basis vectors. Decomposition methods have been studied extensively, but many methods return real-valued matrices. Interpreting real-valued factor matrices is hard if the original data is Boolean. In this paper, we describe a matrix decomposition formulation for Boolean data, the Discrete Basis Problem. The problem seeks for a Boolean decomposition of a binary matrix, thus allowing the user to easily interpret the basis vectors. We also describe a variation of the problem, the Discrete Basis Partitioning Problem. We show that both problems are NP-hard. For the Discrete Basis Problem, we give a simple greedy algorithm for solving it; for the Discrete Basis Partitioning Problem we show how it can be solved using existing methods. We present experimental results for the greedy algorithm and compare it against other, well known methods. Our algorithm gives intuitive basis vectors, but its reconstruction error is usually larger than with the real-valued methods. We discuss about the reasons for this behavior.
INDEX TERMS
Mining methods and algorithms, Clustering, classification, and association rules, Text mining
CITATION
Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, Heikki Mannila, "The Discrete Basis Problem", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 10, pp. 1348-1362, October 2008, doi:10.1109/TKDE.2008.53
REFERENCES
 [1] P. Miettinen, T. Mielikäinen, A. Gionis, G. Das, and H. Mannila, “The Discrete Basis Problem,” Proc. 10th European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '06), pp.335-346, 2006. [2] G. Golub and C. van Loan, Matrix Computations. Johns Hopkins Univ. Press, 1996. [3] D. Lee and H. Seung, “Learning the Parts of Objects by Non-Negative Matrix Factorization,” Nature, vol. 401, pp. 788-791, 1999. [4] D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet Allocation,” J.Machine Learning Research, vol. 3, pp. 993-1022, 2003. [5] W. Buntine, “Variational Extensions to EM and Multinomial PCA,” Proc. 13th European Conf. Machine Learning (ECML '02), pp.23-34, Aug. 2002. [6] P. Paatero and U. Tapper, “Positive Matrix Factorization: A Non-Negative Factor Model with Optimal Utilization of Error Estimates of Data Values,” Environmetrics, vol. 5, pp. 111-126, 1994. [7] J.E. Cohen and U.G. Rothblum, “Nonnegative Ranks, Decompositions, and Factorizations of Nonnegative Matrices,” Linear Algebra and Its Applications, vol. 190, pp. 149-168, 1993. [8] M.W. Berry, M. Browne, A.N. Langville, V.P. Pauca, and R.J. Plemmons, “Algorithms and Applications for Approximate Nonnegative Matrix Factorization,” Computational Statistics and Data Analysis, vol. 52, pp. 155-173, 2007. [9] T. Hofmann, “Probabilistic Latent Semantic Indexing,” Proc. 22nd Ann. Int'l ACM Conf. Research and Development in Information Retrieval (SIGIR '99), pp. 50-57, Aug. 1999. [10] W. Buntine and A. Jakulin, “Discrete Component Analysis,” Proc. Subspace, Latent Structure and Feature Selection, Statistical and Optimization, Perspectives Workshop (SLSFS '05), pp. 1-33, 2006. [11] E. Bingham, A. Kabán, and M. Fortelius, “The Aspect Bernoulli Model: Multiple Causes of Presences and Absences,” to be published in Pattern Analysis and Applications, 2008. [12] J. Seppänen, E. Bingham, and H. Mannila, “A Simple Algorithm for Topic Identification in 0-1 Data,” Proc. Seventh European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '03), pp. 423-434, 2003. [13] A.I. Schein, L.K. Saul, and L.H. Ungar, “A Generalized Linear Model for Principal Component Analysis of Binary Data,” Proc. Ninth Int'l Workshop Artificial Intelligence and Statistics (AI & Statistics), 2003. [14] D.P. O'Leary and S. Peleg, “Digital Image Compression by Outer Product Expansion,” IEEE Trans. Comm., vol. 31, no. 3, pp. 441-444, 1983. [15] T.G. Kolda and D.P. O'Leary, “A Semidiscrete Matrix Decomposition for Latent Semantic Indexing in Information Retrieval,” ACM Trans. Information Systems, vol. 16, no. 4, pp. 322-346, 1998. [16] M.W. Berry, S.A. Pulatova, and G.W. Stewart, “Algorithm 844: Computing Sparce Reduced-Rank Approximations to Sparce Matrices,” ACM Trans. Math. Software, vol. 31, no. 2, pp. 252-269, 2005. [17] P. Drineas, R. Kannan, and M.W. Mahoney, “Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition,” SIAM J. Computing, vol. 36, no. 1, pp. 184-206, 2006. [18] P. Drineas, M.W. Mahoney, and S. Muthukrishnan, Relative-Error CUR Matrix Decompositions, arXiv:0708.3696v1 [cs.DS], http://arxiv.org/abs0708.3696, Aug. 2007. [19] M. Koyutürk, A. Grama, and N. Ramakrsihnan, “Compression, Clustering, and Pattern Discovery in Very-High-Dimensional Discrete-Attribute Data Sets,” IEEE Trans. Knowledge Data Eng., vol. 17, pp. 447-461, 2005. [20] A. Gionis, H. Mannila, and J.K. Seppänen, “Geometric and Combinatorial Tiles in 0-1 Data,” Proc. Eighth European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '04), pp. 173-184, 2004. [21] F. Geerts, B. Goethals, and T. Mielikäinen, “Tiling Databases,” Proc. Seventh Int'l Conf. Discovery Science (DS '04), pp.278-289, 2004. [22] J. Besson, R. Pensa, C. Robardet, and J.-F. Boulicaut, “Constraint-Based Mining of Fault-Tolerant Patterns from Boolean Data,” Proc. Fourth Int'l Workshop Knowledge Discovery in Inductive Databases (KDID '06), pp. 55-71, 2006. [23] N. Mishra, D. Ron, and R. Swaminathan, “A New Conceptual Clustering Framework,” Machine Learning, vol. 56, pp. 115-151, 2004. [24] R.K. Brayton, G.D. Hachtel, and A.L. Sangiovanni-Vincentelli, “Multilevel Logic Synthesis,” Proc. IEEE, vol. 78, no. 2, pp. 264-300, 1990. [25] J.A. Hartigan, “Direct Clustering of a Data Matrix,” J. Am. Statistical Assoc., vol. 67, no. 337, pp. 123-129, 1972. [26] A. Banerjee, I.S. Dhillon, J. Ghosh, S. Merugu, and D.S. Modha, “A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximations,” Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '04), pp. 509-514, 2004. [27] S. Madeira and A. Oliveira, “Biclustering Algorithms for Biological Data Analysis: A Survey,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 1, pp. 24-45, Jan.-Mar. 2004. [28] C. Robardet and F. Feschet, “Efficient Local Search in Conceptual Clustering,” Proc. Fourth Int'l Conf. Discovery Science (DS '01), pp.323-335, 2001. [29] J. Vaidya, V. Atluri, and Q. Guo, “The Role Mining Problem: Finding a Minimal Descriptive Set of Roles,” Proc. ACM Symp. Access Control Models and Technologies (SACMAT '07), pp. 175-184, 2007. [30] H. Lu, J. Vaidya, and V. Atluri, “Optimal Boolean Matrix Decomposition: Application to Role Engineering,” Proc. IEEE Int'l Conf. Data Eng. (ICDE '08), pp. 297-306, Apr. 2008. [31] S.D. Monson, N.J. Pullman, and R. Rees, “A Survey of Clique and Biclique Coverings and Factorizations of (0, 1)-Matrices,” Bull. Inst. Combinatorics and Its Applications, vol. 14, pp. 17-86, 1995. [32] D.A. Gregory and N.J. Pullman, “Semiring Rank: Boolean Rank and Nonnegative Rank Factorizations,” J. Combinatorics, Information and System Sciences, vol. 8, no. 3, pp. 223-233, 1983. [33] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, 1979. [34] H.U. Simon, “On Approximate Solutions for Combinatorial Optimization Problems,” SIAM J. Discrete Math., vol. 3, no. 2, pp. 294-310, 1990. [35] R.G. Downey and M.R. Fellows, “Parameterized Complexity,” Monographs in Computer Science. Springer-Verlag, 1999. [36] J. Flum and M. Grohe, Parameterized Complexity Theory. Springer, 2006. [37] N. Megiddo and K. Supowit, “On the Complexity of Some Common Geometric Location Problems,” SIAM J. Computing, vol. 13, no. 1, pp. 182-196, 1984. [38] V. Arya, N. Garg, R. Kjandekar, A. Meyerson, K. Munagala, and V. Pandit, “Local Search Heuristics for k-Median and Facility Location Problems,” SIAM J. Computing, vol. 33, no. 3, pp. 544-562, 2004. [39] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases,” Proc. ACM SIGMOD '93, pp. 207-216, May 1993. [40] K. Lang, “Newsweeder: Learning to Filter Netnews,” Proc. 12th Int'l Conf. Machine Learning (ICML '95), pp. 331-339, 1995. [41] D. Newman, S. Hettich, C. Blake, and C. Merz, “UCI Repository of Machine Learning Databases,” http://www.ics.uci.edu/~mlearnMLRepository.html , 1998. [42] M. Fortelius, Neogene of the Old World Database of Fossil Mammals (NOW '05), http://www.helsinki.fi/sciencenow/, 2005. [43] A. Pajala and A. Jakulin, “Plenary Votes in the Finnish Parliament during 1991-2005,” Tampere: Finnish Social Science Data Archive, http://www.fsd.uta.fienglish/, 2006. [44] D. Lee and H. Seung, “Algorithms for Non-Negative Matrix Factorization,” Advances in Neural Information Processing Systems, vol. 13, pp. 556-562, 2001.