Subscribe

Issue No.10 - October (2008 vol.20)

pp: 1348-1362

Pauli Miettinen , University of Helsinki, University of Helsinki

Taneli Mielikäinen , Nokia Research Center Palo Alto, Palo Alto

Aristides Gionis , Yahoo, Barcelona

Gautam Das , University of Texas at Arlington, Arlington

Heikki Mannila , University of Helsinki and Helsinki University of Technology, Helsinki

DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/TKDE.2008.53

ABSTRACT

Matrix decomposition methods represent a data matrix as a product of two factor matrices: one containing basis vectors that represent meaningful concepts in the data, and another describing how the observed data can be expressed as combinations of the basis vectors. Decomposition methods have been studied extensively, but many methods return real-valued matrices. Interpreting real-valued factor matrices is hard if the original data is Boolean. In this paper, we describe a matrix decomposition formulation for Boolean data, the Discrete Basis Problem. The problem seeks for a Boolean decomposition of a binary matrix, thus allowing the user to easily interpret the basis vectors. We also describe a variation of the problem, the Discrete Basis Partitioning Problem. We show that both problems are NP-hard. For the Discrete Basis Problem, we give a simple greedy algorithm for solving it; for the Discrete Basis Partitioning Problem we show how it can be solved using existing methods. We present experimental results for the greedy algorithm and compare it against other, well known methods. Our algorithm gives intuitive basis vectors, but its reconstruction error is usually larger than with the real-valued methods. We discuss about the reasons for this behavior.

INDEX TERMS

Mining methods and algorithms, Clustering, classification, and association rules, Text mining

CITATION

Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, Heikki Mannila, "The Discrete Basis Problem",

*IEEE Transactions on Knowledge & Data Engineering*, vol.20, no. 10, pp. 1348-1362, October 2008, doi:10.1109/TKDE.2008.53REFERENCES

- [1] P. Miettinen, T. Mielikäinen, A. Gionis, G. Das, and H. Mannila, “The Discrete Basis Problem,”
Proc. 10th European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '06), pp.335-346, 2006.- [2] G. Golub and C. van Loan,
Matrix Computations. Johns Hopkins Univ. Press, 1996.- [3] D. Lee and H. Seung, “Learning the Parts of Objects by Non-Negative Matrix Factorization,”
Nature, vol. 401, pp. 788-791, 1999.- [5] W. Buntine, “Variational Extensions to EM and Multinomial PCA,”
Proc. 13th European Conf. Machine Learning (ECML '02), pp.23-34, Aug. 2002.- [9] T. Hofmann, “Probabilistic Latent Semantic Indexing,”
Proc. 22nd Ann. Int'l ACM Conf. Research and Development in Information Retrieval (SIGIR '99), pp. 50-57, Aug. 1999.- [10] W. Buntine and A. Jakulin, “Discrete Component Analysis,”
Proc. Subspace, Latent Structure and Feature Selection, Statistical and Optimization, Perspectives Workshop (SLSFS '05), pp. 1-33, 2006.- [11] E. Bingham, A. Kabán, and M. Fortelius, “The Aspect Bernoulli Model: Multiple Causes of Presences and Absences,” to be published in
Pattern Analysis and Applications, 2008.- [12] J. Seppänen, E. Bingham, and H. Mannila, “A Simple Algorithm for Topic Identification in 0-1 Data,”
Proc. Seventh European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '03), pp. 423-434, 2003.- [13] A.I. Schein, L.K. Saul, and L.H. Ungar, “A Generalized Linear Model for Principal Component Analysis of Binary Data,”
Proc. Ninth Int'l Workshop Artificial Intelligence and Statistics (AI & Statistics), 2003.- [16] M.W. Berry, S.A. Pulatova, and G.W. Stewart, “Algorithm 844: Computing Sparce Reduced-Rank Approximations to Sparce Matrices,”
ACM Trans. Math. Software, vol. 31, no. 2, pp. 252-269, 2005.- [18] P. Drineas, M.W. Mahoney, and S. Muthukrishnan,
Relative-Error CUR Matrix Decompositions, arXiv:0708.3696v1 [cs.DS], http://arxiv.org/abs0708.3696, Aug. 2007.- [20] A. Gionis, H. Mannila, and J.K. Seppänen, “Geometric and Combinatorial Tiles in 0-1 Data,”
Proc. Eighth European Conf. Principles and Practice of Knowledge Discovery in Databases (PKDD '04), pp. 173-184, 2004.- [21] F. Geerts, B. Goethals, and T. Mielikäinen, “Tiling Databases,”
Proc. Seventh Int'l Conf. Discovery Science (DS '04), pp.278-289, 2004.- [22] J. Besson, R. Pensa, C. Robardet, and J.-F. Boulicaut, “Constraint-Based Mining of Fault-Tolerant Patterns from Boolean Data,”
Proc. Fourth Int'l Workshop Knowledge Discovery in Inductive Databases (KDID '06), pp. 55-71, 2006.- [28] C. Robardet and F. Feschet, “Efficient Local Search in Conceptual Clustering,”
Proc. Fourth Int'l Conf. Discovery Science (DS '01), pp.323-335, 2001.- [29] J. Vaidya, V. Atluri, and Q. Guo, “The Role Mining Problem: Finding a Minimal Descriptive Set of Roles,”
Proc. ACM Symp. Access Control Models and Technologies (SACMAT '07), pp. 175-184, 2007.- [31] S.D. Monson, N.J. Pullman, and R. Rees, “A Survey of Clique and Biclique Coverings and Factorizations of (0, 1)-Matrices,”
Bull. Inst. Combinatorics and Its Applications, vol. 14, pp. 17-86, 1995.- [32] D.A. Gregory and N.J. Pullman, “Semiring Rank: Boolean Rank and Nonnegative Rank Factorizations,”
J. Combinatorics, Information and System Sciences, vol. 8, no. 3, pp. 223-233, 1983.- [33] M.R. Garey and D.S. Johnson,
Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, 1979.- [35] R.G. Downey and M.R. Fellows, “Parameterized Complexity,”
Monographs in Computer Science. Springer-Verlag, 1999.- [36] J. Flum and M. Grohe,
Parameterized Complexity Theory. Springer, 2006.- [40] K. Lang, “Newsweeder: Learning to Filter Netnews,”
Proc. 12th Int'l Conf. Machine Learning (ICML '95), pp. 331-339, 1995.- [41] D. Newman, S. Hettich, C. Blake, and C. Merz, “UCI Repository of Machine Learning Databases,” http://www.ics.uci.edu/~mlearnMLRepository.html , 1998.
- [42] M. Fortelius,
Neogene of the Old World Database of Fossil Mammals (NOW '05), http://www.helsinki.fi/sciencenow/, 2005.- [43] A. Pajala and A. Jakulin, “Plenary Votes in the Finnish Parliament during 1991-2005,” Tampere: Finnish Social Science Data Archive, http://www.fsd.uta.fienglish/, 2006.
- [44] D. Lee and H. Seung, “Algorithms for Non-Negative Matrix Factorization,”
Advances in Neural Information Processing Systems, vol. 13, pp. 556-562, 2001. |