Issue No.03 - March (2013 vol.25)
pp: 690-703
Byron J. Gao , Texas State University - San Marcos, San Marcos
Martin Ester , Simon Fraser University, Burnaby
Hui Xiong , Rutgers, the State University of New Jersey, Newark
Jin-Yi Cai , University of Wisconsin - Madison, Madison
Oliver Schulte , Simon Fraser University, Burnaby
In this paper, we introduce and study the minimum consistent subset cover (MCSC) problem. Given a finite ground set X and a constraint t, find the minimum number of consistent subsets that cover X, where a subset of X is consistent if it satisfies t. The MCSC problem generalizes the traditional set covering problem and has minimum clique partition (MCP), a dual problem of graph coloring, as an instance. Many common data mining tasks in rule learning, clustering, and pattern mining can be formulated as MCSC instances. In particular, we discuss the minimum rule set (MRS) problem that minimizes model complexity of decision rules, the converse k-clustering problem that minimizes the number of clusters, and the pattern summarization problem that minimizes the number of patterns. For any of these MCSC instances, our proposed generic algorithm CAG can be directly applicable. CAG starts by constructing a maximal optimal partial solution, then performs an example-driven specific-to-general search on a dynamically maintained bipartite assignment graph to simultaneously learn a set of consistent subsets with small cardinality covering the ground set.
Data mining, Pattern recognition, Complexity theory, Minimization, Decision trees, Graph coloring, Clustering algorithms, pattern summarization, Minimum consistent subset cover, set covering, graph coloring, minimum clique partition, minimum star partition, minimum rule set, converse k-clustering
Byron J. Gao, Martin Ester, Hui Xiong, Jin-Yi Cai, Oliver Schulte, "The Minimum Consistent Subset Cover Problem: A Minimization View of Data Mining", IEEE Transactions on Knowledge & Data Engineering, vol.25, no. 3, pp. 690-703, March 2013, doi:10.1109/TKDE.2011.260
[1] M. Adler and B. Heeringa, "Approximating Optimal Decision Trees," Technical Report 05-25, Univ. of Massachusetts-Amherst, 2005.
[2] F. Afrati, A. Gionis, and H. Mannila, "Approximating a Collection of Frequent Sets," Proc. 10th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), 2004.
[3] C. Apté and S. Weiss, "Data Mining with Decision Trees and Decision Rules," Future Generation Computer Systems, vol. 13, nos. 2/3, pp. 197-210, 1997.
[4] S.H.S.P.B.R. Apte, "C. RAMP: Rules Abstraction for Modeling and Prediction," technical report, IBM Research Division, T.J. Watson Research Center, 1995.
[5] M. Bern and D. Eppstein, "Approximation Algorithms for Geometric Problems," Approximation Algorithms for NP-Hard Problems, D.S. Hochbaum, ed., PWS Publishing Co., 1997.
[6] C. Blake and C. Merz, "UCI Repository of Machine Learning Databases," http://archive.ics.uci.eduml/, 1998.
[7] D. Brelaz, "New Methods to Color the Vertices of a Graph," Comm. ACM, vol. 22, no. 4, pp. 251-256, 1979.
[8] V. Chaoji, M.A. Hasan, S. Salem, and M.J. Zaki, "Sparcl: Efficient and Effective Shape-Based Clustering," Proc. IEEE Int'l Conf. Data Mining (ICDM), 2008.
[9] P.A. Chou, "Optimal Partitioning for Classification and Regression Trees," IEEE Trans. Pattern Analysis Machine Intelligence, vol. 13, no. 4, pp. 340-354, Apr. 1991.
[10] P. Clark and T. Niblett, "The CN2 Induction Algorithm," Machine Learning, vol. 3, no. 4, pp. 261-283, 1989.
[11] W.W. Cohen, "Fast Effective Rule Induction," Proc. 12th Int'l Conf. Machine Learning (ICML), 1995.
[12] D. de Werra, "Heuristics for Graph Coloring," Computational Graph Theory, Springer-Verlag, 1990.
[13] A.E. Eiben, J.K. Van Der Hauw, and J.I. Van Hemert, "Graph Coloring with Adaptive Evolutionary Algorithms," J. Heuristics, vol. 4, no. 1, pp. 25-46, 1998.
[14] J. Fürnkranz, "Separate-and-Conquer Rule Learning," Artificial Intelligence Rev., vol. 13, no. 1, pp. 3-54, 1999.
[15] B.J. Gao and M. Ester, "Turning Clusters Into Patterns: Rectangle-Based Discriminative Data Description," Proc. IEEE Sixth Int'l Conf. Data Mining (ICDM), 2006.
[16] B.J. Gao, M. Ester, J.-Y. Cai, O. Schulte, and H. Xiong, "The Minimum Consistent Subset Cover Problem and its Applications in Data Mining," Proc. 13th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD), 2007.
[17] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman & Co., 1979.
[18] R. Ge, M. Ester, B.J. Gao, Z. Hu, B. Bhattacharya, and B. Ben-Moshe, "Joint Cluster Analysis of Attribute Data and Relationship Data: The Connected K-Center Problem, Algorithms and Applications," ACM Trans. Knowledge Discovering Data, vol. 2, pp. 7:1-7:35, 2008.
[19] J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh, "BOAT-Optimistic Decision Tree Construction," Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD), 1999.
[20] J. Han, Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., 2005.
[21] D.S. Hochbaum, "Various Notions of Approximations: Good, Better, Best, and More," Approximation Algorithms for NP-Hard Problems, PWS Publishing Co., 1997.
[22] S.J. Hong, "R-MINI: An Iterative Approach for Generating Minimal Rules from Examples," IEEE Trans. Knowledge and Data Eng., vol. 9, no. 5, pp. 709-717, Sept./Oct. 1997.
[23] L. Hyafil and R. Rivest, "Constructing Optimal Binary Decision Trees is NP-Complete," Information Processing Letters, vol. 5, no. 1, pp. 15-17, 1976.
[24] D. Johnson, "Worst-Case Behavior of Graph-Coloring Algorithms," Proc. Fifth Southeastern Conf. Combinatorics, Graph Theory, and Computing, pp. 513-528, 1974.
[25] D.S. Johnson, "Approximation Algorithms for Combinatorial Problems," J. Computer and Systems Science, vol. 9, no. 3, pp. 256-278, 1974.
[26] D.S. Johnson, C.R. Aragon, L.A. McGeoch, and C. Schevon, "Optimization by Simulated Annealing: An Experimental Evaluation. Part i, Graph Partitioning," Operation Research, vol. 37, no. 6, pp. 865-892, 1989.
[27] R. Karp, "Reducibility among Combinatorial Problems," Complexity of Computer Computations, R. Miller and J. Thatcher, eds., Plenum Press. 1972.
[28] K.A. Kaufman and R.S. Michalski, "Learning from Inconsistent and Noisy Data: The AQ18 Approach," Proc. 11th Int'l Symp. Methodologies for Intelligent Systems (ISMIS), 1999.
[29] L. Kaufman and P. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley & Sons, 1990.
[30] J. Krarup and P. Pruzan, "The Simple Plant Location Problem: Survey and Synthesis," European J. Operational Research, vol. 12, pp. 36-81, 1983.
[31] A. Kulkarni and L. Kanal, "An Optimization Approach to Hierarchical Classifier Design," Proc. Third Int'l Joint Conf. Pattern Recognition (IJCPR), 1976.
[32] H. Mannila, "Theoretical Frameworks for Data Mining," SIGKDD Explorations, vol. 1, no. 2, pp. 30-32, 2000.
[33] M. Mehta, J. Rissanen, and R. Agrawal, "MDL-Based Decision Tree Pruning," Proc. First Int'l Conf. Knowledge Discovery and Data Mining (KDD), 1995.
[34] W.S. Meisel and D.A. Michalopoulos, "A Partitioning Algorithm with Application in Pattern Classification and the Optimization of Decision Trees," IEEE Trans. Computers, vol. C-22, no. 1, pp. 93-103, Jan. 1973.
[35] S.K. Murthy, "Automatic Construction of Decision Trees from Data: A Multi-Disciplinary Survey," Data Mining Knowledge Discovery, vol. 2, no. 4, pp. 345-389, 1998.
[36] G. Naumov, "Np-Completeness of Problems of Construction of Optimal Decision Trees," Soviet Physics, vol. 36, no. 4, pp. 270-271, 1991.
[37] V. Paschos, "Polynomial Approximation and Graph-Coloring," Computing, vol. 70, no. 1, pp. 41-86, 2003.
[38] H.J. Payne and W.S. Meisel, "An Algorithm for Constructing Optimal Binary Decision Trees," IEEE Trans. Computers, vol. C-26, no. 9, pp. 905-916, Sept. 1977.
[39] A. Paz and S. Moran, "Non-Deterministic Polynomial Optimization Problems and their Approximations," Theoretical Computer Science, vol. 15, pp. 251-277, 1981.
[40] D. Pelleg and A. Moore, "X-means: Extending k-Means with Efficient Estimation of the Number of Clusters," Proc. 17th Int'l Conf. Machine Learning (ICML), 2000.
[41] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., 1993.
[42] J. Rissanen, "Modelling by Shortest Data Description," Automatica, vol. 14, pp. 465-471, 1978.
[43] J. Rissanen, Stochastic Complexity in Statistical Inquiry Theory. World Scientific Publishing Co., Inc., 1989.
[44] C. Toregas, R. SWain, C. Revelle, and L. Bergman, "The Location of Emergency Service Facilities," Operations Research, vol. 19, pp. 1363-1373, 1971.
[45] L. Trevisan, "Inapproximability of Combinatorial Optimization Problems," Optimisation Combinatiore, V. Paschos, ed., vol. 2, Hermes, 2005.
[46] J. Wojtusiak, R. Michalski, K. Kaufman, and J. Pietrzykowski, "Multitype Pattern Discovery via AQ21: A Brief Description of the Method and its Novel Features," Technical Report MLI 06-2, Machine Learning and Inference Laboratory, George Mason Univ., 2006.
[47] Q. Yang and X. Wu, "10 Challenging Problems in Data Mining Research," Int'l J. Information Technology and Decision Making, vol. 5, no. 4, pp. 597-604, 2006.