The Community for Technology Leaders
RSS Icon
Issue No.05 - May (2011 vol.23)
pp: 683-698
Wenfei Fan , University of Edinburgh, Edinburgh
Floris Geerts , University of Edinburgh, Edinburgh
Jianzhong Li , Harbin Institute of Technology, Harbin
Ming Xiong , Bell Laboratories, Murray Hill
This paper investigates the discovery of conditional functional dependencies (CFDs). CFDs are a recent extension of functional dependencies (FDs) by supporting patterns of semantically related constants, and can be used as rules for cleaning relational data. However, finding quality CFDs is an expensive process that involves intensive manual effort. To effectively identify data cleaning rules, we develop techniques for discovering CFDs from relations. Already hard for traditional FDs, the discovery problem is more difficult for CFDs. Indeed, mining patterns in CFDs introduces new challenges. We provide three methods for CFD discovery. The first, referred to as CFDMiner, is based on techniques for mining closed item sets, and is used to discover constant CFDs, namely, CFDs with constant patterns only. Constant CFDs are particularly important for object identification, which is essential to data cleaning and data integration. The other two algorithms are developed for discovering general CFDs. One algorithm, referred to as CTANE, is a levelwise algorithm that extends TANE, a well-known algorithm for mining FDs. The other, referred to as FastCFD, is based on the depth-first approach used in FastFD, a method for discovering FDs. It leverages closed-item-set mining to reduce the search space. As verified by our experimental study, CFDMiner can be multiple orders of magnitude faster than CTANE and FastCFD for constant CFD discovery. CTANE works well when a given relation is large, but it does not scale well with the arity of the relation. FastCFD is far more efficient than CTANE when the arity of the relation is large; better still, leveraging optimization based on closed-item-set mining, FastCFD also scales well with the size of the relation. These algorithms provide a set of cleaning-rule discovery tools for users to choose for different applications.
Integrity, conditional functional dependency, functional dependency, free item set, closed item set.
Wenfei Fan, Floris Geerts, Jianzhong Li, Ming Xiong, "Discovering Conditional Functional Dependencies", IEEE Transactions on Knowledge & Data Engineering, vol.23, no. 5, pp. 683-698, May 2011, doi:10.1109/TKDE.2010.154
[1] W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis, "Conditional Functional Dependencies for Capturing Data Inconsistencies," ACM Trans. Database Systems (TODS), vol. 33, no. 2, pp. 1-48, 2008.
[2] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, "Improving Data Quality: Consistency and Accuracy," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 315-326, 2007.
[3] M. Arenas, L.E. Bertossi, and J. Chomicki, "Consistent Query Answers in Inconsistent Databases," Theory and Practice of Logic Programming (TPLP), vol. 3, nos. 4-5, pp. 393-424, 2003.
[4] J. Chomicki and J. Marcinkowski, "Minimal-Change Integrity Maintenance Using Tuple Deletions," Information and Computation, vol. 197, nos. 1-2, pp. 90-121, 2005.
[5] J. Wijsen, "Database Repairing Using Updates," ACM Trans. Database Systems (TODS), vol. 30, no. 3, pp. 722-768, 2005.
[6] C. Batini and M. Scannapieco, Data Quality: Concepts, Methodologies and Techniques. Springer, 2006.
[7] E. Rahm and H.H. Do, "Data Cleaning: Problems and Current Approaches," IEEE Data Eng. Bull., vol. 23, no. 4, pp. 3-13, Dec. 2000.
[8] Gartner, "Forecast: Data Quality Tools, Worldwide, 2006-2011," 2007.
[9] S. Abiteboul, R. Hull, and V. Vianu, Foundations of Databases. Addison-Wesley, 1995.
[10] L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu, "On Generating Near-Optimal Tableaux for Conditional Functional Dependencies," Proc. VLDB Endowment, vol. 1, no. 1, pp. 376-390, 2008.
[11] E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson, "Entity Identification in Database Integration," Information Sciences, vol. 89, nos. 1/2, pp. 1-38, 1996.
[12] H. Mannila and K.-J. Räihä, "Dependency Inference," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 350-364, 1987.
[13] Y. Huhtala, J. Kärkk ainen, P. Porkka, and H. Toivonen, "TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies," Computer J., vol. 42, no. 2, pp. 100-111, 1999.
[14] C.M. Wyss, C. Giannella, and E.L. Robertson, "FastFDs: A Heuristic-Driven, Depth-First Algorithm for Mining Functional Dependencies from Relation Instances—Extended Abstract," Proc. Int'l Conf. Data Warehousing and Knowledge Discovery (DaWak), pp. 101-110, 2001.
[15] P.A. Flach and I. Savnik, "Database Dependency Discovery: A Machine Learning Approach," AI Comm., vol. 12, no. 3, pp. 139-160, 1999.
[16] S. Lopes, J.-M. Petit, and L. Lakhal, "Efficient Discovery of Functional Dependencies and Armstrong Relations," Proc. Seventh Int'l Conf. Extending Database Technology: Advances in Database Technology (EDBT), pp. 350-364, 2000.
[17] T. Calders, R.T. Ng, and J. Wijsen, "Searching for Dependencies at Multiple Abstraction Levels," ACM Trans. Database Systems (TODS), vol. 27, no. 3, pp. 229-260, 2003.
[18] R.S. King and J.J. Legendre, "Discovery of Functional and Approximate Functional Dependencies in Relational Databases," J. Applied Math. and Decision Sciences (JAMDS), vol. 7, no. 1, pp. 49-59, 2003.
[19] I.F. Ilyas, V. Markl, P.J. Haas, P. Brown, and A. Aboulnaga, "Cords: Automatic Discovery of Correlations and Soft Functional Dependencies," Proc. ACM SIGMOD, pp. 647-658, 2004.
[20] H. Mannila and H. Toivonen, "Levelwise Search and Borders of Theories in Knowledge Discovery," Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 259-289, 1997.
[21] F. Chiang and R. Miller, "Discovering Data Quality Rules," Proc. VLDB Endowment, vol. 1, no. 1, pp. 1166-1177, 2008.
[22] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A.I. Verkamo, "Fast Discovery of Association Rules," Proc. Advances in Knowledge Discovery and Data Mining, pp. 307-328, 1996.
[23] M.J. Zaki, "Mining Non-Redundant Association Rules," Data Mining and Knowledge Discovery, vol. 9, no. 3, pp. 223-248, 2004.
[24] J. Li, G. Liu, and L. Wong, "Mining Statistically Important Equivalence Classes and Delta-Discriminative Emerging Patterns," Proc. 13th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD '07), pp. 430-439, 2007.
[25] R. Medina and N. Lhouari, "A Unified Hierarchy for Functional Dependencies, Conditional Functional Dependencies and Association Rules," Proc. ASeventh Int'l Conf. Formal Concept Analysis (ICFCA '09), pp. 98-113, 2009.
[26] H. Li, J. Li, L. Wong, M. Feng, and Y.-P. Tan, "Relative Risk and Odds Ratio: A Data Mining Perspective," Proc. 24th ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems (PODS '05), pp. 368-377, 2005.
[27] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma, "Improving Data Quality: Consistency and Accuracy," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 315-326, 2007.
[28] W. Fan, S. Ma, Y. Hu, J. Liu, and Y. Wu, "Propagating Functional Dependencies with Conditions," Proc. VLDB Endowment (PVLDB), vol. 1, no. 1, pp. 391-407, 2008.
[29] L. Bravo, W. Fan, F. Geerts, and S. Ma, "Increasing the Expressivity of Conditional Functional Dependencies without Extra Complexity," Proc. IEEE 24th Int'l Conf. Data Eng. (ICDE), pp. 516-525, 2008.
[30] G. Cormode, L. Golab, F. Korn, A. McGregor, D. Srivastava, and X. Zhang, "Estimating the Confidence of Conditional Functional Dependencies," Proc. ACM SIGMOD, pp. 469-482, 2009.
[31] L. Bravo, W. Fan, and S. Ma, "Extending Dependencies with Conditions," Proc. Int'l Conf. Very Large Data Bases (VLDB), pp. 243-254, 2007.
[32] B. Goethals, W.L. Page, and H. Mannila, "Mining Association Rules of Simple Conjunctive Queries," Proc. SIAM-SDM, pp. 96-107, 2008.
[33] N. Alon and J.H. Spencer, The Probabilistic Method. John Wiley Inc., 1992.
31 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool