The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.03 - March (2014 vol.26)
pp: 682-697
Ma'ayan Dror , Ben-Gurion University of the Negev, Beer-Sheva
Asaf Shabtai , Ben-Gurion University of the Negev, Beer-Sheva
Lior Rokach , Ben-Gurion University of the Negev, Beer-Sheva
Yuval Elovici , Ben-Gurion University of the Negev, Beer-Sheva
ABSTRACT
One-to-many data linkage is an essential task in many domains, yet only a handful of prior publications have addressed this issue. Furthermore, while traditionally data linkage is performed among entities of the same type, it is extremely necessary to develop linkage techniques that link between matching entities of different types as well. In this paper, we propose a new one-to-many data linkage method that links between entities of different natures. The proposed method is based on a one-class clustering tree (OCCT) that characterizes the entities that should be linked together. The tree is built such that it is easy to understand and transform into association rules, i.e., the inner nodes consist only of features describing the first set of entities, while the leaves of the tree represent features of their matching entities from the second data set. We propose four splitting criteria and two different pruning methods which can be used for inducing the OCCT. The method was evaluated using data sets from three different domains. The results affirm the effectiveness of the proposed method and show that the OCCT yields better performance in terms of precision and recall (in most cases it is statistically significant) when compared to a C4.5 decision tree-based linkage method.
INDEX TERMS
Couplings, Decision trees, Vegetation, Training, Classification algorithms, Numerical models, Buildings,decision tree induction, Clustering, classification, data matching
CITATION
Ma'ayan Dror, Asaf Shabtai, Lior Rokach, Yuval Elovici, "OCCT: A One-Class Clustering Tree for Implementing One-to-Many Data Linkage", IEEE Transactions on Knowledge & Data Engineering, vol.26, no. 3, pp. 682-697, March 2014, doi:10.1109/TKDE.2013.23
REFERENCES
[1] I.P. Fellegi and A.B. Sunter, "A Theory for Record Linkage," J. Am. Statistical Soc., vol. 64, no. 328, pp. 1183-1210, Dec. 1969.
[2] M. Yakout, A.K. Elmagarmid, H. Elmeleegy, M. Quzzani, and A. Qi, "Behavior Based Record Linkage," Proc. VLDB Endowment, vol. 3, nos. 1/2, pp. 439-448, 2010.
[3] J. Domingo-Ferrer and V. Torra, "Disclosure Risk Assessment in Statistical Microdata Protection via Advanced Record Linkage," Statistics and Computing, vol. 13, no. 4, pp. 343-354, 2003.
[4] F. De Comité, F. Denis, R. Gilleron, and F. Letouzey, "Positive and Unlabeled Examples Help Learning," Proc. 10th Int'l Conf. Algorithmic Learning Theory, pp. 219-230, 1999.
[5] M.D. Larsen and D.B. Rubin, "Iterative Automated Record Linkage Using Mixture Models," J. Am. Statistical Assoc., vol. 96, no. 453, pp. 32-41, Mar. 2001.
[6] S. Ivie, G. Henry, H. Gatrell, and C. Giraud-Carrier, "A Metric-Based Machine Learning Approach to Genealogical Record Linkage," Proc. Seventh Ann. Workshop Technology for Family History and Genealogical Research, 2007.
[7] A.J. Storkey, C.K.I. Williams, E. Taylor, and R.G. Mann, "An Expectation Maximisation Algorithm for One-to-Many Record Linkage," Univ. of Edinburgh Informatics Research Report, 2005.
[8] P. Christen and K. Goiser, "Quality and Complexity Measures for Data Linkage and Deduplication," Quality Measures in Data Mining, vol. 43, pp. 127-151, 2007.
[9] P. Langley, Elements of Machine Learning. Morgan Kaufmann, 1996.
[10] H. Blockeel, L.D. Raedt, and J. Ramon, "Top-Down Induction of Clustering Trees," ArXiv Computer Science e-prints, pp. 55-63, 1998.
[11] D.J. Rohde, M.R. Gallagher, M.J. Drinkwater, and K.A. Pimbblet, "Matching of Catalogues by Probabilistic Pattern Classification," Monthly Notices of the Royal Astronomical Soc., vol. 369, no. 1, pp. 2-14, May 2006.
[12] L. Gu and R. Baxter, "Decision Models for Record Linkage," Data Mining, vol. 3755, pp. 146-160, 2006.
[13] P. Christen and K. Goiser, "Towards Automated Data Linkage and Deduplication," technical report, Australian Nat'l Univ., 2005.
[14] E. Frank, M.A. Hall, G. Holmes, R. Kirkby, and B. Pfahringer, "WEKA - A Machine Learning Workbench for Data Mining," The Data Mining and Knowledge Discovery Handbook, pp. 1305-1314, Springer, 2005.
[15] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.
[16] O. Benjelloun, H. Garcia, D. Menestrina, Q. Su, S. Whang, and J. Widom, "Swoosh: A Generic Approach to Entity Resolution," The VLDB J., vol. 18, no. 1, pp. 255-276, 2009.
[17] S.E. Whang and H. Gercia-Molina, "Joint Entity Resolution," technical report, Stanford Univ., 2009.
[18] I.S. Dhillon, S. Mallela, and D.S. Modha, "Information-Theoretic Co-Clustering," Proc. Ninth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 89-98, 2003.
[19] F. Letouzey, F. Denis, and R. Gilleron, "Learning from Positive and Unlabeled Examples," Proc. 11th Int'l Conf. Algorithmic Learning Theory, pp. 71-85, 2009.
[20] C. Li, Y. Zhang, and X. Li, "OcVFDT: One-Class Very Fast Decision Tree for One-Class Classification of Data Streams," Proc. Third Int'l Workshop Knowledge Discovery from Sensor Data, pp. 79-86, 2009.
[21] J. Struyf and S. Dzeroski, "Clustering Trees with Instance Level Constraints," Proc. 18th European Conf. Machine Learning, pp. 359-370, 2007.
[22] P. Christen, "A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication," IEEE Trans. Knowledge and Data Eng., vol. 24, no. 9, pp. 1537-1555, Sept. 2012, doi:10.1109/TKDE. 2011. 127.
[23] V. Torra and J. Domingo-Ferrer, "Record Linkage Methods for Multidatabase Data Mining," Studies in Fuzziness and Soft Computing, vol. 123, pp. 101-132, 2003.
[24] D.D. Dorfman and E. Alf, "Maximum-Likelihood Estimation of Parameters of Signal-Detection Theory and Determination of Confidence Intervals—Rating-Method Data," J. Math. Psychology, vol. 6, no. 3, pp. 487-496, 1969.
[25] J.R. Quinlan, "Induction of Decision Trees," Machine Learning, vol. 1, no. 1, pp. 81-106, Mar. 1986.
[26] S. Guha, R. Rastogi, and K. Shim, "Rock: A Robust Clustering Algorithm for Categorical Attributes," Information Systems, vol. 25, no. 5, pp. 345-366, July 2000.
[27] D.E. Knuth, J.H. MorrisJr., and V.R. Pratt, "Fast Pattern Matching in Strings," SIAM J. Computing, vol. 6, no. 2, pp. 323-350, 1977.
[28] J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," Proc. Fifth Symp. Math. Statistics and Probability, pp. 281-297, 1967.
[29] A. Gershman et al., "A Decision Tree Based Recommender System," Proc. 10th Int'l Conf. Innovative Internet Community Services, pp. 170-179, 2010.
[30] F. Provost and P. Domingos, "Tree Induction for Probability-Based Ranking," Machine Learning, vol. 52, no. 3, pp 199-215, 2003.
[31] M.A. Hall, "Correlation-Based Feature Subset Selection for Machine Learning," technical report, Univ. of Waikato, New Zealand, 1998.
[32] C. Ferri, P. Flach, and J. Hernández-Orallo, "Learning Decision Trees Using the Area under the ROC Curve," Proc. Ninth Int'l Conf. Machine Learning, pp. 139-146, 2002.
[33] C.A. Metz, "ROCKIT Software," http://metz-roc.uchicago.eduMetzROC, 2003.
[34] A. Kamra, E. Terzi, and E. Bertino, "Detecting Anomalous Access Patterns in Relational Databases," J. Very Large Databases, vol. 17, no. 5, pp. 1063-1177, 2008.
[35] S. Mathew, M. Petropoulos, H. Ngo, and S. Upadhyaya, "A Data-Centric Approach to Insider Attack Detection in Database Systems," Proc. 13th Int'l Conf. Recent Advances in Intrusion Detection, vol. 6307, pp. 382-401, 2009.
[36] M. Gafny, A. Shabtai, L. Rokach, and Y. Elovici, "Detecting Data Misuse by Applying Context-Based Data Linkage," Proc. ACM CCS Workshop Insider Threats, 2010.
[37] G. Adomavicius and A. Tuzhilin, "Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions," IEEE Trans. Knowledge and Data Eng., vol. 17, no. 6, pp. 739-749, June 2005.
[38] R. Burke, "Knowledge-Based Recommender Systems," Encyclopedia of Library and Information Systems, vol. 69, no. 32, pp. 175-186, 2000.
[39] N. Golbandi, Y. Koren, and R. Lempel, "Adaptive Bootstrapping of Recommender Systems Using Decision Trees," Proc. Fourth ACM Int'L Conf. Web Search and Data Mining, pp. 595-604, 2011.
[40] J.W. Kim et al., "Application of Decision-Tree Induction Techniques to Personalized Advertisements on Internet Storefronts," Int'l J. Electronic Commerce, vol. 5, no. 3, pp. 45-62, 2001.
[41] P. Li and S. Yamada, "A Movie Recommender System Based on Inductive Learning," Proc. IEEE Conf. Cybernetics and Intelligent Systems, pp. 318-323, 2004.
[42] S.L. Lee, "Commodity Recommendations of Retail Business Based on Decision Tree Induction," Expert Systems with Applications, vol. 37, no. 5, pp. 3685-3694, May 2010.
[43] A. Bouza, G. Reif, A. Bernstein, and H. Gall, "Semtree: Ontology-Based Decision Tree Algorithm for Recommender Systems," Proc. Int'l Semantic Web Conf., 2008.
[44] B. Krulwich, "Lifestyle Finder: Intelligent User Profiling Using Large-Scale Demographic Data," Artificial Intelligence Magazine, vol. 18, no. 3, pp. 37-46, 1997.
[45] M. Pazzani and J. Michael, "A Framework for Collaborative, Content-Based and Demographic Filtering," Artificial Intelligence Rev., vol. 13, no. 5, pp. 393-408, 1999.
[46] M. Pazzani and D. Billsus, "Content-Based Recommendation Systems," The Adaptive Web, pp. 325-341, Springer-Verlag, 2007.
[47] G.A. Wang, H. Chen, J.J. Xu, and H. Atabakhsh, "Automatically Detecting Criminal Identity Deception: An Adaptive Detection Algorithm," IEEE Trans. Systems, Man and Cybernetics, Part A: Systems and Humans, vol. 36, no. 5, pp. 988-999, Sept. 2006.
[48] M.B. Salem and S.J. Stolfo, "Modeling User Search Behavior for Masquerade Detection," Proc. 14th Symp. Recent Advances in Intrusion Detection, 2011.
[49] K. Wang and S.J. Stolfo, "One-Class Training for Masquerade Detection," Proc. Workshop Data Mining for Computer Security, pp. 19-22, 2003.
[50] R. Kohavi, "Wrappers for Performance Enhancement and Oblivious Decision Graphs," PhD thesis, Stanford Univ., STAN-CS-TR-95-1560, 1995.
69 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool