The Community for Technology Leaders
2013 IEEE 29th International Conference on Data Engineering (ICDE) (2013)
Brisbane, Australia Australia
Apr. 8, 2013 to Apr. 12, 2013
ISSN: 1063-6382
ISBN: 978-1-4673-4909-3
pp: 1231-1241
S. Chaturvedi , IBM Res. - India, New Delhi, India
K. H. Prasad , IBM Res. - India, New Delhi, India
T. A. Faruquie , IBM Res. - India, New Delhi, India
B. S. Chawda , IBM Res. - India, New Delhi, India
L. V. Subramaniam , IBM Res. - India, New Delhi, India
R. Krishnapuram , IBM Res. - India, New Delhi, India
ABSTRACT
Data quality is a perennial problem for many enterprise data assets. To improve data quality, businesses often employ rule based data standardization systems in which domain experts code rules for handling important and prevalent patterns. Finding these patterns is laborious and time consuming, particularly for noisy or highly specialized data sets. It is also subjective to the persons determining these patterns. In this paper we present a tool to automatically mine patterns that can help in improving the efficiency and effectiveness of these data standardization systems. The automatically extracted patterns are used by the domain and knowledge experts for rule writing. We use a greedy algorithm to extract patterns that result in a maximal coverage of data. We further group the extracted patterns such that each group represents patterns that capture similar domain knowledge. We propose a similarity measure that uses input pattern semantics to group these patterns. We demonstrate the effectiveness of our method for standardization tasks on three real world datasets.
INDEX TERMS
Semantics, Writing, Buildings, Data mining, Noise measurement, Standards
CITATION

S. Chaturvedi, K. H. Prasad, T. A. Faruquie, B. S. Chawda, L. V. Subramaniam and R. Krishnapuram, "Automating pattern discovery for rule based data standardization systems," 2013 29th IEEE International Conference on Data Engineering (ICDE 2013)(ICDE), Brisbane, QLD, 2013, pp. 1231-1241.
doi:10.1109/ICDE.2013.6544912
208 ms
(Ver 3.3 (11022016))