This Article 
 Bibliographic References 
 Add to: 
A Scalable Approach to Integrating Heterogeneous Aggregate Views of Distributed Databases
January/February 2003 (vol. 15 no. 1)
pp. 232-235

Abstract—Aggregate views are commonly used for summarizing information held in very large databases such as those encountered in data warehousing, large scale transaction management, and statistical databases. Such applications often involve distributed databases that have developed independently and therefore may exhibit incompatibility, heterogeneity, and data inconsistency. We are here concerned with the integration of aggregates that have heterogeneous classification schemes where local ontologies, in the form of such classification schemes, may be mapped onto a common ontology. In previous work, we have developed a method for the integration of such aggregates; the method previously developed is efficient, but cannot handle innate data inconsistencies that are likely to arise when a large number of databases are being integrated. In this paper, we develop an approach that can handle data inconsistencies and is thus inherently much more scalable. In our new approach, we first construct a dynamic shared ontology by analyzing the correspondence graph that relates the heterogeneous classification schemes; the aggregates are then derived by minimization of the Kullback-Leibler information divergence using the EM (Expectation-Maximization) algorithm. Thus, we may assess whether global queries on such aggregates are answerable, partially answerable, or unanswerable in advance of computing the aggregates themselves.

[1] A.P. Dempster, N.M. Laird, and D.B. Rubin, “Maximum Likelihood from Incomplete Data via the EM Algorithm,” J. Royal Statistical Soc., Series B, vol. 39, pp. 1-38, 1977.
[2] F.M. Malvestuto, “A Universal-Scheme Approach to Statistical Databases Containing Homogeneous Summary Tables,” ACM Trans. Database Systems, vol. 18, pp. 678-708, 1993.
[3] S.I. McClean, B.W. Scotney, and C.M. Shapcott, “Incorporating Domain Knowledge into Attribute-Oriented Data Mining,” Int'l J. Intelligent Systems, vol. 6, pp. 535-548, 2000.
[4] S.I. McClean, B.W. Scotney, and C.M. Shapcott, “Aggregation of Imprecise and Uncertain Information in Databases,” IEEE Trans. Knowledge and Data Eng., vol. 13, no. 6, pp. 902-912, Nov./Dec. 2001.
[5] B.W. Scotney and S.I. McClean, “Efficient Knowledge Discovery through the Integration of Heterogeneous Data,” Information and Software Technology, (special issue on knowledge discovery and data mining), vol. 41, pp. 569-578, 1999.
[6] B.W. Scotney, S.I. McClean, and M.C. Rodgers, “Optimal and Efficient Integration of Heterogeneous Summary Tables in a Distributed Database,” Data and Knowledge Eng. J., vol. 29, pp. 337-350, 2000.
[7] The Statlog Partners, “Dataset Descriptions and Results,” Machine Learning, Neural and Statistical Classification, pp. 131-174, 1994.
[8] Y. Vardi and D. Lee, “From Image Deblurring to Optimal Investments: Maximum Likelihood Solutions for Positive Linear Inverse Problems (with discussion),” J. Royal Statistical Soc., Series B, pp. 569-612, 1993.

Index Terms:
Distributed databases, aggregate views, dynamic ontologies, Kullback-Leibler information divergence.
Sally McClean, Bryan Scotney, Kieran Greer, "A Scalable Approach to Integrating Heterogeneous Aggregate Views of Distributed Databases," IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 1, pp. 232-235, Jan.-Feb. 2003, doi:10.1109/TKDE.2003.1161592
Usage of this product signifies your acceptance of the Terms of Use.