This Article 
 Bibliographic References 
 Add to: 
Warehouse Creation-A Potential Roadblock to Data Warehousing
January/February 1999 (vol. 11 no. 1)
pp. 118-126

Abstract—Data warehousing is gaining in popularity as organizations realize the benefits of being able to perform sophisticated analyses of their data. Recent years have seen the introduction of a number of data-warehousing engines, from both established database vendors as well as new players. The engines themselves are relatively easy to use and come with a good set of end-user tools. However, there is one key stumbling block to the rapid development of data warehouses, namely that of warehouse population. Specifically, problems arise in populating a warehouse with existing data since it has various types of heterogeneity. Given the lack of good tools, this task has generally been performed by various system integrators, e.g., software consulting organizations which have developed in-house tools and processes for the task. The general conclusion is that the task has proven to be labor-intensive, error-prone, and generally frustrating, leading a number of warehousing projects to be abandoned mid-way through development. However, the picture is not as grim as it appears. The problems that are being encountered in warehouse creation are very similar to those encountered in data integration, and they have been studied for about two decades. However, not all problems relevant to warehouse creation have been solved, and a number of research issues remain. The principal goal of this paper is to identify the common issues in data integration and data-warehouse creation. We hope this will lead: 1) developers of warehouse creation tools to examine and, where appropriate, incorporate the techniques developed for data integration, and 2) researchers in both the data integration and the data warehousing communities to address the open research issues in this important area.

[1] R. Ahmed et al., "The Pegasus Heterogeneous Multidatabase System," Computer, vol. 24, no. 12, pp. 19-27, 1991.
[2] Apertus Inc., white paper, URL:http:/
[3] J. Bain, "A United Health Care Perspective on Business Information Strategies," Putting the Data Warehouse on the Internet, May 1997.
[4] C. Batini, M. Lenzerini, and S.B. Navathe, “A Comparative Analysis of Methodologies for Database Schema Integration,” ACM Computing Surveys, vol. 18, no. 2, pp. 323-364, Dec. 1986.
[5] M.L. Brodie and M. Stonebraker, Migrating Legacy Systems: Gateways, Interfaces, and The Incremental Approach, Morgan Kaufmann, 1996.
[6] E.F. Codd, S.B. Codd, and C.T. Salley, "Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate, technical report, E.F. Codd and Associates, 1993.
[7] A. Chatterjee and A. Segev, "Data Manipulation in Heterogeneous Databases," SIGMOD Record, vol. 20, no. 4, pp. 64-68, ACM, Dec. 1991.
[8] U. Dayal, "Processing Queries Over Generalized Hierarchies in a Multidatabase Systems," Proc. Ninth Int'l Conf. Very Large Data Bases, pp. 342-353,Florence, Italy, Oct. 1983.
[9] U. Dayal, "Query Processing in Multidatabase Systems," W. Kim, D.S. Reiner, and D.S. Batory, eds., Query Processing in Database Systems, pp. 81-108, Springer-Verlag, 1985.
[10] L.G. Demichiel, “Resolving Database Incompatibility: An Approach to Performing Relational Operations over Mismatched Domains,” IEEE Trans. Knowledge and Data Eng., vol. 4, pp. 485-493, 1989.
[11] S. Dao and B. Perry, "Applying A Data Miner to Heterogeneous Schema Integration," Proc. First Int'l Conf. Knowledge Discovery in Databases, pp. 63-68,Montreal, Canada, Aug. 1995.
[12] R. Elmasri and S.B. Navathe, Fundamentals of Database Systems, Benjamin/Cummings, 1993.
[13] C. Faison, "Web Enabled Data Warehouses at Cargill," Putting the Data Warehouse on the Internet, May 1997.
[14] Firstlogic Inc., URL:http:/
[15] J. Gray et al., "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals," J. Data Mining and Knowledge Discovery, Vol. 1, No. 1, 1997, pp. 29-53.
[16] A. Gupta, V. Harinarayan, and D. Quass, "Aggregate-Query Processing in Data Warehousing Environments," Proc. Eighth Int'l Conf. Very Large Databases (VLDB), pp. 358-369,Zurich, Switzerland, Sept. 1995.
[17] H. Garcia-Molina, "Global Consistency Constraints Considered Harmful for Heterogeneous Database Systems," Proc. First Int'l Workshop Interoperability in Multidatabase Systems, pp. 248-250,Kyoto, Japan, Apr. 1991.
[18] M. Ganesh, J. Srivastava, and T. Richardson, "Mining Entity-Identification Rules for Database Integration," Proc. Second Int'l Conf. Knowledge Discovery in Databases, pp. 291-294,Portland, Ore., Aug. 1996.
[19] J. Hill, "Distinguishing Data Movement Technologies," Gartner Group Research Note, Strategic Data Management, Apr. 1998.
[20] S. Hayne and S. Ram, “Multi‐User View Integration (MUVIS): An Expert System for View Integration,” Proc. Sixth Int’l Conf. Data Engineering, CS Press, Los Alamitos, Calif., 1990, pp. 402–409.
[21] V. Harinarayan, A. Rajaraman, and J. D. Ullman, “Implementing Data Cubes Efficiently,” Proc. ACM SIGMOD, pp. 205-216, June 1996
[22] M.A. Hernández and S.J. Stolfo, “The Merge/Purge Problem for Large Databases,” Proc. 1995 ACM SIGMOD Conf., pp. 127-138, May 1995.
[23] "OLAP: Scaling to The Masses," Information Advantage White Paper, Minneapolis, Minn., 1997.
[24] Idcentric Inc., "Customer Data Quality, A White Paper," URL:http:/
[25] W.H. Inmon and R.D. Hackathorn, Using The Data Warehouse, John Wiley and Sons, 1994.
[26] W.H. Inmon, Building the Data Warehouse, John Wiley and Sons, 1992.
[27] A. Jenks, "Enterprise Data Strategy for Norwest Inc.," private communication, June 1995.
[28] W. Kent, "Breakdown of the Information Model," SIGMOD Record, vol. 20, no. 3, pp. 10-15, ACM, Sept. 1991.
[29] E.-P. Lim, J. Srivastava, and S.Y. Hwang, "An Algebraic Framework for Multidatabase Queries," Distributed and Parallel Databases, vol. 3, no. 3, pp. 273-307, July 1995.
[30] E.-P. Lim, J. Srivastava, S. Prabhakar, and J.P. Richardson, "Entity Identification in Database Integration," Proc. Ninth IEEE Int'l Conf. Data Eng., pp. 294-301,Vienna, Apr. 1993.
[31] E.-P. Lim, J. Srivastava, and S. Shekhar, "An Evidential Reasoning Approach to Attribute Value Conflict Resolution in Database Integration," IEEE Trans. Knowledge and Data Eng., vol. 8, no. 5, pp. 707-723, Oct. 1996.
[32] Microsoft Corp., "SQL-Server 7.0: Decision Support System," product an nouncement, Redmond, Wash., 1998.
[33] MicroStrategy, "The Decision Support Systems (DSS)," product an nouncement, Viena, Va., 1998.
[34] Platinum Inc., URL:http:/
[35] Postalsoft Inc., URL:http:/
[36] C. Pu, "Key Equivalence in Heterogeneous Databases," Proc. First Int'l Workshop Interoperability in Multidatabase Systems, pp. 314-316,Kyoto, Japan, Apr. 1991.
[37] T. Richardson and J. Srivastava, "Enterprise/Integrator: Using Object Technology for Data Integration," Proc. Object-Oriented Programming Systems, Languages, and Applications Workshop Object Oriented Integration of Legacy Data Systems,Austin, Texas, Oct. 1995.
[38] M. Rusinkiewicz, A. Sheth, and G. Karabatis, “Specifying Interdatabase Dependencies in a Multidatabase Environment,” IEEE Computer, vol. 24, no. 12, pp. 46-53, Dec. 1991.
[39] A. Sheth, "Building Federated Database Systems," Newsletter, Distributed Processing Technical Committee, vol. 10, no. 2, pp. 50-58, 1988.
[40] A.P. Seth and J.A. Larson,“Federated database systems for managing distributed, heterogeneous andautonomous databases,” ACM Computing Surveys, vol. 22, no. 3, pp. 184-236, September 1990.
[41] J. Souza, "SIS: A Schema Integration System," Proc. Fifth Nat'l Conf. BNCOD, pp. 167-185,Canterbury, United Kingdom, July 1986.
[42] R. Tanler, "The Intranet Data Warehouse," Putting the Data Warehouse on The Internet, May 1997.
[43] L.A. Taylor, "Cargill's Informational Technology Strategy," white paper, Cargill Inc., Oct. 1993.
[44] F.S.-C. Tseng, A.L.P. Chen, and W.-P. Yang, "Answering Heterogeneous Database Queries with Degrees of Uncertainty," Proc. Second Int'l Conf. Parallel and Distributed Information Systems, vol. 1, no. 1, pp. 281-302, Jan. 1993.
[45] J. Widom, “Research Problems in Data Warehousing,” Proc. Int'l Conf. Information and Knowledge Management, pp. 25-30, Nov. 1995.
[46] Y.R. Wang and S. Madnick, “The Interdatabase Instance Identification Problem in Integrating Autonomous Systems,” Proc. Fifth Int'l Conf. Data Eng., pp. 46-55, Feb. 1989.

Index Terms:
Data warehouse, entity identification, attribute value conflict, data mining, data integration.
Jaideep Srivastava, Ping-Yao Chen, "Warehouse Creation-A Potential Roadblock to Data Warehousing," IEEE Transactions on Knowledge and Data Engineering, vol. 11, no. 1, pp. 118-126, Jan.-Feb. 1999, doi:10.1109/69.755620
Usage of this product signifies your acceptance of the Terms of Use.