loading...
 This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
State-Space Optimization of ETL Workflows
October 2005 (vol. 17 no. 10)
pp. 1404-1419
Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization, and insertion into a data warehouse. In this paper, we delve into the logical optimization of ETL processes, modeling it as a state-space search problem. We consider each ETL workflow as a state and fabricate the state space through a set of correct state transitions. Moreover, we provide an exhaustive and two heuristic algorithms toward the minimization of the execution cost of an ETL workflow. The heuristic algorithm with greedy characteristics significantly outperforms the other two algorithms for a large set of experimental cases.

[1] 1404 D.J. Abadi et al. “Aurora: A New Model and Architecture for Data Stream Management,” The Very Large Data Bases J., vol. 12, no. 2, pp. 120-139, 2003.[2] J. Adzic and V. Fiore , “Data Warehouse Population Platform,” Proc. Fifth Int'l Workshop Design and Management of Data Warehouses, 2003.[3] S. Alagic and P.A. Bernstein , “A Model Theory for Generic Schema Management,” Proc. Eighth Int'l Workshop Database Programming Languages, pp. 228-246, 2001.[4] S. Babu and J. Widom , “Continuous Queries over Data Streams,” SIGMOD Record, vol. 30, no. 3, pp. 109-120, 2001. [5] V. Borkar , K. Deshmuk , and S. Sarawagi , “Automatically Extracting Structure from Free Text Addresses,” Bull. Technical Committee on Data Eng., vol. 23, no. 4, pp. 27-32, 2000.[6] J. Chen , S. Chen , and E.A. Rundensteiner , “A Transactional Model for Data Warehouse Maintenance,” Proc. 21st Int'l Conf. Concept Modelling, pp. 247-262, 2002.[7] Y. Cui and J. Widom , “Lineage Tracing for General Data Warehouse Transformations,” The Very Large Data Bases J., vol. 12, pp. 41-58, 2003.[8] R. Elmasri and S.B. Navathe , Fundamentals of Database Systems. Addison-Wesley Pubs, 2000.[9] Gartner, “ETL Magic Quadrant Update: Market Pressure Increases,” http://www.gartner.com/reprints/informatica 112769.html, 2002.[10] H. Galhardas , D. Florescu , D. Shasha , and E. Simon , “Ajax: An Extensible Data Cleaning Tool,” Proc. ACM Int'l Conf. Management of Data, p. 590, 2000.[11] G. Graefe , “Query Evaluation Techniques for Large Databases,” ACM Computing Surveys, vol. 5, no. 2, pp. 73-170, 1993. [12] IBM, “IBM Data Warehouse Manager,” http://www-3.ibm.com/software/data/db2datawarehouse , 2004.[13] Informatica, “PowerCenter,” http://www.informatica.com/ products/data+integration/ power-centerdefault.htm , 2004.[14] M. Jarke and J. Koch , “Query Optimization in Database Systems,” ACM Computing Surveys, vol. 16, no. 2, pp. 111-152, 1984. [15] D. Lomet and J. Gehrke , “Special Issue on Data Stream Processing,” Bull. Technical Committee on Data Eng., vol. 26, no. 1, 2003.[16] W. Labio , J.L. Wiener , H. Garcia-Molina , and V. Gorelik , “Efficient Resumption of Interrupted Warehouse Loads,” Proc. ACM Int'l Conf. Management of Data, pp. 46-57, 2000.[17] Microsoft, “Data Transformation Services,” www.microsoft.com, 2004.[18] R.J. Miller , Y.E. Ioannidis , and R. Ramakrishnan , “Schema Equivalence in Heterogeneous Systems: Bridging Theory and Practice,” Information Systems, vol. 19, no. 1, pp. 3-31, 1994. [19] Oracle Corp., “Oracle9i Warehouse Builder User's Guide, Release 9.0.2,” http://otn.oracle.com/products/warehouse content.html, Nov. 2001.[20] M. Tamer Ozsu and P. Valduriez , Principles of Distributed Database Systems. Prentice Hall, 1991.[21] E. Rahm and H. Do , “Data Cleaning: Problems and Current Approaches,” Bull. Technical Committee on Data Eng., vol. 23, no. 4, pp. 3-13, 2000.[22] V. Raman and J. Hellerstein , “Potter's Wheel: An Interactive Data Cleaning System,” Proc. 27th Intl. Conf. Very Large Data Bases, pp. 381-390, 2001.[23] D. Theodoratos and T.K. Sellis , “Data Warehouse Configuration,” Proc. 23rd Int'l Conf. Very Large Data Bases, pp. 126-135, 1997.[24] P. Vassiliadis , A. Simitsis , and S. Skiadopoulos , “Modeling ETL Activities as Graphs,” Proc. Fourth Int'l Workshop Design and Management of Data Warehouses, pp. 52-61, 2002.[25] Y. Velegrakis , R.J. Miller , and L. Popa , “Mapping Adaptation under Evolving Schemas,” Proc. 29th Int'l Conf. Very Large Data Bases, pp. 584-595, 2003.[26] P. Vassiliadis , A. Simitsis , P. Georgantas , and M. Terrovitis , “A Framework for the Design of ETL Scenarios,” Proc. 15th Conf. Advanced Information System Eng., pp. 520-535, 2003.[27] A. Simitsis , P. Vassiliadis , and T. Sellis , “Optimizing ETL Processes in Data Warehouses,” Proc. 21st IEEE Int'l Conf. Data Eng., pp. 564-575, 2005.

Index Terms:
Index Terms- Database management, database integration, data warehouse and repository, workflow management, heterogeneous databases.
Citation:
Alkis Simitsis, Panos Vassiliadis, Timos Sellis, "State-Space Optimization of ETL Workflows," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 10, pp. 1404-1419, Oct. 2005, doi:10.1109/TKDE.2005.169
Usage of this product signifies your acceptance of the Terms of Use.