The Community for Technology Leaders
RSS Icon
Subscribe
Issue No.07 - July (2008 vol.20)
pp: 940-955
ABSTRACT
This paper surveys the most relevant research on combining Data Warehouse (DW) and Web data. It studies the XML technologies that are currently being used to integrate, store, query and retrieve web data, and their application to DWs. The paper reviews different DW distributed architectures and the use of XML languages as an integration tool in these systems. It also introduces the problem of dealing with semi-structured data in a DW. It studies Web data repositories, the design of multidimensional databases for XML data sources and the XML extensions of On-Line Analytical Processing techniques. The paper addresses the application of information retrieval technology in a DW to exploit text-rich documents collections. The authors hope that the paper will help to discover the main limitations and opportunities that offer the combination of the DW and the Web fields, as well as, to identify open research lines.
INDEX TERMS
Data warehouse and repository, XML/XSL/RDF
CITATION
Juan Manuel P?rez, Rafael Berlanga, Mar?a Jos? Aramburu, Torben Bach Pedersen, "Integrating Data Warehouses with Web Data: A Survey", IEEE Transactions on Knowledge & Data Engineering, vol.20, no. 7, pp. 940-955, July 2008, doi:10.1109/TKDE.2007.190746
REFERENCES
[1] E. Maler, T. Bray, J. Paoli, F. Yergeau, and C.M. Sperberg-McQueen, Extensible Markup Language (XML) 1.0 (Fourth Edition), World Wide Web Consortium (W3C) recommendation, http://www.w3.org/TR/2006REC-xml-20060816 , Aug. 2006.
[2] R.A. Baeza-Yates and B.A. Ribeiro-Neto, Modern Information Retrieval. ACM Press/Addison-Wesley, 1999.
[3] W.H. Inmon, Building the Data Warehouse. John Wiley & Sons, 2005.
[4] E.F. Codd, S.B. Codd, and C.T. Salley, Providing OLAP to User-Analysts: An IT Mandate. Codd & Date, Inc., 1993.
[5] W. Hümmer, A. Bauer, and G. Harde, “XCube—XML for Data Warehouses,” Proc. Sixth ACM Int'l Workshop Data Warehousing and OLAP (DOLAP '03), pp. 33-40, 2003.
[6] Microsoft Corp. and Hyperion Solutions Corp., XML for Analysis Specification, http:/xmla.org, 2001.
[7] J. Trujillo, S. Luján-Mora, and I. Song, “Applying UML and XML for Designing and Interchanging Information for Data Warehouses and OLAP Applications,” J. Database Management, vol. 14, no. 1, pp. 41-72, 2004.
[8] O. Mangisengi, J. Huber, C. Hawel, and W. Essmayr, “A Framework for Supporting Interoperability of Data Warehouse Islands Using XML,” Proc. Third Int'l Conf. Data Warehousing and Knowledge Discovery (DaWaK '01), pp. 328-338, 2001.
[9] T.B. Nguyen, A.M. Tjoa, and O. Mangisengi, “MetaCube-X: An XML Metadata Foundation of Interoperability Search among Web Data Warehouses,” Proc. Third Int'l Workshop Design and Management of Data Warehouses (DMDW '01), pp. 8.1-8.8, 2001.
[10] F. Tseng and C. Chen, “Integrating Heterogeneous Data Warehouses Using XML Technologies,” J. Information Science, vol. 31, no. 3, pp. 209-229, 2005.
[11] T. Niemi, M. Niinimäki, J. Nummenmaa, and P. Thanisch, “Constructing an OLAP Cube from Distributed XML Data,” Proc. Fifth ACM Int'l Workshop Data Warehousing and OLAP (DOLAP '02), pp. 22-37, 2002.
[12] T. Niemi, M. Niinimäki, J. Nummenmaa, and P. Thanisch, “Applying Grid Technologies to XML Based OLAP Cube Construction,” Proc. Fifth Int'l Workshop Design and Management of Data Warehouses (DMDW '03), pp. 4.1-4.13, 2003.
[13] R.M. Bruckner, T.M. Ling, O. Mangisengi, and A.M. Tjoa, “A Framework for a Multidimensional OLAP Model Using Topic Maps,” Proc. Second Int'l Conf. Web Information Systems Eng. (WISE '01), pp. 109-118, 2001.
[14] L. Xyleme, “A Dynamic Warehouse for XML Data of the Web,” IEEE Data Eng. Bull., vol. 24, no. 2, pp. 40-47, 2001.
[15] The Web Warehousing & Mining Group, “Whoweda,” http://www.cais.ntu.edu.sg:8000~whoweda, 2007.
[16] M. Golfarelli, S. Rizzi, and B. Vrdoljak, “Data Warehouse Design from XML Sources,” Proc. Fourth ACM Int'l Conf. Data Warehousing and OLAP (DOLAP '01), pp. 40-47, 2001.
[17] J. Pokorný, “Modelling Stars Using XML,” Proc. Fourth ACM Int'l Conf. Data Warehousing and OLAP (DOLAP '01), pp. 24-31, 2001.
[18] M.R. Jensen, T.H. Møller, and T.B. Pedersen, “Specifying OLAP Cubes on XML Data,” J. Intelligent Information Systems, vol. 17, nos.2-3, pp. 255-280, 2001.
[19] D. Pedersen, K. Riis, and T.B. Pedersen, “XML-Extended OLAP Querying,” Proc. 14th Int'l Conf. Scientific and Statistical Database Management (SSDBM '02), pp. 195-206, 2002.
[20] D. Pedersen, J. Pedersen, and T.B. Pedersen, “Integrating XML Data in the TARGIT OLAP System,” Proc. 20th Int'l Conf. Data Eng. (ICDE '04), pp. 778-781, 2004.
[21] M.C. McCabe, J. Lee, A. Chowdhury, D. Grossman, and O. Frieder, “On the Design and Evaluation of a Multi-Dimensional Approach to Information Retrieval,” Proc. ACM SIGIR '00, pp.363-365, 2000.
[22] J. Mothe, C. Chrisment, B. Dousset, and J. Alaux, “Doccube: Multi-Dimensional Visualisation and Exploration of Large Document Sets,” J. Am. Soc. for Information Science and Technology, vol. 54, no. 7, pp. 650-659, 2003.
[23] J.M. Pérez, R. Berlanga, M.J. Aramburu, and T.B. Pedersen, “A Relevance-Extended Multi-Dimensional Model for a Data Warehouse Contextualized with Documents,” Proc. Eighth ACM Int'l Workshop Data Warehousing and OLAP (DOLAP '05), pp. 19-28, 2005.
[24] S. Lu, Y. Sun, M. Atay, and F. Fotouhi, “On the Consistency of XML DTDs,” Data & Knowledge Eng., vol. 52, no. 2, pp. 231-247, 2005.
[25] J. Kamps, M. Marx, M. de Rijke, and B. Sigurbjörnsson, “Best-Match Querying from Document-Centric XML,” Proc. Seventh Int'l Workshop the Web and Databases (WebDB '04), pp. 55-60, 2004.
[26] D.C. Fallside and P. Walmsley, XML Schema Part 0: Primer Second Edition, World Wide Web Consortium (W3C) recommendation, http://www.w3.org/TR/2004REC-xmlschema-0-20041028 /, Oct. 2004.
[27] J. Clark and S. DeRose, XML Path Language (XPath) Version 1.0, World Wide Web Consortium (W3C) recommendation, W3C, http://www.w3.org/TR/1999REC-xpath-19991116 , Nov. 1999.
[28] J. Robie, M.F. Fernández, D. Chamberlin, S. Boag, D. Florescu, and J. Siméon, XQuery 1.0: An XML Query Language, World Wide WebConsortium (W3C) candidate recommendation, http://www.w3.org/TR/2006CR-xquery-20060608 /, June 2006.
[29] S. DeRose, E. Maler, and D. Orchard, XML Linking Language (XLink) Version 1.0, World Wide Web Consortium (W3C) recommendation, http://www.w3.org/TR/2001REC-xlink-20010627 /, June 2001.
[30] S. Deach, T. Graham, A. Berglund, P. Grosso, J. Caruso, J. Richman, S. Adler, R.A. Milowski, E. Gutentag, S. Zilles, and S. Parnell, Extensible Stylesheet Language (XSL) Version 1.0, World Wide Web Consortium (W3C) recommendation, http://www.w3.org/TR/2001REC-xsl-20011015 /, Oct. 2001.
[31] M.C. Daconta, L.J. Obrst, and K.T. Smith, The Semantic Web: A Guide to the Future of XML, Web Services, and Knowledge Management. John Wiley & Sons, 2003.
[32] S. Pepper and G. Moore, XML Topic Maps (XTM) 1.0, TopicMaps.Org specification, http://www.topicmaps.org/xtm/1.0xtm1-20010806.html , Aug. 2001.
[33] E. Miller and F. Manola, RDF Primer, World Wide Web Consortium (W3C) recommendation, http://www.w3.org/TR/2004REC-rdf-primer-20040210 /, Feb. 2004.
[34] F. van Harmelen and D.L. McGuinness, OWL Web Ontology Language Overview, World Wide Web Consortium (W3C) recommendation, http://www.w3.org/TR/2004REC-owl-features-20040210 /, Feb. 2004.
[35] R. Kimball and M. Ross, The Data Warehouse Toolkit. John Wiley & Sons, 2002.
[36] J. Widom, “Research Problems in Data Warehousing,” Proc. Fourth Int'l Conf. Information and Knowledge Management (CIKM '95), pp.25-30, 1995.
[37] G. Spofford, MDX Solutions with Microsoft SQL Server Analysis Services. John Wiley & Sons, 2001.
[38] T.B. Pedersen and C.S. Jensen, “Multidimensional Databases,” The Industrial Information Technology Handbook, R. Zurawski, ed., pp. 1-13, CRC Press, 2005.
[39] S. Chaudhuri and U. Dayal, “An Overview of Data Warehousing and OLAP Technology,” SIGMOD Record, vol. 26, no. 1, pp. 65-74, 1997.
[40] P. Ponniah, Data Warehousing Fundamentals: A Comprehensive Guide for IT Processionals. John Wiley & Sons, 2001.
[41] Y. Lafon and N. Mitra, SOAP Version 1.2 Part 0: Primer (Second Edition), World Wide Web Consortium (W3C) recommendation, http://www.w3.org/TR/2007REC-soap12-part0-20070427 /, Apr. 2007.
[42] C. Lee, C.-J. Chen, and H. Lu, “An Aspect of Query Optimization in Multidatabase Systems,” SIGMOD Record, vol. 24, no. 3, pp. 28-33, 1995.
[43] T.B. Nguyen, A.M. Tjoa, and R. Wagner, “Conceptual Multidimensional Data Model Based on MetaCube,” Proc. First Int'l Conf. Advances in Information Systems (ADVIS '00), pp. 24-33, 2000.
[44] A.P. Sheth and J.A. Larson, “Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases,” ACM Computing Surveys, vol. 22, no. 3, pp. 183-236, 1990.
[45] I. Foster and C. Kesselman, The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 1998.
[46] S.S. Bhowmick, “WHOM: A Data Model and Algebra for a Web Warehouse,” PhD dissertation, School of Computer Eng., Nanyang Technological Univ., 2001.
[47] C. Yinyan, E.P. Lim, and W.K. Ng, “Storage Management of a Historical Web Warehousing System,” Proc. 11th Int'l Conf. Database and Expert Systems Applications (DEXA '00), pp. 457-466, 2000.
[48] S.S. Bhowmick, S. Mandria, and W.K. Ng, “Detecting and Representing Relevant Web Deltas in Whoweda,” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 2, pp. 423-441, Mar./Apr. 2003.
[49] B. Nguyen, S. Abiteboul, G. Cóbena, and M. Preda, “Monitoring XML Data on the Web,” Proc. ACM SIGMOD '01, pp. 437-448, 2001.
[50] G. Cóbena, S. Abiteboul, and A. Marian, “Detecting Changes in XML Documents,” Proc. 18th Int'l Conf. Data Eng. (ICDE '02), pp.41-52, 2002.
[51] A. Marian, S. Abiteboul, G. Cóbena, and L. Mignet, “Change-Centric Management of Versions in an XML Warehouse,” Proc. 27th Int'l Conf. Very Large Data Bases (VLDB '01), pp. 581-590, 2001.
[52] Y. Zhuge and H. Garcia-Molina, “Graph Structured Views and Their Incremental Maintenance,” Proc. 14th Int'l Conf. Data Eng., pp. 116-125, 1998.
[53] R. Avnur and J.M. Hellerstein, “Eddies: Continuously Adaptive Query Processing,” Proc. ACM SIGMOD '00, pp. 261-272, 2000.
[54] D. Pedersen and T.B. Pedersen, “Achieving Adaptivity for OLAP-XML Federations,” Proc. Sixth ACM Int'l Conf. Data Warehousing and OLAP (DOLAP '03), pp. 25-32, 2003.
[55] D. Pedersen and T.B. Pedersen, “Synchronizing XPath Views,” Proc. Eighth Int'l Database Eng. and Application Symp. (IDEAS '04), pp. 149-160, 2004.
[56] OMG—Object Management Group, Unified Modeling Language (UML), http:/www.uml.org, 2004.
[57] M.R. Jensen, T.H. Møller, and T.B. Pedersen, “Converting XML DTDs to UML Diagrams for Conceptual Data Integration,” Data & Knowledge Eng., vol. 44, no. 3, pp. 323-346, 2003.
[58] B. Vrdoljak, M. Banek, and S. Rizzi, “Designing Web Warehouses from XML Schemas,” Proc. Fifth Int'l Conf. Data Warehousing and Knowledge Discovery (DaWaK '01), pp. 89-98, 2003.
[59] D. Pedersen, T.B. Pedersen, and K. Riis, “The Decoration Operator: A Foundation for On-Line Dimensional Data Integration,” Proc. Eighth Int'l Database Eng. and Applications Symp. (IDEAS '04), pp. 357-366, 2004.
[60] D. Pedersen, K. Riis, and T.B. Pedersen, “Cost Modeling and Estimation for OLAP-XML Federations,” Proc. Fourth Int'l Conf. Data Warehousing and Knowledge Discovery, pp. 245-254, 2002.
[61] D. Pedersen, K. Riis, and T.B. Pedersen, “Query Optimization for OLAP-XML Federations,” Proc. Fifth ACM Int'l Workshop Data Warehousing and OLAP (DOLAP '02), pp. 57-64, 2002.
[62] M. Krishnaprasad, Z.H. Liu, A. Manikutty, J. Warner, V. Arora, and S. Kotsovolos, “Query Rewrite for XML in Oracle XMLDB,” Proc. 30th Int'l Conf. Very Large Data Bases (VLDB), 2004.
[63] S. Pal, I. Cseri, O. Seeliger, M. Rys, G. Schaller, W. Yu, D. Tomic, A. Baras, B. Berg, D. Churin, and E. Kogan, “XQuery Implementation in a Relational Database System,” Proc. 31st Int'l Conf. Very Large Data Bases (VLDB '05), pp. 1175-1186, 2005.
[64] Z.H. Liu, M. Krishnaprasad, and V. Arora, “Native XQuery Processing in Oracle XMLDB,” Proc. ACM SIGMOD '05, pp. 828-833, 2005.
[65] I. Sanz, J.M. Pérez, R. Berlanga, and M.J. Aramburu, “XML Schemata Inference and Evolution,” Proc. 14th Int'l Conf. Database and Expert Systems Applications (DEXA '00), pp. 109-118, 2003.
[66] K. Beyer, D. Chambérlin, L.S. Colby, F. Özcan, H. Pirahesh, and Y. Xu, “Extending XQuery for Analytics,” Proc. ACM SIGMOD '05, pp. 503-514, 2005.
[67] N. Wiwatwattana, H.V. Jagadish, L.V.S. Lakshmanan, and D. Srivastava, “X^3: A Cube Operator for XML OLAP,” Proc. 23rd Int'l Conf. Data Eng. (ICDE '07), pp. 916-925, 2007.
[68] H.V. Jagadish, L.V.S. Lakshmanan, D. Srivastava, and K. Thompson, “Tax: A Tree Algebra for XML,” Revised Papers from the Eighth Int'l Workshop Database Programming Languages (DBPL '01), pp. 149-164, 2002.
[69] O. Romero and A. Abelló, “Automating Multidimensional Design from Ontologies,” Proc. 10th ACM Int'l Workshop Data Warehousing and OLAP (DOLAP), 2007.
[70] J.M. Pérez, R. Berlanga, and M.J. Aramburu, “Semi-Structured Information Warehouses: An Approach to a Document Model to Support their Construction,” Proc. Sixth Int'l Conf. Enterprise Information Systems (ICEIS '04), pp. 579-582, 2004.
[71] G. Salton, A. Wong, and C.S. Yang, “A Vector Space Model for Automatic Indexing,” Comm. ACM, vol. 18, no. 11, pp. 613-620, 1975.
[72] J.M. Ponte and W.B. Croft, “A Language Modeling Approach to Information Retrieval,” Proc. ACM SIGIR '98, pp. 275-281, 1998.
[73] V. Lavrenko and W.B. Croft, “Relevance-Based Language Models,” Proc. ACM SIGIR '01, pp. 120-127, 2001.
[74] A. Singahl, C. Buckley, and M. Mitra, “Pivoted Document Length Normalization,” Proc. ACM SIGIR '96, pp. 21-29, 1996.
[75] J. Lee, D. Grossman, and R. Orlandic, “MIRE: A Multidimensional Information Retrieval Engine for Structured Data and Text,” Proc. Int'l Conf. Information Technology: Coding and Computing, pp. 224-229, 2002.
[76] J. Lee, D. Grossman, and R. Orlandic, “An Evaluation of the Incorporation of a Semantic Network into a Multidimensional Retrieval Engine,” Proc. 12th Int'l Conf. Information and Knowledge Management (CIKM '03), pp. 572-575, 2003.
[77] B.-K. Park, H. Han, and I.-Y. Song, “XML-OLAP: A Multidimensional Analysis Framework for XML Warehouses,” Proc. Sixth Int'l Conf. Data Warehousing and Knowledge Discovery (DaWaK '05), pp. 32-42, 2005.
[78] J.M. Pérez, R. Berlanga, and M.J. Aramburu, “A Document Model Based on Relevance Modeling Techniques for Semi-Structured Information,” Proc. 15th Int'l Conf. Database and Expert Systems Applications, pp. 318-327, 2004.
[79] T.B. Pedersen, C.S. Jensen, and C.E. Dyreson, “A Foundation for Capturing and Querying Complex Multidimensional Data,” Information Systems, vol. 26, no. 5, pp. 383-423, 2001.
[80] J.M. Pérez, T.B. Pedersen, R. Berlanga, and M.J. Aramburu, “IR and OLAP in XML Document Warehouses,” Proc. 27th European Conf. Information Retrieval Research (ECIR '05), pp. 536-539, 2005.
[81] T. Priebe and G. Pernul, “Towards Integrative Enterprise Knowledge Portals,” Proc. 12th Int'l Conf. Information and Knowledge Management (CIKM '03), pp. 216-223, 2003.
[82] A. Badia, “Text Warehousing: Present and Future,” Processing and Managing Complex Data for Decision Support, J. Darmont and O.Boussaïd, eds., pp. 96-121, Idea Group Publishing, 2006.
16 ms
(Ver 2.0)

Marketing Automation Platform Marketing Automation Tool