This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Data Mining for XML Query-Answering Support
Aug. 2012 (vol. 24 no. 8)
pp. 1393-1407
Mirjana Mazuran, Politecnico di Milano, Milano
Elisa Quintarelli, Politecnico di Milano, Milano
Letizia Tanca, Politecnico di Milano, Milano
Extracting information from semistructured documents is a very hard task, and is going to become more and more critical as the amount of digital information available on the Internet grows. Indeed, documents are often so large that the data set returned as answer to a query may be too big to convey interpretable knowledge. In this paper, we describe an approach based on Tree-Based Association Rules (TARs): mined rules, which provide approximate, intensional information on both the structure and the contents of Extensible Markup Language (XML) documents, and can be stored in XML format as well. This mined knowledge is later used to provide: 1) a concise idea—the gist—of both the structure and the content of the XML document and 2) quick, approximate answers to queries. In this paper, we focus on the second feature. A prototype system and experimental results demonstrate the effectiveness of the approach.

[1] R. Agrawal and R. Srikant, "Fast Algorithms for Mining Association Rules in Large Databases," Proc. 20th Int'l Conf. Very Large Data Bases, pp. 478-499, 1994.
[2] T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa, "Efficient Substructure Discovery from Large Semi-Structured Data," Proc. SIAM Int'l Conf. Data Mining, 2002.
[3] T. Asai, H. Arimura, T. Uno, and S. Nakano, "Discovering Frequent Substructures in Large Unordered Trees," Technical Report DOI-TR 216, Dept. of Informatics, Kyushu Univ., http://www.i.kyushu-u.ac.jp/doitrtrcs216.pdf , 2003.
[4] E. Baralis, P. Garza, E. Quintarelli, and L. Tanca, "Answering XML Queries by Means of Data Summaries," ACM Trans. Information Systems, vol. 25, no. 3, p. 10, 2007.
[5] D. Barbosa, L. Mignet, and P. Veltri, "Studying the XML Web: Gathering Statistics from an XML Sample," World Wide Web, vol. 8, no. 4, pp. 413-438, 2005.
[6] D. Braga, A. Campi, S. Ceri, M. Klemettinen, and P. Lanzi, "Discovering Interesting Information in XML Data with Association Rules," Proc. ACM Symp. Applied Computing, pp. 450-454, 2003.
[7] Y. Chi, Y. Yang, Y. Xia, and R.R. Muntz, "CMTreeMiner: Mining both Closed and Maximal Frequent Subtrees," Proc. Eighth Pacific-Asia Conf. Knowledge Discovery and Data Mining, pp. 63-73, 2004.
[8] C. Combi, B. Oliboni, and R. Rossato, "Querying XML Documents by Using Association Rules," Proc. 16th Int'l Conf. Database and Expert Systems Applications, pp. 1020-1024, 2005.
[9] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, "Privacy Preserving Mining of Association Rules," Proc. Eighth ACM Int'l Conf. Knowledge Discovery and Data Mining, pp. 217-228, 2002.
[10] L. Feng, T.S. Dillon, H. Weigand, and E. Chang, "An XML-Enabled Association Rule Framework," Proc. 14th Int'l Conf. Database and Expert Systems Applications, pp. 88-97, 2003.
[11] S. Gasparini and E. Quintarelli, "Intensional Query Answering to XQuery Expressions," Proc. 16th Int'l Conf. Database and Expert Systems Applications, pp. 544-553, 2005.
[12] B. Goethals and M.J. Zaki, "Advances in Frequent Itemset Mining Implementations: Report on FIMI 03," SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 109-117, 2004.
[13] R. Goldman and J. Widom, "DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases," Proc. 23rd Int'l Conf. Very Large Data Bases, pp. 436-445, 1997.
[14] R. Goldman and J. Widom, "Approximate DataGuides," Proc. Workshop Query Processing for Semistructured Data and Non-Standard Data Formats, pp. 436-445, 1999.
[15] A. Inokuchi, T. Washio, and H. Motoda, "Complete Mining of Frequent Patterns from Graphs: Mining Graph Data," Machine Learning, vol. 50, no. 3, pp. 321-354, 2003.
[16] A. Jiménez, F. Berzal, and J.C. Cubero, "Mining Induced and Embedded Subtrees in Ordered, Unordered, and Partially-Ordered Trees," Proc. 17th Int'l Symp. Methodologies for Intelligent Systems, pp. 111-120, 2008.
[17] D. Katsaros, A. Nanopoulos, and Y. Manolopoulos, "Fast Mining of Frequent Tree Structures by Hashing and Indexing," Information and Software Technology, vol. 47, no. 2, pp. 129-140, 2005.
[18] M. Kuramochi and G. Karypis, "An Efficient Algorithm for Discovering Frequent Subgraphs," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 9, pp. 1038-1051, Sept. 2004.
[19] H.C. Liu and J. Zeleznikow, "Relational Computation for Mining Association Rules from XML Data," Proc. 14th ACM Conf. Information and Knowledge Management, pp. 253-254, 2005.
[20] G. Marchionini, "Exploratory Search: From Finding to Understanding," Comm. ACM, vol. 49, no. 4, pp. 41-46, 2006.
[21] M. Mazuran, E. Quintarelli, and L. Tanca, "Mining Tree-Based Association Rules from XML Documents," technical report, Politecnico di Milano, http://home.dei.polimi.it/quintare/ Papers MQT09-RR.pdf, 2009.
[22] M. Mazuran, E. Quintarelli, and L. Tanca, "Mining Tree-Based Frequent Patterns from XML," Proc. Eighth Int'l Conf. Flexible Query Answering Systems, pp. 287-299, 2009.
[23] S. Nijssen and J.N. Kok, "Efficient Discovery of Frequent Unordered Trees," Proc. First Int'l Workshop Mining Graphs, Trees and Sequences, 2003.
[24] J. Paik, H.Y. Youn, and U.M. Kim, "A New Method for Mining Association Rules from a Collection of XML Documents," Proc. Int'l Conf. Computational Science and Its Applications, pp. 936-945, 2005.
[25] A. Termier, M. Rousset, and M. Sebag, "Dryade: A New Approach for Discovering Closed Frequent Trees in Heterogeneous Tree Databases," Proc. IEEE Fourth Int'l Conf. Data Mining, pp. 543-546, 2004.
[26] A. Termier, M. Rousset, M. Sebag, K. Ohara, T. Washio, and H. Motoda, "DryadeParent, an Efficient and Robust Closed Attribute Tree Mining Algorithm," IEEE Trans. Knowledge and Data Eng., vol. 20, no. 3, pp. 300-320, Mar. 2008.
[27] World Wide Web Consortium, XML Schema, http://www.w3C.org/TRxmlschema-1/, 2001.
[28] World Wide Web Consortium, XML Information Set, http://www.w3C.orgxml-infoset/, 2001.
[29] World Wide Web Consortium, XQuery 1.0: An XML Query Language, http://www.w3C.org/TRxquery, 2007.
[30] World Wide Web Consortium, Extensible Markup Language (XML) 1.0, http://www.w3C.org/TRREC-xml/, 1998.
[31] J.W.W. Wan and G. Dobbie, "Extracting Association Rules from XML Documents Using XQuery," Proc. Fifth ACM Int'l Workshop Web Information and Data Management, pp. 94-97, 2003.
[32] K. Wang and H. Liu, "Discovering Typical Structures of Documents: A Road Map Approach," Proc. 21st Int'l Conf. Research and Development in Information Retrieval, pp. 146-154, 1998.
[33] K. Wang and H. Liu, "Discovering Structural Association of Semistructured Data," IEEE Trans. Knowledge and Data Eng., vol. 12, no. 3, pp. 353-371, May/June 2000.
[34] K. Wong, J.X. Yu, and N. Tang, "Answering XML Queries Using Path-Based Indexes: A Survey," World Wide Web, vol. 9, no. 3, pp. 277-299, 2006.
[35] Y. Xiao, J.F. Yao, Z. Li, and M.H. Dunham, "Efficient Data Mining for Maximal Frequent Subtrees," Proc. IEEE Third Int'l Conf. Data Mining, pp. 379-386, 2003.
[36] X. Yan and J. Han, "CloseGraph: Mining Closed Frequent Graph Patterns," Proc. Ninth ACM Int'l Conf. Knowledge Discovery and Data Mining, pp. 286-295, 2003.
[37] M.J. Zaki, "Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications," IEEE Trans. Knowledge and Data Eng., vol. 17, no. 8, pp. 1021-1035, Aug. 2005.

Index Terms:
XML, approximate query-answering, data mining, intensional information, succinct answers.
Citation:
Mirjana Mazuran, Elisa Quintarelli, Letizia Tanca, "Data Mining for XML Query-Answering Support," IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 8, pp. 1393-1407, Aug. 2012, doi:10.1109/TKDE.2011.80
Usage of this product signifies your acceptance of the Terms of Use.