This Article 
 Bibliographic References 
 Add to: 
An Efficient and Scalable Algorithm for Clustering XML Documents by Structure
January 2004 (vol. 16 no. 1)
pp. 82-96

Abstract—With the standardization of XML as an information exchange language over the net, a huge amount of information is formatted in XML documents. In order to analyze this information efficiently, decomposing the XML documents and storing them in relational tables is a popular practice. However, query processing becomes expensive since, in many cases, an excessive number of joins is required to recover information from the fragmented data. If a collection consists of documents with different structures (for example, they come from different DTDs), mining clusters in the documents could alleviate the fragmentation problem. We propose a hierarchical algorithm (S-GRACE) for clustering XML documents based on structural information in the data. The notion of structure graph (s-graph) is proposed, supporting a computationally efficient distance metric defined between documents and sets of documents. This simple metric yields our new clustering algorithm which is efficient and effective, compared to other approaches based on tree-edit distance. Experiments on real data show that our algorithm can discover clusters not easily identified by manual inspection.

[1] S. Abiteboul, S. Cluet, and T. Milo, Querying and Updating the File Proc. 19th Int'l Conf. Very Large Data Bases, pp. 73-84, 1993.
[2] A. Aboulnaga, J.F. Naughton, and C. Zhang, Generating Synthetic Complex-Structured XML Document Proc. Fifth Int'l Workshop Web and Databases, 2001.
[3] H. Bunke and K. Shearer, A Graph Distance Metric Based on the Maximal Common Subgraph Pattern Recognition Letters, vol. 19, no. 3, pp. 255-259, 1998.
[4] D. Coppersmith and S. Winograd, Matrix Multiplication via Arithmetic Progressions Proc. 19th Ann. ACM Symp. Theory of Computing, 1987.
[5] DBLP XML records, , Feb. 2001.
[6] S. DeRose, E. Maler, and D. Orchard, XML Linking Language (XLink), Version 1.0 W3C Recommendation,http://www.w3. org/TRxlink/, June 2001.
[7] A. Deutsch, M. Fernandez, and D. Suciu, Storing Semistructured Data with STORED Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 431-442, 1999.
[8] A.L. Diaz and D. Lovell XML Generator,http://www.alpha , 1999.
[9] M. Ester, H. Kriegel, J., Sander, and X. Xu, A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proc. Second Int'l Conf. Knowledge Discovery and Data Mining, pp. 226-231, 1996.
[10] Excelon,http://www.odi.comexcelon, 2001.
[11] D Guillaume and F Murtagh, Clustering of XML Documents Computer Physics Comm., vol. 127, pp. 215-227, 2000.
[12] S. Guha, R. Rastogi, and K. Shim, ROCK: A Robust Clustering Algorithm For Categorical Attributes Proc. 15th Int'l Conf. Data Eng., pp. 512-521, 1999.
[13] International Press Telecommunications Council, News Industry Text Format(NITF),http:/, 2000.
[14] R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes, Exploiting Local Similarity for Indexing Paths in Graph-Structured Data Proc. 18th Int'l Conf. Data Eng., 2002.
[15] J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom, Lore: A Database Management System for Semistructured Data SIGMOD Record, vol. 26, no. 3, pp. 54-66, Sept. 1997.
[16] R.T. Ng and J. Han, Efficient and Effective Clustering Methods for Spatial Data Mining Proc. 20th Int'l Conf. Very Large Data Bases, pp. 144-155, Sept. 1994.
[17] A. Nierman and H.V. Jagadish, Evaluating Structural Similarity in XML Documents Proc. Fifth Int'l Workshop Web and Databases, June 2002.
[18] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[19] J. Shanmugasundaram, K. Tufte, G. He, C. Zhang, D. DeWitt, and J. Naughton, Relational Databases for Querying XML Documents: Limitations and Opportunities Proc. 25th Int'l Conf. Very Large Data Bases, pp. 302-314, 1999.
[20] T. Shimura, M. Yoshikawa, and S. Uemura, Storage and Retrieval of XML Documents Using Object-Relational Databases Proc. 10th Int'l Conf. Database and Expert Systems Applications, pp. 206-217, 1999.
[21] World Wide Web Consortium, XML Path Language (XPath) Version 1.0, Nov. 1999.
[22] World Wide Web Consortium, XQuery: A Query Language for XML W3C Working Draft,, Feb. 2001.
[23] O. Zamir, O. Etzioni, O. Madani, and R.M. Karp, Fast and Intuitive Clustering of Web Documents Proc. Second Int'l Conf. Knowledge Discovery and Data Mining, pp. 287-290, 1997.
[24] K. Zhang and D. Shasha, Simple Fast Algorithms for the Editing Distance between Trees and Related Problems SIAM J. Computing, vol. 18, no. 6, pp. 1245-1262, 1989.

Index Terms:
Data mining, clustering, XML, semistructured data, query processing.
Wang Lian, David Wai-lok Cheung, Nikos Mamoulis, Siu-Ming Yiu, "An Efficient and Scalable Algorithm for Clustering XML Documents by Structure," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 1, pp. 82-96, Jan. 2004, doi:10.1109/TKDE.2004.1264824
Usage of this product signifies your acceptance of the Terms of Use.