This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Discovering Structural Association of Semistructured Data
May/June 2000 (vol. 12 no. 3)
pp. 353-371

Abstract—Many semistructured objects are similarly, though not identically, structured. We study the problem of discovering “typical” substructures of a collection of semistructured objects. The discovered structures can serve the following purposes: 1) the “table-of-contents” for gaining general information of a source, 2) a road map for browsing and querying information sources, 3) a basis for clustering documents, 4) partial schemas for providing standard database access methods, and 5) user/customer's interests and browsing patterns. The discovery task is impacted by structural features of semistructured data in a nontrivial way and traditional data mining frameworks are inapplicable. We define this discovery problem and propose a solution.

[1] S. Abiteboul, “Querying Semi-Structured Data,” Proc. Int'l Conf. Data Engineering, 1997. http://www-db.stanford.edu/pub/papersicdt97.semistructured.ps .
[2] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Databases,” Proc. 1993 ACM-SIGMOD Int'l Conf. Management of Data, pp. 207-216, May 1993.
[3] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int'l Conf. Very Large Data Bases, pp. 487-499, Sept. 1994.
[4] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu, “A Query Language and Optimization Techniques for Unstructured Data,” Proc. SIGMOD, pp. 505–516, 1996.
[5] D. Konopnicki and O. Shmueli, “W3QS: A Query System for the World-Wide Web,” Very Large Data Bases, pp. 54–65, 1995.
[6] S.B. Huffman and C. Baudin, “Toward Structured Retrieval in Semi-Structured Information Spaces,” Proc. Int'l Joint Conf. Artificial Intelligence, pp. 751–756, 1997.
[7] A.O. Mendelzon, G.A. Mihaila, and T. Milo, “Querying the World Wide Web,” Proc. Fourth Int'l Conf. Parallel and Distributed Information Systems, 1996. .
[8] S. Nestorov, S. Abiteboul, and R. Motwani, “Inferring Structure in Semistructured Data,” Proc. Workshop Management of Semistructured Data, pp. 42–48, May 1997. See [15].
[9] S. Nestorov, J. Ullman, J. Wiener, and S. Chawathe, “Representative Objects: Concise Representations of Semistructured, Hierarchical Data,” Proc. Int'l Conf. Data Engineering, 1997.
[10] Y. Papakonstantinuo, H. Garcia-Molina, and J. Widom, “Object Exchange Across Heterogeneous Information Sources,” Proc. Int'l Conf. Data Engineering, pp. 251–260, 1995.
[11] S.W. Reyner, “An Analysis of a Good Algorithm for the Subtree Problem,” SIAM J. Computing, vol. 6, no. 4, Dec. 1977.
[12] D.Y. Seo, D.H. Lee, K.M. Lee, and J.Y. Lee, “Discovery of Schema Information from a Forest of Selectively Labeled Ordered Trees,” Proc. Workshop Management of Semistructured Data, pp. 54–59, May 1997. See [15].
[13] K. Wang and H.Q. Liu, “Schema Discovery from Semistructured Data,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 271–274, Aug. 1997.
[14] K. Wang and H.Q. Liu, “Discovering Typical Structures of Documents: A Road Map Approach,” Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 146–154, Aug. 1998.
[15] The Workshop on Management of Semistructured Data, 1997. ftp://ftp.db.toronto.edu/pub/papers/pdis96.ps.gzhttp:/ / www.research.att.com/suciu workshop-papers.html.

Index Terms:
Association rule, database, data mining, knowledge discovery, semistructured data, web mining.
Citation:
Ke Wang, Huiqing Liu, "Discovering Structural Association of Semistructured Data," IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 353-371, May-June 2000, doi:10.1109/69.846290
Usage of this product signifies your acceptance of the Terms of Use.