This Article 
 Bibliographic References 
 Add to: 
Discovering Structural Association of Semistructured Data
May/June 2000 (vol. 12 no. 3)
pp. 353-371

Abstract—Many semistructured objects are similarly, though not identically, structured. We study the problem of discovering “typical” substructures of a collection of semistructured objects. The discovered structures can serve the following purposes: 1) the “table-of-contents” for gaining general information of a source, 2) a road map for browsing and querying information sources, 3) a basis for clustering documents, 4) partial schemas for providing standard database access methods, and 5) user/customer's interests and browsing patterns. The discovery task is impacted by structural features of semistructured data in a nontrivial way and traditional data mining frameworks are inapplicable. We define this discovery problem and propose a solution.

[1] S. Abiteboul, “Querying Semi-Structured Data,” Proc. Int'l Conf. Data Engineering, 1997. .
[2] R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules Between Sets of Items in Large Databases,” Proc. 1993 ACM-SIGMOD Int'l Conf. Management of Data, pp. 207-216, May 1993.
[3] R. Agrawal and R. Srikant, “Fast Algorithms for Mining Association Rules,” Proc. 1994 Int'l Conf. Very Large Data Bases, pp. 487-499, Sept. 1994.
[4] P. Buneman, S. Davidson, G. Hillebrand, and D. Suciu, “A Query Language and Optimization Techniques for Unstructured Data,” Proc. SIGMOD, pp. 505–516, 1996.
[5] D. Konopnicki and O. Shmueli, “W3QS: A Query System for the World-Wide Web,” Very Large Data Bases, pp. 54–65, 1995.
[6] S.B. Huffman and C. Baudin, “Toward Structured Retrieval in Semi-Structured Information Spaces,” Proc. Int'l Joint Conf. Artificial Intelligence, pp. 751–756, 1997.
[7] A.O. Mendelzon, G.A. Mihaila, and T. Milo, “Querying the World Wide Web,” Proc. Fourth Int'l Conf. Parallel and Distributed Information Systems, 1996. .
[8] S. Nestorov, S. Abiteboul, and R. Motwani, “Inferring Structure in Semistructured Data,” Proc. Workshop Management of Semistructured Data, pp. 42–48, May 1997. See [15].
[9] S. Nestorov, J. Ullman, J. Wiener, and S. Chawathe, “Representative Objects: Concise Representations of Semistructured, Hierarchical Data,” Proc. Int'l Conf. Data Engineering, 1997.
[10] Y. Papakonstantinuo, H. Garcia-Molina, and J. Widom, “Object Exchange Across Heterogeneous Information Sources,” Proc. Int'l Conf. Data Engineering, pp. 251–260, 1995.
[11] S.W. Reyner, “An Analysis of a Good Algorithm for the Subtree Problem,” SIAM J. Computing, vol. 6, no. 4, Dec. 1977.
[12] D.Y. Seo, D.H. Lee, K.M. Lee, and J.Y. Lee, “Discovery of Schema Information from a Forest of Selectively Labeled Ordered Trees,” Proc. Workshop Management of Semistructured Data, pp. 54–59, May 1997. See [15].
[13] K. Wang and H.Q. Liu, “Schema Discovery from Semistructured Data,” Proc. Int'l Conf. Knowledge Discovery and Data Mining, pp. 271–274, Aug. 1997.
[14] K. Wang and H.Q. Liu, “Discovering Typical Structures of Documents: A Road Map Approach,” Proc. ACM SIGIR Conf. Research and Development in Information Retrieval, pp. 146–154, Aug. 1998.
[15] The Workshop on Management of Semistructured Data, 1997. / workshop-papers.html.

Index Terms:
Association rule, database, data mining, knowledge discovery, semistructured data, web mining.
Ke Wang, Huiqing Liu, "Discovering Structural Association of Semistructured Data," IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 353-371, May-June 2000, doi:10.1109/69.846290
Usage of this product signifies your acceptance of the Terms of Use.