This Article 
   
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Finding Patterns in Three-Dimensional Graphs: Algorithms and Applications to Scientific Data Mining
July/August 2002 (vol. 14 no. 4)
pp. 731-749

This paper presents a method for finding patterns in 3D graphs. Each node in a graph is an undecomposable or atomic unit and has a label. Edges are links between the atomic units. Patterns are rigid substructures that may occur in a graph after allowing for an arbitrary number of whole-structure rotations and translations as well as a small number (specified by the user) of edit operations in the patterns or in the graph. (When a pattern appears in a graph only after the graph has been modified, we call that appearance “approximate occurrence.”) The edit operations include relabeling a node, deleting a node and inserting a node. The proposed method is based on the geometric hashing technique, which hashes node-triplets of the graphs into a 3D table and compresses the label-triplets in the table. To demonstrate the utility of our algorithms, we discuss two applications of them in scientific data mining. First, we apply the method to locating frequently occurring motifs in two families of proteins pertaining to RNA-directed DNA Polymerase and Thymidylate Synthase and use the motifs to classify the proteins. Then, we apply the method to clustering chemical compounds pertaining to aromatic, bicyclicalkanes, and photosynthesis. Experimental results indicate the good performance of our algorithms and high recall and precision rates for both classification and clustering.

[1] E.E. Abola, F.C. Bernstein, S.H. Bryant, T.F. Koetzle, and J. Weng, “Protein Data Bank,” Data Comm. Int'l Union of Crystallography, pp. 107-132, 1987.
[2] F.C. Bernstein, T.F. Koetzle, G.J.B. Williams, E.F. Meyer, M.D. Brice, J.R. Rodgers, O. Kennard, T. Shimanouchi, and M. Tasumi, “The Protein Data Bank: A Computer-Based Archival File for Macromolecular Structures,” J. Molecular Biology, vol. 112, 1977.
[3] O. Bourdon and G. Medioni, "Object Recognition Using Geometric Hashing on the Connection Machine," Proc. Int'l Conf. Pattern Recognition, IEEE Computer Society, 1990, pp. 596-600.
[4] D. Conklin, “Knowledge Discovery in Molecular Structure Databases,” doctoral dissertation, Dept. Computing and Information Science, Queen's Univ., Canada, 1995.
[5] D. Conklin, "Machine Discovery of Protein Motifs," Machine Learning, Vol. 21, Nos. 1and 2, Oct. and Nov. 1995, pp. 125-150.
[6] D. Conklin, S. Fortier, and J. Glasgow, “Knowledge Discovery in Molecular Databases,” IEEE Trans. Knowledge and Data Eng., vol. 5, no. 6, pp. 985-987, 1993.
[7] L. Dehaspe, H. Toivonen, and R.D. King, “Finding Frequent Substructures in Chemical Compounds,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, pp. 30-36, 1998.
[8] S. Djoko, D.J. Cook, and L.B. Holder, "An Empirical Study of Domain Knowledge and Its Benefits to Substructure Discovery," IEEE Trans. Knowledge and Data Engineering, Vol. 9, No. 4, 1997, pp. 575-586.
[9] D. Fischer, O. Bachar, R. Nussinov, and H.J. Wolfson, “An Efficient Automated Computer Vision Based Technique for Detection of Three Dimensional Structural Motifs in Proteins,” J. Biomolecular and Structural Dynamics, vol. 9, no. 4, pp. 769-789, 1992.
[10] H.N. Gabow, Z. Galil, and T.H. Spencer, “Efficient Implementation of Graph Algorithms Using Contraction,” Proc. 25th Ann. IEEE Symp. Foundations of Computer Science, pp. 347-357, 1984.
[11] W.E.L. Grimson,D.P. Huttenlocher,, and D.W. Jacobs,“Affine matching with bounded sensor error: A study of geometrichashing and alignment,” Tech. Rep.ort 1250, Massachusetts Inst. of Tech nology, 1991.
[12] R.W. Hamming, “Error Detecting and Error Correcting Codes,” The Bell System Technical J., vol. 29, no. 2, pp. 147-160, 1950, Reprinted in E.E. Swartzlander, Computer Arithmetic. vol. 2, Los Alamitos, Calif.: IEEE Computer Soc. Press Tutorial, 1990.
[13] L. Kaufman and P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley&Sons, 1990.
[14] Y. Lamdan and H.J. Wolfson, "Geometric hashing: A general and efficient model-based recognition scheme," Second Int'l Conf. Computer Vision, pp. 238-249, 1988.
[15] H. Mannila and H. Toivonen, “Levelwise Search and Borders of Theories in Knowledge Discovery,” Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 241-258, 1997.
[16] J. McHugh, Algorithmic Graph Theory. Prentice Hall, 1990.
[17] X. Pennec and N. Ayache, “An$\big. O(n^{2})\bigr.$Algorithm for 3D Substructure Matching of Proteins,” Proc. First Int'l Workshop Shape and Pattern Matching in Computational Biology, pp. 25-40, 1994.
[18] C. Pu, K.P. Sheka, L. Chang, J. Ong, A. Chang, E. Alessio, I.N. Shindyalov, W. Chang, and P.E. Bourne, “PDBtool: A Prototype Object Oriented Toolkit for Protein Structure Verification,” Technical Report CUCS-048-92, Dept. Computer Science, Columbia Univ., 1992.
[19] I. Rigoutsos and R. Hummel, “Scalable Parallel Geometric Hashing for Hypercube (SIMD) Architectures,” Technical Report TR-553, Dept. Computer Science, New York Univ., 1991.
[20] I. Rigoutsos and R. Hummel, “On a Parallel Implementation of Geometric Hashing on the Connection Machine,” Technical Report TR-554, Dept. Computer Science, New York Univ., 1991.
[21] I. Rigoutsos, D. Platt, and A. Califano, “Flexible Substructure Matching in Very Large Databases of 3D-Molecular Information,” research report, IBM T.J. Watson Research Center, Yorktown Heights, N.Y., 1996.
[22] J. Sadowski and J. Gasteiger, “From Atoms and Bonds to Three-Dimensional Atomic Coordinates: Automatic Model Builders,” Chemical Rev., pp. 2567-2581, vol. 93, 1993.
[23] B. Sandak, R. Nussinov, and H.J. Wolfson, “A Method for Biomolecular Structural Recognition and Docking Allowing Conformational Flexibility,” J. Computational Biology, vol. 5, no. 4, pp. 631-654, 1998.
[24] B. Sandak, H.J. Wolfson, and R. Nussinov, “Flexible Docking Allowing Induced Fit in Proteins: Insights from an Open to Closed Conformational Isomers,” Proteins, vol. 32, pp. 159-74, 1998.
[25] Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. D. Sankoff and J.B. Kruskal, eds., Reading, Mass.: Addison-Wesley, 1983.
[26] K.B. Sarachik, “Limitations of Geometric Hashing in the Presence of Gaussian Noise,” Technical Memo AIM-1395, Artificial Intelligence Laboratory, Massachusetts Inst. of Tech nology, 1992.
[27] R.R. Stoll, Linear Algebra and Matrix Theory, New York: McGraw-Hill, 1952.
[28] I.I. Vaisman, A. Tropsha, and W. Zheng, “Compositional Preferences in Quadruplets of Nearest Neighbor Residues in Protein Structures: Statistical Geometry Analysis,” Proc. IEEE Int'l Join Symp. Intelligence and Systems, pp. 163-168, 1998.
[29] G. Verbitsky, R. Nussinov, and H.J. Wolfson, “Flexible Structural Comparison Allowing Hinge Bending and Swiveling Motions,” Proteins, vol. 34, pp. 232-254, 1999.
[30] J.T.-L. Wang, G.-W. Chirn, T.G. Marr, B. Shapiro, D. Shasha, and K. Zhang, "Combinatorial Pattern Discovery for Scientific Data: Some Preliminary Results," Proc. ACM SIGMOD, Minneapolis, pp. 115-125, May 1994.
[31] J.T.L. Wang, Q. Ma, D. Shasha, and C.H. Wu, “Application of Neural Networks to Biological Data Mining: A Case Study in Protein Sequence Classification,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 305–309, 2000.
[32] J.T.L. Wang, T.G. Marr, D. Shasha, B. A. Shapiro, G.-W. Chirn, and T.Y. Lee, “Complementary Classification Approaches for Protein Sequences,” Protein Eng., vol. 9, no. 5, pp. 381-386, 1996.
[33] J.T.L. Wang, B.A. Shapiro, and D. Shasha, eds., Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications, New York: Oxford Univ. Press, 1999.
[34] X. Wang and J.T.L. Wang, “Fast Similarity Search in Three-Dimensional Structure Databases,” J. Chemical Information and Computer Sciences, vol. 40, no. 2, pp. 442-451, 2000.
[35] X. Wang, J.T.L. Wang, D. Shasha, B.A. Shapiro, S. Dikshitulu, I. Rigoutsos, and K. Zhang, “Automated Discovery of Active Motifs in Three Dimensional Molecules,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining, pp. 89-95, 1997.
[36] K. Zhang, J.T.L. Wang, and D. Shasha, “On the Editing Distance Between Undirected Acyclic Graphs,” Int'l J. Foundations of Computer Science, special issue on Computational Biology, vol. 7, no. 1, pp. 43-57, 1996.

Index Terms:
KDD, classification and clustering, data mining, geometric hashing, structural pattern discovery, biochemistry, medicine.
Citation:
Xiong Wang, Jason T.L. Wang, Dennis Shasha, Bruce A. Shapiro, Isidore Rigoutsos, Kaizhong Zhang, "Finding Patterns in Three-Dimensional Graphs: Algorithms and Applications to Scientific Data Mining," IEEE Transactions on Knowledge and Data Engineering, vol. 14, no. 4, pp. 731-749, July-Aug. 2002, doi:10.1109/TKDE.2002.1019211
Usage of this product signifies your acceptance of the Terms of Use.