This Article 
 Bibliographic References 
 Add to: 
Finding Patterns on Protein Surfaces: Algorithms and Applications to Protein Classification
August 2005 (vol. 17 no. 8)
pp. 1065-1078
Xiong Wang, IEEE
A successful application of data mining to bioinformatics is protein classification. A number of techniques have been developed to classify proteins according to important features in their sequences, secondary structures, or three-dimensional structures. In this paper, we introduce a novel approach to protein classification based on significant patterns discovered on the surface of a protein. We define a notion called \alpha{\hbox{-}}{\rm{surface}}. We discuss the geometric properties of \alpha{\hbox{-}}{\rm{surface}} and present an algorithm that calculates the \alpha{\hbox{-}}{\rm{surface}} from a finite set of points in R^{3}. We apply the algorithm to extracting the \alpha{\hbox{-}}{\rm{surface}} of a protein and use a pattern discovery algorithm to discover frequently occurring patterns on the surfaces. The pattern discovery algorithm utilizes a new index structure called the \Delta{\rm{B}}^{+} tree. We use these patterns to classify the proteins. While most existing techniques focus on the binary classification problem, we apply our approach to classifying three families of proteins. Experimental results show the good performance of the proposed approach.

[1] E. Allgower and P. Schmidt, “An Algorithm for Piecewise Linear Approximation of an Implicitly Defined Manifold,” SIAM J. Numerical Analysis, vol. 22, pp. 322-346, 1985.
[2] N. Amenta, M. Bern, and M. Kamvysselis, “A New Voronoi-Based Surface Reconstruction Algorithm,” Proc. Siggraph '98, pp. 415-421, 1998.
[3] N. Amenta and M. Bern, “Surface Reconstruction by Voronoi Filtering,” Discrete and Computational Geometry, vol. 22, pp. 481-504, 1999.
[4] N. Amenta, S. Choi, T.K. Dey, and N. Leekha, “A Simple Algorithm for Homeomorphic Surface Reconstruction,” Int'l J. Computational Geometry and Applications, vol. 12, pp. 125-141, 2002.
[5] T.K. Attwood, M.D.R. Croning, and A. Gaulton, “Deriving Structural and Functional Insights from a Ligand-Based Hierarchical Classification of G Protein-Coupled Receptors,” Protein Eng., vol. 15, pp. 7-12, 2002.
[6] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne, “The Protein Data Bank,” Nucleic Acids Research, vol. 28, pp. 235-242, 2000.
[7] C. Borgelt and M.R. Berthold, “Mining Molecular Fragments: Finding Relevant Substructures of Molecules,” Proc. 2002 IEEE Int'l Conf. Data Mining, pp. 51-58, 2002.
[8] P. Bradley, P.S. Kim, and B. Berger, “TRILOGY: Discovery of Sequence-Structure Patterns across Diverse Proteins,” Proc. Nat'l Academy of Sciences, vol. 99, no. 13, pp. 8500-8505, 2002.
[9] C. Branden and J. Tooze, Introduction to Protein Structure. New York: Garland Publishing, Inc., 1999.
[10] Y.N. Chirgadze and E.A. Larionova, “Spatial Sign-Alternating Charge Clusters in Globular Proteins,” Protein Eng., vol. 12, pp. 101-105, 1999.
[11] C. Chothia and E.Y. Jones, “The Molecular Structure of Cell Adhesion Molecules,” Ann. Rev. of Biochemistry, vol. 66, pp. 823-862, 1997.
[12] M. Coatney and S. Parthasarathy, “MotifMiner: A General Toolkit for Efficiently Identifying Common Substructures in Molecules,” Proc. Third IEEE Symp. BioInformatics and BioEng., pp. 336-340, 2003.
[13] B. Curless and M. Levoy, “A Volumetric Method for Building Complex Models from Range Images,” Computer Graphics, vol. 30, pp. 303-312, 1996.
[14] L. Dehaspe, H. Toivonen, and R.D. King, “Finding Frequent Substructures in Chemical Compounds,” Proc. Fourth Int'l Conf. Knowledge Discovery and Data Mining, pp. 30-36, 1998.
[15] T.K. Dey, J. Giesen, and J. Hudson, “Delaunay Based Shape Reconstruction from Large Data,” Proc. IEEE Symp. Parallel and Large Data Visualization and Graphics, pp. 19-27, 2001.
[16] H. Edelsbrunner and E.P. Mücke, “Three-Dimensional Alpha Shapes,” Proc. ACM Workshop Volume Visualization, pp. 75-82, 1992.
[17] H. Edelsbrunner and E.P. Mücke, “Three-Dimensional Alpha Shapes,” ACM Trans. Graphics, vol. 13, no. 1, pp. 43-72, 1994.
[18] D.W. Elrod and K.-C. Chou, “A Study on the Correlation of G-Protein-Coupled Receptor Types with Amino Acid Composition,” Protein Eng., vol. 15, pp. 713-715, 2002.
[19] V. Gaede and O. Günther, “Multidimensional Access Methods,” ACM Computing Surveys, vol. 30, no. 2, pp. 170-231, 1998.
[20] J. Han and M. Kamber, Data Mining: Concepts and Techniques. Morgan Kaufmann, 2000.
[21] H. Hoppe, T. DeRose, T. Duchamp, J. McDonald, and W. Stuetzle, “Surface Reconstruction from Unorganized Points,” Computer Graphics, vol. 26, no. 2, pp. 71-78, 1992.
[22] Y. Kaneta, N. Shoji, T. Ohkawa, and H. Nakamura, “A Method of Comparing Protein Molecular Surface Based on Normal Vectors with Attributes and Its Application to Function Identification,” Information Sciences, vol. 146, nos. 1-4, pp. 41-54, 2002.
[23] C. Kesmir, A.K. Nussbaum, H. Schild, V. Detours, and S. Brunak, “Prediction of Proteasome Cleavage Motifs by Neural Networks,” Protein Eng., vol. 15, pp. 287-296, 2002.
[24] D. Kihara, T. Shimizu, and M. Kanehisa, “Prediction of Membrane Proteins Based on Classification of Transmembrane Segments,” Protein Eng., vol. 11, pp. 961-970, 1998.
[25] R. King, A. Karwath, A. Clare, and L. Dephaspe, “Genome Scale Prediction of Protein Functional Class from Sequence Using Data Mining,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 384-389, 2000.
[26] M. Kuramochi and G. Karypis, “Discovering Geometric Frequent Subgraphs,” Proc. 2002 IEEE Int'l Conf. Data Mining, pp. 258-265, 2002.
[27] Y. Lamdan and H. Wolfson, “Geometric Hashing: A General and Efficient Model-Based Recognition Scheme,” Proc. Int'l Conf. Computer Vision, pp. 237-249, 1988.
[28] B.L. Lokeshwar, “MMP Inhibition in Prostate Cancer,” Annals of the New York Academy of Sciences, vol. 878, pp. 271-289, 1999.
[29] W.E. Lorensen and H.E. Cline, “Marching Cubes: A High Resolution 3D Surface Construction Algorithm,” Proc. Siggraph '87, pp. 163-169, 1987.
[30] A.C. W. May, “Towards More Meaningful Hierarchical Classification of Protein Three-Dimensional Structures,” Proteins: Structure, Function and Genetics, vol. 37, pp. 20-29, 1999.
[31] A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia, “SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures,” J. Molecular Biology, vol. 247, pp. 536-540, 1995.
[32] B. O'Neill, Elementary Differential Geometry. Orlando, Fla.: Academic Press, 1966.
[33] P.E. O'Neil, “The SB-Tree: An Index-Sequential Structure for High-Performance Sequential Accesses,” Acta Informatica, vol. 29, pp. 241-265, 1992.
[34] C.A. Orengo, A.D. Michie, S. Jones, D.T. Jones, M.B. Swindells, and J.M. Thornton, “CATH— A Hierarchic Classification of Protein Domain Structures,” Structure, vol. 5, pp. 1093-1108, 1997.
[35] J.P. Overington, Z.Y. Zhu, A. Sali, M.S. Johnson, R. Sowdhamini, C. Louie, and T.L. Blundell, “Molecular Recognition in Protein Families: A Database of Three-Dimensional Structures of Related Proteins,” Biochemical Soc. Trans., vol. 21, pp. 597-604, 1993.
[36] S. Parthasarathy and M. Coatney, “Efficient Discovery of Common Substructures in Macromolecules,” Proc. 2002 IEEE Int'l Conf. Data Mining, pp. 362-369, 2002.
[37] C. Pasquier and S.J. Hamodrakas, “A Hierarchical Artificial Neural Network System for the Classification of Transmembrane Proteins,” Protein Eng., vol. 12, pp. 631-634, 1999.
[38] J. Rao and KA. Ross, “Making ${\rm{B}}^{+}{\hbox{-}}{\rm{Trees}}$ Cache Conscious in Main Memory,” Proc. 2000 ACM SIGMOD Int'l Conf. Management of Data, pp. 475-486, 2000.
[39] M. Rosen, S.L. Lin, H. Wolfson, and R. Nussinov, “Molecular Shape Comparisons in Searches for Active Sites and Functional Similarity,” Protein Eng., vol. 11, pp. 263-277, 1999.
[40] T. Seidl and H.-P. Kriegel, “A 3D Molecular Surface Representation Supporting Neighborhood Queries,” Advances in Spatial Databases, Proc. Fourth Int'l Symp., vol. 951, pp. 240-258, 1995.
[41] A. Varshney, F.P. BrooksJr., and W.V. Wright, “Computing Smooth Molecular Surfaces,” IEEE Computer Graphics and Applications, vol. 14, no. 5, pp. 19-25, 1994.
[42] A.C. Wallace, N. Borkakoti, and J.M. Thornton, “TESS: A Geometric Hashing Algorithm for Deriving 3D Coordinate Templates for Searching Structural Databases. Application to Enzyme Active Sites,” Protein Science, vol. 6, pp. 2308-2323, 1997.
[43] C. Wang and S. Parthasarathy, “Parallel Algorithms for Mining Frequent Structural Motifs in Scientific Data,” Proc. 18th Ann. Int'l Conf. Supercomputing, pp. 31-40, 2004.
[44] J.T.L. Wang, T.G. Marr, D. Shasha, B. Shapiro, G.-W. Chirn, and T.Y. Lee, “Complementary Classification Approaches for Protein Sequences,” Protein Eng., vol. 9, pp. 381-386, 1996.
[45] Pattern Discovery in Biomolecular Data: Tools, Techniques and Applications, J.T.L. Wang, B.A. Shapiro, and D. Shasha, eds. New York: Oxford Univ. Press, 1999.
[46] K. Wang and H. Liu, “Discovering Structural Association of Semistructured Data,” IEEE Trans. Knowledge and Data Eng., vol. 12, no. 3, pp. 353-371, May/June 2000.
[47] X. Wang, “$\alpha {\hbox{-}}{\rm{Surface}}$ and Its Application to Mining Protein Data,” Proc. 2001 IEEE Int'l Conf. Data Mining, pp. 659-662, 2001.
[48] X. Wang, “Mining Protein Surfaces,” Proc. 2001 ACM SIGMOD Workshop Research Issues in Data Mining and Knowledge Discovery, pp. 20-24, 2001.
[49] X. Wang, “$\Delta {\rm{B}}^{+}$ Tree: Indexing 3D Point Sets for Pattern Discovery,” Proc. 2002 IEEE Int'l Conf. Data Mining, pp. 701-704, 2002.
[50] X. Wang, J.T.L. Wang, D. Shasha, B.A. Shapiro, S. Dikshitulu, I. Rigoutsos, and K. Zhang, “Automated Discovery of Active Motifs in Three-Dimensional Molecules,” Proc. Third Int'l Conf. Knowledge Discovery and Data Mining, pp. 89-95, 1997.
[51] X. Wang, J. Wang, D. Shasha, B. Shapiro, I. Rigoutsos, and K. Zhang, “Finding Patterns in Three Dimensional Graphs: Algorithms and Applications to Scientific Data Mining,” IEEE Trans. Knowledge and Data Eng., vol. 14, no. 4, pp. 731-749, July/Aug. 2002.
[52] L. Wei and R.B. Altman, “Recognizing Complex, Asymmetric Functional Sites in Protein Structures Using a Bayesian Scoring Function,” J. Bioinformatics and Computational Biology, vol. 1, pp. 119-138, 2003.
[53] J. Yi and N. Sundaresan, “A Classifier for Semi-Structured Documents,” Proc. Sixth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, pp. 340-344, 2000.
[54] M.J. Zaki, “Efficiently Mining Frequent Trees in a Forest,” Proc. Eighth ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining, 2002.

Index Terms:
Index Terms- KDD, classification, data mining, structural pattern discovery, biochemistry, medicine.
Xiong Wang, "Finding Patterns on Protein Surfaces: Algorithms and Applications to Protein Classification," IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 8, pp. 1065-1078, Aug. 2005, doi:10.1109/TKDE.2005.126
Usage of this product signifies your acceptance of the Terms of Use.