This Article 
 Bibliographic References 
 Add to: 
Clustering 100,000 Protein Structure Decoys in Minutes
May-June 2012 (vol. 9 no. 3)
pp. 765-773
Dongbo Bu, Inst. of Comput. Technol., Beijing, China
Shuai Cheng Li, Dept. of Comput. Sci., City Univ. of Hong Kong, Kowloon, China
Ming Li, Sch. of Comput. Sci., Univ. of WaterlooWaterloo, Waterloo, ON, Canada
Ab initio protein structure prediction methods first generate large sets of structural conformations as candidates (called decoys), and then select the most representative decoys through clustering techniques. Classical clustering methods are inefficient due to the pairwise distance calculation, and thus become infeasible when the number of decoys is large. In addition, the existing clustering approaches suffer from the arbitrariness in determining a distance threshold for proteins within a cluster: a small distance threshold leads to many small clusters, while a large distance threshold results in the merging of several independent clusters into one cluster. In this paper, we propose an efficient clustering method through fast estimating cluster centroids and efficient pruning rotation spaces. The number of clusters is automatically detected by information distance criteria. A package named ONION, which can be downloaded freely, is implemented accordingly. Experimental results on benchmark data sets suggest that ONION is 14 times faster than existing tools, and ONION obtains better selections for 31 targets, and worse selection for 19 targets compared to SPICKER's selections. On an average PC, ONION can cluster 100,000 decoys in around 12 minutes.

[1] I-TASSER, “Protein Structure Decoys,” http://zhang. I-TASSERdecoys/, 2007.
[2] S.F. Altschul, J.C. Wootton, E. Zaslavsky, and Y.-K. Yu, “The Construction and Use of Log-Odds Substitution Scores for Multiple Sequence Alignment,” PLoS Computational Biology, vol. 6, no. 7, p. e1000852, 2010.
[3] K.S. Arun, T.S. Huang, and S.D. Blostein, “Least-Squares Fitting of Two 3-D Point Sets,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 9, no. 5, pp. 698-700, Sept. 1987.
[4] M.R. Betancourt and J. Skolnick, “Finding the Needle in a Haystack: Educing Native Folds from Ambiguous ab Initio Protein Structure,” J. Computational Chemistry, vol. 22, pp. 339-353, 2001.
[5] C. Bystroff and Y. Shao, “Fully Automated ab Initio Protein Structure Prediction Using I-SITES, HMMSTR and ROSETTA,” Bioinformatics, vol. 18, no. Suppl. 1, pp. 54-61, 2002.
[6] R. Diamond, “On the Multiple Simultaneous Superposition of Molecular Structures by Rigid Body Transformations,” Protein Science : A Publication of the Protein Soc., vol. 1, no. 10, pp. 1279-1287, Oct. 1992.
[7] C. Goutte, L.K. Hansen, M.G. Liptrot, and E. Rostrup, “Feature-Space Clustering for fMRI Meta-Analysis,” Human Brain Mapping, vol. 13, pp. 165-183, 2001.
[8] T. Hamelryck, J.T. Kent, and A. Krogh, “Sampling Realistic Protein Conformations Using Local Structural Bias,” PLoS Computational Biology, vol. 2, no. 9, pp. 1121-1133, Sept. 2006.
[9] A. Kumar, Y. Sabharwal, and S. Sen, “A Simple Linear Time ($1+\epsilon$ )-Approximation Algorithm for k-Means Clustering in Any Dimensions,” FOCS '04: Proc. 45th Ann. IEEE Symp. Foundations of Computer Science, pp. 454-462, 2004.
[10] H. Li and Y. Zhou, “SCUD: Fast Structure Clustering of Decoys Using Reference State to Remove Overall Rotation,” J. Computational Chemistry, vol. 26, no. 11, pp. 1189-92, 2005.
[11] S.C. Li, D. Bu, J. Xu, and M. Li, “Fragment-HMM: A New Approach to Protein Structure Prediction,” Protein Science, vol. 17, pp. 1925-1934, Aug. 2008.
[12] S.C. Li and Y.K. Ng, “Calibur: An Efficient Tool for Clustering Large Numbers of Protein Decoys,” BMC Bioinformatics, vol. 11, article 25, 2010.
[13] S.C. Li, Y.K. Ng, and L. Zhang, “A Ptas for the k-Consensus Structures Problem under Squared Euclidean Distance,” Algorithms, vol. 1, no. 2, pp. 43-51, 2008.
[14] J. Qian, S.C. Li, D. Bu, M. Li, and J. Xu, “Finding Compact Structural Motifs,” Proc. 18th Ann. Symp. Combinatorial Pattern Matching (CPM '07), pp. 142-149, 2007.
[15] D. Shortle, K.T. Simons, and D. Baker, “Clustering of Low-Energy Conformations Near the Native Structures of Small Proteins,” Proc. Nat'l Academy of Sciences USA, vol. 95, pp. 1158-62, 1998.
[16] K.T. Simons, C. Kooperberg, E. Huang, and D. Baker, “Assembly of Protein Tertiary Structures from Fragments with Similar Local Sequences Using Simulated Annealing and Bayesian Scoring Functions,” J. Molecular Biology, vol. 268, no. 1, pp. 209-25, Apr. 1997.
[17] J. Skolnick, A. Kolinski, and A.R. Ortiz, “Monsster: A Method for Folding Globular Proteins with a Small Number of Distance Restraints,” J. Molecular Biology, vol. 265, no. 2, pp. 217-241, 1997.
[18] G. Wen, Z. Wang, S. Xia, and D. Zhu, “Least-Squares Fitting of Multiple M-Dimensional Point Sets,” Visual Computer: Int'l J. Computer Graphics, vol. 22, no. 6, pp. 387-398, 2006.
[19] S. Wu, J. Skolnick, and Y. Zhang, “Ab Initio Modeling of Small Proteins by Iterative Tasser Simulations,” BMC Biology, vol. 5, article 17, 2007.
[20] J. Zhang, Q. Wang, K. Vantasin, J. Zhang, Z. He, I. Kosztin, Y. Shang, and D. Xu, “A Multi-Layer Evaluation Approach for Protein Structure Prediction and Model Quality Assessment,” Proteins: Structure, Function, and Bioinformatics, vol. 72, pp. 172-184, 2011.
[21] Y. Zhang and J. Skolnick, “Spicker: Approach to Clustering Protein Structures for Near-Native Model Selection,” J. Computational Chemistry, vol. 25, pp. 865-871, 2004.

Index Terms:
software packages,ab initio calculations,biology computing,molecular biophysics,molecular configurations,proteins,benchmark data sets,protein structure decoy clustering,ab initio protein structure prediction method,structural conformations,classical clustering method,cluster centroids,pruning rotation space,information distance criteria,ONION package,Proteins,Clustering algorithms,Approximation methods,Bioinformatics,Computational biology,Density functional theory,Three dimensional displays,clustering.,Protein structure,decoy selection
Dongbo Bu, Shuai Cheng Li, Ming Li, "Clustering 100,000 Protein Structure Decoys in Minutes," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 3, pp. 765-773, May-June 2012, doi:10.1109/TCBB.2011.142
Usage of this product signifies your acceptance of the Terms of Use.