This Article 
 Bibliographic References 
 Add to: 
Statistical Characterization of Protein Ensembles
January-March 2008 (vol. 5 no. 1)
pp. 42-55
When accounting for structural fluctuations or measurement errors, a single rigid structure may not be sufficient to represent a protein. One approach to solve this problem is to represent the possible conformations as a discrete set of observed conformations, an ensemble. In this work, we follow a different richer approach, and introduce a framework for estimating probability density functions in very high dimensions, and then apply it to represent ensembles of folded proteins. This proposed approach combines techniques such as kernel density estimation, maximum likelihood, cross-validation, and bootstrapping. We present the underlying theoretical and computational framework and apply it to artificial data and protein ensembles obtained from molecular dynamics simulations. We compare the results with those obtained experimentally, illustrating the potential and advantages of this representation.

[1] K. Lindorff-Larsen, R.B. Best, M.A. DePristo, C.M. Dobson, and M. Vendruscolo, “Simultaneous Determination of Protein Structure and Dynamics,” Nature, vol. 433, pp. 128-132, 2005.
[2] N. Furnham, T.L. Blundell, M.A. DePristo, and T.C. Terwilliger, “Correspondence: Is One Solution Good Enough,” Nature Structural and Molecular Biology, vol. 13, pp. 184-185, Mar. 2006.
[3] A.Y. Grosberg and A.R. Khoklov, Statistical Physics of Macromolecules. AIP Press, 1994.
[4] J.E. Kohn, I.S. Millett, J. Jacob, B. Zagrovic, T.M. Dillon, N. Cingel, R.S. Dothager, S. Seifert, P. Thiyagarajan, T.R. Sosnick, M.Z. Hasan, V.S. Pande, I. Ruczinski, S. Doniach, and K.W. Plaxco, “Random-Coil Behavior and the Dimensions of Chemically Unfolded Proteins,” Proc. Nat'l Academy of Sciences, vol. 101, pp.12491-12496, 2004.
[5] W. Rieping, M. Habeck, and M. Nilges, “Inferential Structure Determination,” Science, vol. 309, pp. 303-306, 2005.
[6] B. Zagrovic, C.D. Snow, S. Khalid, M.R. Shirts, and V.S. Pande, “Native-Like Mean Structure in the Unfolded Ensemble of Small Proteins,” J. Molecular Biology, vol. 323, pp. 153-164, 2002.
[7] D. Shortle, K.T. Simons, and D. Baker, “Clustering of Low-Energy Conformations Near the Native Structures of Small Proteins,” Proc. Nat'l Academy of Sciences, vol. 95, pp. 11158-11162, 1998.
[8] P. Bradley, K.M.S. Misura, and D. Baker, “Toward High-Resolution de Novo Structure Prediction for Small Proteins,” Science, vol. 309, pp. 1868-1871, 2005.
[9] V.S. Pande, Folding@Home Distributed Computing, Stanford Univ., http:/, 2005.
[10] D. Baker, The Baker Laboratory, http:/, 2003.
[11] S.J. Teague, “Implications of Protein Flexibility for Drug Discovery,” Nature Rev. Drug Discovery, vol. 2, pp. 527-541, 2003.
[12] C. Branden and J. Tooze, Introduction to Protein Structure. Garland Publishing, 1998.
[13] D. Rother, G. Sapiro, and V. Pande, “Statistical Characterization of Protein Ensembles,” Proc. Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '05), pp. 297-298, 2005.
[14] R.E. Neapolitan, Learning Bayesian Networks, p. 674. Pearson Prentice Hall, 2004.
[15] M.I. Jordan, Learning in Graphical Models. Kluwer Academic Publishers, 1998.
[16] M. Teyssier and D. Koller, “Ordering-Based Search: A Simple and Effective Algorithm for Learning Bayesian Networks,” Proc. Conf. Uncertainty in Artificial Intelligence (UAI '05), pp. 584-590, 2005.
[17] G.E. Hinton, “Training Products of Experts by Minimizing Contrastive Divergence,” Neural Computation, vol. 14, pp. 1771-1800, 2002.
[18] G. Seroussi, personal communication, 2006.
[19] H. Akaike, “A New Look at the Statistical Model Identification,” IEEE Trans. Automatic Control, vol. 19, pp. 716-723, 1974.
[20] K.P. Burnham and D.R. Anderson, Model Selection and Inference: A Practical Information—Theoretic Approach, p. 353. Springer, 1998.
[21] J. Rissanen, Stochastic Complexity in Statistical Inquiry. World Scientific, 1989.
[22] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, p. 536. Springer, 2001.
[23] S.M. Kay, Fundamentals of Statistical Signal Processing: Estimation Theory, vol. 1. Prentice Hall, 1993.
[24] P.A. Viola, “Alignment by Maximization of Mutual Information,” PhD dissertation, MIT, 1995.
[25] T.M. Cover and J.A. Thomas, Elements of Information Theory. John Wiley & Sons, 1991.
[26] B.W. Silverman, Density Estimation for Statistics and Data Analysis. Chapman and Hall, 1986.
[27] K.V. Mardia and P.E. Jupp, Directional Statistics. John Wiley & Sons, 2000.
[28] B. Efron and R.J. Tibshirani, An Introduction to the Bootstrap. Chapman and Hall, 1993.
[29] R.C.H. Cheng, “Bootstrap Methods in Computer Simulation Experiments,” Proc. 1995 Winter Simulation Conf., pp. 171-177, 1995.
[30] C.D. Snow, L. Qiu, D. Du, F. Gai, S.J. Hagen, and V.S. Pande, “Trp Zipper Folding Kinetics by Molecular Dynamics and Temperature—Jump Spectroscopy,” Proc. Nat'l Academy of Sciences, vol. 101, pp. 4077-4082, 2004.
[31] B. Zagrovic, C.D. Snow, M.R. Shirts, and V.S. Pande, “Simulation of Folding of a Small Alpha-Helical Protein in Atomistic Detail Using Worldwide-Distributed Computing,” J. Molecular Biology, vol. 323, pp. 927-937, 2002.
[32] C.L. Brooks, M. Karplus, and B. Montgomery Pettitt, Proteins: A Theoretical Perspective of Dynamics, Structure, and Thermodynamics, vol. 71, p. 259. Wiley-Interscience, 1988.
[33] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, and P.E. Bourne, “The Protein Data Bank,” Nucleic Acids Research, vol. 28, pp. 235-242, 2000.
[34] J. McKnight McKnight Lab PDB Files,, 2005.
[35] A. Elgammal, R. Duraiswami, and L.S. Davis, “Efficient Kernel Density Estimation Using the Fast Gauss Transform with Applications to Color Modeling and Tracking,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 25, no. 11, pp. 1499-1504, Nov. 2003.
[36] A.G. Gray and A.W. Moore, “Nonparametric Density Estimation: Toward Computational Tractability,” Proc. SIAM Int'l Conf. Data Mining, 2003.
[37] J. Beirlant, E.J. Dudewicz, L. Györfi, and E.C. Van der Meulen, “Nonparametric Entropy Estimation: An Overview,” Int'l J. Math. and Statistical Sciences, vol. 6, pp. 17-40, 1997.

Index Terms:
protein ensembles, density estimation, Bayesian networks, graphical models, maximum likelihood, cross-validation, bootstrapping
Diego Rother, Guillermo Sapiro, Vijay Pande, "Statistical Characterization of Protein Ensembles," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 5, no. 1, pp. 42-55, Jan.-March 2008, doi:10.1109/TCBB.2007.1061
Usage of this product signifies your acceptance of the Terms of Use.