
- •Foreword
- •Preface
- •Contents
- •Introduction
- •Oren M. Becker
- •Alexander D. MacKerell, Jr.
- •Masakatsu Watanabe*
- •III. SCOPE OF THE BOOK
- •IV. TOWARD A NEW ERA
- •REFERENCES
- •Atomistic Models and Force Fields
- •Alexander D. MacKerell, Jr.
- •II. POTENTIAL ENERGY FUNCTIONS
- •D. Alternatives to the Potential Energy Function
- •III. EMPIRICAL FORCE FIELDS
- •A. From Potential Energy Functions to Force Fields
- •B. Overview of Available Force Fields
- •C. Free Energy Force Fields
- •D. Applicability of Force Fields
- •IV. DEVELOPMENT OF EMPIRICAL FORCE FIELDS
- •B. Optimization Procedures Used in Empirical Force Fields
- •D. Use of Quantum Mechanical Results as Target Data
- •VI. CONCLUSION
- •REFERENCES
- •Dynamics Methods
- •Oren M. Becker
- •Masakatsu Watanabe*
- •II. TYPES OF MOTIONS
- •IV. NEWTONIAN MOLECULAR DYNAMICS
- •A. Newton’s Equation of Motion
- •C. Molecular Dynamics: Computational Algorithms
- •A. Assigning Initial Values
- •B. Selecting the Integration Time Step
- •C. Stability of Integration
- •VI. ANALYSIS OF DYNAMIC TRAJECTORIES
- •B. Averages and Fluctuations
- •C. Correlation Functions
- •D. Potential of Mean Force
- •VII. OTHER MD SIMULATION APPROACHES
- •A. Stochastic Dynamics
- •B. Brownian Dynamics
- •VIII. ADVANCED SIMULATION TECHNIQUES
- •A. Constrained Dynamics
- •C. Other Approaches and Future Direction
- •REFERENCES
- •Conformational Analysis
- •Oren M. Becker
- •II. CONFORMATION SAMPLING
- •A. High Temperature Molecular Dynamics
- •B. Monte Carlo Simulations
- •C. Genetic Algorithms
- •D. Other Search Methods
- •III. CONFORMATION OPTIMIZATION
- •A. Minimization
- •B. Simulated Annealing
- •IV. CONFORMATIONAL ANALYSIS
- •A. Similarity Measures
- •B. Cluster Analysis
- •C. Principal Component Analysis
- •REFERENCES
- •Thomas A. Darden
- •II. CONTINUUM BOUNDARY CONDITIONS
- •III. FINITE BOUNDARY CONDITIONS
- •IV. PERIODIC BOUNDARY CONDITIONS
- •REFERENCES
- •Internal Coordinate Simulation Method
- •Alexey K. Mazur
- •II. INTERNAL AND CARTESIAN COORDINATES
- •III. PRINCIPLES OF MODELING WITH INTERNAL COORDINATES
- •B. Energy Gradients
- •IV. INTERNAL COORDINATE MOLECULAR DYNAMICS
- •A. Main Problems and Historical Perspective
- •B. Dynamics of Molecular Trees
- •C. Simulation of Flexible Rings
- •A. Time Step Limitations
- •B. Standard Geometry Versus Unconstrained Simulations
- •VI. CONCLUDING REMARKS
- •REFERENCES
- •Implicit Solvent Models
- •II. BASIC FORMULATION OF IMPLICIT SOLVENT
- •A. The Potential of Mean Force
- •III. DECOMPOSITION OF THE FREE ENERGY
- •A. Nonpolar Free Energy Contribution
- •B. Electrostatic Free Energy Contribution
- •IV. CLASSICAL CONTINUUM ELECTROSTATICS
- •A. The Poisson Equation for Macroscopic Media
- •B. Electrostatic Forces and Analytic Gradients
- •C. Treatment of Ionic Strength
- •A. Statistical Mechanical Integral Equations
- •VI. SUMMARY
- •REFERENCES
- •Steven Hayward
- •II. NORMAL MODE ANALYSIS IN CARTESIAN COORDINATE SPACE
- •B. Normal Mode Analysis in Dihedral Angle Space
- •C. Approximate Methods
- •IV. NORMAL MODE REFINEMENT
- •C. Validity of the Concept of a Normal Mode Important Subspace
- •A. The Solvent Effect
- •B. Anharmonicity and Normal Mode Analysis
- •VI. CONCLUSIONS
- •ACKNOWLEDGMENT
- •REFERENCES
- •Free Energy Calculations
- •Thomas Simonson
- •II. GENERAL BACKGROUND
- •A. Thermodynamic Cycles for Solvation and Binding
- •B. Thermodynamic Perturbation Theory
- •D. Other Thermodynamic Functions
- •E. Free Energy Component Analysis
- •III. STANDARD BINDING FREE ENERGIES
- •IV. CONFORMATIONAL FREE ENERGIES
- •A. Conformational Restraints or Umbrella Sampling
- •B. Weighted Histogram Analysis Method
- •C. Conformational Constraints
- •A. Dielectric Reaction Field Approaches
- •B. Lattice Summation Methods
- •VI. IMPROVING SAMPLING
- •A. Multisubstate Approaches
- •B. Umbrella Sampling
- •C. Moving Along
- •VII. PERSPECTIVES
- •REFERENCES
- •John E. Straub
- •B. Phenomenological Rate Equations
- •II. TRANSITION STATE THEORY
- •A. Building the TST Rate Constant
- •B. Some Details
- •C. Computing the TST Rate Constant
- •III. CORRECTIONS TO TRANSITION STATE THEORY
- •A. Computing Using the Reactive Flux Method
- •B. How Dynamic Recrossings Lower the Rate Constant
- •IV. FINDING GOOD REACTION COORDINATES
- •A. Variational Methods for Computing Reaction Paths
- •B. Choice of a Differential Cost Function
- •C. Diffusional Paths
- •VI. HOW TO CONSTRUCT A REACTION PATH
- •A. The Use of Constraints and Restraints
- •B. Variationally Optimizing the Cost Function
- •VII. FOCAL METHODS FOR REFINING TRANSITION STATES
- •VIII. HEURISTIC METHODS
- •IX. SUMMARY
- •ACKNOWLEDGMENT
- •REFERENCES
- •Paul D. Lyne
- •Owen A. Walsh
- •II. BACKGROUND
- •III. APPLICATIONS
- •A. Triosephosphate Isomerase
- •B. Bovine Protein Tyrosine Phosphate
- •C. Citrate Synthase
- •IV. CONCLUSIONS
- •ACKNOWLEDGMENT
- •REFERENCES
- •Jeremy C. Smith
- •III. SCATTERING BY CRYSTALS
- •IV. NEUTRON SCATTERING
- •A. Coherent Inelastic Neutron Scattering
- •B. Incoherent Neutron Scattering
- •REFERENCES
- •Michael Nilges
- •II. EXPERIMENTAL DATA
- •A. Deriving Conformational Restraints from NMR Data
- •B. Distance Restraints
- •C. The Hybrid Energy Approach
- •III. MINIMIZATION PROCEDURES
- •A. Metric Matrix Distance Geometry
- •B. Molecular Dynamics Simulated Annealing
- •C. Folding Random Structures by Simulated Annealing
- •IV. AUTOMATED INTERPRETATION OF NOE SPECTRA
- •B. Automated Assignment of Ambiguities in the NOE Data
- •C. Iterative Explicit NOE Assignment
- •D. Symmetrical Oligomers
- •VI. INFLUENCE OF INTERNAL DYNAMICS ON THE
- •EXPERIMENTAL DATA
- •VII. STRUCTURE QUALITY AND ENERGY PARAMETERS
- •VIII. RECENT APPLICATIONS
- •REFERENCES
- •II. STEPS IN COMPARATIVE MODELING
- •C. Model Building
- •D. Loop Modeling
- •E. Side Chain Modeling
- •III. AB INITIO PROTEIN STRUCTURE MODELING METHODS
- •IV. ERRORS IN COMPARATIVE MODELS
- •VI. APPLICATIONS OF COMPARATIVE MODELING
- •VII. COMPARATIVE MODELING IN STRUCTURAL GENOMICS
- •VIII. CONCLUSION
- •ACKNOWLEDGMENTS
- •REFERENCES
- •Roland L. Dunbrack, Jr.
- •II. BAYESIAN STATISTICS
- •A. Bayesian Probability Theory
- •B. Bayesian Parameter Estimation
- •C. Frequentist Probability Theory
- •D. Bayesian Methods Are Superior to Frequentist Methods
- •F. Simulation via Markov Chain Monte Carlo Methods
- •III. APPLICATIONS IN MOLECULAR BIOLOGY
- •B. Bayesian Sequence Alignment
- •IV. APPLICATIONS IN STRUCTURAL BIOLOGY
- •A. Secondary Structure and Surface Accessibility
- •ACKNOWLEDGMENTS
- •REFERENCES
- •Computer Aided Drug Design
- •Alexander Tropsha and Weifan Zheng
- •IV. SUMMARY AND CONCLUSIONS
- •REFERENCES
- •Oren M. Becker
- •II. SIMPLE MODELS
- •III. LATTICE MODELS
- •B. Mapping Atomistic Energy Landscapes
- •C. Mapping Atomistic Free Energy Landscapes
- •VI. SUMMARY
- •REFERENCES
- •Toshiko Ichiye
- •II. ELECTRON TRANSFER PROPERTIES
- •B. Potential Energy Parameters
- •IV. REDOX POTENTIALS
- •A. Calculation of the Energy Change of the Redox Site
- •B. Calculation of the Energy Changes of the Protein
- •B. Calculation of Differences in the Energy Change of the Protein
- •VI. ELECTRON TRANSFER RATES
- •A. Theory
- •B. Application
- •REFERENCES
- •Fumio Hirata and Hirofumi Sato
- •Shigeki Kato
- •A. Continuum Model
- •B. Simulations
- •C. Reference Interaction Site Model
- •A. Molecular Polarization in Neat Water*
- •B. Autoionization of Water*
- •C. Solvatochromism*
- •F. Tautomerization in Formamide*
- •IV. SUMMARY AND PROSPECTS
- •ACKNOWLEDGMENTS
- •REFERENCES
- •Nucleic Acid Simulations
- •Alexander D. MacKerell, Jr.
- •Lennart Nilsson
- •D. DNA Phase Transitions
- •III. METHODOLOGICAL CONSIDERATIONS
- •A. Atomistic Models
- •B. Alternative Models
- •IV. PRACTICAL CONSIDERATIONS
- •A. Starting Structures
- •C. Production MD Simulation
- •D. Convergence of MD Simulations
- •WEB SITES OF INTEREST
- •REFERENCES
- •Membrane Simulations
- •Douglas J. Tobias
- •II. MOLECULAR DYNAMICS SIMULATIONS OF MEMBRANES
- •B. Force Fields
- •C. Ensembles
- •D. Time Scales
- •III. LIPID BILAYER STRUCTURE
- •A. Overall Bilayer Structure
- •C. Solvation of the Lipid Polar Groups
- •IV. MOLECULAR DYNAMICS IN MEMBRANES
- •A. Overview of Dynamic Processes in Membranes
- •B. Qualitative Picture on the 100 ps Time Scale
- •C. Incoherent Neutron Scattering Measurements of Lipid Dynamics
- •F. Hydrocarbon Chain Dynamics
- •ACKNOWLEDGMENTS
- •REFERENCES
- •Appendix: Useful Internet Resources
- •B. Molecular Modeling and Simulation Packages
- •Index
15
Bayesian Statistics in Molecular and
Structural Biology
Roland L. Dunbrack, Jr.
Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia, Pennsylvania
I.INTRODUCTION
Much of computational biophysics and biochemistry is aimed at making predictions of protein structure, dynamics, and function. Most prediction methods are at least in part knowledge-based rather than being derived entirely from the principles of physics. For instance, in comparative modeling of protein structure, each step in the process—from homolog identification and sequence–structure alignment to loop and side-chain model- ing—is dominated by information derived from the protein sequence and structure databases (see Chapter 14). In molecular dynamics simulations, the potential energy function is based partly on conformational analysis of known peptide and protein structures and thermodynamic data (see Chapter 2).
The biophysical and biochemical data we have available are complex and of variable quality and density. We have sequences from many different kinds of organisms and sequences for proteins that are expressed in very different environments in a single organism or even a single cell. Some sequence families are very large, and some have only one known member. We have structures from many protein families, from NMR spectroscopy and from X-ray crystallography, some of high resolution and some not. These structures can be analyzed on the level of bond lengths and angles, or dihedral angles, and interatomic distances, or in terms of secondary, tertiary, and quaternary structure. Some structural features are very common, such as α-helices, and some are relatively rare, such as valine residues with backbone dihedral φ 0°.
The amount of data is also increasing. The nonredundant protein sequence database available from GenBank now contains over 500,000 amino acid sequences, and there are at least 30 completed genomes from all three kingdoms of life. The number of unique sequences in the Protein Databank of experimentally determined structures is now over 3000 [1]. The number of known protein folds is at least 400 [2–4]. In the next few years, the databanks will continue to grow exponentially as the Drosophila, Arabidopsis, corn, mouse, and human genomes are completed. Several institutions are planning projects to determine as many protein structures as possible in target genomes, such as yeast, Mycoplasma genitalium, and E. coli.
313
314 |
Dunbrack |
To gain the most predictive utility as well as conceptual understanding from the sequence and structure data available, careful statistical analysis will be required. The statistical methods needed must be robust to the variation in amounts and quality of data in different protein families and for structural features. They must be updatable as new data become available. And they should help us generate as much understanding of the determinants of protein sequence, structure, dynamics, and functional relationships as possible.
In recent years, Bayesian statistics has come to the forefront of research among professional statisticians because of its analytical power for complex models and its conceptual simplicity. In the natural and social sciences, Bayesian methods have also attracted significant attention, including the fields of genetics [5], epidemiology [6,7], medicine [8], high energy physics [9], astrophysics [10,11], hydrology [12], archaeology [13], and economics [14]. Bayesian statistics have been used in molecular and structural biology in sequence alignment [15], remote homolog detection [16,17], threading [18,19], NMR spectroscopy [20–24], X-ray structure determination [25–27], and side-chain conformational analysis [28]. Its counterpart, frequentist statistics, has in turn lost ground. To see why, we need to examine their basic conceptual frameworks. In the next section, I compare the Bayesian and frequentist viewpoints and discuss the reasons Bayesian methods are superior in both their conceptual components and their practical aspects. After that, I describe some important aspects of Bayesian statistics required for its application to protein sequence and structural data analysis. In the last section, I review several applications of Bayesian inference in molecular and structural biology to demonstrate its utility and conceptual simplicity. A useful introduction to Bayesian methods and their applications in machine learning and molecular biology can be found in the book by Baldi and Brunak [29].
II. BAYESIAN STATISTICS
A. Bayesian Probability Theory
The goal of any statistical analysis is inference concerning whether on the basis of available data, some hypothesis about the natural world is true. The hypothesis may consist of the value of some parameter or parameters, such as a physical constant or the exact proportion of an allelic variant in a human population, or the hypothesis may be a qualitative statement, such as ‘‘This protein adopts an α/β barrel fold’’ or ‘‘I am currently in Philadelphia.’’ The parameters or hypothesis can be unobservable or as yet unobserved. How the data arise from the parameters is called the model for the system under study and may include estimates of experimental error as well as our best understanding of the physical process of the system.
Probability in Bayesian inference is interpreted as the degree of belief in the truth of a statement. The belief must be predicated on whatever knowledge of the system we possess. That is, probability is always conditional, p(X|I), where X is a hypothesis, a statement, the result of an experiment, etc., and I is any information we have on the system. Bayesian probability statements are constructed to be consistent with common sense. This can often be expressed in terms of a fair bet. As an example, I might say that ‘‘the probability that it will rain tomorrow is 75%.’’ This can be expressed as a bet: ‘‘I will bet $3 that it will rain tomorrow, if you give me $4 if it does and nothing if it does not.’’ (If I bet $3 on 4 such days, I have spent $12; I expect to win back $4 on 3 of those days, or $12).

Bayesian Statistics |
315 |
At the same time, I would not bet $3 on no rain in return for $4 if it does not rain. This behavior would be inconsistent, since if I did both simultaneously I would bet $6 for a certain return of only $4. Consistent betting would lead me to bet $1 on no rain in return for $4. It can be shown that for consistent betting behavior, only certain rules of probability are allowed, as follows.
There are two central rules of probability theory on which Bayesian inference is based [30]:
1.The sum rule: p(A|I) p(A|I) 1
2.The product rule: p(A, B|I) p(A|B, I)p(B|I) p(B|A, I)p(A|I)
The first rule states that the probability of A plus the probability of not-A (A) is equal to 1. The second rule states that the probability for the occurrence of two events is related to the probability of one of the events occurring multiplied by the conditional probability of the other event given the occurrence of the first event. We can drop the notation of conditioning on I as long as it is understood implicitly that all probabilities are conditional on the information we possess about the system. Dropping the I, we have the usual expression of Bayes’ rule,
p(A, B) p(A|B)p(B) p(B|A)p(A) |
(1) |
For Bayesian inference, we are seeking the probability of a hypothesis H given the data D. This probability is denoted p(H|D). It is very likely that we will want to compare different hypotheses, so we may want to compare p(H1 |D) with p(H2 |D). Because it is difficult to write down an expression for p(H|D), we use Bayes’ rule to invert the probability of p(D|H) to obtain an expression for p(H|D):
p(H|D) |
p(D|H)p(H) |
(2) |
|
p(D) |
|||
|
|
In this expression, p(H) is referred to as the prior probability of the hypothesis H. It is used to express any information we may have about the probability that the hypothesis H is true before we consider the new data D. p(D|H) is the likelihood of the data given that the hypothesis H is true. It describes our view of how the data arise from whatever H says about the state of nature, including uncertainties in measurement and any physical theory we might have that relates the data to the hypothesis. p(D) is the marginal distribution of the data D, and because it is a constant with respect to the parameters it is frequently considered only as a normalization factor in Eq. (2), so that p(H|D) p(D|H)p(H) up to a proportionality constant. If we have a set of hypotheses that are exclusive and exhaustive, i.e., one and only one must be true, then
p(D) p(D|Hi)p(Hi) |
(2a) |
i |
|
p(H|D) is the posterior distribution, which is, after all, what we are after. It gives the probability of the hypothesis after we consider the available data and our prior knowledge. With the normalization provided by the expression for p(D), for an exhaustive set of hypotheses we have ∑i p(Hi |D) 1, which is what we would expect from the sum rule axiom described above.

316 |
Dunbrack |
As an example of likelihoods and prior and posterior probabilities, we give the following example borrowed from Gardner [31].* The chairman of a statistics department has decided to grant tenure to one of three junior faculty members, Dr. A, Dr. B, or Dr. C. Assistant professor A decides to ask the department’s administrative assistant, Mr. Smith, if he knows who is being given tenure. Mr. Smith decides to have fun with Dr. A and says that he won’t tell her who is being given tenure. Instead, he will tell her which of Dr. B and Dr. C is going to be denied tenure. Mr. Smith does not yet know who is and who is not getting tenure and tells Dr. A to come back the next day. In the meantime, he decides that if A is getting tenure he will flip a coin and will tell A that B is not getting tenure if the coin shows heads, and that C is not getting tenure if it shows tails. If B or C is getting tenure, he will tell A that either C or B, respectively, is not getting tenure
Dr. A comes back the next day, and Mr. Smith tells A that C is not getting tenure. A then figures that her chances of tenure have now risen to 50%. Mr. Smith believes he has not in fact changed A’s knowledge concerning her tenure prospects. Who is correct?
For prior probabilities, if HA is the statement ‘‘A gets tenure’’ and likewise for HB and HC, we have prior probabilities p(HA) p(HB) p(HC ) 1/3. We can evaluate the likelihood of S, that Mr. Smith will say ‘‘C is not getting tenure,’’ if HA, HB, or HC is true:
p(S|HA) 0.5; p(S|HB) 1; p(S|HC ) 0
So the posterior probability that A will get tenure based on Mr. Smith’s statement is
p(HA |S) p(S|HA)p(HA)
|
p(S|Hr )p(Hr ) |
(3) |
||
|
|
|
|
|
|
r A,B,C |
|
|
|
|
(1/2) (1/3) |
|
1 |
|
[(1/2 1/3)] [1 (1/3)] [0 (1/3)] |
|
|||
|
3 |
Mr. Smith has not in fact changed A’s knowledge, because her prior and posterior probabilities of getting tenure are both 1/3. Mr. Smith has, however, changed A’s knowledge of B’s prospects of tenure, which are now 2/3. Another way to think about this problem is that before Mr. Smith has told A anything, the probability of B or C getting tenure was 2/3. After his statement, the same 2/3 total probability applies to B and C, but now C’s probability of tenure is 0 and B’s has therefore risen to 2/3. A’s posterior probability is unchanged.
B. Bayesian Parameter Estimation
Most often the hypothesis H concerns the value of a continuous parameter, which is denoted θ. The data D are also usually observed values of some physical quantity (temperature, mass, dihedral angle, etc.) denoted y, usually a vector. y may be a continuous variable, but quite often it may be a discrete integer variable representing the counts of some event occurring, such as the number of heads in a sequence of coin flips. The expression for the posterior distribution for the parameter θ given the data y is now given as
* The original story concerned three prisoners to be executed, one of whom is pardoned.