Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
Скачиваний:
71
Добавлен:
15.08.2013
Размер:
5.59 Mб
Скачать

15

Bayesian Statistics in Molecular and

Structural Biology

Roland L. Dunbrack, Jr.

Institute for Cancer Research, Fox Chase Cancer Center, Philadelphia, Pennsylvania

I.INTRODUCTION

Much of computational biophysics and biochemistry is aimed at making predictions of protein structure, dynamics, and function. Most prediction methods are at least in part knowledge-based rather than being derived entirely from the principles of physics. For instance, in comparative modeling of protein structure, each step in the process—from homolog identification and sequence–structure alignment to loop and side-chain model- ing—is dominated by information derived from the protein sequence and structure databases (see Chapter 14). In molecular dynamics simulations, the potential energy function is based partly on conformational analysis of known peptide and protein structures and thermodynamic data (see Chapter 2).

The biophysical and biochemical data we have available are complex and of variable quality and density. We have sequences from many different kinds of organisms and sequences for proteins that are expressed in very different environments in a single organism or even a single cell. Some sequence families are very large, and some have only one known member. We have structures from many protein families, from NMR spectroscopy and from X-ray crystallography, some of high resolution and some not. These structures can be analyzed on the level of bond lengths and angles, or dihedral angles, and interatomic distances, or in terms of secondary, tertiary, and quaternary structure. Some structural features are very common, such as α-helices, and some are relatively rare, such as valine residues with backbone dihedral φ 0°.

The amount of data is also increasing. The nonredundant protein sequence database available from GenBank now contains over 500,000 amino acid sequences, and there are at least 30 completed genomes from all three kingdoms of life. The number of unique sequences in the Protein Databank of experimentally determined structures is now over 3000 [1]. The number of known protein folds is at least 400 [2–4]. In the next few years, the databanks will continue to grow exponentially as the Drosophila, Arabidopsis, corn, mouse, and human genomes are completed. Several institutions are planning projects to determine as many protein structures as possible in target genomes, such as yeast, Mycoplasma genitalium, and E. coli.

313

314

Dunbrack

To gain the most predictive utility as well as conceptual understanding from the sequence and structure data available, careful statistical analysis will be required. The statistical methods needed must be robust to the variation in amounts and quality of data in different protein families and for structural features. They must be updatable as new data become available. And they should help us generate as much understanding of the determinants of protein sequence, structure, dynamics, and functional relationships as possible.

In recent years, Bayesian statistics has come to the forefront of research among professional statisticians because of its analytical power for complex models and its conceptual simplicity. In the natural and social sciences, Bayesian methods have also attracted significant attention, including the fields of genetics [5], epidemiology [6,7], medicine [8], high energy physics [9], astrophysics [10,11], hydrology [12], archaeology [13], and economics [14]. Bayesian statistics have been used in molecular and structural biology in sequence alignment [15], remote homolog detection [16,17], threading [18,19], NMR spectroscopy [20–24], X-ray structure determination [25–27], and side-chain conformational analysis [28]. Its counterpart, frequentist statistics, has in turn lost ground. To see why, we need to examine their basic conceptual frameworks. In the next section, I compare the Bayesian and frequentist viewpoints and discuss the reasons Bayesian methods are superior in both their conceptual components and their practical aspects. After that, I describe some important aspects of Bayesian statistics required for its application to protein sequence and structural data analysis. In the last section, I review several applications of Bayesian inference in molecular and structural biology to demonstrate its utility and conceptual simplicity. A useful introduction to Bayesian methods and their applications in machine learning and molecular biology can be found in the book by Baldi and Brunak [29].

II. BAYESIAN STATISTICS

A. Bayesian Probability Theory

The goal of any statistical analysis is inference concerning whether on the basis of available data, some hypothesis about the natural world is true. The hypothesis may consist of the value of some parameter or parameters, such as a physical constant or the exact proportion of an allelic variant in a human population, or the hypothesis may be a qualitative statement, such as ‘‘This protein adopts an α/β barrel fold’’ or ‘‘I am currently in Philadelphia.’’ The parameters or hypothesis can be unobservable or as yet unobserved. How the data arise from the parameters is called the model for the system under study and may include estimates of experimental error as well as our best understanding of the physical process of the system.

Probability in Bayesian inference is interpreted as the degree of belief in the truth of a statement. The belief must be predicated on whatever knowledge of the system we possess. That is, probability is always conditional, p(X|I), where X is a hypothesis, a statement, the result of an experiment, etc., and I is any information we have on the system. Bayesian probability statements are constructed to be consistent with common sense. This can often be expressed in terms of a fair bet. As an example, I might say that ‘‘the probability that it will rain tomorrow is 75%.’’ This can be expressed as a bet: ‘‘I will bet $3 that it will rain tomorrow, if you give me $4 if it does and nothing if it does not.’’ (If I bet $3 on 4 such days, I have spent $12; I expect to win back $4 on 3 of those days, or $12).

Bayesian Statistics

315

At the same time, I would not bet $3 on no rain in return for $4 if it does not rain. This behavior would be inconsistent, since if I did both simultaneously I would bet $6 for a certain return of only $4. Consistent betting would lead me to bet $1 on no rain in return for $4. It can be shown that for consistent betting behavior, only certain rules of probability are allowed, as follows.

There are two central rules of probability theory on which Bayesian inference is based [30]:

1.The sum rule: p(A|I) p(A|I) 1

2.The product rule: p(A, B|I) p(A|B, I)p(B|I) p(B|A, I)p(A|I)

The first rule states that the probability of A plus the probability of not-A (A) is equal to 1. The second rule states that the probability for the occurrence of two events is related to the probability of one of the events occurring multiplied by the conditional probability of the other event given the occurrence of the first event. We can drop the notation of conditioning on I as long as it is understood implicitly that all probabilities are conditional on the information we possess about the system. Dropping the I, we have the usual expression of Bayes’ rule,

p(A, B) p(A|B)p(B) p(B|A)p(A)

(1)

For Bayesian inference, we are seeking the probability of a hypothesis H given the data D. This probability is denoted p(H|D). It is very likely that we will want to compare different hypotheses, so we may want to compare p(H1 |D) with p(H2 |D). Because it is difficult to write down an expression for p(H|D), we use Bayes’ rule to invert the probability of p(D|H) to obtain an expression for p(H|D):

p(H|D)

p(D|H)p(H)

(2)

p(D)

 

 

In this expression, p(H) is referred to as the prior probability of the hypothesis H. It is used to express any information we may have about the probability that the hypothesis H is true before we consider the new data D. p(D|H) is the likelihood of the data given that the hypothesis H is true. It describes our view of how the data arise from whatever H says about the state of nature, including uncertainties in measurement and any physical theory we might have that relates the data to the hypothesis. p(D) is the marginal distribution of the data D, and because it is a constant with respect to the parameters it is frequently considered only as a normalization factor in Eq. (2), so that p(H|D) p(D|H)p(H) up to a proportionality constant. If we have a set of hypotheses that are exclusive and exhaustive, i.e., one and only one must be true, then

p(D) p(D|Hi)p(Hi)

(2a)

i

 

p(H|D) is the posterior distribution, which is, after all, what we are after. It gives the probability of the hypothesis after we consider the available data and our prior knowledge. With the normalization provided by the expression for p(D), for an exhaustive set of hypotheses we have ∑i p(Hi |D) 1, which is what we would expect from the sum rule axiom described above.

316

Dunbrack

As an example of likelihoods and prior and posterior probabilities, we give the following example borrowed from Gardner [31].* The chairman of a statistics department has decided to grant tenure to one of three junior faculty members, Dr. A, Dr. B, or Dr. C. Assistant professor A decides to ask the department’s administrative assistant, Mr. Smith, if he knows who is being given tenure. Mr. Smith decides to have fun with Dr. A and says that he won’t tell her who is being given tenure. Instead, he will tell her which of Dr. B and Dr. C is going to be denied tenure. Mr. Smith does not yet know who is and who is not getting tenure and tells Dr. A to come back the next day. In the meantime, he decides that if A is getting tenure he will flip a coin and will tell A that B is not getting tenure if the coin shows heads, and that C is not getting tenure if it shows tails. If B or C is getting tenure, he will tell A that either C or B, respectively, is not getting tenure

Dr. A comes back the next day, and Mr. Smith tells A that C is not getting tenure. A then figures that her chances of tenure have now risen to 50%. Mr. Smith believes he has not in fact changed A’s knowledge concerning her tenure prospects. Who is correct?

For prior probabilities, if HA is the statement ‘‘A gets tenure’’ and likewise for HB and HC, we have prior probabilities p(HA) p(HB) p(HC ) 1/3. We can evaluate the likelihood of S, that Mr. Smith will say ‘‘C is not getting tenure,’’ if HA, HB, or HC is true:

p(S|HA) 0.5; p(S|HB) 1; p(S|HC ) 0

So the posterior probability that A will get tenure based on Mr. Smith’s statement is

p(HA |S) p(S|HA)p(HA)

 

p(S|Hr )p(Hr )

(3)

 

 

 

 

 

r A,B,C

 

 

 

 

(1/2) (1/3)

 

1

 

[(1/2 1/3)] [1 (1/3)] [0 (1/3)]

 

 

3

Mr. Smith has not in fact changed A’s knowledge, because her prior and posterior probabilities of getting tenure are both 1/3. Mr. Smith has, however, changed A’s knowledge of B’s prospects of tenure, which are now 2/3. Another way to think about this problem is that before Mr. Smith has told A anything, the probability of B or C getting tenure was 2/3. After his statement, the same 2/3 total probability applies to B and C, but now C’s probability of tenure is 0 and B’s has therefore risen to 2/3. A’s posterior probability is unchanged.

B. Bayesian Parameter Estimation

Most often the hypothesis H concerns the value of a continuous parameter, which is denoted θ. The data D are also usually observed values of some physical quantity (temperature, mass, dihedral angle, etc.) denoted y, usually a vector. y may be a continuous variable, but quite often it may be a discrete integer variable representing the counts of some event occurring, such as the number of heads in a sequence of coin flips. The expression for the posterior distribution for the parameter θ given the data y is now given as

* The original story concerned three prisoners to be executed, one of whom is pardoned.