Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Becker O.M., MacKerell A.D., Roux B., Watanabe M. (eds.) Computational biochemistry and biophysic.pdf
Скачиваний:
71
Добавлен:
15.08.2013
Размер:
5.59 Mб
Скачать

326

Dunbrack

and data sample means. The weight of the prior variance is represented by the degree of freedom, ν0, while the weight of the data variance is n 1.

F. Simulation via Markov Chain Monte Carlo Methods

In practice, it may not be possible to use conjugate prior and likelihood functions that result in analytical posterior distributions, or the distributions may be so complicated that the posterior cannot be calculated as a function of the entire parameter space. In either case, statistical inference can proceed only if random values of the parameters can be drawn from the full posterior distribution:

p(θ|y)

p(y)p(θ)

(18)

Θ p(y)p(θ)dθ

 

 

We can also calculate expected values for any function of the parameters:

E[ f(θ|y]

Θf(θ)p(y)p(θ)dθ

 

(19)

 

 

Θ p(y)p(θ)dθ

If we could draw directly from the posterior distribution, then we could plot p(θ|y) from a histogram of the draws on θ. Similarly, we could calculate the expectation value of any function of the parameters by making random draws of θ from the posterior distribution and calculating

n

E[ f(θ)] 1n f(θt) (20)

t 1

In some cases, we may not be able to draw directly from the posterior distribution. The difficulty lies in calculating the denominator of Eq. (18), the marginal data distribution p(y). But usually we can evaluate the ratio of the probabilities of two values for the parameters, p(θt |y)/p(θu |y), because the denominator in Eq. (18) cancels out in the ratio. The Markov chain Monte Carlo method [40] proceeds by generating draws from some distribution of the parameters, referred to as the proposal distribution, such that the new draw depends only on the value of the old draw, i.e., some function q(θt t 1). We accept the new draw with probability

π(θt t 1) min 1,

p(θt |y)q(θt 1 t)

 

p(θt 1 | y)q(θt t 1)

(21)

and otherwise we set θt θt 1. This is the Metropolis–Hastings method, first proposed by Metropolis and Ulam [45] in the context of equation of state calculations [46] and further developed by Hastings [47]. This scheme can be shown to result in a stationary distribution that asymptotically approaches the posterior distribution.

Several variations of this method go under different names. The Metropolis algorithm uses only symmetrical proposal distributions such that q(θt t 1) q(θt 1 t). The expression for π(θt t 1) reduces to

Bayesian Statistics

 

 

327

 

p(θt |y)

 

π(θt t 1) min 1,

 

 

(22)

p(θt 1 |y)

This is the form that chemists and physicists are most accustomed to. The probabilities are calculated from the Boltzmann equation and the energy difference between state t and state t 1. Because we are using a ratio of probabilities, the normalization factor, i.e., the partition function, drops out of the equation. Another variant when θ is multidimensional (which it usually is) is to update one component at a time. We define θt, i {θt,1, θt,2, . . . , θt,i 1, θt 1,i 1, . . . , θt 1,m}, where m is the number of components in θ. So θt, i contains all of the components except θ ,i and all the components that precede the ith component have been updated in step t, while the components that follow have not yet been updated. The m components are updated one at a time with this probability:

π(θt,i t, i) min 1,

p(θt,i |y, θt, i)q(θt 1,i t,i, θt, i)

 

p(θt 1,i |y, θt, i)q(θt,i t 1,i, θt, i)

(23)

If draws can be made from the posterior distribution for each component conditional on values for the others, i.e., from p(θt,i |y, θt, i), then this conditional posterior distribution can be used as the proposal distribution. In this case, the probability in Eq. (23) is always 1, and all draws are accepted. This is referred to as Gibbs sampling and is the most common form of MCMC used in statistical analysis.

G.Mixture Models

Mixture models have come up frequently in Bayesian statistical analysis in molecular and structural biology [16,28] as described below, so a description is useful here. Mixture models can be used when simple forms such as the exponential or Dirichlet function alone do not describe the data well. This is usually the case for a multimodal data distribution (as might be evident from a histogram of the data), when clearly a single Gaussian function will not suffice. A mixture is a sum of simple forms for the likelihood:

n

 

p(y) qi p(yi)

(24)

i 1

where ni 1 qi 1 for the n components of the mixture. For instance, if the terms in Eq. (24) are normal, then each term is of the form (for a single data point yj)

p(yj i)

1

exp

(yj µi)2

 

(25)

 

2σi2

2πσi

so each θi {µi, σ2i }.

Maximum likelihood methods used in classical statistics are not valid to estimate the θ’s or the q’s. Bayesian methods have only become possible with the development of Gibbs sampling methods described above, because to form the likelihood for a full data set entails the product of many sums of the form of Eq. (24):

N

i 1

 

 

 

n

 

 

p({y1, y2, . . . , yN}) j 1 qi p(yj i)

 

(26)

328

Dunbrack

Because we are dealing with count data and proportions for the values qi, the appropriate conjugate prior distribution for the q’s is the Dirichlet distribution,

p(q1, q2, . . . , qk) Dirichlet(α1, α2, . . . , αk)

where the α’s are prior counts for the components of the mixture. A simplification is to associate each data point with a single component, usually the component with the nearest location (i.e., µi). In this case, it is necessary to associate with each data point yj a variable cj that denotes the component that yj belongs to. These variables cj are unknown and are therefore called ‘‘missing data.’’ Equation (26) now simplifies to

 

N

 

 

p({y1, y2, . . . , yN})

 

p(yj cj)

(27)

j 1

A straightforward Gibbs sampling strategy when the number of components is known (or fixed) is as follows [48].

Step 1. From a histogram of the data, partition the data into N components, each roughly corresponding to a mode of the data distribution. This defines the cj. Set the parameters for prior distributions on the θ parameters that are conjugate to the likelihoods. For the normal distribution the priors are defined in Eq. (15), so the full prior for the n components is

n

 

 

p(θ1, θ2, . . . , θk) i 1

N(µ0i, σ02i/κ0) Inv χ2(ν0i, σ02i)

(28)

The prior hyperparameters, µ0i, etc., can be estimated from the data assigned to each component. First define Ni Nj 1 I(cj i), where I(cj i) 1 if cj i and is 0 otherwise. Then, for instance, the prior hyperparameters for the mean values are defined by

N

µ0i 1 I(cj i)yj (29)

Ni j 1

The parameters of the Dirichlet prior for the q’s should be proportional to the counts for each component in this preliminary data analysis. So we now have a collection of prior parameters {θ0i (µ0i, κ0i, σ20i, ν0i)} and a preliminary assignment of each data point to a component, {cj}, and therefore the preliminary number of data points for each component, {Ni}.

Step 2. Draw a value for each θi {µi, σ2i } from the normal posterior distribution for Ni data points with average yi,

p(θi |{yi}) N(µN

, σN2

) Inv χ2(νN , σN2

i

)

(30)

 

 

i

 

 

i

i

 

 

where [as in Eq. (17)]

 

 

 

 

µN

1

(κ0i µ0i Ni

 

i)

 

 

(31a)

y

 

 

 

 

 

i

κNi

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

κNi κ0i Ni,

νNi

ν0i Ni

 

 

(31b)

Bayesian Statistics

 

 

 

 

 

 

 

 

 

329

νN σN2

 

 

ν0i σ02i (Ni 1)si2

Niκ0i

(

 

i µ0i)2

(31c)

 

 

y

 

 

 

i

i

 

 

 

 

 

κNi

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

N

 

si2

 

1

(yk

 

i)2

(31d)

 

 

y

Ni

1

 

 

 

 

 

k 1

 

Draw (q1, q2, . . . , qn) from Dirichlet (α1 N1, α2 N2, . . . , αn Nn), which is the posterior distribution with prior counts αi and data counts Ni.

Step 3. Reset the cj by drawing a random number uj between 0 and 1 for each cj and set cj to i if

 

i1

n

 

1

qi p(yji) uj

1

qi p(yj i)

(32)

Z

Z

 

i 1

i i1

 

where Z ni 1 qi p(yj i) is the normalization factor.

Step 4. Sum up the Ni and calculate the averages yi from the data and the values of cj.

Step 5. Repeat steps 2–4 until convergence.

The number of components necessary can usually be judged from the data, but the appropriateness of a particular value of n can be judged by comparing different values of n and calculating the entropy distance, or Kullback–Leibler divergence,

ED(g, h) g(x) ln

g(x)

dx

(33)

 

 

h(x)

 

where, for instance, g might be a three-component model and h might be a two-component model. If ED(g, h) 0, then the model g is better than the model h.

H.Explanatory Variables

There is some confusion in using Bayes’ rule on what are sometimes called explanatory variables. As an example, we can try to use Bayesian statistics to derive the probabilities of each secondary structure type for each amino acid type, that is p(µ|r), where µ is α, β, or γ (for coil) secondary structures and r is one of the 20 amino acids. It is tempting to write p(µ|r) p(r)p(µ)/p(r) using Bayes’ rule. This expression is, of course, correct and can be used on PDB data to relate these probabilities. But this is not Bayesian statistics, which relate parameters that represent underlying properties with (limited) data that are manifestations of those parameters in some way. In this case, the parameters we are after are θµ(r) p(µ|r). The data from the PDB are in the form of counts for yµ(r), the number of amino acids of type r in the PDB that have secondary structure µ. There are 60 such numbers (20 amino acid types 3 secondary structure types). We then have for each amino acid type a Bayesian expression for the posterior distribution for the values of θµ(r):

p( (r)|y(r)) p(y(r)| (r))p( (r))

(34)

where and y are vectors of three components α, β, and γ. The prior is a Dirichlet distribution with some number of counts for the three secondary structure types for amino acid type r, i.e., Dirichlet (nα(r), nβ(r), nγ(r)). We could choose the three nµ(r) to be equal to some small number, say 10. Or we could set them equal to 100 pµ, where pµ is the