 Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

# Теория информации / Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p

.pdf
Скачиваний:
107
Добавлен:
09.08.2013
Размер:
10.58 Mб
Скачать (a) Pr{ 1 n=
n i 1
 PROBLEMS 405 (a) Let the expected log likelihood be Eθ0 lθ (Xn) ln n fθ (xi ) n fθ0 (xi )dxn, = i=1 i=1

and show that

Eθ0 (l(Xn)) = (h(fθ0 ) D(fθ0 ||fθ ))n.

(b) Show that the maximum over θ of the expected log likelihood is achieved by θ = θ0.

11.18Large deviations . Let X1, X2, . . . be i.i.d. random variables drawn according to the geometric distribution

 Pr X = k } = pk−1(1 − p), k = 1, 2, . . . . {

Find good estimates (to ﬁrst order in the exponent) of:

Xi α}.

 1 n Xi ≥ α}. (b) Pr{X1 = k| n i=1

(c)Evaluate parts (a) and (b) for p = 12 , α = 4.

11.19Another expression for Fisher information. Use integration by parts to show that

J (θ ) = −E 2 ln fθ (x) . ∂θ 2

11.20Stirling’s approximation. Derive a weak form of Stirling’s approximation for factorials; that is, show that

 n n n n ≤ n! ≤ n (11.327) e e

using the approximation of integrals by sums. Justify the following steps:

 ln(n!) = + ln(n) ≤ 2 n−1 ln x dx + ln n = · · · n−1 ln(i) i=2

(11.328)

406 INFORMATION THEORY AND STATISTICS

and

nn

 ln(n!) = ln(i) ≥ 0 ln x dx = · · · . (11.329) i=1

n

11.21 Asymptotic value of k . Use the simple approximation of Prob-

lem 11.20 to show that if 0 ≤ p ≤ 1, and k = np (i.e., k is the largest integer less than or equal to np), then

 nlim 1 log n = − p log p − (1 − p) log(1 − p) = H (p). →∞ n k

(11.330) Now let pi , i = 1, . . . , m be a probability distribution on m symbols (i.e., pi ≥ 0 and i pi = 1). What is the limiting value of

 1 log n m 1 npj n np1 np2 . . . npm−1 n − j =−0 1 log n! ? = n np 1 ! np 2 ! . . . np m−1 ! (n − m−1 np )! j = 0 (11.331) j

11.22Running difference. Let X1, X2, . . . , Xn be i.i.d. Q1(x), and

Y1, Y2, . . . , Yn be i.i.d. Q2(y). Let Xn and Y n be independent.

 Find an expression for Pr{ n Xi − n i=1 i=1 Yi ≥ nt} good to ﬁrst order in the exponent. Again, this answer can be left in parametric form. 11.23 Large likelihoods . Let X1, X2, . . . be i.i.d. Q(x), x {1, 2, . . . , m}. Let P (x) be some other probability mass function. We

form the log likelihood ratio

 1 P n(X1, X2, . . . , Xn) 1 n P (Xi ) log log = n Qn(X1, X2, . . . , Xn) n Q(Xi ) i=1

of the sequence Xn and ask for the probability that it exceeds a certain threshold. Speciﬁcally, ﬁnd (to ﬁrst order in the exponent)

 Qn 1 log P (X1, X2, . . . , Xn) > 0 . n Q(X1, X2, . . . , Xn)

There may be an undetermined parameter in the answer.

PROBLEMS 407

11.24Fisher information for mixtures . Let f1(x) and f0(x) be two given probability densities. Let Z be Bernoulli(θ ), where θ is unknown. Let X f1(x) if Z = 1 and X f0(x) if Z = 0.

(a)Find the density fθ (x) of the observed X.

(b)Find the Fisher information J (θ ).

(c)What is the Cramer´ – Rao lower bound on the mean-squared error of an unbiased estimate of θ ?

(d)Can you exhibit an unbiased estimator of θ ?

11.25Bent coins . Let {Xi } be iid Q, where

 Q(k) = Pr(Xi = k) m qk (1 − q)m−k for k = 0, 1, 2, . . . , m. = k Thus, the Xi ’s are iid Binomial(m, q). Show that as n → ∞, Pr X1 k 1 n Xi α P (k), = n ≥ → i=1 where P is Binomial(m, λ) (i.e., P (k) = m λk (1 − λ)m−k for some λ [0, 1]). Find λ. k 11.26 Conditional limiting distribution (a) Find the exact value of Pr X1 1 1 n Xi 1 (11.332) = n = 4 i=1 if X1, X2, . . . , are Bernoulli( 2 ) and n is a multiple of 4. 3 (b) Now let Xi {−1, 0, 1} and let X1, X2 . . . be i.i.d. uniform over {−1, 0, +1}. Find the limit of Pr X1 = +1 1 n 1 (11.333) n Xi2 = 2 i=1

for n = 2k, k → ∞.

408 INFORMATION THEORY AND STATISTICS

11.27Variational inequality . Verify for positive random variables X that

 log E (X) = sup E (log X) − D(Q P ) , (11.334) P Q ! Q || " where EP (X) = x xP (x) and D(Q||P ) = x Q(x) log Q(x) P (x) and the supremum is over all Q(x) ≥ 0, Q(x) = 1. It is enough to extremize J (Q) EQ ln X D(Q P ) = + − ||

λ( Q(x) − 1).

11.28Type constraints

(a) Find constraints on the type PXn such that the sample variance

 1 n X2 2 (X )2 α , where X2 and X n − 1 n n ≤ n = n i=1 i Xn = n i=1 Xi . 2 n 2 ≤ α). − (Xn) (b) Find the exponent in the probability Q (Xn

You can leave the answer in parametric form.

11.29 Uniform distribution on the simplex . Which of these methods will generate a sample from the uniform distribution on the sim-

 plex {x Rn : xi ≥ 0, n i=1 xi = 1}? n Yj . (a) Let Yi be i.i.d. uniform [0, 1] with Xi = Yi / j =1 λy (b) Let Yi be i.i.d. exponentially distributed λe− , y ≥ 0, with n (c) Xi = Yi / j =1 Yj . (Break stick into n parts ) Let Y1, Y2, . . . , Yn−1 be i.i.d. uni- form [0, 1], and let Xi be the length of the ith interval.

HISTORICAL NOTES

The method of types evolved from notions of strong typicality; some of the ideas were used by Wolfowitz  to prove channel capacity theorems. The method was fully developed by Csiszar´ and Korner¨ , who derived the main theorems of information theory from this viewpoint. The method of types described in Section 11.1 follows the development in Csiszar´ and Korner¨. The L1 lower bound on relative entropy is due to Csiszar´ , Kullback , and Kemperman . Sanov’s theorem  was generalized by Csiszar´  using the method of types. CHAPTER 12

MAXIMUM ENTROPY

The temperature of a gas corresponds to the average kinetic energy of the molecules in the gas. What can we say about the distribution of velocities in the gas at a given temperature? We know from physics that this distribution is the maximum entropy distribution under the temperature constraint, otherwise known as the Maxwell–Boltzmann distribution. The maximum entropy distribution corresponds to the macrostate (as indexed by the empirical distribution) that has the most microstates (the individual gas velocities). Implicit in the use of maximum entropy methods in physics is a sort of AEP which says that all microstates are equally probable.

12.1MAXIMUM ENTROPY DISTRIBUTIONS

Consider the following problem: Maximize the entropy h(f ) over all probability densities f satisfying

 1. f (x) ≥ 0, with equality outside the support set S 3. f (x)r (x) dx α for 1 i m. 2. S f (x) dx = 1 = i ≤ ≤ (12.1) S i

Thus, f is a density on support set S meeting certain moment constraints α1, α2, . . . , αm.

Approach 1 ( Calculus) The differential entropy h(f ) is a concave function over a convex set. We form the functional

 J (f ) = − f ln f + λ0 f m λi f ri (12.2) +

i=1

and “differentiate” with respect to f (x), the xth component of f , to obtain

 ∂J m + λ0 + λi ri (x). = − ln f (x) − 1 (12.3) ∂f (x) i=1 Elements of Information Theory, Second Edition, By Thomas M. Cover and Joy A. Thomas Copyright 2006 John Wiley & Sons, Inc.

409

410 MAXIMUM ENTROPY

Setting this equal to zero, we obtain the form of the maximizing density

 m λi ri (x), x S, (12.4) f (x) = eλ0−1+ i=1 where λ0, λ1, . . . , λm are chosen so that f satisﬁes the constraints.

The approach using calculus only suggests the form of the density that maximizes the entropy. To prove that this is indeed the maximum, we can take the second variation. It is simpler to use the information inequality

D(g||f ) ≥ 0.

Approach 2 (Information inequality) If g satisﬁes (12.1) and if f is of the form (12.4), then 0 ≤ D(g||f ) = −h(g) + h(f ). Thus h(g) h(f ) for all g satisfying the constraints. We prove this in the following theorem.

 Theorem 12.1.1 (Maximum entropy distribution) Let f (x) = fλ(x) m = eλ0+ i=1 λi ri (x), x S, where λ0, . . . , λm are chosen so that f satisﬁes

(12.1). Then f uniquely maximizes h(f ) over all probability densities f satisfying constraints (12.1).

 Proof: Let g satisfy the constraints (12.1). Then h(g) = − S g ln g (12.5) = − S f g ln g f (12.6) = −D(g||f ) − S g ln f (12.7) ≤ − S g ln f (12.8) (a) (b) g λ0 λi ri (12.9) = − S + (c) f λ0 + λi ri (12.10) = − S = − S f ln f (12.11) = h(f ), (12.12)

where (a) follows from the nonnegativity of relative entropy, (b) follows from the deﬁnition of f , and (c) follows from the fact that both f and g satisfy the constraints. Note that equality holds in (a) if and only 12.2 EXAMPLES 411

if g(x) = f (x) for all x, except for a set of measure 0, thus proving uniqueness.

The same approach holds for discrete entropies and for multivariate distributions.

12.2EXAMPLES

Example 12.2.1 (One-dimensional gas with a temperature constraint) Let the constraints be EX = 0 and EX2 = σ 2. Then the form of the maximizing distribution is

 f (x) = eλ0+λ1x+λ2x2 . (12.13)

To ﬁnd the appropriate constants, we ﬁrst recognize that this distribution has the same form as a normal distribution. Hence, the density that satisﬁes the constraints and also maximizes the entropy is the N(0, σ 2) distribution:

 1 e− x2 (12.14) f (x) = √ 2σ 2 . 2π σ 2 Example 12.2.2 (Dice, no constraints ) Let S = {1, 2, 3, 4, 5, 6}. The distribution that maximizes the entropy is the uniform distribution, p(x) = 61 for x S. Example 12.2.3 (Dice, with EX = ipi = α) This important exam- that n dice are thrown on the table ple was used by Boltzmann. Suppose

and we are told that the total number of spots showing is . What proportion of the dice are showing face i, i = 1, 2, . . . , 6?

 One way of going about this is to count the number of ways that n dice can fall so that ni dice show face i. There are n n1, n2, . . . , n6

such ways. This is a macrostate indexed by (n1, n2, . . . , n6) corresponding

 n microstates, each having probability 1 to n1, n2, . . . , n6 . To ﬁnd the 6n most probable macrostate, we wish to maximize n under n1, n2, . . . , n6 the constraint observed on the total number of spots, 6 ini = nα. (12.15) i=1 Using a crude Stirling’s approximation, n! ≈ ( ne )n, we ﬁnd that n n n ( e ) (12.16) 6 ni n n1, n2, . . . , n6 ≈ i=1 ( e ) i
under the constraint (12.15) is almost
Thus, maximizing

412 MAXIMUM ENTROPY

 6 n ni (12.17) = ni i=1 = enH n1 , n2 ,..., n6 . (12.18) n n n n

n1, n2, . . . , n6

equivalent to maximizing H (p1, p2, . . . , p6) under the constraint ipi =

 α . Using Theorem 12.1.1 under this constraint, we ﬁnd the maximum entropy probability mass function to be pi = eλi , (12.19) 6 λi i=1 e

where λ is chosen so that ipi = α. Thus, the most probable macrostate is (np1 , np2 . . . . , np6 ), and we expect to ﬁnd ni = npi dice showing face i.

In Chapter 11 we show that the reasoning and the approximations are essentially correct. In fact, we show that not only is the maximum entropy macrostate the most likely, but it also contains almost all of the probability. Speciﬁcally, for rational α,

 Pr Ni − pi < , i = 1, 2, . . . , 6 n Xi = nα → 1, (12.20) n i = 1

as n → ∞ along the subsequence such that is an integer.

Example 12.2.4 Let S = [a, b], with no other constraints. Then the maximum entropy distribution is the uniform distribution over this range.

 Example 12.2.5 S = [0, ∞) and EX = µ. Then the entropy-maxi- mizing distribution is 1 e− x f (x) = µ , x ≥ 0. (12.21) µ

This problem has a physical interpretation. Consider the distribution of the height X of molecules in the atmosphere. The average potential energy of the molecules is ﬁxed, and the gas tends to the distribution that has the maximum entropy subject to the constraint that E(mgX) is ﬁxed. This is the exponential distribution with density f (x) = λeλx , x ≥ 0. The density of the atmosphere does indeed have this distribution.

 12.3 ANOMALOUS MAXIMUM ENTROPY PROBLEM 413

Example 12.2.6 S = (−∞, ), and EX = µ. Here the maximum entropy is inﬁnite, and there is no maximum entropy distribution. (Consider normal distributions with larger and larger variances.)

Example 12.2.7 S = (−∞, ), EX = α1, and EX2 = α2. The maximum entropy distribution is N(α1, α2 α12).

Example 12.2.8 S = Rn, EXi Xj = Kij , 1 ≤ i, j n. This is a multivariate example, but the same analysis holds and the maximum entropy

density is of the form

f (x) = eλ0+ i,j λij xi xj . (12.22)

Since the exponent is a quadratic form, it is clear by inspection that the density is a multivariate normal with zero mean. Since we have to satisfy the second moment constraints, we must have a multivariate normal with covariance Kij , and hence the density is

 1 e− 1 xT K−1x f (x) = 2 , (12.23) √ n |K|1/2 2π which has an entropy h(Nn(0, K)) = 1 log(2π e)n|K|, (12.24) 2

as derived in Chapter 8.

Example 12.2.9 Suppose that we have the same constraints as in Example 12.2.8, but EXi Xj = Kij only for some restricted set of (i, j ) A. For example, we might know only Kij for i = j ± 2. Then by comparing (12.22) and (12.23), we can conclude that (K−1)ij = 0 for (i, j ) Ac (i.e., the entries in the inverse of the covariance matrix are 0 when (i, j ) is outside the constraint set).

12.3ANOMALOUS MAXIMUM ENTROPY PROBLEM

We have proved that the maximum entropy distribution subject to the

 constraints S hi (x)f (x) dx = αi (12.25) is of the form f (x) = eλ0+ λi hi (x) (12.26)

if λ0, λ1, . . . , λp satisfying the constraints (12.25) exist.

414 MAXIMUM ENTROPY

We now consider a tricky problem in which the λi cannot be chosen to satisfy the constraints. Nonetheless, the “maximum” entropy can be found. We consider the following problem: Maximize the entropy subject

 to the constraints ∞ f (x) dx = 1, (12.27) −∞ ∞ xf (x) dx = α1, (12.28) −∞ ∞ x2f (x) dx = α2, (12.29) −∞ ∞ x3f (x) dx = α3. (12.30) −∞

Here, the maximum entropy distribution, if it exists, must be of the form

 f (x) = eλ0+λ1x+λ2x2 +λ3x3 . (12.31)

But if λ3 is nonzero, −∞f = ∞ and the density cannot be normalized. So λ3 must be 0. But then we have four equations and only three variables, so that in general it is not possible to choose the appropriate constants. The method seems to have failed in this case.

The reason for the apparent failure is simple: The entropy has a least upper bound under these constraints, but it is not possible to attain it. Consider the corresponding problem with only ﬁrst and second moment constraints. In this case, the results of Example 12.2.1 show that the entropymaximizing distribution is the normal with the appropriate moments. With the additional third moment constraint, the maximum entropy cannot be higher. Is it possible to achieve this value?

We cannot achieve it, but we can come arbitrarily close. Consider a normal distribution with a small “wiggle” at a very high value of x. The moments of the new distribution are almost the same as those of the old one, the biggest change being in the third moment. We can bring the ﬁrst and second moments back to their original values by adding new wiggles to balance out the changes caused by the ﬁrst. By choosing the position of the wiggles, we can get any value of the third moment without reducing the entropy signiﬁcantly below that of the associated normal. Using this method, we can come arbitrarily close to the upper bound for the maximum entropy distribution. We conclude that

 sup h(f ) = h(N(0, α2 − α12)) = 1 ln 2π e(α2 − α12). (12.32) 2