Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Теория информации / Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p

.pdf
Скачиваний:
214
Добавлен:
09.08.2013
Размер:
10.58 Mб
Скачать
(a) Pr{ 1 n=
n i 1

 

 

 

 

 

 

PROBLEMS 405

(a) Let the expected log likelihood be

 

Eθ0 lθ (Xn)

 

 

ln

n

fθ (xi ) n

fθ0 (xi )dxn,

 

=

 

 

 

 

 

 

 

 

i=1

i=1

 

and show that

Eθ0 (l(Xn)) = (h(fθ0 ) D(fθ0 ||fθ ))n.

(b) Show that the maximum over θ of the expected log likelihood is achieved by θ = θ0.

11.18Large deviations . Let X1, X2, . . . be i.i.d. random variables drawn according to the geometric distribution

Pr X

=

k

} =

pk−1(1

p),

k

=

1, 2, . . . .

{

 

 

 

 

 

Find good estimates (to first order in the exponent) of:

Xi α}.

1

n

Xi α}.

(b) Pr{X1 = k| n

i=1

(c)Evaluate parts (a) and (b) for p = 12 , α = 4.

11.19Another expression for Fisher information. Use integration by parts to show that

J (θ ) = −E 2 ln fθ (x) . ∂θ 2

11.20Stirling’s approximation. Derive a weak form of Stirling’s approximation for factorials; that is, show that

 

n

n

 

n

 

n

 

 

 

n! ≤ n

 

 

(11.327)

e

e

using the approximation of integrals by sums. Justify the following steps:

ln(n!)

=

 

+

ln(n)

2

n−1 ln x dx

+

ln n

= · · ·

 

n−1 ln(i)

 

 

 

 

 

 

 

i=2

 

 

 

 

 

 

 

 

(11.328)

406 INFORMATION THEORY AND STATISTICS

and

nn

ln(n!) = ln(i)

0

ln x dx = · · · .

(11.329)

i=1

 

 

 

n

11.21 Asymptotic value of k . Use the simple approximation of Prob-

lem 11.20 to show that if 0 ≤ p ≤ 1, and k = np (i.e., k is the largest integer less than or equal to np), then

nlim

1

log

n

= −

p log p

(1

p) log(1

p)

=

H (p).

 

→∞

n

k

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(11.330) Now let pi , i = 1, . . . , m be a probability distribution on m symbols (i.e., pi ≥ 0 and i pi = 1). What is the limiting value of

1

log

 

 

 

 

 

 

 

 

 

 

n

 

 

 

 

m 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

npj

 

 

 

 

 

n

 

 

np1 np2 . . . npm−1 n

j =0

 

 

 

 

 

 

 

 

 

1

log

 

 

 

 

 

 

 

 

 

 

 

 

n!

 

 

 

 

 

 

 

 

 

 

 

?

 

=

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

n

 

np

1

 

!

 

np

2

 

! . . .

 

np

m−1

 

! (n

m−1

 

np

 

)!

 

 

 

 

 

 

 

 

 

 

 

j

=

0

 

(11.331)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

j

 

 

11.22Running difference. Let X1, X2, . . . , Xn be i.i.d. Q1(x), and

Y1, Y2, . . . , Yn be i.i.d. Q2(y). Let Xn and Y n be independent.

Find an expression for Pr{

n

Xi

 

n

i=1

 

i=1 Yi nt} good to first

order in the exponent. Again, this answer can be left in parametric

form.

 

 

 

 

 

 

 

 

11.23 Large likelihoods . Let X1, X2, . . . be

i.i.d. Q(x), x {1, 2,

. . . , m}. Let P (x) be some other probability mass function. We

form the log likelihood ratio

1 P n(X1, X2, . . . , Xn)

1

n

P (Xi )

log

 

log

 

=

 

 

n

Qn(X1, X2, . . . , Xn)

n

Q(Xi )

 

 

 

 

 

i=1

 

of the sequence Xn and ask for the probability that it exceeds a certain threshold. Specifically, find (to first order in the exponent)

Qn

1

log

P (X1, X2, . . . , Xn)

> 0 .

 

 

 

n

Q(X1, X2, . . . , Xn)

 

There may be an undetermined parameter in the answer.

PROBLEMS 407

11.24Fisher information for mixtures . Let f1(x) and f0(x) be two given probability densities. Let Z be Bernoulli(θ ), where θ is unknown. Let X f1(x) if Z = 1 and X f0(x) if Z = 0.

(a)Find the density fθ (x) of the observed X.

(b)Find the Fisher information J (θ ).

(c)What is the Cramer´ – Rao lower bound on the mean-squared error of an unbiased estimate of θ ?

(d)Can you exhibit an unbiased estimator of θ ?

11.25Bent coins . Let {Xi } be iid Q, where

Q(k)

=

Pr(Xi

=

k)

m qk (1

q)mk

for k

=

0, 1, 2, . . . , m.

 

 

 

=

 

k

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Thus, the Xi ’s are iid Binomial(m, q). Show that as n → ∞,

 

 

Pr

X1

 

k

 

1

 

 

 

 

n

Xi

 

α

 

 

 

P

(k),

 

 

 

 

=

n

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

i=1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

where P is Binomial(m, λ) (i.e., P

(k)

=

m λk (1

λ)mk for

some λ [0, 1]). Find λ.

 

 

 

 

 

 

 

 

 

 

 

 

 

k

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

11.26 Conditional limiting distribution

 

 

 

 

 

 

 

 

 

 

 

 

(a) Find the exact value of

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Pr X1

 

 

1

 

 

1

 

n Xi

 

 

1

 

 

 

 

 

(11.332)

 

 

 

 

=

n

=

 

4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

i=1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

if X1, X2, . . . , are Bernoulli( 2 ) and n is a multiple of 4.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

3

 

 

 

 

 

 

 

 

 

 

 

(b) Now let Xi {−1, 0, 1} and let X1, X2 . . . be i.i.d. uniform

over {−1, 0, +1}. Find the limit of

 

 

 

 

 

 

 

 

 

 

 

 

Pr X1 = +1

 

 

1

 

n

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(11.333)

 

 

 

n

Xi2 = 2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

i=1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

for n = 2k, k → ∞.

408 INFORMATION THEORY AND STATISTICS

11.27Variational inequality . Verify for positive random variables X that

 

 

log E (X)

=

sup

E

 

(log X)

D(Q P ) ,

 

(11.334)

 

 

P

Q

!

Q

 

 

 

 

||

"

 

 

 

 

where

EP (X) = x xP (x) and

D(Q||P ) = x Q(x) log

Q(x)

P (x)

and

the supremum

is

over

all

 

Q(x) ≥ 0,

 

Q(x) = 1.

It

is

enough

to

extremize

J (Q)

 

EQ ln

X

 

D(Q P )

 

=

 

 

 

+

 

 

 

 

 

 

 

 

 

 

 

 

 

 

||

 

λ( Q(x) − 1).

11.28Type constraints

(a) Find constraints on the type PXn such that the sample variance

 

 

 

 

 

 

 

 

 

 

 

 

1

 

n

 

X2

 

 

2

 

(X )2

α

,

where

X2

 

 

and

 

 

 

 

 

 

 

 

X

n

1

n n

 

n

= n

i=1

 

i

 

 

 

 

 

 

 

 

 

 

 

 

 

Xn = n

i=1 Xi .

 

 

 

 

 

 

 

 

 

2

 

 

 

 

n

 

2

 

 

 

 

 

α).

 

 

 

 

 

(Xn)

(b) Find

the exponent in the probability Q

(Xn

 

You can leave the answer in parametric form.

11.29 Uniform distribution on the simplex . Which of these methods will generate a sample from the uniform distribution on the sim-

plex {x Rn : xi ≥ 0,

n

 

 

 

 

 

 

 

i=1 xi = 1}?

 

 

 

n

Yj .

(a)

Let Yi be i.i.d. uniform [0, 1] with Xi = Yi /

 

j =1

 

 

 

 

 

λy

 

 

 

 

 

 

 

 

 

 

 

 

(b)

Let Yi be i.i.d. exponentially distributed

 

λe

 

 

, y

0, with

 

n

 

 

 

 

 

 

(c)

Xi = Yi / j =1 Yj .

 

 

 

 

 

 

 

 

(Break stick into n parts ) Let Y1, Y2, . . . , Yn−1

 

be i.i.d. uni-

 

form [0, 1], and let Xi be the length of the ith interval.

HISTORICAL NOTES

The method of types evolved from notions of strong typicality; some of the ideas were used by Wolfowitz [566] to prove channel capacity theorems. The method was fully developed by Csiszar´ and Korner¨ [149], who derived the main theorems of information theory from this viewpoint. The method of types described in Section 11.1 follows the development in Csiszar´ and Korner¨. The L1 lower bound on relative entropy is due to Csiszar´ [138], Kullback [336], and Kemperman [309]. Sanov’s theorem [455] was generalized by Csiszar´ [141] using the method of types.

CHAPTER 12

MAXIMUM ENTROPY

The temperature of a gas corresponds to the average kinetic energy of the molecules in the gas. What can we say about the distribution of velocities in the gas at a given temperature? We know from physics that this distribution is the maximum entropy distribution under the temperature constraint, otherwise known as the Maxwell–Boltzmann distribution. The maximum entropy distribution corresponds to the macrostate (as indexed by the empirical distribution) that has the most microstates (the individual gas velocities). Implicit in the use of maximum entropy methods in physics is a sort of AEP which says that all microstates are equally probable.

12.1MAXIMUM ENTROPY DISTRIBUTIONS

Consider the following problem: Maximize the entropy h(f ) over all probability densities f satisfying

1.

f (x) ≥ 0, with equality outside the support set S

3.

 

f (x)r (x) dx

α

for 1 i

 

m.

2.

S f (x) dx = 1

= i

(12.1)

 

S

i

 

Thus, f is a density on support set S meeting certain moment constraints α1, α2, . . . , αm.

Approach 1 ( Calculus) The differential entropy h(f ) is a concave function over a convex set. We form the functional

J (f )

= −

 

f ln f

+

λ0

 

f

m

λi

f ri

(12.2)

 

 

 

 

 

 

+

 

 

 

i=1

and “differentiate” with respect to f (x), the xth component of f , to obtain

 

 

∂J

m

 

 

 

+ λ0 + λi ri (x).

 

 

 

 

= − ln f (x) − 1

(12.3)

 

∂f (x)

 

 

 

 

i=1

 

 

 

 

Elements of Information Theory, Second Edition,

By Thomas M. Cover and Joy A. Thomas

Copyright 2006 John Wiley & Sons, Inc.

 

 

409

410 MAXIMUM ENTROPY

Setting this equal to zero, we obtain the form of the maximizing density

m

λi ri (x),

x S,

(12.4)

f (x) = eλ0−1+ i=1

where λ0, λ1, . . . , λm are chosen so that f

satisfies the constraints.

 

The approach using calculus only suggests the form of the density that maximizes the entropy. To prove that this is indeed the maximum, we can take the second variation. It is simpler to use the information inequality

D(g||f ) ≥ 0.

Approach 2 (Information inequality) If g satisfies (12.1) and if f is of the form (12.4), then 0 ≤ D(g||f ) = −h(g) + h(f ). Thus h(g) h(f ) for all g satisfying the constraints. We prove this in the following theorem.

Theorem 12.1.1 (Maximum entropy distribution)

Let f (x)

=

fλ(x)

m

 

 

 

= eλ0+ i=1

λi ri (x), x S, where λ0, . . . , λm are chosen so that f satisfies

(12.1). Then f uniquely maximizes h(f ) over all probability densities f satisfying constraints (12.1).

Proof: Let g satisfy the constraints (12.1). Then

 

h(g) = − S g ln g

 

 

 

 

(12.5)

= −

S

 

 

f

 

 

 

 

 

 

g ln

 

g

 

f

 

 

(12.6)

 

 

 

 

 

 

= −D(g||f ) S g ln f

(12.7)

≤ −

S

g ln f

 

 

 

(12.8)

(a)

 

 

 

 

 

(b)

 

 

g

λ0

 

 

λi ri

(12.9)

= −

S

 

+

 

 

(c)

 

 

f

 

λ0

+

 

λi ri

(12.10)

= − S

 

 

 

 

 

= − S f ln f

 

 

(12.11)

=

h(f ),

 

 

 

 

 

 

(12.12)

 

 

 

 

 

 

 

where (a) follows from the nonnegativity of relative entropy, (b) follows from the definition of f , and (c) follows from the fact that both f and g satisfy the constraints. Note that equality holds in (a) if and only

12.2 EXAMPLES

411

if g(x) = f (x) for all x, except for a set of measure 0, thus proving uniqueness.

The same approach holds for discrete entropies and for multivariate distributions.

12.2EXAMPLES

Example 12.2.1 (One-dimensional gas with a temperature constraint) Let the constraints be EX = 0 and EX2 = σ 2. Then the form of the maximizing distribution is

f (x)

=

eλ0+λ1x+λ2x2 .

(12.13)

 

 

 

To find the appropriate constants, we first recognize that this distribution has the same form as a normal distribution. Hence, the density that satisfies the constraints and also maximizes the entropy is the N(0, σ 2) distribution:

 

1

e

x2

 

(12.14)

 

f (x) =

 

2σ 2 .

 

2π σ 2

Example 12.2.2

(Dice, no constraints ) Let S = {1, 2, 3, 4, 5, 6}. The

distribution that maximizes the entropy is the uniform distribution, p(x)

=

61 for x S.

 

 

 

 

 

 

 

Example 12.2.3

(Dice, with EX = ipi

= α)

This important exam-

 

 

 

that n dice are thrown on the table

ple was used by Boltzmann. Suppose

 

 

 

 

and we are told that the total number of spots showing is . What proportion of the dice are showing face i, i = 1, 2, . . . , 6?

One way of going about this is to count the number

of ways that

n dice can fall so that ni

dice show face i. There are

 

n

 

 

 

n1, n2, . . . , n6

 

such ways. This is a macrostate indexed by (n1, n2, . . . , n6) corresponding

n

microstates, each having probability

 

1

 

 

to n1, n2, . . . , n6

 

. To find the

 

6n

most probable macrostate, we wish to maximize

n

under

 

 

 

 

 

 

 

 

 

n1, n2, . . . , n6

 

the constraint observed on the total number of spots,

 

 

 

 

 

 

 

6

 

 

 

 

 

 

 

 

 

 

 

 

 

ini = nα.

 

 

 

 

 

 

(12.15)

 

 

 

i=1

 

 

 

 

 

 

 

 

 

 

Using a crude Stirling’s approximation, n! ≈ ( ne )n, we find that

 

 

 

 

 

 

 

n n

 

 

 

 

 

 

 

 

 

n

 

 

( e )

 

 

 

 

 

 

(12.16)

 

 

 

6

ni

n

 

 

 

 

 

 

 

n1, n2, . . . , n6

i=1 ( e

)

i

 

 

 

 

under the constraint (12.15) is almost
Thus, maximizing

412 MAXIMUM ENTROPY

 

 

 

6

n

 

ni

(12.17)

 

 

 

 

 

 

= ni

 

 

 

 

 

 

 

i=1

 

 

 

 

 

 

 

 

 

=

enH

n1

,

n2

,...,

n6

.

(12.18)

 

 

n

n

n

 

 

 

 

 

 

 

 

 

 

 

 

 

n

 

 

 

 

 

 

 

 

 

 

 

n1, n2, . . . , n6

equivalent to maximizing H (p1, p2, . . . , p6) under the constraint ipi =

α

. Using Theorem 12.1.1 under this constraint, we find the

maximum

 

 

entropy probability mass function to be

 

 

 

 

pi =

eλi

 

,

(12.19)

 

6

λi

 

 

i=1 e

 

 

 

where λ is chosen so that ipi = α. Thus, the most probable macrostate is (np1 , np2 . . . . , np6 ), and we expect to find ni = npi dice showing face i.

In Chapter 11 we show that the reasoning and the approximations are essentially correct. In fact, we show that not only is the maximum entropy macrostate the most likely, but it also contains almost all of the probability. Specifically, for rational α,

Pr

 

Ni

pi

< , i

=

1, 2, . . . , 6

n

Xi

=

1, (12.20)

n

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

i

=

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

as n → ∞ along the subsequence such that is an integer.

Example 12.2.4 Let S = [a, b], with no other constraints. Then the maximum entropy distribution is the uniform distribution over this range.

Example 12.2.5 S = [0, )

and

EX = µ. Then

the entropy-maxi-

mizing distribution is

1

 

 

 

 

 

e

x

 

f (x) =

 

 

µ , x ≥ 0.

(12.21)

 

µ

This problem has a physical interpretation. Consider the distribution of the height X of molecules in the atmosphere. The average potential energy of the molecules is fixed, and the gas tends to the distribution that has the maximum entropy subject to the constraint that E(mgX) is fixed. This is the exponential distribution with density f (x) = λeλx , x ≥ 0. The density of the atmosphere does indeed have this distribution.

12.3 ANOMALOUS MAXIMUM ENTROPY PROBLEM

413

Example 12.2.6 S = (−∞, ), and EX = µ. Here the maximum entropy is infinite, and there is no maximum entropy distribution. (Consider normal distributions with larger and larger variances.)

Example 12.2.7 S = (−∞, ), EX = α1, and EX2 = α2. The maximum entropy distribution is N(α1, α2 α12).

Example 12.2.8 S = Rn, EXi Xj = Kij , 1 ≤ i, j n. This is a multivariate example, but the same analysis holds and the maximum entropy

density is of the form

f (x) = eλ0+ i,j λij xi xj . (12.22)

Since the exponent is a quadratic form, it is clear by inspection that the density is a multivariate normal with zero mean. Since we have to satisfy the second moment constraints, we must have a multivariate normal with covariance Kij , and hence the density is

1

 

 

e

1 xT K−1x

 

 

f (x) =

 

 

 

 

2

,

(12.23)

 

n |K|1/2

2π

 

 

 

 

which has an entropy

 

 

 

 

 

 

h(Nn(0, K)) =

1

log(2π e)n|K|,

 

(12.24)

 

 

2

 

as derived in Chapter 8.

Example 12.2.9 Suppose that we have the same constraints as in Example 12.2.8, but EXi Xj = Kij only for some restricted set of (i, j ) A. For example, we might know only Kij for i = j ± 2. Then by comparing (12.22) and (12.23), we can conclude that (K−1)ij = 0 for (i, j ) Ac (i.e., the entries in the inverse of the covariance matrix are 0 when (i, j ) is outside the constraint set).

12.3ANOMALOUS MAXIMUM ENTROPY PROBLEM

We have proved that the maximum entropy distribution subject to the

constraints

S hi (x)f (x) dx = αi

(12.25)

is of the form

f (x) = eλ0+ λi hi (x)

(12.26)

 

if λ0, λ1, . . . , λp satisfying the constraints (12.25) exist.

414 MAXIMUM ENTROPY

We now consider a tricky problem in which the λi cannot be chosen to satisfy the constraints. Nonetheless, the “maximum” entropy can be found. We consider the following problem: Maximize the entropy subject

to the constraints

 

f (x) dx = 1,

(12.27)

−∞

 

xf (x) dx = α1,

(12.28)

−∞

 

x2f (x) dx = α2,

(12.29)

−∞

 

x3f (x) dx = α3.

(12.30)

−∞

 

Here, the maximum entropy distribution, if it exists, must be of the form

f (x)

=

eλ0+λ1x+λ2x2

+λ3x3 .

(12.31)

 

 

 

 

But if λ3 is nonzero, −∞f = ∞ and the density cannot be normalized. So λ3 must be 0. But then we have four equations and only three variables, so that in general it is not possible to choose the appropriate constants. The method seems to have failed in this case.

The reason for the apparent failure is simple: The entropy has a least upper bound under these constraints, but it is not possible to attain it. Consider the corresponding problem with only first and second moment constraints. In this case, the results of Example 12.2.1 show that the entropymaximizing distribution is the normal with the appropriate moments. With the additional third moment constraint, the maximum entropy cannot be higher. Is it possible to achieve this value?

We cannot achieve it, but we can come arbitrarily close. Consider a normal distribution with a small “wiggle” at a very high value of x. The moments of the new distribution are almost the same as those of the old one, the biggest change being in the third moment. We can bring the first and second moments back to their original values by adding new wiggles to balance out the changes caused by the first. By choosing the position of the wiggles, we can get any value of the third moment without reducing the entropy significantly below that of the associated normal. Using this method, we can come arbitrarily close to the upper bound for the maximum entropy distribution. We conclude that

sup h(f ) = h(N(0, α2 α12)) =

1

ln 2π e(α2

α12).

(12.32)

2