Теория информации / Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p
.pdf406 INFORMATION THEORY AND STATISTICS
and
nn
ln(n!) = ln(i) ≥ |
0 |
ln x dx = · · · . |
(11.329) |
i=1 |
|
|
|
n
11.21 Asymptotic value of k . Use the simple approximation of Prob-
lem 11.20 to show that if 0 ≤ p ≤ 1, and k = np (i.e., k is the largest integer less than or equal to np), then
nlim |
1 |
log |
n |
= − |
p log p |
− |
(1 |
− |
p) log(1 |
− |
p) |
= |
H (p). |
|
|||||||||||||
→∞ |
n |
k |
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
(11.330) Now let pi , i = 1, . . . , m be a probability distribution on m symbols (i.e., pi ≥ 0 and i pi = 1). What is the limiting value of
1 |
log |
|
|
|
|
|
|
|
|
|
|
n |
|
|
|
|
m 1 |
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
npj |
|
|
|
|
|
|||||||||
n |
|
|
np1 np2 . . . npm−1 n − |
j =−0 |
|
|
|
|
|
|
|
||||||||||||||||||
|
|
1 |
log |
|
|
|
|
|
|
|
|
|
|
|
|
n! |
|
|
|
|
|
|
|
|
|
|
|
? |
|
|
= |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
n |
|
np |
1 |
|
! |
|
np |
2 |
|
! . . . |
|
np |
m−1 |
|
! (n |
− |
m−1 |
|
np |
|
)! |
|||||||
|
|
|
|
|
|
|
|
|
|
|
j |
= |
0 |
|
(11.331) |
||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
j |
|
|
11.22Running difference. Let X1, X2, . . . , Xn be i.i.d. Q1(x), and
Y1, Y2, . . . , Yn be i.i.d. Q2(y). Let Xn and Y n be independent.
Find an expression for Pr{ |
n |
Xi − |
|
n |
i=1 |
|
i=1 Yi ≥ nt} good to first |
||
order in the exponent. Again, this answer can be left in parametric |
||||
form. |
|
|
|
|
|
|
|
|
|
11.23 Large likelihoods . Let X1, X2, . . . be |
i.i.d. Q(x), x {1, 2, |
|||
. . . , m}. Let P (x) be some other probability mass function. We |
form the log likelihood ratio
1 P n(X1, X2, . . . , Xn) |
1 |
n |
P (Xi ) |
||||
log |
|||||||
|
log |
|
= |
|
|
||
n |
Qn(X1, X2, . . . , Xn) |
n |
Q(Xi ) |
||||
|
|
|
|
|
i=1 |
|
of the sequence Xn and ask for the probability that it exceeds a certain threshold. Specifically, find (to first order in the exponent)
Qn |
1 |
log |
P (X1, X2, . . . , Xn) |
> 0 . |
|
|
|||
|
n |
Q(X1, X2, . . . , Xn) |
|
There may be an undetermined parameter in the answer.
408 INFORMATION THEORY AND STATISTICS
11.27Variational inequality . Verify for positive random variables X that
|
|
log E (X) |
= |
sup |
E |
|
(log X) |
− |
D(Q P ) , |
|
(11.334) |
|||||||
|
|
P |
Q |
! |
Q |
|
|
|
|
|| |
" |
|
|
|
|
|||
where |
EP (X) = x xP (x) and |
D(Q||P ) = x Q(x) log |
Q(x) |
|||||||||||||||
P (x) |
||||||||||||||||||
and |
the supremum |
is |
over |
all |
|
Q(x) ≥ 0, |
|
Q(x) = 1. |
||||||||||
It |
is |
enough |
to |
extremize |
J (Q) |
|
EQ ln |
X |
|
D(Q P ) |
|
|||||||
= |
|
|
|
+ |
||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
− |
|| |
|
λ( Q(x) − 1).
11.28Type constraints
(a) Find constraints on the type PXn such that the sample variance
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
n |
|
X2 |
|
|
|||
2 |
|
(X )2 |
α |
, |
where |
X2 |
|
|
and |
||||||||||||
|
|
|
|
|
|
|
|||||||||||||||
|
X |
n − |
1 |
n n ≤ |
|
n |
= n |
i=1 |
|
i |
|
||||||||||
|
|
|
|
|
|
|
|
|
|
|
|||||||||||
|
Xn = n |
i=1 Xi . |
|
|
|
|
|
|
|
|
|
2 |
|
||||||||
|
|
|
n |
|
2 |
|
|
|
|
|
≤ α). |
||||||||||
|
|
|
|
|
− (Xn) |
||||||||||||||||
(b) Find |
the exponent in the probability Q |
(Xn |
|
You can leave the answer in parametric form.
11.29 Uniform distribution on the simplex . Which of these methods will generate a sample from the uniform distribution on the sim-
plex {x Rn : xi ≥ 0, |
n |
|
|
|
|
|
|
|
|
i=1 xi = 1}? |
|
|
|
n |
Yj . |
||||
(a) |
Let Yi be i.i.d. uniform [0, 1] with Xi = Yi / |
|
j =1 |
||||||
|
|
|
|
|
λy |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(b) |
Let Yi be i.i.d. exponentially distributed |
|
λe− |
|
|
, y |
≥ |
0, with |
|
|
n |
|
|
|
|
|
|
||
(c) |
Xi = Yi / j =1 Yj . |
|
|
|
|
|
|
|
|
(Break stick into n parts ) Let Y1, Y2, . . . , Yn−1 |
|
be i.i.d. uni- |
|||||||
|
form [0, 1], and let Xi be the length of the ith interval. |
HISTORICAL NOTES
The method of types evolved from notions of strong typicality; some of the ideas were used by Wolfowitz [566] to prove channel capacity theorems. The method was fully developed by Csiszar´ and Korner¨ [149], who derived the main theorems of information theory from this viewpoint. The method of types described in Section 11.1 follows the development in Csiszar´ and Korner¨. The L1 lower bound on relative entropy is due to Csiszar´ [138], Kullback [336], and Kemperman [309]. Sanov’s theorem [455] was generalized by Csiszar´ [141] using the method of types.
410 MAXIMUM ENTROPY
Setting this equal to zero, we obtain the form of the maximizing density
m |
λi ri (x), |
x S, |
(12.4) |
f (x) = eλ0−1+ i=1 |
|||
where λ0, λ1, . . . , λm are chosen so that f |
satisfies the constraints. |
|
The approach using calculus only suggests the form of the density that maximizes the entropy. To prove that this is indeed the maximum, we can take the second variation. It is simpler to use the information inequality
D(g||f ) ≥ 0.
Approach 2 (Information inequality) If g satisfies (12.1) and if f is of the form (12.4), then 0 ≤ D(g||f ) = −h(g) + h(f ). Thus h(g) ≤ h(f ) for all g satisfying the constraints. We prove this in the following theorem.
Theorem 12.1.1 (Maximum entropy distribution) |
Let f (x) |
= |
fλ(x) |
|
m |
|
|
|
|
= eλ0+ i=1 |
λi ri (x), x S, where λ0, . . . , λm are chosen so that f satisfies |
(12.1). Then f uniquely maximizes h(f ) over all probability densities f satisfying constraints (12.1).
Proof: Let g satisfy the constraints (12.1). Then |
|
|||||||||
h(g) = − S g ln g |
|
|
|
|
(12.5) |
|||||
= − |
S |
|
|
f |
|
|
|
|
||
|
|
g ln |
|
g |
|
f |
|
|
(12.6) |
|
|
|
|
|
|
|
|||||
= −D(g||f ) − S g ln f |
(12.7) |
|||||||||
≤ − |
S |
g ln f |
|
|
|
(12.8) |
||||
(a) |
|
|
|
|
|
|||||
(b) |
|
|
g |
λ0 |
|
|
λi ri |
(12.9) |
||
= − |
S |
|
+ |
|
|
|||||
(c) |
|
|
f |
|
λ0 |
+ |
|
λi ri |
(12.10) |
|
= − S |
|
|
|
|
|
|||||
= − S f ln f |
|
|
(12.11) |
|||||||
= |
h(f ), |
|
|
|
|
|
|
(12.12) |
||
|
|
|
|
|
|
|
where (a) follows from the nonnegativity of relative entropy, (b) follows from the definition of f , and (c) follows from the fact that both f and g satisfy the constraints. Note that equality holds in (a) if and only
12.2 EXAMPLES |
411 |
if g(x) = f (x) for all x, except for a set of measure 0, thus proving uniqueness.
The same approach holds for discrete entropies and for multivariate distributions.
12.2EXAMPLES
Example 12.2.1 (One-dimensional gas with a temperature constraint) Let the constraints be EX = 0 and EX2 = σ 2. Then the form of the maximizing distribution is
f (x) |
= |
eλ0+λ1x+λ2x2 . |
(12.13) |
|
|
|
To find the appropriate constants, we first recognize that this distribution has the same form as a normal distribution. Hence, the density that satisfies the constraints and also maximizes the entropy is the N(0, σ 2) distribution:
|
1 |
e− |
x2 |
|
(12.14) |
|||
|
f (x) = |
√ |
|
2σ 2 . |
||||
|
2π σ 2 |
|||||||
Example 12.2.2 |
(Dice, no constraints ) Let S = {1, 2, 3, 4, 5, 6}. The |
|||||||
distribution that maximizes the entropy is the uniform distribution, p(x) |
= |
|||||||
61 for x S. |
|
|
|
|
|
|
|
|
Example 12.2.3 |
(Dice, with EX = ipi |
= α) |
This important exam- |
|||||
|
|
|
that n dice are thrown on the table |
|||||
ple was used by Boltzmann. Suppose |
|
|
|
|
and we are told that the total number of spots showing is nα. What proportion of the dice are showing face i, i = 1, 2, . . . , 6?
One way of going about this is to count the number |
of ways that |
|||
n dice can fall so that ni |
dice show face i. There are |
|
n |
|
|
|
n1, n2, . . . , n6 |
|
such ways. This is a macrostate indexed by (n1, n2, . . . , n6) corresponding
n |
microstates, each having probability |
|
1 |
|
|
||||||||
to n1, n2, . . . , n6 |
|
. To find the |
|||||||||||
|
6n |
||||||||||||
most probable macrostate, we wish to maximize |
n |
under |
|||||||||||
|
|
|
|
|
|
|
|
|
n1, n2, . . . , n6 |
|
|||
the constraint observed on the total number of spots, |
|
|
|
|
|||||||||
|
|
|
6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
ini = nα. |
|
|
|
|
|
|
(12.15) |
|||
|
|
|
i=1 |
|
|
|
|
|
|
|
|
|
|
Using a crude Stirling’s approximation, n! ≈ ( ne )n, we find that |
|
||||||||||||
|
|
|
|
|
|
n n |
|
|
|
|
|
|
|
|
|
n |
|
|
( e ) |
|
|
|
|
|
|
(12.16) |
|
|
|
|
6 |
ni |
n |
|
|
|
|
|
|||
|
|
n1, n2, . . . , n6 |
≈ i=1 ( e |
) |
i |
|
|
|
|
412 MAXIMUM ENTROPY
|
|
|
6 |
n |
|
ni |
(12.17) |
|||||
|
|
|
|
|||||||||
|
|
= ni |
|
|
|
|
||||||
|
|
|
i=1 |
|
|
|
|
|
|
|
||
|
|
= |
enH |
n1 |
, |
n2 |
,..., |
n6 |
. |
(12.18) |
||
|
|
n |
n |
n |
||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
n |
|
|
|
|
|
|
|
|
|
|
|
n1, n2, . . . , n6
equivalent to maximizing H (p1, p2, . . . , p6) under the constraint ipi =
α |
. Using Theorem 12.1.1 under this constraint, we find the |
maximum |
|||
|
|
||||
entropy probability mass function to be |
|
|
|
||
|
pi = |
eλi |
|
, |
(12.19) |
|
6 |
λi |
|||
|
|
i=1 e |
|
|
|
where λ is chosen so that ipi = α. Thus, the most probable macrostate is (np1 , np2 . . . . , np6 ), and we expect to find ni = npi dice showing face i.
In Chapter 11 we show that the reasoning and the approximations are essentially correct. In fact, we show that not only is the maximum entropy macrostate the most likely, but it also contains almost all of the probability. Specifically, for rational α,
Pr |
|
Ni |
− |
pi |
< , i |
= |
1, 2, . . . , 6 |
n |
Xi |
= |
nα |
→ |
1, (12.20) |
||
n |
|||||||||||||||
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
i |
= |
1 |
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
as n → ∞ along the subsequence such that nα is an integer.
Example 12.2.4 Let S = [a, b], with no other constraints. Then the maximum entropy distribution is the uniform distribution over this range.
Example 12.2.5 S = [0, ∞) |
and |
EX = µ. Then |
the entropy-maxi- |
|||
mizing distribution is |
1 |
|
|
|
|
|
|
e− |
x |
|
|||
f (x) = |
|
|
µ , x ≥ 0. |
(12.21) |
||
|
µ |
This problem has a physical interpretation. Consider the distribution of the height X of molecules in the atmosphere. The average potential energy of the molecules is fixed, and the gas tends to the distribution that has the maximum entropy subject to the constraint that E(mgX) is fixed. This is the exponential distribution with density f (x) = λe−λx , x ≥ 0. The density of the atmosphere does indeed have this distribution.
12.3 ANOMALOUS MAXIMUM ENTROPY PROBLEM |
413 |
Example 12.2.6 S = (−∞, ∞), and EX = µ. Here the maximum entropy is infinite, and there is no maximum entropy distribution. (Consider normal distributions with larger and larger variances.)
Example 12.2.7 S = (−∞, ∞), EX = α1, and EX2 = α2. The maximum entropy distribution is N(α1, α2 − α12).
Example 12.2.8 S = Rn, EXi Xj = Kij , 1 ≤ i, j ≤ n. This is a multivariate example, but the same analysis holds and the maximum entropy
density is of the form
f (x) = eλ0+ i,j λij xi xj . (12.22)
Since the exponent is a quadratic form, it is clear by inspection that the density is a multivariate normal with zero mean. Since we have to satisfy the second moment constraints, we must have a multivariate normal with covariance Kij , and hence the density is
1 |
|
|
e− |
1 xT K−1x |
|
|
|||
f (x) = |
|
|
|
|
2 |
, |
(12.23) |
||
√ |
|
n |K|1/2 |
|||||||
2π |
|
|
|
|
|||||
which has an entropy |
|
|
|
|
|
|
|||
h(Nn(0, K)) = |
1 |
log(2π e)n|K|, |
|
(12.24) |
|||||
|
|
||||||||
2 |
|
as derived in Chapter 8.
Example 12.2.9 Suppose that we have the same constraints as in Example 12.2.8, but EXi Xj = Kij only for some restricted set of (i, j ) A. For example, we might know only Kij for i = j ± 2. Then by comparing (12.22) and (12.23), we can conclude that (K−1)ij = 0 for (i, j ) Ac (i.e., the entries in the inverse of the covariance matrix are 0 when (i, j ) is outside the constraint set).
12.3ANOMALOUS MAXIMUM ENTROPY PROBLEM
We have proved that the maximum entropy distribution subject to the
constraints |
S hi (x)f (x) dx = αi |
(12.25) |
is of the form |
f (x) = eλ0+ λi hi (x) |
(12.26) |
|
if λ0, λ1, . . . , λp satisfying the constraints (12.25) exist.
414 MAXIMUM ENTROPY
We now consider a tricky problem in which the λi cannot be chosen to satisfy the constraints. Nonetheless, the “maximum” entropy can be found. We consider the following problem: Maximize the entropy subject
to the constraints |
|
∞ f (x) dx = 1, |
(12.27) |
−∞ |
|
∞ xf (x) dx = α1, |
(12.28) |
−∞ |
|
∞ x2f (x) dx = α2, |
(12.29) |
−∞ |
|
∞ x3f (x) dx = α3. |
(12.30) |
−∞ |
|
Here, the maximum entropy distribution, if it exists, must be of the form
f (x) |
= |
eλ0+λ1x+λ2x2 |
+λ3x3 . |
(12.31) |
|
|
|
|
But if λ3 is nonzero, −∞∞ f = ∞ and the density cannot be normalized. So λ3 must be 0. But then we have four equations and only three variables, so that in general it is not possible to choose the appropriate constants. The method seems to have failed in this case.
The reason for the apparent failure is simple: The entropy has a least upper bound under these constraints, but it is not possible to attain it. Consider the corresponding problem with only first and second moment constraints. In this case, the results of Example 12.2.1 show that the entropymaximizing distribution is the normal with the appropriate moments. With the additional third moment constraint, the maximum entropy cannot be higher. Is it possible to achieve this value?
We cannot achieve it, but we can come arbitrarily close. Consider a normal distribution with a small “wiggle” at a very high value of x. The moments of the new distribution are almost the same as those of the old one, the biggest change being in the third moment. We can bring the first and second moments back to their original values by adding new wiggles to balance out the changes caused by the first. By choosing the position of the wiggles, we can get any value of the third moment without reducing the entropy significantly below that of the associated normal. Using this method, we can come arbitrarily close to the upper bound for the maximum entropy distribution. We conclude that
sup h(f ) = h(N(0, α2 − α12)) = |
1 |
ln 2π e(α2 |
− α12). |
(12.32) |
2 |