Теория информации / Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p
.pdf17.4 INEQUALITIES FOR TYPES |
665 |
≤ −r(x) log r(x) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(17.30) |
||||||
x X |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
= ||p − q||1 |
− |
|
r(x) |
|
|
log |
|
r(x) |
|
||p − q||1 |
(17.31) |
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|||||||||||||||
p |
− |
q |
1 |
|
p |
− |
q |
1 |
||||||||||||||||
|
|
|
|
x X |
|| |
|
|
|
|| |
|
|
|
|| |
|
|| |
|
|
|
|
|||||
p |
− |
q |
|
1 log |
|
p |
− |
q |
|
1 |
+ || |
p |
− |
q |
1H |
|
r(x) |
|
(17.32) |
|||||
|
|
|
|
|||||||||||||||||||||
= −|| |
|
|| |
|| |
|
|
|| |
|
|
|| |
|
|
||p − q||1 |
|
|||||||||||
≤ −||p − q||1 log ||p − q||1 |
+ ||p − q||1 log |X|, |
(17.33) |
||||||||||||||||||||||
where (17.30) follows from (17.27). |
|
|
|
|
|
|
|
|
|
|
Finally, relative entropy is stronger than the L1 norm in the following
sense: |
|
|
|
|
Lemma 17.3.2 (Lemma 11.6.1) |
|
|
|
|
D(p1||p2) ≥ |
1 |
||p1 |
− p2||12. |
(17.34) |
|
||||
2 ln 2 |
The relative entropy between two probability mass functions P (x) and Q(x) is zero when P = Q. Around this point, the relative entropy has a quadratic behavior, and the first term in the Taylor series expansion of the relative entropy D(P ||Q) around the point P = Q is the chi-squared
distance between the distributions P and Q. Let |
|
|||||
χ 2(P , Q) |
|
(P (x) − Q(x))2 |
. |
(17.35) |
||
|
|
|
||||
|
= x |
|
Q(x) |
|
||
Lemma 17.3.3 For P near Q, |
|
|
|
|
||
D(P |
Q) = |
1 |
χ 2 + · · · . |
(17.36) |
||
|
||||||
2 |
||||||
Proof: See Problem 11.2. |
|
|
|
|
|
|
17.4INEQUALITIES FOR TYPES
The method of types is a powerful tool for proving results in large deviation theory and error exponents. We repeat the basic theorems.
Theorem 17.4.1 (Theorem 11.1.1) |
|
The number of types with denom- |
||
inator n is bounded by |
|
+ |
|
|
|Pn| ≤ |
(n |
1)|X|. |
(17.37) |
|
|
|
666 INEQUALITIES IN INFORMATION THEORY
Theorem 17.4.2 (Theorem 11.1.2) If X1, X2, . . . , Xn are drawn i.i.d. according to Q(x), the probability of xn depends only on its type and is given by
|
|
|
|
|
|
Qn(xn) = 2−n(H (Pxn )+D(Pxn ||Q)). |
|
|
(17.38) |
|||
Theorem 17.4.3 |
|
(Theorem 11.1.3: Size of a type class T (P )) |
For any |
|||||||||
type P Pn, |
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
1 |
2nH (P ) ≤ |T (P )| ≤ 2nH (P ). |
|
|
(17.39) |
||
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
(n + 1)|X| |
|
|
|||||
Theorem 17.4.4 |
|
(Theorem 11.1.4) For any P |
Pnnand |
|
nD(P |
Q) |
||||||
|
|
|
|
|
|
|
|
any distribu- |
||||
tion Q, the probability of the type class T (P ) under Q |
is 2− |
|| |
to |
|||||||||
first order in the exponent. More precisely, |
|
|
|
|
|
|||||||
|
|
1 |
|
|
2−nD(P ||Q) ≤ Qn(T (P )) ≤ 2−nD(P ||Q). |
|
|
(17.40) |
||||
|
|
|
|
|
|
|
|
|||||
|
(n |
+ |
1) |
|X| |
|
|
||||||
|
|
|
|
|
|
|
|
|
|
17.5COMBINATORIAL BOUNDS ON ENTROPY
We give tight bounds on the size of |
n |
when |
k is not 0 or |
n using the |
||
k |
||||||
|
|
|
|
|
||
result of Wozencraft and Reiffen |
[568]: |
|
|
|
||
|
|
|
|
|
Lemma 17.5.1 For 0 < p < 1, q = 1 − p, such that np is an integer,
1 |
≤ |
n 2−nH (p) |
|
1 |
. |
(17.41) |
||
√ |
|
≤ √ |
|
|||||
8npq |
np |
π npq |
|
|
Proof: We begin with a strong form of Stirling’s approximation [208], which states that
|
|
|
n |
n |
|
|
|
n |
|
n |
|
1 |
|
|
√2π n |
≤ n! ≤ √2π n |
|
|
|||||||||||
|
|
|
e 12n . |
(17.42) |
||||||||||
|
|
|
||||||||||||
e |
e |
|
Applying this to find an upper bound, we obtain
|
|
|
|
√ |
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|||
|
|
|
|
2π n( n )ne |
|
|
|
|
|
|
|
|
||||||||
n |
12n |
|
|
|
||||||||||||||||
|
|
|
|
|
|
|
|
e |
|
|
|
|
|
|
|
|
(17.43) |
|||
|
|
|
|
|
np |
np |
|
|
|
|
|
|
nq |
|
nq |
|||||
|
|
|
|
|
|
|
|
|
|
|
||||||||||
np |
≤ √2π np( e ) |
|
√2π nq( e |
) |
|
|
||||||||||||||
|
1 |
|
|
|
|
1 |
|
|
1 |
|
|
|
|
|
(17.44) |
|||||
|
= |
√ |
|
|
|
e 12n |
|
|
||||||||||||
|
|
pnpqnq |
|
|
||||||||||||||||
|
2π npq |
|
|
|
|
|
|
|
|
|
|
|
17.6 ENTROPY RATES OF SUBSETS |
667 |
|
|
|
|
1 |
|
2nH (p), |
(17.45) |
|||
|
|
|
|
< |
|
|
|
|
||
|
|
|
|
√ |
|
|
|
|||
|
|
|
|
π npq |
||||||
|
1 |
1 |
|
|
|
|
|
|
|
|
|
= 1.087 < √2, hence proving the upper bound. |
|
||||||||
since e 12n < e |
12 |
|
The lower bound is obtained similarly. Using Stirling’s formula, we
obtain |
|
|
|
|
|
|
√2π n( n )ne− 12np + 12nq |
|
|
|
|
|||||||||||||||||||||||
|
|
|
n |
|
|
|
|
|
|
|||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
1 |
|
|
|
|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
e |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(17.46) |
|||
|
|
|
|
|
|
|
|
np |
np |
|
|
|
|
|
|
|
|
nq |
|
nq |
|
|
|
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||
|
|
|
np |
≥ |
√2π np( e ) |
|
√2π nq( e |
) |
|
|
|
|
|
|
|
|
||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
1 |
|
|
|
|||||
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
1 |
|
|
|
e− |
|
|
|
+ |
|
|
|
|
||||||||
|
|
|
|
|
= |
|
|
|
|
|
|
|
|
|
12np |
12nq |
(17.47) |
|||||||||||||||||
|
|
|
|
|
√ |
|
|
|
|
pnpqnq |
||||||||||||||||||||||||
|
|
|
|
|
2π npq |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
1 |
|
|
|
||||
|
|
|
|
|
|
|
|
|
1 |
|
|
2nH (p) e− |
|
|
|
|
+ |
|
. |
|
||||||||||||||
|
|
|
|
|
= |
|
|
|
|
12np |
12nq |
(17.48) |
||||||||||||||||||||||
|
|
|
|
|
√ |
|
|
|
||||||||||||||||||||||||||
|
|
|
|
|
2π npq |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
If np ≥ 1, and nq ≥ 3, then |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||
|
1 |
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
e− |
+ |
|
|
|
|
e− 91 |
|
|
0.8948 > |
√π |
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
12np |
12nq |
≥ |
= |
|
= |
0.8862, |
(17.49) |
|||||||||||||||||||||||||||
|
|
|
|
|||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
and the lower bound follows directly from substituting this into the equation. The exceptions to this condition are the cases where np = 1, nq = 1 or 2, and np = 2, nq = 2 (the case when np ≥ 3, nq = 1 or 2 can be handled by flipping the roles of p and q). In each of these cases
np = 1, nq = 1 → n = 2, p = |
1 |
|
n |
= 2, bound = 2 |
2 |
, |
np |
||
np = 1, nq = 2 → n = 3, p = |
1 |
|
n |
= 3, bound = 2.92 |
3 |
, |
np |
||
np = 2, nq = 2 → n = 4, p = |
1 |
|
n |
= 6, bound = 5.66. |
2 |
, |
np |
Thus, even in these special cases, the bound is valid, and hence the lower bound is valid for all p =0, 1. Note that the lower bound blows up when p = 0 or p = 1, and is therefore not valid.
17.6ENTROPY RATES OF SUBSETS
We now generalize the chain rule for differential entropy. The chain rule provides a bound on the entropy rate of a collection of random variables in terms of the entropy of each random variable:
n
h(X1, X2, . . . , Xn) ≤ h(Xi ). (17.50)
i=1
|
17.6 ENTROPY RATES OF SUBSETS |
669 |
|
(n) |
(n) |
(n) |
(n) |
which is the desired result hn |
≤ hn−1 |
. We now prove that hk |
≤ hk−1 |
for all k ≤ n by first conditioning on a k-element subset, and then taking
a uniform choice over its (k |
− |
1)-element subsets. For each k-element |
||
(k) |
(k) |
|
|
|
subset, hk |
≤ hk−1 |
, and hence the inequality remains true after taking |
the expectation over all k-element subsets chosen uniformly from the n
elements. |
|
|
|
Theorem 17.6.2 Let r > 0, and define |
|
||
(n) |
1 |
r h(X(S)) |
|
tk = |
|
e k . |
(17.55) |
n |
|||
|
k |
S: |S|=k |
|
Then |
|
|
|
t1(n) ≥ t2(n) ≥ · · · ≥ tn(n). |
(17.56) |
Proof: Starting from (17.54), we multiply both sides by r, exponentiate, and then apply the arithmetic mean geometric mean inequality, to obtain
e n1 rh(X1, X2, . . . , Xn)
1 |
n |
rh(X1,X2,...,Xi−1,Xi+1,...,Xn) |
|
|||||
≤ e n |
i=1 |
(n−1) |
|
|
|
(17.57) |
||
1 |
|
n rh(X1,X2,...,Xi−1,Xi |
+ |
1,...,Xn) |
|
|||
≤ |
|
e |
(n−1) |
|
for all r ≥ 0, |
(17.58) |
||
n |
|
|
||||||
|
|
|
i=1 |
|
|
|
|
|
which is equivalent to tn(n) ≤ tn(n)−1. Now we use the same arguments as
in Theorem 17.6.1, taking an average over all subsets to prove the result that for all k ≤ n, tk(n) .
Definition The average conditional entropy rate per element for all subsets of size k is the average of the above quantities for k-element subsets of {1, 2, . . . , n}:
g(n) |
= |
1 |
|
h(X(S)|X(Sc)) |
. |
(17.59) |
|
|
|||||
k |
n |
k |
|
|||
|
|
k |
S:|S|=k |
|
|
|
Here gk (S) is the entropy per element of the set S conditional on the elements of the set Sc. When the size of the set S increases, one can expect a greater dependence among the elements of the set S, which explains Theorem 17.6.1.
672 INEQUALITIES IN INFORMATION THEORY
Theorem 17.7.1 (Theorem 11.10.1: Cram´er – Rao inequality) The mean-squared error of any unbiased estimator T (X) of the parameter θ is lower bounded by the reciprocal of the Fisher information:
var(T ) ≥ |
1 |
(17.70) |
J (θ ) . |
We now prove a fundamental relationship between the differential entropy and the Fisher information:
Theorem 17.7.2 (de Bruijn’s identity: entropy and Fisher information ) Let X be any random variable with a finite variance with a density f (x). Let Z be an independent normally distributed random variable with zero
mean and unit variance. Then |
|
|
|
|
|
|
|
|
|||
|
∂ |
√ |
|
|
1 |
|
√ |
|
|
(17.71) |
|
|
|
he(X + |
|
tZ) = |
|
J (X + |
|
tZ), |
|||
|
∂t |
|
2 |
where he is the differential entropy to base e. In particular, if the limit exists as t → 0,
|
|
|
|
|
|
|
∂t |
|
|
|
|
|
|
+ |
|
|
|
|
|
|
|
|
t |
|
0 |
= |
2 |
|
|
|
|
|
|
|||||||||||||||
|
|
|
|
|
|
|
∂ |
|
he(X |
|
|
|
√tZ) |
|
|
|
|
|
|
|
|
|
|
1 |
J (X). |
|
(17.72) |
|||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
= |
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
Proof: Let Yt = X + |
|
√ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||
|
|
tZ. Then the density of Yt |
is |
|
||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
gt (y) = |
|
∞ |
|
|
|
|
1 |
|
|
|
|
e− |
(y−x)2 |
|
(17.73) |
|||||||||||||||||||||||||||||
|
|
|
|
|
|
|
f (x) |
√ |
|
|
|
|
|
2t dx. |
||||||||||||||||||||||||||||||||||
|
|
2π t |
|
|
||||||||||||||||||||||||||||||||||||||||||||
Then |
|
|
|
|
|
|
|
|
−∞ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
|
|
|
|
|
|
|
|
|
∞ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
|
|
∂ |
= |
|
|
|
|
|
|
|
|
∂ |
1 |
|
|
|
e− |
(y−x)2 |
|
(17.74) |
|||||||||||||||||||||||||||
|
|
|
|
gt (y) |
|
|
|
|
f (x) |
|
|
|
|
|
|
√ |
|
|
|
|
|
|
|
|
|
2t |
|
dx |
||||||||||||||||||||
|
|
|
∂t |
|
|
|
|
|
∂t |
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
2π t |
|
|
|
|
|
|||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
−∞ |
|
|
|
|
|
|
|
|
2t √2π t e− |
|
|
|
|
|
|||||||||||||||||||||||||||
|
|
|
|
|
= |
|
∞ |
|
f (x) − |
|
|
2t |
|
|
||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
1 |
|
|
|
|
|
|
|
|
|
|
(y |
−x)2 |
|
|
||||||||
|
|
|
|
|
|
|
−∞ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
(y |
− |
x)2 1 |
|
|
(y−x)2 |
dx. |
|
|
||||||||||||||||||||||||||||||||
|
|
|
|
|
|
+ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
e− |
|
|
|
|
|
2t |
|
|
|
|
|
(17.75) |
||||||||||
|
|
|
|
|
|
|
|
|
2t2 |
|
|
|
|
√ |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
2π t |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||
We also calculate |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
∂ |
|
|
∞ |
|
|
|
|
|
|
|
|
1 |
|
|
|
∂ |
|
|
|
|
|
|
(y−x)2 |
dx |
|
||||||||||||||||||||||
|
|
gt (y) = |
|
|
|
|
f (x) |
√ |
|
|
|
e− |
|
|
|
2t |
|
|
(17.76) |
|||||||||||||||||||||||||||||
|
∂y |
|
|
|
|
∂y |
|
|
|
|
|
|||||||||||||||||||||||||||||||||||||
|
−∞ |
2π t |
|
|
|
|
|
|||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
= |
|
∞ |
|
|
|
|
|
|
|
|
1 |
|
|
|
|
− |
y |
|
|
t |
|
x |
|
|
|
|
|
(y−x)2 |
|
|||||||||||||||||||
|
|
|
|
|
|
|
|
|
√2π t |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||
|
|
|
|
|
|
|
|
|
f (x) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
− |
|
|
|
e− |
|
2t |
dx |
(17.77) |
|||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
−∞
674 INEQUALITIES IN INFORMATION THEORY
(17.84) goes to 0 at both limits and the theorem is proved. In the proof, we have exchanged integration and differentiation in (17.74), (17.76), (17.78), and (17.82). Strict justification of these exchanges requires the application of the bounded convergence and mean value theorems; the details may be found in Barron [30].
This theorem can be used to prove the entropy power inequality, which gives a lower bound on the entropy of a sum of independent random variables.
Theorem 17.7.3 (Entropy power inequality) If X and Y are independent random n-vectors with densities, then
2 n2 h(X + Y) ≥ 2 n2 h(X) + 2 n2 h(Y) . |
(17.86) |
We outline the basic steps in the proof due to Stam [505] and Blachman [61]. A different proof is given in Section 17.8.
Stam’s proof of the entropy power inequality is based on a perturbation argument. Let n = 1. Let Xt = X + √f (t)Z1, Yt = Y + √g(t)Z2, where Z1 and Z2 are independent N(0, 1) random variables. Then the entropy power inequality for n = 1 reduces to showing that s(0) ≤ 1, where we define
s(t) |
= |
22h(Xt ) + 22h(Yt ) |
. |
(17.87) |
|
22h(Xt +Yt ) |
|
If f (t) → ∞ and g(t) → ∞ as t → ∞, it is easy to show that s(∞) = 1. If, in addition, s (t) ≥ 0 for t ≥ 0, this implies that s(0) ≤ 1. The proof of the fact that s (t) ≥ 0 involves a clever choice of the functions f (t) and g(t), an application of Theorem 17.7.2 and the use of a convolution inequality for Fisher information,
|
1 |
|
|
1 |
|
1 |
|
(17.88) |
|
|
|
|
≥ |
|
|
+ |
|
. |
|
J (X |
+ |
Y ) |
J (X) |
J (Y ) |
|||||
|
|
|
|
|
|
|
|
|
The entropy power inequality can be extended to the vector case by induction. The details may be found in the papers by Stam [505] and Blachman [61].
17.8 ENTROPY POWER INEQUALITY AND BRUNN–MINKOWSKI INEQUALITY
The entropy power inequality provides a lower bound on the differential entropy of a sum of two independent random vectors in terms of their individual differential entropies. In this section we restate and outline an