Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Теория информации / Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p

.pdf
Скачиваний:
214
Добавлен:
09.08.2013
Размер:
10.58 Mб
Скачать

17.4 INEQUALITIES FOR TYPES

665

≤ −r(x) log r(x)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(17.30)

x X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

= ||p q||1

 

r(x)

 

 

log

 

r(x)

 

||p q||1

(17.31)

 

 

 

 

 

 

 

 

 

 

p

q

1

 

p

q

1

 

 

 

 

x X

||

 

 

 

||

 

 

 

||

 

||

 

 

 

 

p

q

 

1 log

 

p

q

 

1

+ ||

p

q

1H

 

r(x)

 

(17.32)

 

 

 

 

= −||

 

||

||

 

 

||

 

 

||

 

 

||p q||1

 

≤ −||p q||1 log ||p q||1

+ ||p q||1 log |X|,

(17.33)

where (17.30) follows from (17.27).

 

 

 

 

 

 

 

 

 

 

Finally, relative entropy is stronger than the L1 norm in the following

sense:

 

 

 

 

Lemma 17.3.2 (Lemma 11.6.1)

 

 

 

 

D(p1||p2)

1

||p1

p2||12.

(17.34)

 

2 ln 2

The relative entropy between two probability mass functions P (x) and Q(x) is zero when P = Q. Around this point, the relative entropy has a quadratic behavior, and the first term in the Taylor series expansion of the relative entropy D(P ||Q) around the point P = Q is the chi-squared

distance between the distributions P and Q. Let

 

χ 2(P , Q)

 

(P (x) Q(x))2

.

(17.35)

 

 

 

 

= x

 

Q(x)

 

Lemma 17.3.3 For P near Q,

 

 

 

 

D(P

Q) =

1

χ 2 + · · · .

(17.36)

 

2

Proof: See Problem 11.2.

 

 

 

 

 

 

17.4INEQUALITIES FOR TYPES

The method of types is a powerful tool for proving results in large deviation theory and error exponents. We repeat the basic theorems.

Theorem 17.4.1 (Theorem 11.1.1)

 

The number of types with denom-

inator n is bounded by

 

+

 

 

|Pn| ≤

(n

1)|X|.

(17.37)

 

 

666 INEQUALITIES IN INFORMATION THEORY

Theorem 17.4.2 (Theorem 11.1.2) If X1, X2, . . . , Xn are drawn i.i.d. according to Q(x), the probability of xn depends only on its type and is given by

 

 

 

 

 

 

Qn(xn) = 2n(H (Pxn )+D(Pxn ||Q)).

 

 

(17.38)

Theorem 17.4.3

 

(Theorem 11.1.3: Size of a type class T (P ))

For any

type P Pn,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

2nH (P ) ≤ |T (P )| ≤ 2nH (P ).

 

 

(17.39)

 

 

 

 

 

 

 

 

 

 

 

 

 

(n + 1)|X|

 

 

Theorem 17.4.4

 

(Theorem 11.1.4) For any P

Pnnand

 

nD(P

Q)

 

 

 

 

 

 

 

 

any distribu-

tion Q, the probability of the type class T (P ) under Q

is 2

||

to

first order in the exponent. More precisely,

 

 

 

 

 

 

 

1

 

 

2nD(P ||Q) Qn(T (P )) ≤ 2nD(P ||Q).

 

 

(17.40)

 

 

 

 

 

 

 

 

 

(n

+

1)

|X|

 

 

 

 

 

 

 

 

 

 

 

 

17.5COMBINATORIAL BOUNDS ON ENTROPY

We give tight bounds on the size of

n

when

k is not 0 or

n using the

k

 

 

 

 

 

result of Wozencraft and Reiffen

[568]:

 

 

 

 

 

 

 

 

Lemma 17.5.1 For 0 < p < 1, q = 1 − p, such that np is an integer,

1

n 2nH (p)

 

1

.

(17.41)

 

≤ √

 

8npq

np

π npq

 

 

Proof: We begin with a strong form of Stirling’s approximation [208], which states that

 

 

 

n

n

 

 

 

n

 

n

 

1

 

 

2π n

n! ≤ 2π n

 

 

 

 

 

e 12n .

(17.42)

 

 

 

e

e

 

Applying this to find an upper bound, we obtain

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

2π n( n )ne

 

 

 

 

 

 

 

 

n

12n

 

 

 

 

 

 

 

 

 

 

 

e

 

 

 

 

 

 

 

 

(17.43)

 

 

 

 

 

np

np

 

 

 

 

 

 

nq

 

nq

 

 

 

 

 

 

 

 

 

 

 

np

≤ √2π np( e )

 

2π nq( e

)

 

 

 

1

 

 

 

 

1

 

 

1

 

 

 

 

 

(17.44)

 

=

 

 

 

e 12n

 

 

 

 

pnpqnq

 

 

 

2π npq

 

 

 

 

 

 

 

 

 

 

 

17.6 ENTROPY RATES OF SUBSETS

667

 

 

 

 

1

 

2nH (p),

(17.45)

 

 

 

 

<

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

π npq

 

1

1

 

 

 

 

 

 

 

 

= 1.087 < 2, hence proving the upper bound.

 

since e 12n < e

12

 

The lower bound is obtained similarly. Using Stirling’s formula, we

obtain

 

 

 

 

 

 

2π n( n )ne12np + 12nq

 

 

 

 

 

 

 

n

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

e

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(17.46)

 

 

 

 

 

 

 

 

np

np

 

 

 

 

 

 

 

 

nq

 

nq

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

np

2π np( e )

 

2π nq( e

)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

1

 

 

 

e

 

 

 

+

 

 

 

 

 

 

 

 

 

=

 

 

 

 

 

 

 

 

 

12np

12nq

(17.47)

 

 

 

 

 

 

 

 

 

pnpqnq

 

 

 

 

 

2π npq

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

2nH (p) e

 

 

 

 

+

 

.

 

 

 

 

 

 

=

 

 

 

 

12np

12nq

(17.48)

 

 

 

 

 

 

 

 

 

 

 

 

 

2π npq

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

If np ≥ 1, and nq ≥ 3, then

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

e

+

 

 

 

 

e91

 

 

0.8948 >

π

 

 

 

 

 

 

 

 

 

 

 

12np

12nq

=

 

=

0.8862,

(17.49)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

and the lower bound follows directly from substituting this into the equation. The exceptions to this condition are the cases where np = 1, nq = 1 or 2, and np = 2, nq = 2 (the case when np ≥ 3, nq = 1 or 2 can be handled by flipping the roles of p and q). In each of these cases

np = 1, nq = 1 → n = 2, p =

1

 

n

= 2, bound = 2

2

,

np

np = 1, nq = 2 → n = 3, p =

1

 

n

= 3, bound = 2.92

3

,

np

np = 2, nq = 2 → n = 4, p =

1

 

n

= 6, bound = 5.66.

2

,

np

Thus, even in these special cases, the bound is valid, and hence the lower bound is valid for all p =0, 1. Note that the lower bound blows up when p = 0 or p = 1, and is therefore not valid.

17.6ENTROPY RATES OF SUBSETS

We now generalize the chain rule for differential entropy. The chain rule provides a bound on the entropy rate of a collection of random variables in terms of the entropy of each random variable:

n

h(X1, X2, . . . , Xn) h(Xi ). (17.50)

i=1

668 INEQUALITIES IN INFORMATION THEORY

We extend this to show that the entropy per element of a subset of a set of random variables decreases as the size of the subset increases. This is not true for each subset but is true on the average over subsets, as expressed in Theorem 17.6.1.

Definition Let (X1, X2, . . . , Xn) have a density, and for every S {1, 2, . . . , n}, denote by X(S) the subset {Xi : i S). Let

hk(n) =

1

 

h(X(S))

.

(17.51)

n

k

 

k

S: |S|=k

 

 

 

Here h(n)k is the average entropy in bits per symbol of a randomly drawn k-element subset of {X1, X2, . . . , Xn}.

The following theorem by Han [270] says that the average entropy decreases monotonically in the size of the subset.

Theorem 17.6.1

 

h1(n) h2(n) ≥ · · · ≥ hn(n).

(17.52)

Proof: We first prove the last inequality, hn(n) hn(n)−1. We write

 

h(X1, X2, . . . , Xn) = h(X1

, X2

, . . . , Xn−1)+h(Xn|X1, X2, . . . , Xn−1),

h(X1, X2, . . . , Xn) = h(X1

, X2

, . . . , Xn−2, Xn)

 

 

+ h(Xn−1|X1, X2, . . . , Xn−2, Xn),

 

 

h(X1

, X2

, . . . , Xn−2, Xn)

 

.

+ h(Xn−1|X1, X2, . . . , Xn−2),

 

.

 

 

 

 

.

 

 

, . . . , Xn) + h(X1).

 

h(X1, X2, . . . , Xn) h(X2

, X3

 

Adding these n inequalities and using the chain rule, we obtain

n

n h(X1, X2, . . . , Xn) h(X1, X2, . . . , Xi−1, Xi+1, . . . , Xn)

 

 

 

 

 

i=1

 

 

 

 

(17.53)

 

 

 

 

 

+ h(X1, X2, . . . , Xn)

 

or

1

n

 

 

 

 

 

 

 

1

 

h(X

, X

, . . . , X , X

 

, . . . , X

)

 

 

 

h(X1

, X2, . . . , Xn)

 

 

1

2

i−1

i+1

n

 

,

 

n

n

 

 

n 1

 

 

 

 

 

 

 

 

i=1

 

 

 

 

 

 

(17.54)

(n)
tk−1

 

17.6 ENTROPY RATES OF SUBSETS

669

(n)

(n)

(n)

(n)

which is the desired result hn

hn−1

. We now prove that hk

hk−1

for all k n by first conditioning on a k-element subset, and then taking

a uniform choice over its (k

1)-element subsets. For each k-element

(k)

(k)

 

 

subset, hk

hk−1

, and hence the inequality remains true after taking

the expectation over all k-element subsets chosen uniformly from the n

elements.

 

 

 

Theorem 17.6.2 Let r > 0, and define

 

(n)

1

r h(X(S))

 

tk =

 

e k .

(17.55)

n

 

k

S: |S|=k

 

Then

 

 

 

t1(n) t2(n) ≥ · · · ≥ tn(n).

(17.56)

Proof: Starting from (17.54), we multiply both sides by r, exponentiate, and then apply the arithmetic mean geometric mean inequality, to obtain

e n1 rh(X1, X2, . . . , Xn)

1

n

rh(X1,X2,...,Xi−1,Xi+1,...,Xn)

 

e n

i=1

(n−1)

 

 

 

(17.57)

1

 

n rh(X1,X2,...,Xi−1,Xi

+

1,...,Xn)

 

 

e

(n−1)

 

for all r ≥ 0,

(17.58)

n

 

 

 

 

 

i=1

 

 

 

 

 

which is equivalent to tn(n) tn(n)−1. Now we use the same arguments as

in Theorem 17.6.1, taking an average over all subsets to prove the result that for all k n, tk(n) .

Definition The average conditional entropy rate per element for all subsets of size k is the average of the above quantities for k-element subsets of {1, 2, . . . , n}:

g(n)

=

1

 

h(X(S)|X(Sc))

.

(17.59)

 

 

k

n

k

 

 

 

k

S:|S|=k

 

 

 

Here gk (S) is the entropy per element of the set S conditional on the elements of the set Sc. When the size of the set S increases, one can expect a greater dependence among the elements of the set S, which explains Theorem 17.6.1.

670 INEQUALITIES IN INFORMATION THEORY

In the case of the conditional entropy per element, as k increases, the size of the conditioning set Sc decreases and the entropy of the set S increases. The increase in entropy per element due to the decrease in conditioning dominates the decrease due to additional dependence among the elements, as can be seen from the following theorem due to Han [270]. Note that the conditional entropy ordering in the following theorem is the reverse of the unconditional entropy ordering in Theorem 17.6.1.

Theorem 17.6.3

g1(n) g2(n) ≤ · · · ≤ gn(n).

(17.60)

Proof: The proof proceeds on lines very similar to the proof of the theorem for the unconditional entropy per element for a random subset.

(n)

(n)

and then use this to prove the rest of

We first prove that gn

gn−1

the inequalities. By the chain rule, the entropy of a collection of random variables is less than the sum of the entropies:

n

h(X1, X2, . . . , Xn) h(Xi ). (17.61)

i=1

Subtracting both sides of this inequality from nh(X1, X2, . . . , Xn), we have

n

(n − 1)h(X1, X2, . . . , Xn) (h(X1, X2, . . . , Xn) h(Xi )) (17.62)

i=1

n

=h(X1, . . . , Xi−1, Xi+1, . . . , Xn|Xi ).

i=1

(17.63)

Dividing this by n(n − 1), we obtain

h(X1, X2, . . . , Xn)

n

1

n

h(X1, X2, . . . , Xi−1, Xi

 

1

, . . . , Xn

Xi )

 

 

 

1

+

 

|

 

,

n

n

 

 

 

 

 

 

i=1

 

 

 

 

 

 

 

(17.64)

(n)

(n)

(n)

(n)

for

which is equivalent to gn

gn−1

. We now prove that gk

gk−1

all k n by first conditioning on a k-element subset and then taking

a uniform choice over its (k

1)-element subsets. For each k-element

(k)

(k)

 

 

subset, gk

gk−1

, and hence the inequality remains true after taking

the expectation over all k-element subsets chosen uniformly from the n elements.

 

 

 

 

17.7 ENTROPY AND FISHER INFORMATION

671

Theorem 17.6.4

Let

 

 

 

 

 

 

 

f

(n)

 

1

 

I (X(S); X(Sc))

.

(17.65)

k

=

n

 

 

 

 

k

 

 

 

 

k

S:|S|=k

 

 

 

 

Then

 

 

 

 

 

 

 

 

 

f1(n) f2(n) ≥ · · · ≥ fn(n).

(17.66)

Proof: The theorem

cfollows from

the identity I (X(S); X(Sc)) =

h(X(S)) h(X(S)|X(S

)) and Theorems 17.6.1 and 17.6.3.

 

17.7ENTROPY AND FISHER INFORMATION

The differential entropy of a random variable is a measure of its descriptive complexity. The Fisher information is a measure of the minimum error in estimating a parameter of a distribution. In this section we derive a relationship between these two fundamental quantities and use this to derive the entropy power inequality.

Let X be any random variable with density f (x). We introduce a location parameter θ and write the density in a parametric form as f (x θ ). The Fisher information (Section 11.10) with respect to θ is given by

 

=

f (x

θ )

∂θ

 

θ ) 2

 

 

J (θ )

 

 

ln f (x

 

dx.

(17.67)

 

 

 

 

−∞

In this case, differentiation with respect to x is equivalent to differentiation with respect to θ . So we can write the Fisher information as

J (X) =

−∞

=

−∞

which we can rewrite as

J (X) =

f (x

θ )

 

ln f (x

θ ) 2

dx

 

 

 

 

 

 

 

∂x

 

 

 

f (x)

ln f (x) 2

dx,

(17.68)

∂x

 

 

 

 

 

 

 

 

 

 

 

 

 

 

f (x)

 

f (x)

2 dx.

 

∂x

(17.69)

 

f (x)

−∞

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

We will call this the Fisher information of the distribution of X. Notice that like entropy, it is a function of the density.

The importance of Fisher information is illustrated in the following theorem.

672 INEQUALITIES IN INFORMATION THEORY

Theorem 17.7.1 (Theorem 11.10.1: Cram´er – Rao inequality) The mean-squared error of any unbiased estimator T (X) of the parameter θ is lower bounded by the reciprocal of the Fisher information:

var(T )

1

(17.70)

J (θ ) .

We now prove a fundamental relationship between the differential entropy and the Fisher information:

Theorem 17.7.2 (de Bruijn’s identity: entropy and Fisher information ) Let X be any random variable with a finite variance with a density f (x). Let Z be an independent normally distributed random variable with zero

mean and unit variance. Then

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

(17.71)

 

 

he(X +

 

tZ) =

 

J (X +

 

tZ),

 

∂t

 

2

where he is the differential entropy to base e. In particular, if the limit exists as t → 0,

 

 

 

 

 

 

 

∂t

 

 

 

 

 

 

+

 

 

 

 

 

 

 

 

t

 

0

=

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

he(X

 

 

 

tZ)

 

 

 

 

 

 

 

 

 

 

1

J (X).

 

(17.72)

 

 

 

 

 

 

 

 

 

 

 

=

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Proof: Let Yt = X +

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

tZ. Then the density of Yt

is

 

 

 

 

 

gt (y) =

 

 

 

 

 

1

 

 

 

 

e

(yx)2

 

(17.73)

 

 

 

 

 

 

 

f (x)

 

 

 

 

 

2t dx.

 

 

2π t

 

 

Then

 

 

 

 

 

 

 

 

−∞

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

=

 

 

 

 

 

 

 

 

1

 

 

 

e

(yx)2

 

(17.74)

 

 

 

 

gt (y)

 

 

 

 

f (x)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2t

 

dx

 

 

 

∂t

 

 

 

 

 

∂t

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2π t

 

 

 

 

 

 

 

 

 

 

 

 

−∞

 

 

 

 

 

 

 

 

2t 2π t e

 

 

 

 

 

 

 

 

 

 

=

 

 

f (x)

 

 

2t

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

(y

x)2

 

 

 

 

 

 

 

 

 

−∞

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(y

x)2 1

 

 

(yx)2

dx.

 

 

 

 

 

 

 

 

+

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

e

 

 

 

 

 

2t

 

 

 

 

 

(17.75)

 

 

 

 

 

 

 

 

 

2t2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2π t

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

We also calculate

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

(yx)2

dx

 

 

 

gt (y) =

 

 

 

 

f (x)

 

 

 

e

 

 

 

2t

 

 

(17.76)

 

∂y

 

 

 

 

∂y

 

 

 

 

 

 

−∞

2π t

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

=

 

 

 

 

 

 

 

 

 

1

 

 

 

 

y

 

 

t

 

x

 

 

 

 

 

(yx)2

 

 

 

 

 

 

 

 

 

 

2π t

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

f (x)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

e

 

2t

dx

(17.77)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

−∞

17.7 ENTROPY AND FISHER INFORMATION

673

and

 

 

 

 

 

 

 

 

 

 

 

e

 

 

 

dx

 

 

2

gt (y)

 

f (x)

1

 

 

 

2t

 

 

 

=

 

 

 

 

y

 

x

(yx)2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

∂y2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

−∞

 

 

2π t

 

∂y

 

t

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

f (x)

 

 

 

 

 

 

e

 

2t

 

 

 

 

 

e

2t

 

 

 

=

 

 

1

1

 

(yx)2

 

(y

 

x)2

(yx)2

 

 

 

 

 

 

 

 

 

 

 

 

+

 

 

 

 

 

 

 

 

 

 

 

 

 

 

t

 

 

 

 

 

t2

 

 

 

 

 

 

 

 

2π t

 

 

 

 

 

 

 

 

 

 

 

 

 

−∞

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Thus,

1 2

t gt (y) = 2 ∂y2 gt (y).

(17.78)

dx.

(17.79)

(17.80)

We will use this relationship to calculate the derivative of the entropy of Yt , where the entropy is given by

 

 

 

 

 

he(Yt ) = − gt (y) ln gt (y) dy.

(17.81)

 

 

 

 

 

 

 

 

−∞

 

 

 

 

 

 

Differentiating, we obtain

 

 

 

 

 

 

 

 

 

he(Yt )

 

gt (y) dy

gt (y) ln gt (y) dy

(17.82)

 

 

 

 

 

 

∂t

 

= − −∞ ∂t

−∞ ∂t

 

 

 

 

 

gt (y) dy

 

1

2

gt (y) ln gt (y) dy.

(17.83)

 

 

 

= − ∂t

2

2

 

 

 

−∞

−∞ ∂y

 

The first term is zero since gt (y) dy = 1. The second term can be integrated by parts to obtain

he(Yt )

= −

1

 

∂gt (y)

ln gt (y)

+

1

gt (y) 2

1

dy.

∂t

2

∂y

2

∂y

gt (y)

 

 

−∞

−∞

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(17.84) The second term in (17.84) is 12 J (Yt ). So the proof will be complete if

we show that the first term in (17.84)

is zero. We can rewrite the first

term as

 

 

 

 

 

 

 

 

 

 

 

 

 

∂gt (y)

 

 

∂gt (y)

 

 

 

 

 

 

 

 

 

 

ln gt (y)

 

∂y

 

 

 

 

 

 

 

 

 

 

2

 

gt (y) ln

 

gt (y) .

(17.85)

 

 

 

 

 

 

∂y

=

gt (y)

 

 

 

 

 

 

 

 

The square of the first factor integrates to the Fisher information and hence must be bounded as y → ±∞. The second factor goes to zero since x ln x → 0 as x → 0 and gt (y) → 0 as y → ±∞. Hence, the first term in

674 INEQUALITIES IN INFORMATION THEORY

(17.84) goes to 0 at both limits and the theorem is proved. In the proof, we have exchanged integration and differentiation in (17.74), (17.76), (17.78), and (17.82). Strict justification of these exchanges requires the application of the bounded convergence and mean value theorems; the details may be found in Barron [30].

This theorem can be used to prove the entropy power inequality, which gives a lower bound on the entropy of a sum of independent random variables.

Theorem 17.7.3 (Entropy power inequality) If X and Y are independent random n-vectors with densities, then

2 n2 h(X + Y) ≥ 2 n2 h(X) + 2 n2 h(Y) .

(17.86)

We outline the basic steps in the proof due to Stam [505] and Blachman [61]. A different proof is given in Section 17.8.

Stam’s proof of the entropy power inequality is based on a perturbation argument. Let n = 1. Let Xt = X + f (t)Z1, Yt = Y + g(t)Z2, where Z1 and Z2 are independent N(0, 1) random variables. Then the entropy power inequality for n = 1 reduces to showing that s(0) ≤ 1, where we define

s(t)

=

22h(Xt ) + 22h(Yt )

.

(17.87)

 

22h(Xt +Yt )

 

If f (t) → ∞ and g(t) → ∞ as t → ∞, it is easy to show that s() = 1. If, in addition, s (t) ≥ 0 for t ≥ 0, this implies that s(0) ≤ 1. The proof of the fact that s (t) ≥ 0 involves a clever choice of the functions f (t) and g(t), an application of Theorem 17.7.2 and the use of a convolution inequality for Fisher information,

 

1

 

 

1

 

1

 

(17.88)

 

 

 

 

 

+

 

.

J (X

+

Y )

J (X)

J (Y )

 

 

 

 

 

 

 

 

 

The entropy power inequality can be extended to the vector case by induction. The details may be found in the papers by Stam [505] and Blachman [61].

17.8 ENTROPY POWER INEQUALITY AND BRUNN–MINKOWSKI INEQUALITY

The entropy power inequality provides a lower bound on the differential entropy of a sum of two independent random vectors in terms of their individual differential entropies. In this section we restate and outline an