Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Теория информации / Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p

.pdf
Скачиваний:
214
Добавлен:
09.08.2013
Размер:
10.58 Mб
Скачать

8.2 AEP FOR CONTINUOUS RANDOM VARIABLES 245

8.2AEP FOR CONTINUOUS RANDOM VARIABLES

One of the important roles of the entropy for discrete random variables is in the AEP, which states that for a sequence of i.i.d. random variables, p(X1, X2, . . . , Xn) is close to 2nH (X) with high probability. This enables us to define the typical set and characterize the behavior of typical sequences.

We can do the same for a continuous random variable.

Theorem 8.2.1 Let X1, X2, . . . , Xn be a sequence of random variables drawn i.i.d. according to the density f (x). Then

1

log f (X1

, X2, . . . , Xn) E[− log f (X)] = h(X) in probability.

 

n

 

 

 

(8.10)

Proof: The proof follows directly from the weak law of large numbers.

This leads to the following definition of the typical set.

 

 

 

Definition

For > 0 and any n, we define the typical set A(n) with

respect to f (x) as follows:

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A(n)

=

(x1

, x2, . . . , xn)

 

Sn :

1

log f (x1

, x2, . . . , xn)

h(X)

,

 

 

 

 

n

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

where f (x1, x2, . . . , xn) =

n

 

 

 

 

 

(8.11)

i=1 f (xi ).

 

 

 

 

 

The properties of the typical set for continuous random variables parallel those for discrete random variables. The analog of the cardinality of the typical set for the discrete case is the volume of the typical set for continuous random variables.

Definition The volume Vol(A) of a set A Rn is defined as

Vol(A) = dx1 dx2 · · · dxn. (8.12)

A

Theorem 8.2.2 The typical set A(n) has the following properties:

1.Pr A(n) > 1 − for n sufficiently large.

2.Vol A(n) ≤ 2n(h(X)+ ) for all n.

3.Vol A(n) (1 − )2n(h(X)) for n sufficiently large.

246 DIFFERENTIAL ENTROPY

Proof: By Theorem 8.2.1, − n1 log f (Xn) = − n1 probability, establishing property 1. Also,

1 = f (x1, x2, . . . , xn) dx1 dx2

Sn

log f (Xi ) h(X) in

· · · dxn

(8.13)

(n) f (x1, x2, . . . , xn) dx1 dx2 · · · dxn

(8.14)

 

A

 

(n) 2n(h(X)+ ) dx1 dx2 · · · dxn

(8.15)

 

A

 

= 2n(h(X)+ ) (n) dx1 dx2 · · · dxn

(8.16)

 

A

 

= 2n(h(X)+ ) Vol A(n) .

(8.17)

Hence we have property 2. We argue further that the volume of the typical set is at least this large. If n is sufficiently large so that property 1 is satisfied, then

 

 

1 − ≤

(n) f (x1, x2, . . . , xn) dx1 dx2 · · · dxn

(8.18)

 

 

 

A

 

 

 

 

 

 

 

 

 

(n) 2n(h(X)) dx1 dx2 · · · dxn

 

(8.19)

 

 

 

A

 

 

 

 

 

 

 

 

 

= 2n(h(X))

(n)

dx1 dx2 · · · dxn

 

(8.20)

 

 

 

 

 

 

A

 

 

 

 

 

 

= 2n(h(X)) Vol A(n) ,

 

(8.21)

establishing property 3. Thus for n sufficiently large, we have

 

(1

)2n(h(X))

Vol(A(n))

2n(h(X)+ ).

 

(8.22)

 

 

 

 

 

 

 

Theorem 8.2.3 The set A(n) is the smallest volume set with probability ≥ 1 − , to first order in the exponent.

Proof: Same as in the discrete case.

 

This theorem indicates that the volume of the smallest set that contains most of the probability is approximately 2nh. This is an n-dimensional

volume, so the corresponding side length is (2nh) n1 = 2h. This provides

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY

247

an interpretation of the differential entropy: It is the logarithm of the equivalent side length of the smallest set that contains most of the probability. Hence low entropy implies that the random variable is confined to a small effective volume and high entropy indicates that the random variable is widely dispersed.

Note. Just as the entropy is related to the volume of the typical set, there is a quantity called Fisher information which is related to the surface area of the typical set. We discuss Fisher information in more detail in Sections 11.10 and 17.8.

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY

Consider a random variable X with density f (x) illustrated in Figure 8.1. Suppose that we divide the range of X into bins of length . Let us assume that the density is continuous within the bins. Then, by the mean value theorem, there exists a value xi within each bin such that

 

 

 

 

 

f (xi )

=

(i+1) f (x) dx.

(8.23)

 

 

 

 

 

 

 

 

 

 

 

i

 

Consider the quantized random variable X , which is defined by

 

 

 

 

X = xi

if i X < (i + 1) .

(8.24)

f(x)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

x

FIGURE 8.1. Quantization of a continuous random variable.

248 DIFFERENTIAL ENTROPY

Then the probability that X = xi is

 

 

 

pi

=

(i+1) f (x) dx

=

f (xi ) .

(8.25)

 

i

 

 

The entropy of the quantized version is

 

 

 

 

 

 

 

 

H (X ) = − pi log pi

 

 

(8.26)

−∞

 

 

 

 

 

 

 

 

 

= − f (xi ) log(f (xi ) )

 

(8.27)

−∞

 

 

 

 

= − f (xi ) log f (xi ) f (xi ) log

(8.28)

= − f (xi ) log f (xi ) − log ,

(8.29)

since f (xi ) = f (x) = 1. If f (x) log f (x) is Riemann integrable (a condition to ensure that the limit is well defined [556]), the first term in (8.29) approaches the integral of −f (x) log f (x) as → 0 by definition of Riemann integrability. This proves the following.

Theorem 8.3.1 If the density f (x) of the random variable X is Riemann integrable, then

H (X ) + log → h(f ) = h(X), as → 0.

(8.30)

Thus, the entropy of an n-bit quantization of a continuous random variable X is approximately h(X) + n.

Example 8.3.1

1.

If X

has a

 

 

 

 

 

 

 

 

 

 

we

let

=

2n,

uniform distribution on [0, 1] and

 

 

then

h = 0,

H (X

 

) = n,

and n bits

suffice

 

to

describe X to n

 

bit accuracy.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2.

If X is uniformly distributed on [0, 1

], the first 3 bits to the right

 

 

 

 

 

 

 

 

 

8

 

 

 

 

 

 

 

 

 

of the decimal point must be 0. To describe X to n-bit accuracy

3.

requires only n − 3 bits, which agrees with h(X) = −3.

 

 

If X

( , σ 2)

 

 

 

σ 2

= 100,

describing

X

 

n

 

 

 

 

 

N 0

 

with

 

1

 

2

 

to

 

bit accuracy

 

would require on the average n +

2 log(2π eσ

 

) = n + 5.37 bits.

8.4 JOINT AND CONDITIONAL DIFFERENTIAL ENTROPY

249

In general, h(X) + n is the number of bits on the average required to describe X to n-bit accuracy.

The differential entropy of a discrete random variable can be considered to be −∞. Note that 2−∞ = 0, agreeing with the idea that the volume of the support set of a discrete random variable is zero.

8.4JOINT AND CONDITIONAL DIFFERENTIAL ENTROPY

As in the discrete case, we can extend the definition of differential entropy of a single random variable to several random variables.

Definition The differential entropy of a set X1, X2, . . . , Xn of random variables with density f (x1, x2, . . . , xn) is defined as

h(X1, X2, . . . , Xn) = − f (xn) log f (xn) dxn. (8.31)

Definition If X, Y have a joint density function f (x, y), we can define the conditional differential entropy h(X|Y ) as

|

= −

 

|

(8.32)

h(X Y )

 

f (x, y) log f (x y) dx dy.

Since in general f (x|y) = f (x, y)/f (y), we can also write

h(X|Y ) = h(X, Y ) h(Y ).

(8.33)

But we must be careful if any of the differential entropies are infinite.

The next entropy evaluation is used frequently in the text.

Theorem 8.4.1 (Entropy of a multivariate normal distribution) Let X1, X2, . . . , Xn have a multivariate normal distribution with mean µ and covariance matrix K. Then

h(X1, X2, . . . , Xn) = h(Nn(µ, K)) =

1

log(2π e)n|K| bits, (8.34)

2

where |K| denotes the determinant of K.

250 DIFFERENTIAL ENTROPY

Proof: The probability density function of X1, X2, . . . , Xn is

 

 

f (x)

 

 

 

 

 

 

1

 

 

e21 (xµ)T K−1(xµ).

 

 

 

(8.35)

 

 

 

 

 

 

 

 

 

 

n

1

 

 

 

Then

 

 

= 2π

 

|K| 2

 

 

 

 

 

 

 

 

 

 

h(f )

 

f (x)

1

(x

 

 

µ)T K−1(x µ)

 

ln

2π

n

K

2

dx

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

= −

 

 

 

 

 

 

 

 

− −

 

 

 

| |

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(8.36)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

=1 E

2

=1 E

2

=1

2 i,j

(Xi

µi ) K−1

 

ij (Xj

µj )

 

 

+

 

1

ln(2π )n|K| (8.37)

 

 

 

 

 

2

i,j

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(X

µ )(X

j

µ ) K

1

 

 

+

 

1

 

( π )n

K

(8.38)

 

ij

 

 

 

 

i

i

j

 

 

 

 

2 ln 2

|

|

i,j

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

E[(Xj

µj )(Xi

µi )] K−1 ij

 

+

ln(2π )n|K|

(8.39)

 

 

 

 

 

2

=

1

 

Kj i

K−1

 

ij +

1

ln(2π )n|K|

(8.40)

2

 

 

 

2

 

 

 

j i

 

 

 

 

 

 

 

=

1

 

(KK−1)jj +

1

ln(2π )n|K|

(8.41)

 

 

 

2

2

 

 

 

j

 

 

 

 

 

 

 

 

 

1

1

 

 

 

 

 

 

 

 

=

 

 

Ijj +

 

ln(2π )n|K

|

 

(8.42)

2

2

 

 

 

 

j

 

 

 

 

 

 

 

 

=

n

+

1

ln(2π )n|K|

 

 

 

 

 

 

(8.43)

 

 

 

 

 

 

 

 

22

=

1

ln(2π e)n|K|

nats

 

(8.44)

2

 

=

1

log(2π e)n|K|

bits.

 

(8.45)

 

2

8.5RELATIVE ENTROPY AND MUTUAL INFORMATION

We now extend the definition of two familiar quantities, D(f ||g) and I (X; Y ), to probability densities.

f (x, y)

8.5 RELATIVE ENTROPY AND MUTUAL INFORMATION

251

Definition The relative entropy (or Kullback–Leibler distance) D(f ||g) between two densities f and g is defined by

||

 

=

 

 

g

 

D(f

g)

 

f log

f

.

(8.46)

 

 

Note that D(f ||g) is finite only if the support set of f is contained in the support set of g. [Motivated by continuity, we set 0 log 00 = 0.]

Definition The mutual information I (X; Y ) between two random variables with joint density f (x, y) is defined as

I (X; Y ) = f (x, y) log f (x)f (y) dx dy. (8.47)

From the definition it is clear that

I (X; Y ) = h(X) h(X|Y ) = h(Y ) h(Y |X) = h(X) + h(Y ) h(X, Y )

(8.48)

and

(8.49)

I (X; Y ) = D(f (x, y)||f (x)f (y)).

The properties of D(f ||g) and I (X; Y ) are the same as in the discrete case. In particular, the mutual information between two random variables is the limit of the mutual information between their quantized

versions, since

 

I (X ; Y ) = H (X ) H (X |Y )

(8.50)

h(X) − log − (h(X|Y ) − log )

(8.51)

= I (X; Y ).

(8.52)

More generally, we can define mutual information in terms of finite partitions of the range of the random variable. Let X be the range of a random variable X. A partition P of X is a finite collection of disjoint sets Pi such that i Pi = X. The quantization of X by P (denoted [X]P) is the discrete random variable defined by

Pr([X]P = i) = Pr(X Pi ) =

dF (x).

(8.53)

 

Pi

 

For two random variables X and Y with partitions P and Q, we can calculate the mutual information between the quantized versions of X and Y using (2.28). Mutual information can now be defined for arbitrary pairs of random variables as follows:

252 DIFFERENTIAL ENTROPY

Definition The mutual information between two random variables X and Y is given by

I (X; Y ) = sup I ([X]P; [Y ]Q),

(8.54)

P,Q

 

where the supremum is over all finite partitions P and Q.

This is the master definition of mutual information that always applies, even to joint distributions with atoms, densities, and singular parts. Moreover, by continuing to refine the partitions P and Q, one finds a monotonically increasing sequence I ([X]P ; [Y ]Q) I .

By arguments similar to (8.52), we can show that this definition of mutual information is equivalent to (8.47) for random variables that have a density. For discrete random variables, this definition is equivalent to the definition of mutual information in (2.28).

Example 8.5.1 (Mutual information between correlated Gaussian ran-

dom variables with correlation ρ)

Let (X, Y ) N(0, K), where

 

 

 

σ 2 2

 

2

 

 

 

 

K

ρσ

2 .

 

 

(8.55)

 

=

ρσ

σ

 

 

 

 

 

Then h(X) = h(Y ) = 21 log(2π e)σ 2

and

h(X, Y ) = 21 log(2π e)2|K| =

21 log(2π e)2σ 4(1 − ρ2), and therefore

 

1

 

 

 

 

 

 

 

log(1

ρ2).

(8.56)

I (X; Y ) = h(X) + h(Y ) h(X, Y ) = −

 

2

If ρ = 0, X and Y are independent and the mutual information is 0. If ρ = ±1, X and Y are perfectly correlated and the mutual information is infinite.

8.6 PROPERTIES OF DIFFERENTIAL ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION

Theorem 8.6.1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

D(f ||g) ≥ 0

(8.57)

with equality iff f = g almost everywhere (a.e.).

 

Proof: Let S be the support set of f . Then

 

 

|| =

S

 

 

 

f

 

 

 

D(f

g)

 

f log

g

 

(8.58)

 

 

 

 

 

 

 

S

 

f

 

 

 

 

 

log

f

g

 

(by Jensen’s inequality)

(8.59)

 

 

 

 

8.6 DIFFERENTIAL ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION

253

= log S g

(8.60)

≤ log 1 = 0.

(8.61)

We have equality iff we have equality in Jensen’s inequality, which occurs iff f = g a.e.

Corollary I (X; Y ) ≥ 0 with equality iff X and Y are independent.

Corollary h(X|Y ) h(X) with equality iff X and Y are independent.

Theorem 8.6.2 (Chain rule for differential entropy )

n

 

h(X1, X2, . . . , Xn) = h(Xi |X1, X2, . . . , Xi−1).

(8.62)

i=1

 

Proof: Follows directly from the definitions.

 

Corollary

 

h(X1, X2, . . . , Xn) h(Xi ),

(8.63)

with equality iff X1, X2, . . . , Xn are independent.

Proof: Follows directly from Theorem 8.6.2 and the corollary to Theorem 8.6.1.

Application (Hadamard’s inequality ) If we let X N(0, K) be a multivariate normal random variable, calculating the entropy in the above inequality gives us

n

 

|K| ≤ Kii ,

(8.64)

i=1

which is Hadamard’s inequality. A number of determinant inequalities can be derived in this fashion from information-theoretic inequalities (Chapter 17).

Theorem 8.6.3

h(X + c) = h(X).

(8.65)

Translation does not change the differential entropy.

Proof: Follows directly from the definition of differential entropy.

254

DIFFERENTIAL ENTROPY

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Theorem 8.6.4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

h(aX) = h(X) + log |a|.

 

 

 

 

 

(8.66)

Proof:

Let Y = aX. Then fY (y) =

1

fX( ya ), and

 

 

 

|a|

 

 

 

 

h(aX) = −

fY (y) log fY (y) dy

 

 

 

 

 

 

 

(8.67)

 

= −

a

 

 

a

 

 

 

a

 

 

 

a

 

 

 

 

1

 

fX

 

y

 

log

 

1

 

fX

 

y

dy

(8.68)

 

|

|

 

 

 

 

|

 

 

 

 

 

 

 

 

 

 

 

|

 

 

 

 

 

 

 

 

= −

fX(x) log fX(x) dx

+

log

| |

(8.69)

 

 

 

 

a

 

= h(X) + log |a|,

 

 

 

 

 

 

 

 

 

(8.70)

after a change of variables in the integral.

 

 

 

 

 

 

 

 

 

 

Similarly, we can prove the following corollary for vector-valued random variables.

Corollary

h(AX) = h(X) + log |det(A)|.

(8.71)

We now show that the multivariate normal distribution maximizes the entropy over all distributions with the same covariance.

Theorem 8.6.5 Let the random vector X Rn have zero mean and covariance K = EXXt (i.e., Kij = EXi Xj , 1 ≤ i, j n). Then h(X) 12 log(2π e)n|K|, with equality iff X N(0, K).

Proof: Let g(x) be any density satisfying

g(x)xi xj dx = Kij

for all

i, j . Let φ

 

(

, K)

vector

as given in (8.35), where we

 

K be the density of a N

0

 

 

 

set µ = 0. Note that log φK (x) is a quadratic form and xi xj φK (x) dx

=

Kij . Then

 

 

 

 

 

 

0 ≤ D(g||φK )

 

 

(8.72)

 

=

g log(g/φK )

 

(8.73)

 

= −h(g)

g log φK

(8.74)