Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный педагогический университет им. Макаренко

Предмет:

Теория вероятностей и математическая статистика

Файл:

Теория информации / Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p

.pdf

Скачиваний:

214

Добавлен:

09.08.2013

Размер:

10.58 Mб

Скачать

☆

<<< < Предыдущая 16 17 18 19 20 21 22 23 24 25 26 2728 / 7828 29 30 31 32 33 34 35 36 37 38 39 40 > Следующая >>>

8.2 AEP FOR CONTINUOUS RANDOM VARIABLES 245

8.2AEP FOR CONTINUOUS RANDOM VARIABLES

One of the important roles of the entropy for discrete random variables is in the AEP, which states that for a sequence of i.i.d. random variables, p(X1, X2, . . . , Xn) is close to 2−nH (X) with high probability. This enables us to deﬁne the typical set and characterize the behavior of typical sequences.

We can do the same for a continuous random variable.

Theorem 8.2.1 Let X1, X2, . . . , Xn be a sequence of random variables drawn i.i.d. according to the density f (x). Then

1		log f (X1	, X2, . . . , Xn) → E[− log f (X)] = h(X) in probability.
−
	n
			(8.10)

Proof: The proof follows directly from the weak law of large numbers.

This leads to the following deﬁnition of the typical set.

Deﬁnition

For > 0 and any n, we deﬁne the typical set A(n) with

respect to f (x) as follows:

A(n)

(x1

, x2, . . . , xn)

Sn :

log f (x1

, x2, . . . , xn)

−

h(X)

≤

− n

where f (x1, x2, . . . , xn) =

(8.11)

i=1 f (xi ).

The properties of the typical set for continuous random variables parallel those for discrete random variables. The analog of the cardinality of the typical set for the discrete case is the volume of the typical set for continuous random variables.

Deﬁnition The volume Vol(A) of a set A Rn is deﬁned as

Vol(A) = dx1 dx2 · · · dxn. (8.12)

Theorem 8.2.2 The typical set A(n) has the following properties:

1.Pr A(n) > 1 − for n sufﬁciently large.

2.Vol A(n) ≤ 2n(h(X)+ ) for all n.

3.Vol A(n) ≥ (1 − )2n(h(X)− ) for n sufﬁciently large.

246 DIFFERENTIAL ENTROPY

Proof: By Theorem 8.2.1, − n1 log f (Xn) = − n1 probability, establishing property 1. Also,

1 = f (x1, x2, . . . , xn) dx1 dx2

log f (Xi ) → h(X) in

· · · dxn

(8.13)

≥	(n) f (x1, x2, . . . , xn) dx1 dx2 · · · dxn	(8.14)
	A
≥	(n) 2−n(h(X)+ ) dx1 dx2 · · · dxn	(8.15)
	A
= 2−n(h(X)+ ) (n) dx1 dx2 · · · dxn		(8.16)
	A
= 2−n(h(X)+ ) Vol A(n) .		(8.17)

Hence we have property 2. We argue further that the volume of the typical set is at least this large. If n is sufﬁciently large so that property 1 is satisﬁed, then

		1 − ≤	(n) f (x1, x2, . . . , xn) dx1 dx2 · · · dxn						(8.18)
			A
		≤	(n) 2−n(h(X)− ) dx1 dx2 · · · dxn						(8.19)
			A
		= 2−n(h(X)− )				(n)	dx1 dx2 · · · dxn		(8.20)
						A
		= 2−n(h(X)− ) Vol A(n) ,							(8.21)
establishing property 3. Thus for n sufﬁciently large, we have
(1	−	)2n(h(X)− )		≤	Vol(A(n))		≤	2n(h(X)+ ).	(8.22)
	−			≤			≤

Theorem 8.2.3 The set A(n) is the smallest volume set with probability ≥ 1 − , to ﬁrst order in the exponent.

Proof: Same as in the discrete case.

This theorem indicates that the volume of the smallest set that contains most of the probability is approximately 2nh. This is an n-dimensional

volume, so the corresponding side length is (2nh) n1 = 2h. This provides

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY

247

an interpretation of the differential entropy: It is the logarithm of the equivalent side length of the smallest set that contains most of the probability. Hence low entropy implies that the random variable is conﬁned to a small effective volume and high entropy indicates that the random variable is widely dispersed.

Note. Just as the entropy is related to the volume of the typical set, there is a quantity called Fisher information which is related to the surface area of the typical set. We discuss Fisher information in more detail in Sections 11.10 and 17.8.

8.3 RELATION OF DIFFERENTIAL ENTROPY TO DISCRETE ENTROPY

Consider a random variable X with density f (x) illustrated in Figure 8.1. Suppose that we divide the range of X into bins of length . Let us assume that the density is continuous within the bins. Then, by the mean value theorem, there exists a value xi within each bin such that

f (xi )

(i+1) f (x) dx.

(8.23)

Consider the quantized random variable X , which is deﬁned by

X = xi

if i ≤ X < (i + 1) .

(8.24)

f(x)

∆

FIGURE 8.1. Quantization of a continuous random variable.

248 DIFFERENTIAL ENTROPY

Then the probability that X = xi is
pi	=	(i+1) f (x) dx	=	f (xi ) .	(8.25)
	=	i	=
The entropy of the quantized version is
∞
H (X ) = − pi log pi					(8.26)
−∞
∞
= − f (xi ) log(f (xi ) )					(8.27)
−∞
= − f (xi ) log f (xi ) − f (xi ) log					(8.28)
= − f (xi ) log f (xi ) − log ,					(8.29)

since f (xi ) = f (x) = 1. If f (x) log f (x) is Riemann integrable (a condition to ensure that the limit is well deﬁned [556]), the ﬁrst term in (8.29) approaches the integral of −f (x) log f (x) as → 0 by deﬁnition of Riemann integrability. This proves the following.

Theorem 8.3.1 If the density f (x) of the random variable X is Riemann integrable, then

H (X ) + log → h(f ) = h(X), as → 0.

(8.30)

Thus, the entropy of an n-bit quantization of a continuous random variable X is approximately h(X) + n.

Example 8.3.1

If X

has a

let

2−n,

uniform distribution on [0, 1] and

then

h = 0,

H (X

) = n,

and n bits

sufﬁce

describe X to n

bit accuracy.

If X is uniformly distributed on [0, 1

], the ﬁrst 3 bits to the right

of the decimal point must be 0. To describe X to n-bit accuracy

requires only n − 3 bits, which agrees with h(X) = −3.

If X

( , σ 2)

σ 2

= 100,

describing

N 0

with

bit accuracy

would require on the average n +

2 log(2π eσ

) = n + 5.37 bits.

8.4 JOINT AND CONDITIONAL DIFFERENTIAL ENTROPY

249

In general, h(X) + n is the number of bits on the average required to describe X to n-bit accuracy.

The differential entropy of a discrete random variable can be considered to be −∞. Note that 2−∞ = 0, agreeing with the idea that the volume of the support set of a discrete random variable is zero.

8.4JOINT AND CONDITIONAL DIFFERENTIAL ENTROPY

As in the discrete case, we can extend the deﬁnition of differential entropy of a single random variable to several random variables.

Deﬁnition The differential entropy of a set X1, X2, . . . , Xn of random variables with density f (x1, x2, . . . , xn) is deﬁned as

h(X1, X2, . . . , Xn) = − f (xn) log f (xn) dxn. (8.31)

Deﬁnition If X, Y have a joint density function f (x, y), we can deﬁne the conditional differential entropy h(X|Y ) as

\|	= −		\|	(8.32)
h(X Y )			f (x, y) log f (x y) dx dy.	(8.32)

Since in general f (x|y) = f (x, y)/f (y), we can also write

h(X|Y ) = h(X, Y ) − h(Y ).

(8.33)

But we must be careful if any of the differential entropies are inﬁnite.

The next entropy evaluation is used frequently in the text.

Theorem 8.4.1 (Entropy of a multivariate normal distribution) Let X1, X2, . . . , Xn have a multivariate normal distribution with mean µ and covariance matrix K. Then

h(X1, X2, . . . , Xn) = h(Nn(µ, K)) =	1	log(2π e)n\|K\| bits, (8.34)
	2

where |K| denotes the determinant of K.

250 DIFFERENTIAL ENTROPY

Proof: The probability density function of X1, X2, . . . , Xn is

f (x)

e− 21 (x−µ)T K−1(x−µ).

(8.35)

Then

= √2π

|K| 2

h(f )

f (x)

µ)T K−1(x µ)

√2π

= −

−

− −

| |

(8.36)

=1 E

2 i,j

(Xi −

µi ) K−1

ij (Xj −

µj )

ln(2π )n|K| (8.37)

i,j

µ )(X

j −

µ ) K

−

( π )n

(8.38)

i −

2 ln 2

i,j

E[(Xj −

µj )(Xi −

µi )] K−1 ij

ln(2π )n|K|

(8.39)

Kj i

K−1

ij +

ln(2π )n|K|

(8.40)

j i

(KK−1)jj +

ln(2π )n|K|

(8.41)

Ijj +

ln(2π )n|K

(8.42)

ln(2π )n|K|

(8.43)

=	1	ln(2π e)n\|K\|	nats	(8.44)
	2
=	1	log(2π e)n\|K\|	bits.	(8.45)

	2

8.5RELATIVE ENTROPY AND MUTUAL INFORMATION

We now extend the deﬁnition of two familiar quantities, D(f ||g) and I (X; Y ), to probability densities.

f (x, y)

8.5 RELATIVE ENTROPY AND MUTUAL INFORMATION

251

Deﬁnition The relative entropy (or Kullback–Leibler distance) D(f ||g) between two densities f and g is deﬁned by

\|\|		=		g
D(f	g)		f log	f	.	(8.46)

Note that D(f ||g) is ﬁnite only if the support set of f is contained in the support set of g. [Motivated by continuity, we set 0 log 00 = 0.]

Deﬁnition The mutual information I (X; Y ) between two random variables with joint density f (x, y) is deﬁned as

I (X; Y ) = f (x, y) log f (x)f (y) dx dy. (8.47)

From the deﬁnition it is clear that

I (X; Y ) = h(X) − h(X|Y ) = h(Y ) − h(Y |X) = h(X) + h(Y ) − h(X, Y )

(8.48)

and	(8.49)
I (X; Y ) = D(f (x, y)\|\|f (x)f (y)).

The properties of D(f ||g) and I (X; Y ) are the same as in the discrete case. In particular, the mutual information between two random variables is the limit of the mutual information between their quantized

versions, since
I (X ; Y ) = H (X ) − H (X \|Y )	(8.50)
≈ h(X) − log − (h(X\|Y ) − log )	(8.51)
= I (X; Y ).	(8.52)

More generally, we can deﬁne mutual information in terms of ﬁnite partitions of the range of the random variable. Let X be the range of a random variable X. A partition P of X is a ﬁnite collection of disjoint sets Pi such that i Pi = X. The quantization of X by P (denoted [X]P) is the discrete random variable deﬁned by

Pr([X]P = i) = Pr(X Pi ) =	dF (x).	(8.53)
	Pi

For two random variables X and Y with partitions P and Q, we can calculate the mutual information between the quantized versions of X and Y using (2.28). Mutual information can now be deﬁned for arbitrary pairs of random variables as follows:

252 DIFFERENTIAL ENTROPY

Deﬁnition The mutual information between two random variables X and Y is given by

I (X; Y ) = sup I ([X]P; [Y ]Q),	(8.54)
P,Q

where the supremum is over all ﬁnite partitions P and Q.

This is the master deﬁnition of mutual information that always applies, even to joint distributions with atoms, densities, and singular parts. Moreover, by continuing to reﬁne the partitions P and Q, one ﬁnds a monotonically increasing sequence I ([X]P ; [Y ]Q) I .

By arguments similar to (8.52), we can show that this deﬁnition of mutual information is equivalent to (8.47) for random variables that have a density. For discrete random variables, this deﬁnition is equivalent to the deﬁnition of mutual information in (2.28).

Example 8.5.1 (Mutual information between correlated Gaussian ran-

dom variables with correlation ρ)		Let (X, Y ) N(0, K), where
		σ 2 2		2
K		σ 2 2	ρσ	2 .				(8.55)
	=	ρσ	σ
Then h(X) = h(Y ) = 21 log(2π e)σ 2			and	h(X, Y ) = 21 log(2π e)2\|K\| =
21 log(2π e)2σ 4(1 − ρ2), and therefore				1
				1		log(1	− ρ2).	(8.56)
I (X; Y ) = h(X) + h(Y ) − h(X, Y ) = −
I (X; Y ) = h(X) + h(Y ) − h(X, Y ) = −					2

If ρ = 0, X and Y are independent and the mutual information is 0. If ρ = ±1, X and Y are perfectly correlated and the mutual information is inﬁnite.

8.6 PROPERTIES OF DIFFERENTIAL ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION

Theorem 8.6.1
							D(f \|\|g) ≥ 0		(8.57)
with equality iff f = g almost everywhere (a.e.).
Proof: Let S be the support set of f . Then
−		\|\| =	S				f
	D(f	g)		f log			g		(8.58)
	D(f	g)		f log					(8.58)
		≤		S		f
			log		f	g		(by Jensen’s inequality)	(8.59)
			log		f			(by Jensen’s inequality)	(8.59)

8.6 DIFFERENTIAL ENTROPY, RELATIVE ENTROPY, AND MUTUAL INFORMATION	253
= log S g	(8.60)
≤ log 1 = 0.	(8.61)

We have equality iff we have equality in Jensen’s inequality, which occurs iff f = g a.e.

Corollary I (X; Y ) ≥ 0 with equality iff X and Y are independent.

Corollary h(X|Y ) ≤ h(X) with equality iff X and Y are independent.

Theorem 8.6.2 (Chain rule for differential entropy )

n
h(X1, X2, . . . , Xn) = h(Xi \|X1, X2, . . . , Xi−1).	(8.62)
i=1
Proof: Follows directly from the deﬁnitions.
Corollary
h(X1, X2, . . . , Xn) ≤ h(Xi ),	(8.63)

with equality iff X1, X2, . . . , Xn are independent.

Proof: Follows directly from Theorem 8.6.2 and the corollary to Theorem 8.6.1.

Application (Hadamard’s inequality ) If we let X N(0, K) be a multivariate normal random variable, calculating the entropy in the above inequality gives us

n
\|K\| ≤ Kii ,	(8.64)

i=1

which is Hadamard’s inequality. A number of determinant inequalities can be derived in this fashion from information-theoretic inequalities (Chapter 17).

Theorem 8.6.3

h(X + c) = h(X).

(8.65)

Translation does not change the differential entropy.

Proof: Follows directly from the deﬁnition of differential entropy.

254

DIFFERENTIAL ENTROPY

Theorem 8.6.4

h(aX) = h(X) + log |a|.

(8.66)

Proof:

Let Y = aX. Then fY (y) =

fX( ya ), and

|a|

h(aX) = −

fY (y) log fY (y) dy

(8.67)

= −

log

(8.68)

= −

fX(x) log fX(x) dx

log

| |

(8.69)

= h(X) + log |a|,

(8.70)

after a change of variables in the integral.

Similarly, we can prove the following corollary for vector-valued random variables.

Corollary

h(AX) = h(X) + log |det(A)|.

(8.71)

We now show that the multivariate normal distribution maximizes the entropy over all distributions with the same covariance.

Theorem 8.6.5 Let the random vector X Rn have zero mean and covariance K = EXXt (i.e., Kij = EXi Xj , 1 ≤ i, j ≤ n). Then h(X) ≤ 12 log(2π e)n|K|, with equality iff X N(0, K).

Proof: Let g(x) be any density satisfying					g(x)xi xj dx = Kij	for all
i, j . Let φ		(	, K)	vector	as given in (8.35), where we
	K be the density of a N		0
set µ = 0. Note that log φK (x) is a quadratic form and xi xj φK (x) dx							=
Kij . Then
	0 ≤ D(g\|\|φK )					(8.72)
	=	g log(g/φK )				(8.73)
	= −h(g) −			g log φK		(8.74)

<<< < Предыдущая 16 17 18 19 20 21 22 23 24 25 26 2728 / 7828 29 30 31 32 33 34 35 36 37 38 39 40 > Следующая >>>

Соседние файлы в папке Теория информации

#
09.08.201310.58 Mб214Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p.pdf
#
09.08.20131.32 Mб28Gray R.M. Entropy and information theory. 1990., 284p.pdf