Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный педагогический университет им. Макаренко

Предмет:

Теория вероятностей и математическая статистика

Файл:

Теория информации / Gray R.M. Entropy and information theory. 1990., 284p

.pdf

Скачиваний:

Добавлен:

09.08.2013

Размер:

1.32 Mб

Скачать

☆

<<< < Предыдущая 1 2 3 4 5 67 / 317 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

2.5. CONDITIONAL ENTROPY AND INFORMATION

coding. This property is the key to generalizing these quantities to nondiscrete alphabets.

We saw in Lemma 2.3.4 that entropy was a convex function of the underlying distribution. The following lemma provides similar properties of mutual information considered as a function of either a marginal or a conditional distribution.

Lemma 2.5.4: Let „ denote a pmf on a discrete space AX , „(x) = Pr(X = x), and let q be a conditional pmf, q(yjx) = Pr(Y = yjX = x). Let „q denote the resulting joint pmf „q(x; y) = „(x)q(yjx). Let I„q = I„q(X; Y ) be the average

mutual information. Then I„q is a convex				function of q; that is, given two
conditional pmf’s q1 and q2, a ‚ 2 [0; 1],			and q„ = ‚q + (1		¡	‚)q		2	, then
conditional pmf’s q1 and q2, a ‚ 2 [0; 1],			S	1	¡			2
and „ = ‚„1 + (1 ¡T		I„q„ • ‚I„q1 + (1 ¡ ‚)I„q2 ;
and „ = ‚„1 + (1 ¡T	2
and I„q is a convex	function of „, that is, given two pmf’s „1 and „2, ‚ 2 [0; 1],
‚)„ ,
		I„q ‚ ‚I„1q + (1 ¡ ‚)I„2q:
Proof: Let r (respectively, r1, r2, r„) denote the pmf for Y							resulting from q

(respectively q1, q2, q„), that is, r(y) = Pr(Y = y) = x „(x)q(yjx). From (2.5)

I„q„ = ‚ x;y

„(x)q1(x; y) log µ

„(x)„r(y)

„(x)q1(x; y)

„(x)r1(y)

„(x)„q(x; y)

„(x)r1(y) „(x)q1(x; y)

+(1 ¡ ‚) x;y

„(x)q2(x; y) log µ

„(x)„r(y)

„(x)q2(x; y)

„(x)r2(y)

„(x)„q(x; y) „(x)r2(y)

„(x)q2(x; y)

• ‚I„q1 + ‚ x;y „q1

(x; y)

µ „(x)„r(y)

„(x)q1

(x; y) ¡ 1¶

„(x)„q(x; y)

„(x)r1(y)

+(1 ¡ ‚)I„q2

+ (1 ¡ ‚) x;y

„(x)q2

(x; y) µ

„(x)„r(y)

„(x)q2(x; y) ¡ 1¶

„(x)„q(x; y)

„(x)r2(y)

„q„(x; y)

= ‚I„q1 + (1 ¡ ‚)I„q2 + ‚(¡1 +

r1(y))

x;y

r„(y)

„(x)„q(x; y)

+(1 ¡ ‚)(¡1 +

r2(y)) = ‚I„q1 + (1 ¡ ‚)I„q2 :

r„(y)

x;y

Similarly, let „ = ‚„1 + (1 ¡ ‚)„2 and let r1, r2, and r„ denote the induced output pmf’s. Then

I„q = ‚ x;y	„1(x)q(yjx) log µ qr„(yj) q(1y x) r1(jy) ¶
X			(y x)	r		(y)	q(y x)
X		µ qr„(yj)				j
+(1 ¡ ‚) x;y	„2(x)q(yjx) log	µ qr„(yj)			q(2y x) r2(jy) ¶
X			(y x)		r	(y)	q(y x)
X						j

40		CHAPTER 2. ENTROPY AND INFORMATION
			X		r„(y)
= ‚I„1q + (1 ¡ ‚)I„2q ¡ ‚ „1(x)q(yjx) log
					r1(y)
			x;y
	X		r„(y)
¡(1 ¡ ‚)		„2(x)q(yjx) log		‚ ‚I„1q + (1 ¡ ‚)I„2q
	x;y		r2(y)

from another application of (2.5). 2

We consider one other notion of information: Given three ﬂnite alphabet random variables X; Y; Z, deﬂne the conditional mutual information between X and Y given Z by

I(X; Y jZ) = D(PXY Z jjPX£Y jZ )

(2.25)

where PX£Y jZ is the distribution deﬂned by its values on rectangles as

PX£Y jZ (F £G £D) = P (X 2 F jZ = z)P (Y 2 GjZ = z)P (Z = z): (2.26)

z2D

PX£Y jZ has the same conditional distributions for X given Z and for Y given Z as does PXY Z , but now X and Y are conditionally independent given Z. Alternatively, the conditional distribution for X; Y given Z under the distribution PX£Y jZ is the product distribution PX jZ £ PY jZ. Thus

pXY Z (x; y; z)

I(X; Y jZ) =

pXY Z (x; y; z) ln pX

(x z)pY

Z (y z)pZ (z)

x;y;z

pXY jZ (x; yjz)

(2.27)

pXY Z (x; y; z) ln pX

Z (x z)pY

(y z)

x;y;z

Since

pXY Z

pXjZ pY jZ pZ

pX pY Z

pXjZ

pXZ pY

pY jZ

we have the ﬂrst statement in the following lemma.

Lemma 2.5.4:

I(X; Y jZ) + I(Y ; Z) = I(Y ; (X; Z));

(2.28)

I(X; Y jZ) ‚ 0;

(2.29)

with equality if and only if X and Y are conditionally independent given Z, that is, pXY jZ = pXjZ pY jZ . Given ﬂnite valued measurements f and g,

I(f (X); g(Y )jZ) • I(X; Y jZ):

Proof: The second inequality follows from the divergence inequality (2.6)

with P = PXY Z and M = PX£Y jZ , i.e., the pmf’s pXY Z and pXjZ pY jZ pZ . The third inequality follows from Lemma 2.3.3 or its corollary applied to the same

measures. 2

Comments: Eq. (2.28) is called Kolmogorov’s formula. If X and Y are conditionally independent given Z in the above sense, then we also have that

2.6. ENTROPY RATE REVISITED

pXjY Z = pXY jZ =pY jZ = pXjZ , in which case we say that Y ! Z ! X is a Markov chain and note that given Z, X does not depend on Y . (Note that if

Y ! Z ! X is a Markov chain, then so is X ! Z ! Y .) Thus the conditional mutual information is 0 if and only if the variables form a Markov chain with the conditioning variable in the middle. One might be tempted to infer from Lemma 2.3.3 that given ﬂnite valued measurements f , g, and r

(?)

I(f (X); g(Y )jr(Z)) • I(X; Y jZ):

This does not follow, however, since it is not true that if Q is the partition

corresponding to the three quantizers, then D(Pf (X);g(Y );r(Z)jjPf (X)£g(Y )jr(Z))

is HPX;Y;Z jjPX£Y jZ (f (X); g(Y ); r(Z)) because of the way that PX£Y jZ is constructed; e.g., the fact that X and Y are conditionally independent given Z

implies that f (X) and g(Y ) are conditionally independent given Z, but it does not imply that f (X) and g(Y ) are conditionally independent given r(Z). Alternatively, if M is PX£ZjY , then it is not true that Pf (X)£g(Y )jr(Z) equals M (f gr)¡1. Note that if this inequality were true, choosing r(z) to be trivial (say 1 for all z) would result in I(X; Y jZ) ‚ I(X; Y jr(Z)) = I(X; Y ). This cannot be true in general since, for example, choosing Z as (X; Y ) would give I(X; Y jZ) = 0. Thus one must be careful when applying Lemma 2.3.3 if the measures and random variables are related as they are in the case of conditional mutual information.

We close this section with an easy corollary of the previous lemma and of the deﬂnition of conditional entropy. Results of this type are referred to as chain rules for information and entropy.

Corollary 2.5.1: Given ﬂnite alphabet random variables Y , X1, X2, ¢ ¢ ¢,

Xn,

	n
H(X1; X2; ¢ ¢ ¢ ; Xn) =	X
H(X1; X2; ¢ ¢ ¢ ; Xn) =	H(XijX1; ¢ ¢ ¢ ; Xi¡1)
	i=1
	n
Hpjjm(X1; X2; ¢ ¢ ¢ ; Xn) =	X
Hpjjm(X1; X2; ¢ ¢ ¢ ; Xn) =	Hpjjm(XijX1; ¢ ¢ ¢ ; Xi¡1)
	i=1
	n
I(Y ; (X1; X2; ¢ ¢ ¢ ; Xn)) =	X
I(Y ; (X1; X2; ¢ ¢ ¢ ; Xn)) =	I(Y ; XijX1; ¢ ¢ ¢ ; Xi¡1):
	i=1

2.6Entropy Rate Revisited

The chain rule of Corollary 2.5.1 provides a means of computing entropy rates for stationary processes. We have that

n¡1

n1 H(Xn) = n1 X H(XijXi):

i=0

42	CHAPTER 2. ENTROPY AND INFORMATION

First suppose that the source is a stationary kth order Markov process, that is, for any m > k

Pr(Xn = xnjXi = xi; i = 0; 1; ¢ ¢ ¢ ; n ¡ 1)

= Pr(Xn = xnjXi = xi; i = n ¡ k; ¢ ¢ ¢ ; n ¡ 1):

For such a process we have for all n ‚ k that

H(XnjXn) = H(XnjXnk¡k) = H(XkjXk);

where Xim = Xi; ¢ ¢ ¢ ; Xi+m¡1. Thus taking the limit as n ! 1 of the nth order entropy, all but a ﬂnite number of terms in the sum are identical and hence the Cesµaro (or arithmetic) mean is given by the conditional expectation. We have therefore proved the following lemma.

Lemma 2.6.1: If fXng is a stationary kth order Markov source, then

„	k	):
H(X) = H(XkjX

If we have a two-sided stationary process fXng, then all of the previous deﬂnitions for entropies of vectors extend in an obvious fashion and a generalization of the Markov result follows if we use stationarity and the chain rule to write

n¡1

n1 H(Xn) = n1 X H(X0jX¡1; ¢ ¢ ¢ ; X¡i):

i=0

Since conditional entropy is nonincreasing with more conditioning variables ((2.22) or Lemma 2.5.2), H(X0jX¡1; ¢ ¢ ¢ ; X¡i) has a limit. Again using the fact that a Cesµaro mean of terms all converging to a common limit also converges to the same limit we have the following result.

Lemma 2.6.2: If fXng is a two-sided stationary source, then

„ j ¢ ¢ ¢

H(X) = lim H(X0 X¡1; ; X¡n):

n!1

It is tempting to identify the above limit as the conditional entropy given the inﬂnite past, H(X0jX¡1; ¢ ¢ ¢). Since the conditioning variable is a sequence and does not have a ﬂnite alphabet, such a conditional entropy is not included in any of the deﬂnitions yet introduced. We shall later demonstrate that this interpretation is indeed valid when the notion of conditional entropy has been suitably generalized.

The natural generalization of Lemma 2.6.2 to relative entropy rates unfortunately does not work because conditional relative entropies are not in general monotonic with increased conditioning and hence the chain rule does not immediately yield a limiting argument analogous to that for entropy. The argument does work if the reference measure is a kth order Markov, as considered in the following lemma.

2.6. ENTROPY RATE REVISITED

Lemma 2.6.3: If fXng is a source described by process distributions p and m and if p is stationary and m is kth order Markov with stationary transitions, then for n ‚ k Hpjjm(X0jX¡1; ¢ ¢ ¢ ; X¡n) is nondecreasing in n and

„	Hpjjm(X0jX¡1; ¢ ¢ ¢ ; X¡n)
Hpjjm(X) = nlim	Hpjjm(X0jX¡1; ¢ ¢ ¢ ; X¡n)
!1
„	k	)]:
= ¡Hp(X) ¡ Ep[ln m(XkjX		)]:
Proof: For n ‚ k we have that

= ¡Hp(X0jX¡1	Hpjjm(X0jX¡1; ¢ ¢ ¢ ; X¡n)
= ¡Hp(X0jX¡1	; ¢ ¢ ¢ ; X¡n) ¡ xk+1 pXk+1 (xk+1) ln mXk jXk (xkjxk):
	X

Since the conditional entropy is nonincreasing with n and the remaining term does not depend on n, the combination is nondecreasing with n. The remainder of the proof then parallels the entropy rate result. 2

It is important to note that the relative entropy analogs to entropy properties often require kth order Markov assumptions on the reference measure (but not on the original measure).

Markov Approximations

„

Recall that the relative entropy rate Hpjjm(X) can be thought of as a distance between the process with distribution p and that with distribution m and that the rate is given by a limit if the reference measure m is Markov. A particular Markov measure relevant to p is the distribution p(k) which is the kth order Markov approximation to p in the sense that it is a kth order Markov source and it has the same kth order transition probabilities as p. To be more precise, the process distribution p(k) is speciﬂed by its ﬂnite dimensional distributions

p(k)	(xk) = p	X	k (xk)
Xk		X
n¡1
pX(kn) (xn) = pXk (xk)	pXljXlk¡k (xljxlk¡k); n = k; k + 1; ¢ ¢ ¢
Y

l=k

so that

(k)

pXk jXk = pXk jXk :

It is natural to ask how good this approximation is, especially in the limit, that

„ ! 1 is, to study the behavior of the relative entropy rate Hpjjp(k) (X) as k .

Theorem 2.6.2: Given a stationary process p, let p(k) denote the kth order

Markov approximations to p. Then
lim	„		„	(k) (X) = 0:
lim	H	pjjp	(k) (X) = inf H	(k) (X) = 0:
k!1		pjjp	k	pjjp

44	CHAPTER 2. ENTROPY AND INFORMATION

Thus the Markov approximations are asymptotically accurate in the sense that the relative entropy rate between the source and approximation can be made arbitrarily small (zero if the original source itself happens to be Markov).

Proof: As in the proof of Lemma 2.6.3 we can write for n ‚ k that

Hpjjp(k) (X0jX¡1; ¢ ¢ ¢ ; X¡n)

= ¡Hp(X0jX¡1; ¢ ¢ ¢ ; X¡n) ¡ pXk+1 (xk+1) ln pXk jXk (xkjxk)

xk+1

= Hp(X0jX¡1; ¢ ¢ ¢ ; X¡k) ¡ Hp(X0jX¡1; ¢ ¢ ¢ ; X¡n):

Note that this implies that p(Xkn) >> pXn for all n since the entropies are ﬂnite. This automatic domination of the ﬂnite dimensional distributions of a measure by those of its Markov approximation will not hold in the general case to be encountered later, it is speciﬂc to the ﬂnite alphabet case. Taking the limit as n ! 1 gives

„	; ¢ ¢ ¢ ; X¡n)
Hpjjp(k) (X) = nlim Hpjjp(k) (X0jX¡1
!1
„
= Hp(X0jX¡1; ¢ ¢ ¢ ; X¡k) ¡ Hp(X):
The corollary then follows immediately from Lemma 2.6.2. 2

Markov approximations will play a fundamental role when considering relative entropies for general (nonﬂnite alphabet) processes. The basic result above will generalize to that case, but the proof will be much more involved.

2.7 Relative Entropy Densities

Many of the convergence results to come will be given and stated in terms of relative entropy densities. In this section we present a simple but important result describing the asymptotic behavior of relative entropy densities. Although the result of this section is only for ﬂnite alphabet processes, it is stated and proved in a manner that will extend naturally to more general processes later on. The result will play a fundamental role in the basic ergodic theorems to come.

Throughout this section we will assume that M and P are two process

distributions describing a random process					f	X	ng	. Denote as before the sample
n
vector X = (X0; X1; ¢ ¢ ¢ ; Xn¡1),		that is, the vector beginning at time 0 having
			n	induced by M and P will be denoted by
length n. The distributions on X
Mn and Pn, respectively. The corresponding pmf’s are mXn									and pXn .	The
key assumption in this section is			that for			all n if mXn (xn)			= 0, then	also
pXn (xn) = 0, that is,
	Mn >> Pn for all n:								(2.30)

2.7. RELATIVE ENTROPY DENSITIES

If this is the case, we can deﬂne the relative entropy density

pXn (xn)

hn(x) · ln mXn (xn) = ln fn(x);

where

fn(x) ·	‰		p n (xn)				6
		0 X		n		otherwise
			X		)	if mXn (xn) = 0
		m		n (x

(2.31)

(2.32)

Observe that the relative entropy is found by integrating the relative entropy density:

		X			pXn (xn)
HP jjM (Xn) = D(PnjjMn) =			n pXn (xn) ln
		x			mXn (xn)

= Z ln	pXn (Xn)					(2.33)
				dP
	mXn (Xn)
Thus, for example, if we assume that
HP jjM (Xn) < 1; all n;						(2.34)

then (2.30) holds.

The following lemma will prove to be useful when comparing the asymptotic behavior of relative entropy densities for diﬁerent probability measures. It is the ﬂrst almost everywhere result for relative entropy densities that we consider. It is somewhat narrow in the sense that it only compares limiting densities to zero and not to expectations. We shall later see that essentially the same argument implies the same result for the general case (Theorem 5.4.1), only the interim steps involving pmf’s need be dropped. Note that the lemma requires neither stationarity nor asymptotic mean stationarity.

Lemma 2.7.1: Given a ﬂnite alphabet process fXng with process measures P; M satisfying (2.30), Then

hn • 0; M ¡ a.e.

(2.35)

lim sup

n!1

and

lim inf

n hn ‚ 0; P ¡ a.e.:

(2.36)

If in addition M >> P , then

nlim

hn = 0; P ¡ a.e.:

(2.37)

Proof: First consider the probability

hn ‚ †) = M (fn ‚ en†) •

EM (fn)

M (

;

en†

46	CHAPTER 2. ENTROPY AND INFORMATION
where the ﬂnal inequality is Markov’s inequality. But
EM (fn) = Z dM fn			= xn: mXn (xn)=0 mXn (xn)					pXn (xn)
EM (fn) = Z dM fn			= xn: mXn (xn)=0 mXn (xn)					mXn (xn)
			X			X	6
=			X			6
=						pXn (xn) • 1
	xn: mXn (xn)=0
and therefore			1
			1		hn ‚ †) • 2¡n†
	M (
	M (			n
and hence	1					1
1	1					1
X						X
M (		n	hn > †) • e¡n† < 1:
n=1						n=1

From the Borel-Cantelli Lemma (e.g., Lemma 4.6.3 of [50]) this implies that M (n¡1hn ‚ † i.o.) = 0 which implies the ﬂrst equation of the lemma.

Next consider

		1				X X (x
P (¡n hn > †) =				:¡n ln pX		X X (x			pXn (xn)
			xn 1		n (xn)=m n		n)>†
	¡n X (x			X X		and X	(x )6=0
=									pXn (xn)
xn: 1 ln p n			n)=m n (xn)>†			m n		n

where the last statement follows since if mXn (xn) = 0, then also pXn (xn) = 0 and hence nothing would be contributed to the sum. In other words, terms violating this condition add zero to the sum and hence adding this condition to the sum does not change the sum’s value. Thus

hn > †) =

P (¡

X X

pXn (xn)

¡n

and

mXn (xn) mXn (xn)

xn:

ln p

n (xn)=m

n (xn)>†

n (xn)=0

Z Z

= dM fn • dM e¡n†

fn<e¡n† fn<e¡n†

= e¡n†M (fn < e¡n†) • e¡n†:

Thus as before we have that P (n¡1hn > †) • e¡n† and hence that P (n¡1hn • ¡† i.o.) = 0 which proves the second claim. If also M >> P , then the ﬂrst equation of the lemma is also true P -a.e., which when coupled with the second equation proves the third. 2

Chapter 3

The Entropy Ergodic

Theorem

3.1Introduction

The goal of this chapter is to prove an ergodic theorem for sample entropy of ﬂnite alphabet random processes. The result is sometimes called the ergodic theorem of information theory or the asymptotic equipartion theorem, but it is best known as the Shannon-McMillan-Breiman theorem. It provides a common foundation to many of the results of both ergodic theory and information theory. Shannon [129] ﬂrst developed the result for convergence in probability for stationary ergodic Markov sources. McMillan [103] proved L1 convergence for stationary ergodic sources and Breiman [19] [20] proved almost everywhere convergence for stationary and ergodic sources. Billingsley [15] extended the result to stationary nonergodic sources. Jacobs [67] [66] extended it to processes dominated by a stationary measure and hence to two-sided AMS processes. Gray and Kieﬁer [54] extended it to processes asymptotically dominated by a stationary measure and hence to all AMS processes. The generalizations to AMS processes build on the Billingsley theorem for the stationary mean. Following generalizations of the deﬂnitions of entropy and information, corresponding generalizations of the entropy ergodic theorem will be considered in Chapter 8.

Breiman’s and Billingsley’s approach requires the martingale convergence theorem and embeds the possibly one-sided stationary process into a two-sided process. Ornstein and Weiss [117] recently developed a proof for the stationary and ergodic case that does not require any martingale theory and considers only positive time and hence does not require any embedding into two-sided processes. The technique was described for both the ordinary ergodic theorem and the entropy ergodic theorem by Shields [132]. In addition, it uses a form of coding argument that is both more direct and more information theoretic in °avor than the traditional martingale proofs. We here follow the Ornstein and Weiss approach for the stationary ergodic result. We also use some modiﬂcations

48	CHAPTER 3. THE ENTROPY ERGODIC THEOREM

similar to those of Katznelson and Weiss for the proof of the ergodic theorem. We then generalize the result ﬂrst to nonergodic processes using the \sandwich" technique of Algoet and Cover [7] and then to AMS processes using a variation on a result of [54].

We next state the theorem to serve as a guide through the various steps. We also prove the result for the simple special case of a Markov source, for which the result follows from the usual ergodic theorem.

We consider a directly given ﬂnite alphabet source fXng described by a distribution m on the sequence measurable space (›; B). Deﬂne as previously Xkn = (Xk; Xk+1; ¢ ¢ ¢ ; Xk+n¡1). The subscript is omitted when it is zero. For any random variable Y deﬂned on the sequence space (such as Xkn) we deﬂne the random variable m(Y ) by m(Y )(x) = m(Y = Y (x)).

Theorem 3.1.1: The Entropy Ergodic Theorem

Given a ﬂnite alphabet AMS source fXng with process distribution m and stationary mean m„ , let fm„ x; x 2 ›g be the ergodic decomposition of the stationary mean m„ . Then

lim		¡ ln m(Xn)	= h; m		¡		a.e. and in L1(m);
lim		n	= h; m				a.e. and in L1(m);
n	!1	n
	!1
where h(x) is the invariant function deﬂned by
				„
		h(x) = Hm„ x (X):
Furthermore,			1
			1			n		„
		Emh = lim		Hm(X				) = Hm(X);
		n!1 n

that is, the entropy rate of an AMS process is given by the limit, and

„	„
Hm„	(X) = Hm(X):

(3.1)

(3.2)

(3.3)

(3.4)

Comments: The theorem states that the sample entropy using the AMS measure m converges to the entropy rate of the underlying ergodic component of the stationary mean. Thus, for example, if m is itself stationary and ergodic, then the sample entropy converges to the entropy rate of the process m-a.e. and in L1(m). The L1(m) convergence follows immediately from the almost everywhere convergence and the fact that sample entropy is uniformly integrable (Lemma 2.3.6). L1 convergence in turn immediately implies the lefthand equality of (3.3). Since the limit exists, it is the entropy rate. The ﬂnal equality states that the entropy rates of an AMS process and its stationary mean are the same. This result follows from (3.2)-(3.3) by the following argument:

„	„	(X) = Em„ h, but h is invariant and hence
We have that Hm(X) = Emh and Hm„

the two expectations are equal (see, e.g., Lemma 6.3.1 of [50]). Thus we need only prove almost everywhere convergence in (3.1) to prove the theorem.

In this section we limit ourselves to the following special case of the theorem that can be proved using the ordinary ergodic theorem without any new techniques.

<<< < Предыдущая 1 2 3 4 5 67 / 317 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

Соседние файлы в папке Теория информации

#
09.08.201310.58 Mб235Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p.pdf
#
09.08.20131.32 Mб32Gray R.M. Entropy and information theory. 1990., 284p.pdf