Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Теория информации / Gray R.M. Entropy and information theory. 1990., 284p

.pdf
Скачиваний:
28
Добавлен:
09.08.2013
Размер:
1.32 Mб
Скачать

7.4. STATIONARY PROCESSES

139

Proof: From the chain rule for conditional relative entropy (equation (7.7),

1

X

Hpjjm(XN jX¡) = Hpjjm(XljXl; X¡):

l=0

Stationarity implies that each term in the sum equals Hpjjm(X0jX¡), proving the corollary. 2

The next corollary extends Corollary 7.3.1 to processes.

Corollary 7.4.2: Given k and n ‚ k, let Mk denote the class of all k-step stationary Markov process distributions. Then

inf

H

(X) = H

(k) (X) = I

(X

; X¡

Xk):

m2Mk

pjjm

pjjp

p

k

j

 

Proof: Follows from (7.23) and Theorem 7.3.1. 2

This result gives an interpretation of the flnite-gap information property (6.13): If a process has this property, then there exists a k-step Markov process which is only a flnite \distance" from the given process in terms of limiting per-symbol divergence. If any such process has a flnite distance, then the k- step Markov approximation also has a flnite distance. Furthermore, we can apply Corollary 6.4.1 to obtain the generalization of the flnite alphabet result of Theorem 2.6.2

.

Corollary 7.4.3: Given a stationary process distribution p which satisfles the flnite-gap information property,

inf inf

 

 

(k) (X) = lim

(k) (X) = 0:

H

pjjm

(X) = inf H

pjjp

H

k m2Mk

 

k

k!1

 

pjjp

Lemma 7.4.1 also yields the following approximation lemma.

Corollary 7.4.4: Given a process fXng with standard alphabet A let p and m be stationary measures such that PXn << MXn for all n and m is kth order Markov. Let qk be an asymptotically accurate sequence of quantizers for

A. Then

 

 

 

(X) = lim

 

(qk(X));

H

H

pjjm

pjjm

k!1

 

 

that is, the divergence rate can be approximated arbitrarily closely by that of a quantized version of the process. Thus, in particular,

Hpjjm(X) = Hpjjm(X):

140

CHAPTER 7. RELATIVE ENTROPY RATES

Proof: This follows from Corollary 5.2.3 by letting the generating ¾-flelds be Fn = ¾(qn(Xi); i = 0; ¡1; ¢ ¢ ¢) and the representation of conditional relative entropy as an ordinary divergence. 2

Another interesting property of relative entropy rates for stationary processes is that we can \reverse time" when computing the rate in the sense of the following lemma.

Lemma 7.4.2: Let fXng, p, and m be as in Lemma 7.4.1.

If either

 

Hpjjm(X) < 1 or HP jjM (X0jX¡) < 1, then

 

 

 

 

 

Hpjjm(X0jX¡1; ¢ ¢ ¢ ; X¡n) = Hpjjm(X0jX1; ¢ ¢ ¢ ; Xn)

 

 

 

 

 

and hence

 

 

 

 

 

 

 

 

 

 

Hpjjm(X0jX1; X2; ¢ ¢ ¢) = Hpjjm(X0jjX¡1; X¡2; ¢ ¢ ¢) = Hpjjm(X) < 1:

 

 

n

 

n

n

)

Proof: If Hpjjm(X) is flnite, then so must be the terms Hpjjm(X

 

) = D(PX

jjMX

(since otherwise all such terms with larger n would also be inflnite and hence

H could not be flnite). Thus from stationarity

Hpjjm(X0jX¡1; ¢ ¢ ¢ ; X¡n) = Hpjjm(XnjXn)

= D(PXn+1 jjMXn+1 ) ¡ D(PXn jjMXn )

D(PXn+1 jjMXn+1 ) ¡ D(PX1n jjMX1n ) = Hpjjm(X0jX1; ¢ ¢ ¢ ; Xn)

from which the results follow. If on the other hand the conditional relative entropy is flnite, the results then follow as in the proof of Lemma 7.4.1 using the fact that the joint relative entropies are arithmetic averages of the conditional relative entropies and that the conditional relative entropy is deflned as the divergence between the P and S measures (Theorem 5.3.2). 2

7.5Mean Ergodic Theorems

In this section we state and prove some preliminary ergodic theorems for relative entropy densities analogous to those flrst developed for entropy densities in Chapter 3 and for information densities in Section 6.3. In particular, we show that an almost everywhere ergodic theorem for flnite alphabet processes follows easily from the sample entropy ergodic theorem and that an approximation argument then yields an L1 ergodic theorem for stationary sources. The results involve little new and closely parallel those for mutual information densities and therefore the details are skimpy. The results are given for completeness and because the L1 results yield the byproduct that relative entropies are uniformly integrable, a fact which does not follow as easily for relative entropies as it did for entropies.

n
ln m(XijXik¡k)
1
1 X

7.5. MEAN ERGODIC THEOREMS

141

Finite Alphabets

Suppose that we now have two process distributions p and m for a random process fXng with flnite alphabet. Let PXn and MXn denote the induced nth order distributions and pXn and mXn the corresponding probability mass functions (pmf’s). For example, pXn (an) = PXn (fxn : xn = ang) = p(fx : Xn(x) = ang). We assume that PXn << MXn . In this case the relative entropy density is given simply by

hn(x) = hXn (Xn)(x) = ln pXn (xn) ; mXn (xn)

where xn = Xn(x).

The following lemma generalizes Theorem 3.1.1 from entropy densities to relative entropy densities for flnite alphabet processes. Relative entropies are of more general interest than ordinary entropies because they generalize to continuous alphabets in a useful way while ordinary entropies do not.

Lemma 7.5.1: Suppose that fXng is a flnite alphabet process and that p and m are two process distributions with MXn >> PXn for all n, where p is AMS with stationary mean p„, m is a kth order Markov source with stationary transitions, and fpxg is the ergodic decomposition of the stationary mean of p.

Assume also that MX

n

 

 

 

n

for all n. Then

 

 

 

 

 

 

 

>> PX

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

nlim

 

hn

= h; p ¡ a.e. and in L1(p);

 

n

 

 

!1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

where h(x) is the invariant function deflned by

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

k

)

 

h(x) = ¡Hpx

(X) ¡ Epx ln m(XkjX

 

 

 

 

 

 

 

1

 

 

 

 

 

 

n

 

 

 

 

 

 

 

 

=

 

lim

 

H

pxjjm

(X

 

) = H

pxjjm

(X);

(7.27)

where

 

n!1 n

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

m

 

k+1 (xk+1)

 

 

 

 

 

 

 

 

 

 

k

 

 

 

 

 

 

 

 

 

 

 

 

 

k

 

 

 

 

 

 

 

 

X

 

 

 

 

 

 

 

 

 

 

 

 

m(XkjX

 

)(x) ·

 

 

 

 

= MXk jXk (xkjx

):

 

 

 

mXk (xk)

Furthermore,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

n

 

 

Eph = H

pjjm

(X) = lim

 

 

H

pjjm

(X

 

);

(7.28)

 

 

 

 

 

 

 

 

 

n!1 n

 

 

 

 

 

that is, the relative entropy rate of an AMS process with respect to a Markov process with stationary transitions is given by the limit. Lastly,

(7.29)

Hpjjm(X) = Hpjjm(X);

that is, the relative entropy rate of the AMS process with respect to m is the same as that of its stationary mean with respect to m.

Proof: We have that

n1 h(Xn) = n1 ln p(Xn) ¡ n1 ln m(Xk) +

i=k

142

 

 

 

 

 

 

 

 

 

 

 

 

 

CHAPTER 7.

 

RELATIVE ENTROPY RATES

 

 

 

 

= 1 ln p(Xn)

 

 

 

1 ln m(Xk)

 

 

1

 

1 ln m(X Xk)T i¡k;

(7.30)

 

 

 

 

 

 

 

 

 

¡

 

 

 

 

 

 

 

 

¡

 

 

X

 

 

 

 

 

kj

 

 

 

 

 

 

 

n

 

n

 

 

 

 

 

 

n

 

i=k

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

where T is the shift transformation, p(Xn) is an abbreviation for PXn (Xn), and

 

m(X

Xk) = M

Xk jX

k (Xk

Xk). From Theorem 3.1.1 the flrst term converges to

 

kj

 

 

 

 

1j

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

¡Hpx (X)p-a.e. and in L (p).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Since MXk >> PXk , if MXk (F ) = 0, then also PXk (F ) = 0. Thus PXk and

 

hence also p assign zero probability to the event that MXk (Xk) = 0. Thus with

 

probability one under p, ln m(Xk) is flnite and hence the second term in (7.5.4)

 

converges to 0 p-a.e. as n ! 1.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

k

).

Deflne as the minimum nonzero value of the conditional probability m(xkjx

 

Then with probability 1 under MXn and hence also under PXn we have that

 

 

 

 

 

 

 

 

 

 

1 1

 

 

 

1

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

j

¡

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

n

i=k

ln m(Xi

Xik k) ln

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

since otherwise the sequence Xn would have 0 probability under MXn and hence

 

also under PXn and 0 ln 0 is considered to be 0.

Thus the rightmost term of

 

(7.30) is uniformly integrable with respect to p and hence from Theorem 1.8.3

 

this term converges to Epx (ln m(XkjXk)). This proves the leftmost equality of

 

(7.27).

 

 

denote the distribution of Xn under the ergodic component px.

 

Let p

n

jx

 

 

X

 

 

n

 

 

=

 

dp„(x)„pXn x, if MX

 

 

(F ) = 0, then pXn x(F ) =

 

Since MX

n

 

 

 

n

 

n

 

 

>> PX

and PX

 

 

 

 

 

 

0 p-a.e. Since the alphabet of XnRif flnite, wejtherefore also have with probabilityj

 

one under p„ that MXn >> p

 

 

n

jx

and hence

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

H

 

(Xn) =

 

 

pn

 

 

 

 

 

p

 

n

jx

(an)

 

 

 

 

 

 

 

 

 

 

X

 

(an) ln

X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

pxjjm

 

 

 

 

 

 

 

 

 

n

X jx

 

 

 

 

 

MXn (an)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

a

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

is well deflned for p„-almost all x. This expectation can also be written as

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

Hpxjjm(Xn) = ¡Hpx (Xn) ¡ Epx [ln m(Xk) +

X

 

 

 

 

 

 

 

 

 

 

ln m(XkjXk)T i¡k]

 

 

 

i=k

= ¡Hpx (Xn) ¡ Epx [ln m(Xk)] ¡ (n ¡ k)Epx [ln m(XkjXk)];

where we have used the stationarity of the ergodic components. Dividing by n and taking the limit as n ! 1, the middle term goes to zero as previously and the remaining limits prove the middle equality and hence the rightmost inequality in (7.27).

Equation (7.28) follows from (7.27) and L1(p) convergence, that is, since n¡1hn ! h, we must also have that Ep(n¡1hn(Xn)) = n¡1Hpjjm(Xn) converges

to Eph. Since the former limit is Hpjjm(X), (7.28) follows. Since px is invariant (Theorem 1.8.2) and since expectations of invariant functions are the same under an AMS measure and its stationary mean (Lemma 6.3.1 of [50]), application of the previous results of the lemma to both p and p„ proves that

Z Z

„ „ „

Hpjjm(X) = dp(x)Hpx jjm(X) = dp„(x)Hpx jjm(X) = Hpjjm(X);

7.5. MEAN ERGODIC THEOREMS

143

which proves (7.30) and completes the proof of the lemma. 2

Corollary 7.5.1: Given p and m as in the Lemma, then the relative entropy rate of p with respect to m has an ergodic decomposition, that is,

Z

„ „

Hpjjm(X) = dp(x)Hpx jjm(X):

Proof: This follows immediately from (7.27) and (7.28). 2

Standard Alphabets

We now drop the flnite alphabet assumption and suppose that fXng is a standard alphabet process with process distributions p and m, where p is stationary, m is kth order Markov with stationary transitions, and MXn >> PXn are the induced vector distributions for n = 1; 2; ¢ ¢ ¢ . Deflne the densities fn and entropy densities hn as previously.

As an easy consequence of the development to this point, the ergodic decomposition for divergence rate of flnite alphabet processes combined with the deflnition of Has a supremum over rates of quantized processes yields an extension of Corollary 6.2.1 to divergences. This yields other useful properties as summarized in the following corollary.

Corollary 7.5.1: Given a standard alphabet process fXng suppose that p and m are two process distributions such that p is AMS and m is kth order Markov with stationary transitions and MXn >> PXn are the induced vector distributions. Let p„ denote the stationary mean of p and let fpxg denote the ergodic decomposition of the stationary mean p„. Then

Hpjjm(X) = Z

dp(x)Hpx jjm(X):

(7.31)

In addition,

 

 

Hpjjm(X) = Hpjjm(X) = Hpjjm(X) = Hpjjm(X);

(7.32)

that is, the two deflnitions of relative entropy rate yield the same values for AMS p and stationary transition Markov m and both rates are the same as the corresponding rates for the stationary mean. Thus relative entropy rate has an ergodic decomposition in the sense that

Z

„ „

Hpjjm(X) = dp(x)Hpx jjm(X): (7.33)

Comment: Note that the extra technical conditions of Theorem 6.4.2 for

equality of the analogous mutual information rates I and I are not needed here. Note also that only the ergodic decomposition of the stationary mean p„ of the AMS measure p is considered and not that of the Markov source m.

144

CHAPTER 7. RELATIVE ENTROPY RATES

Proof: The flrst statement follows as previously described from the flnite alphabet result and the deflnition of H. The left-most and right-most equalities of (7.32) both follow from the previous lemma. The middle equality of (7.32) follows from Corollary 7.4.2. Eq. (7.33) then follows from (7.31) and (7.32). 2

Theorem 7.5.1: Given a standard alphabet process fXng suppose that p and m are two process distributions such that p is AMS and m is kth order Markov with stationary transitions and MXn >> PXn are the induced vector distributions. Let fpxg denote the ergodic decomposition of the stationary mean p„. If

 

1

 

n

nlim

 

n

Hpjjm(X

 

) = Hpjjm(X) < 1;

!1

 

 

 

 

 

then there is an invariant function h such that n¡1hn ! h in L1(p) as n ! 1. In fact,

h(x) = Hpx jjm(X);

the relative entropy rate of the ergodic component px with respect to m. Thus, in particular, under the stated conditions the relative entropy densities hn are uniformly integrable with respect to p.

Proof: The proof exactly parallels that of Theorem 6.3.1, the mean ergodic theorem for information densities, with the relative entropy densities replacing the mutual information densities. The density is approximated by that of a quantized version and the integral bounded above using the triangle inequality.

One term goes to zero from the flnite alphabet case. Since H = H (Corollary 7.5.1) the remaining terms go to zero because the relative entropy rate can be approximated arbitrarily closely by that of a quantized process. 2

It should be emphasized that although Theorem 7.5.1 and Theorem 6.3.1 are similar in appearance, neither result directly implies the other. It is true that mutual information can be considered as a special case of relative entropy, but given a pair process fXn; Yng we cannot in general flnd a kth order Markov

distribution m for which the mutual information rate I(X; Y ) equals a relative

entropy rate Hpjjm. We will later consider conditions under which convergence of relative entropy densities does imply convergence of information densities.

Chapter 8

Ergodic Theorems for Densities

8.1Introduction

This chapter is devoted to developing ergodic theorems flrst for relative entropy densities and then information densities for the general case of AMS processes with standard alphabets. The general results were flrst developed by Barron [9] using the martingale convergence theorem and a new martingale inequality. The similar results of Algoet and Cover [7] can be proved without direct recourse to martingale theory. They infer the result for the stationary Markov approximation and for the inflnite order approximation from the ordinary ergodic theorem. They then demonstrate that the growth rate of the true density is asymptotically sandwiched between that for the kth order Markov approximation and the inflnite order approximation and that no gap is left between these asymptotic upper and lower bounds in the limit as k ! 1. They use martingale theory to show that the values between which the limiting density is sandwiched are arbitrarily close to each other, but we shall see that this is not necessary and this property follows from the results of Chapter 6.

8.2Stationary Ergodic Sources

Theorem 8.2.1: Given a standard alphabet process fXng, suppose that p and m are two process distributions such that p is stationary ergodic and m is a K- step Markov source with stationary transition probabilities. Let MXn >> PXn be the vector distributions induced by p and m. As before let

hn = ln fXn (Xn) = ln dPXn (Xn): dMXn

145

146

CHAPTER 8.

ERGODIC THEOREMS FOR DENSITIES

Then with probability one under p

 

 

 

 

 

1

 

 

 

lim

 

hn = H

pjjm

(X):

 

n

!1 n

 

 

 

 

 

 

 

Proof: Let p(k) denote the k-step Markov approximation of p as deflned in Theorem 7.3.1, that is, p(k) has the same kth order conditional probabilities and k-dimensional initial distribution. From Corollary 7.3.1, if k ‚ K, then (7.8){(7.10) hold. Consider the expectation

Ep

ˆ fXn

(Xn) !

= EPXn

ˆ fXn

!

= Z

ˆ fXn

! dPXn :

 

 

f

(kn)

(Xn)

 

 

f

(kn)

 

 

f

(kn)

 

 

 

X

 

 

 

 

 

X

 

 

 

 

X

 

Deflne the set An = fxn : fXn > 0g; then PXn (An) = 1. Use the fact that fXn = dPXn =dMXn to write

EP

ˆ fXn

(Xn) !

=

An

ˆ fXn

! fXn dMXn

 

 

f

(kn)

(Xn)

 

Z

f

(kn)

 

 

 

X

 

 

 

 

 

 

X

 

Z

=fX(kn) dMXn :

An

From Corollary 7.3.1,

 

 

 

 

 

 

(k)

 

=

dPX(kn)

 

 

 

 

 

 

 

 

 

 

 

 

fXn

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

dMXn

 

 

 

 

 

 

 

and therefore

 

 

 

!

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ˆ

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

f (kn)(Xn)

= ZAn

 

dP (kn)

 

 

 

 

 

 

 

 

(k)

 

Ep fXn (Xn)

 

dMXn dMXn = PXn (An) 1:

 

 

X

 

 

 

 

 

 

 

 

 

 

X

 

 

 

 

 

 

 

 

 

 

 

 

Thus we can apply Lemma 5.4.2 to the sequence f

(kn)(Xn)=fXn (Xn) to con-

clude that with p-probability 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

f

(kn)

(Xn)

 

 

 

 

 

 

 

 

 

 

 

 

nlim

 

ln

 

X

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

n

f

X

n

(Xn)

 

 

 

 

 

 

 

 

 

!1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

and hence

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

lim

ln f (kn)

(X

n

)

 

 

lim inf

 

f

 

n (Xn):

(8.1)

 

 

 

 

 

 

 

X

 

n

!1

n

X

 

 

n

!1

 

 

n

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The left-hand limit is well deflned by the usual ergodic theorem:

 

1 (k)

 

1 1

 

 

1

 

nlim

 

ln fXn

(Xn) = nlim

 

X

ln fXljXlk¡k

(XljXlk¡k) + nlim

 

 

ln fXk (Xk):

n

n

 

n

!1

 

 

!1

 

l=k

 

!1

 

 

 

 

 

 

 

 

 

 

 

 

 

Since 0 < fXk < 1 with probability 1 under MXk and hence also under PXk , then 0 < fXk (Xk) < 1 under p and therefore n¡1 ln fXk (Xk) ! 0 as n ! 1 with probability one. Furthermore, from the ergodic theorem for stationary and

8.2. STATIONARY ERGODIC SOURCES

147

ergodic processes (e.g., Theorem 7.2.1 of [50]), since p is stationary ergodic we have with probability one under p using (7.20) and Corollary 7.4.1 that

 

 

 

 

 

1 1

 

 

 

 

 

 

 

 

nlim

 

 

 

 

X

ln fXljXlk¡k (XljXlk¡k)

 

 

 

 

 

n

 

 

 

 

 

 

 

!1

 

 

 

 

l=k

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1 1

 

 

 

 

 

 

= nlim

 

 

 

 

 

X

ln fX0jX¡1;¢¢¢;X¡k (X0j

 

 

 

 

 

 

n

 

 

 

 

 

 

!1

 

 

 

 

l=k

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

X¡1; ¢ ¢ ¢ ; X¡k)T l = Ep ln fX0jX¡1;¢¢¢;X¡k (X0jX¡1; ¢ ¢ ¢ ; X¡k)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

= Hpjjm(X0jX¡1; ¢ ¢ ¢ ; X¡k) = Hp(k)jjm(X):

 

Thus with (8.1) we now have that

 

 

 

lim inf

1

ln f

 

 

 

 

(X

n

) H (X X ;

; X )

(8.2)

 

 

 

 

 

 

n

!1

n X

n

 

 

 

 

pjjm 0j ¡1 ¢ ¢ ¢

¡k

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

for any positive integer k. Since m is Kth order Markov, Lemma 7.4.1 and the above imply that

lim inf

1

ln f

 

n (Xn)

H

 

(X

X¡) = H

 

(X);

(8.3)

n

X

pjjm

pjjm

n

!1

 

 

 

 

0j

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

which completes half of the sandwich proof of the theorem.

pjjm(X) = 1, the proof is completed with (8.3). Hence we can suppose

H

If

 

 

 

 

 

 

that Hpjjm(X) < 1. From Lemma 7.4.1 using the distribution SX0;X¡1;X¡2;¢¢¢

constructed there, we have that

 

 

 

 

 

D(PX0;X¡1;¢¢¢jjSX0;X¡1;¢¢¢) = Hpjjm(X0jX¡) = Z

dPX0;X¡ ln fX0jX¡

where

 

 

dPX0

 

 

 

fX0

 

X¡ =

;X¡1;¢¢¢

:

 

j

dS

 

 

 

 

 

;X¡1;¢¢¢

 

 

 

 

X0

 

It should be pointed out that we have not (and will not) prove that fX0jX¡1;¢¢¢;X¡n ! fX0jX¡ ; the convergence of conditional probability densities which follows

from the martingale convergence theorem and the result about which most generalized Shannon-McMillan-Breiman theorems are built. (See, e.g., Barron [9].) We have proved, however, that the expectations converge (Lemma 7.4.1), which is what is needed to make the sandwich argument work.

For the second half of the sandwich proof we construct a measure Q which will be dominated by p on semi-inflnite sequences using the above conditional densities given the inflnite past. Deflne the semi-inflnite sequence Xn¡ = f¢ ¢ ¢ ; X1g

for all nonnegative integers n. Let Bkn = ¾(Xkn) and Bk¡ = ¾(Xk¡) = ¾(¢ ¢ ¢ ; X1) be the ¾-flelds generated by the flnite dimensional random vector Xkn and the

semi-inflnite sequence Xk¡, respectively. Let Q be the process distribution having the same restriction to ¾(Xk¡) as does p and the same restriction to

148

CHAPTER 8. ERGODIC THEOREMS FOR DENSITIES

¾(X0; X1; ¢ ¢ ¢) as does p, but which makes X¡ and Xkn conditionally independent given Xk for any n; that is,

QXk¡ = PXk¡ ;

QXk ;Xk+1;¢¢¢ = PXk ;Xk+1;¢¢¢;

and X¡ ! Xk ! Xkn is a Markov chain for all positive integers n so that

Q(Xkn 2 F jXk¡) = Q(Xkn 2 F jXk):

The measure Q is a (nonstationary) k-step Markov approximation to P in the sense of Section 5.3 and

Q = PX¡£(Xk ;Xk+1;¢¢¢)jXk

(in contrast to P = PX¡Xk Xk1 ). Observe that X¡ ! Xk ! Xkn is a Markov chain under both Q and m.

By assumption,

Hpjjm(X0jX¡) < 1 and hence from Corollary 7.4.1

Hpjjm(XknjXk¡) = nHpjjm(XknjXk¡) < 1

 

and hence from Theorem 5.3.2 the density fXknjXk¡ is well-deflned as

 

fXknjXk¡ =

 

dSXn¡+k

 

 

 

(8.4)

 

PX¡

 

 

 

 

 

 

 

 

 

 

n+k

 

 

 

 

where

 

 

 

 

 

 

 

 

 

 

 

SXn¡+k =

 

 

;

(8.5)

MXknjXk PXk¡

and

 

 

 

 

 

 

 

 

 

 

 

Z dPXn¡+k ln fXknjXk¡ = D(PXn¡+k jjSXn¡+k )

 

= nHpjjm(XknjXk¡) < 1:

(8.6)

Thus, in particular,

 

 

 

 

 

 

 

 

 

 

 

SXn¡+k >> PXn¡+k :

 

 

 

 

Consider now the sequence of ratios of conditional densities

 

n =

fXknjXk (Xn+k)

 

 

 

 

 

f

 

n

¡(X¡

)

 

 

 

 

 

 

Xk jXk

n+k

 

 

 

 

We have that

dp‡n = Z

 

 

 

 

 

Z

n

 

 

 

 

Gn