 Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

# Теория информации / Gray R.M. Entropy and information theory. 1990., 284p

.pdf
Скачиваний:
21
Добавлен:
09.08.2013
Размер:
1.32 Mб
Скачать 3.1. INTRODUCTION 49

Lemma 3.1.1: Given a ﬂnite alphabet stationary kth order Markov source fXng, then there is an invariant function h such that

lim ¡ ln m(Xn) = h; m ¡ a:e: and in L1(m);

n!1 n

where h is deﬂned by

 h(x) = ¡Em„ x ln m(XkjXk); (3.5)

where fmxg is the ergodic decomposition of the stationary mean m„ . more,

 „ k ): h(x) = Hm„ x (X) = Hm„ x (XkjX Proof of Lemma: We have that 1 ln m(Xn) = ¡ 1 n¡1 ¡ ln m(XijXi): n n

X

i=0

Further-

(3.6)

Since the process is kth order Markov with stationary transition probabilites, for i > k we have that

m(XijXi) = m(XijXi¡k; ¢ ¢ ¢ ; X1) = m(XkjXk)T i¡k:

The terms ¡ ln m(XijXi), i = 0; 1; ¢ ¢ ¢ ; k ¡ 1 have ﬂnite expectation and hence are ﬂnite m-a.e. so that the ergodic theorem can be applied to deduce

 ¡ ln m(Xn)(x) = 1 k¡1 ln m(X Xi)(x) 1 n¡1 ln m(X Xk)(T i¡kx) ¡ X X n n ij ¡ n i=k kj i=0 1 k¡1 1 n¡k¡1 = ¡ X X n i=0 ln m(XijXi)(x) ¡ n ln m(XkjXk)(T ix) i=0

! Emx (¡ ln m(XkjXk));

n!1

proving the ﬂrst statement of the lemma. It follows from the ergodic decomposition of Markov sources (see Lemma 8.6.3) of ) that with probability 1, mx(XkjXk) = m(Xk(x); Xk) = m(XkjXk), where ˆ is the ergodic component function. This completes the proof. 2

We prove the theorem in three steps: The ﬂrst step considers stationary and ergodic sources and uses the approach of Ornstein and Weiss  (see also Shields ). The second step removes the requirement for ergodicity. This result will later be seen to provide an information theoretic interpretation of the ergodic decomposition. The third step extends the result to AMS processes by showing that such processes inherit limiting sample entropies from their stationary mean. The later extension of these results to more general relative entropy and information densities will closely parallel the proofs of the second and third steps for the ﬂnite case. 50 CHAPTER 3. THE ENTROPY ERGODIC THEOREM

3.2Stationary Ergodic Sources

This section is devoted to proving the entropy ergodic theorem for the special case of stationary ergodic sources. The result was originally proved by Breiman . The original proof ﬂrst used the martingale convergence theorem to infer the convergence of conditional probabilities of the form m(X0jX¡1; X¡2; ¢ ¢ ¢ ; X¡k) to m(X0jX¡1; X¡2; ¢ ¢ ¢). This result was combined with an an extended form of the ergodic theorem stating that if gk ! g as k ! 1 and if gk is L1-dominated

(supk jgkj is in L1), then 1=n P1 gkT k has the same limit as 1=n P1 gT k.

k=0 k=0

Combining these facts yields that that

1

n1 ln m(Xn) = n1 X ln m(XkjXk)

k=0

1

= n1 X ln m(X0jX¡k k)T k

k=0

has the same limit as

1

1 X ln m(X0jX¡1; X¡2; ¢ ¢ ¢)T k n

k=0

which, from the usual ergodic theorem, is the expectation

E(ln m(X0jX¡) · E(ln m(X0jX¡1; X¡2; ¢ ¢ ¢)):

As suggested at the end of the preceeding chapter, this should be minus the conditional entropy H(X0jX¡1; X¡2; ¢ ¢ ¢) which in turn should be the entropy

rate HX . This approach has three shortcomings: it requires a result from martingale theory which has not been proved here or in the companion volume , it requires an extended ergodic theorem which has similarly not been proved here, and it requires a more advanced deﬂnition of entropy which has not yet been introduced. Another approach is the sandwich proof of Algoet and Cover . They show without using martingale theory or the extended ergodic theo-

 rem that 1=n n¡1 i i is asymptotically sandwiched between the i=0 ln m(X0jX¡i)T entropy rate of a kth order Markov approximation: P 1 n¡1 X n ln m(X0jX¡k k)T i ! Em[ln m(X0jX¡k k)] = ¡H(X0jX¡k k) i=k and 1 n¡1 X n ln m(X0jX¡1; X¡2; ¢ ¢ ¢)T i ! Em[ln m(X0jX1; ¢ ¢ ¢)] i=k

= ¡H(X0jX¡1; X¡2; ¢ ¢ ¢):

By showing that these two limits are arbitrarily close as k ! 1, the result is proved. The drawback of this approach for present purposes is that again the 3.2. STATIONARY ERGODIC SOURCES 51

more advanced notion of conditional entropy given the inﬂnite past is required. Algoet and Cover’s proof that the above two entropies are asymptotically close involves martingale theory, but this can be avoided by using Corollary 5.2.4 as will be seen.

The result can, however, be proved without martingale theory, the extended ergodic theorem, or advanced notions of entropy using the approach of Ornstein and Weiss , which is the approach we shall take in this chapter. In a later chapter when the entropy ergodic theorem is generalized to nonﬂnite alphabets and the convergence of entropy and information densities is proved, the sandwich approach will be used since the appropriate general deﬂnitions of entropy will have been developed and the necessary side results will have been proved.

Lemma 3.2.1: Given a ﬂnite alphabet source fXng with a stationary ergodic distribution m, we have that

lim ¡ ln m(Xn) = h; m ¡ a:e:;

n!1 n

where h(x) is the invariant function deﬂned by

h(x) = Hm(X):

Proof: Deﬂne

hn(x) = ¡ ln m(Xn)(x) = ¡ ln m(xn)

 and 1 ¡ ln m(xn) h(x) = lim inf hn(x) = lim inf : n n n!1 n!1

Since m((x0; ¢ ¢ ¢ ; x1)) • m((x1; ¢ ¢ ¢ ; x1)), we have that

hn(x) ‚ h1(T x):

Dividing by n and taking the limit inﬂmum of both sides shows that h(x) h(T x). Since the n¡1hn are nonnegative and uniformly integrable (Lemma 2.3.6), we can use Fatou’s lemma to deduce that h and hence also hT are integrable with respect to m. Integrating with respect to the stationary measure m yields

Z Z

dm(x)h(x) = dm(x)h(T x)

which can only be true if

h(x) = h(T x); m ¡ a:e:;

that is, if h is an invariant function with m-probability one. If h is invariant almost everywhere, however, it must be a constant with probability one since m is ergodic (Lemma 6.7.1 of ). Since it has a ﬂnite integral (bounded by

Hm(X)), h must also be ﬂnite. Henceforth we consider h to be a ﬂnite constant.

 52 CHAPTER 3. THE ENTROPY ERGODIC THEOREM

We now proceed with steps that resemble those of the proof of the ergodic theorem in Section 7.2 of . Fix † > 0. We also choose for later use a – > 0 small enough to have the following properties: If A is the alphabet of X0 and jjAjj is the ﬂnite cardinality of the alphabet, then

 – ln jjAjj < †; (3.7) and ¡– ln – ¡ (1 ¡ –) ln(1 ¡ –) · h2(–) < †: (3.8)

The latter property is possible since h2() ! 0 as – ! 0.

Deﬂne the random variable n(x) to be the smallest integer n for which n¡1hn(x) h + . By deﬂnition of the limit inﬂmum there must be inﬂnitely many n for which this is true and hence n(x) is everywhere ﬂnite. Deﬂne the set of \bad" sequences by B = fx : n(x) > N g where N is chosen so large that m(B) < –=2. Still mimicking the proof of the ergodic theorem, we deﬂne a bounded modiﬂcation of n(x) by

n~(x) =

n(x) x 62B 1 x 2 B

so that n~(x) • N for all x 2 Bc. We now parse the sequence into variable-length blocks. Iteratively deﬂne nk(x) by

n0(x) = 0

n1(x) = n~(x)

n2(x) = n1(x) + n~(T n1(x)x) = n1(x) + l1(x)

.

.

.

nk+1(x) = nk(x) + n~(T nk (x)x) = nk(x) + lk(x);

where lk(x) is the length of the kth block:

lk(x) = n~(T nk (x)x):

We have parsed a long sequence xL = (x0; ¢ ¢ ¢ ; x1), where L >> N ,

 lk (x) which begin at time nk(x) and have into blocks xnk (x); ¢ ¢ ¢ ; xnk+1(x)¡1 = xnk (x)

length lk(x) for k = 0; 1; ¢ ¢ ¢. We refer to this parsing as the block decomposition of a sequence. The kth block, which begins at time nk(x), must either have sample entropy satisfying

 lk (x) ¡ ln m(xnk (x)) • h + † (3.9) lk(x) or, equivalently, probability at least m(xlk (x) ) ‚ e¡lk (x)(h+†); (3.10) nk (x)
 3.2. STATIONARY ERGODIC SOURCES 53

or it must consist of only a single symbol. Blocks having length 1 (lk = 1) could have the correct sample entropy, that is,

 ¡ ln m(xn1 k (x)) „ 1 • h + †;

or they could be bad in the sense that they are the ﬂrst symbol of a sequence with n > N ; that is,

n(T nk (x)x) > N;

or, equivalently,

T nk (x)x 2 B:

Except for these bad symbols, each of the blocks by construction will have a probability which satisﬂes the above bound.

Deﬂne for nonnegative integers n and positive integers l the sets

S(n; l) = fx : m(Xnl (x)) ‚ e¡l(h+)g;

that is, the collection of inﬂnite sequences for which (3.2.2) and (3.2.3) hold for a block starting at n and having length l. Observe that for such blocks there cannot be more than el(h+) distinct l-tuples for which the bound holds (lest the probabilities sum to something greater than 1). In symbols this is

 jjS(n; l)jj • el(h+†): (3.11)

The ergodic theorem will imply that there cannot be too many single symbol blocks with n(T nk (x)x) > N because the event has small probability. These facts will be essential to the proof.

Even though we write n~(x) as a function of the entire inﬂnite sequence, we can determine its value by observing only the preﬂx xN of x since either there

 is an n • N N• h + † or there is not. Hence there is a N for which n¡1 ln m(xn) function n^(x ) such that n~(x) = n^(x ). Deﬂne the ﬂnite length sequence event C = fx : N ¡ ln m(x1) > h + † g , that is, C is the collection of all N n^(xN ) = 1 and

N -tuples x that are preﬂxes of bad inﬂnite sequences, sequences x for which n(x) > N . Thus in particular,

 x 2 B if and only if xN 2 C: (3.12)

Now recall that we parse sequences of length L >> N and deﬂne the set GL of \good" L-tuples by

 1 L¡N¡1 ¡ X GL = fxL : L N 1C (xiN ) • –g; i=0

that is, GL is the collection of all L-tuples which have fewer than (L¡N ) –L time slots i for which xNi is a preﬂx of a bad inﬂnite sequence. From (3.12) and

 54 CHAPTER 3. THE ENTROPY ERGODIC THEOREM

the ergodic theorem for stationary ergodic sources we know that m-a.e. we get an x for which

 1 n¡1 1 n¡1 – nlim X 1C (xiN ) = nlim X 1B (T ix) = m(B) • : (3.13) n n 2 !1 i=0 !1 i=0

From the deﬂnition of a limit, this means that with probability 1 we get an x for which there is an L0 = L0(x) such that

 1 L¡N¡1 ¡ X L N 1C (xiN ) • –; for all L > L0: (3.14) i=0

This follows simply because if the limit is less than –=2, there must be an L0 so large that for larger L the time average is at least no greater than 2–=2 = . We can restate (3.14) as follows: with probability 1 we get an x for which xL 2 GL for all but a ﬂnite number of L. Stating this in negative fashion, we have one of the key properties required by the proof: If xL 2 GL for all but a ﬂnite number of L, then xL cannot be in the complement GcL inﬂnitely often, that is,

 m(x : xL 2 GLc i.o.) = 0: (3.15)

We now change tack to develop another key result for the proof. For each L we bounded above the cardinality jjGLjj of the set of good L-tuples. By construction there are no more than –L bad symbols in an L-tuple in GL and these can occur in any of at most

µ

 X L (3.16) k • eh2(–)L

k•–L

places, where we have used Lemma 2.3.5. Eq. (3.16) provides an upper bound on the number of ways that a sequence in GL can be parsed by the given rules. The bad symbols and the ﬂnal N symbols in the L-tuple can take on any of the jjAjj diﬁerent values in the alphabet. Eq. (3.11) bounds the number of ﬂnite length sequences that can occur in each of the remaining blocks and hence for any given block decomposition, the number of ways that the remaining blocks blocks can be ﬂlled is bounded above by

 k:T nk (x)x B elk (x)(h+†) = ePk lk (x)(h+†) = eL(h+†); (3.17) Y 62

regardless of the details of the parsing. Combining these bounds we have that

jjGLjj • eh2()L £ jjAjj–L £ jjAjjN £ eL(h+) = eh2()L+(–L+N) ln jjAjj+L(h+)

or

jjGLjj • eL(h++h2()+(+ NL ) ln jjAjj): 3.2. STATIONARY ERGODIC SOURCES 55

Since satisﬂes (3.7){(3.8), we can choose L1 large enough so that N ln jjAjj=L1

and thereby obtain

 jjGLjj • eL(h+4†); L ‚ L1: (3.18)

This bound provides the second key result in the proof of the lemma. We now combine (3.18) and (3.15) to complete the proof.

Let BL denote a collection of L-tuples that are bad in the sense of having too large a sample entropy or, equivalently, too small a probability; that is if

xL 2 BL, then

m(xL) • e¡L(h+5)

 or, equivalently, for any x with preﬂx xL hL(x) ‚ h + 5†: The upper bound on jj G probability of B L T GL : Ljj provides a boundLon the L(h+5†) \ X X m(BL GL) = m(x ) • e¡ xL 2 T xL2GL BL GL

• jjGLjje¡L(h+5) • e¡†L:

Recall now that the above bound is true for a ﬂxed † > 0 and for all L ‚ L1. Thus

 1 L1¡1 \ 1 \ X \ X GL) + X1 m(BL GL) = m(BL m(BL GL) L=1 L=1 L=L 1 LX1 e¡†L < 1 • L1 + =L L 2 and hence from the Borel-Cantelli lemma (Lemma 4.6.3 of ) m(x : xL L T i.o.) = 0. We also L2 however, that m(x : x BL G L have from (3.2.8), 2 c L GL i.o. ) = 0 and hence x GL for all but a ﬂnite number of L. Thus x 2 BL i.o. if and only if x 2 BL LGL i.o. As this latter event has zero x 2 B L i.o.) = 0 and hence probability, we have shown that m(x : T

lim sup hL(x) h + 5†:

L!1

Since is arbitrary we have proved that the limit supremum of the sample entropy ¡n¡1 ln m(Xn) is less than or equal to the limit inﬂmum and therefore that the limit exists and hence with m-probability 1

 lim ¡ ln m(Xn) = h: (3.19) n n!1

Since the terms on the left in (3.19) are uniformly integrable from Lemma 2.3.6, we can integrate to the limit and apply Lemma 2.4.1 to ﬂnd that

 ln m(Xn(x)) „ h = n!1 Z ¡ = H m(X); dm(x) n lim
 56 CHAPTER 3. THE ENTROPY ERGODIC THEOREM

which completes the proof of the lemma and hence also proves Theorem 3.1.1 for the special case of stationary ergodic measures. 2

3.3Stationary Nonergodic Sources

Next suppose that a source is stationary with ergodic decomposition fm; ‚ 2 g and ergodic component function ˆ as in Theorem 1.8.3. The source will produce with probability one under m an ergodic component mand Lemma 3.2.2 will hold for this ergodic component. In other words, we should have that

 1 ln mˆ(X n „ nlim ¡ n ) = Hmˆ (X); m ¡ a.e., (3.20) !1 that is, „ (X)g) = 1: m(fx : ¡ nlim n ln mˆ(x)(x ) = Hmˆ(x) !1

This argument is made rigorous in the following lemma.

Lemma 3.3.1: Suppose that fXng is a stationary not necessarily ergodic source with ergodic component function ˆ. Then

 m(fx : ¡ nlim n „ (3.21) ln mˆ(x)(x ) = Hmˆ(x) (X)g) = 1; m ¡ a:e:: !1 Proof: Let G = fx : ¡ nlim n „ ln mˆ(x)(x ) = Hmˆ(x) (X)g !1 and let G‚ denote the section of G at ‚, that is, G‚ n „ (X)g: = fx : ¡ nlim ln m‚(x ) = Hm‚ !1

From the ergodic decomposition (e.g., Theorem 1.8.3 or , Theorem 8.5.1) and (1.26)

Z

m(G) = dPˆ()m(G);

where

\

m(G) = m(Gjˆ = ) = m(G fx : ˆ(x) = ‚gjˆ = ) = m(G= ) = m(G)

which is 1 for all from the stationary ergodic result. Thus

Z

m(G) = dPˆ()m(G) = 1:

It is straightforward to verify that all of the sets considered are in fact measurable. 2

Unfortunately it is not the sample entropy using the distribution of the ergodic component that is of interest, rather it is the original sample entropy m(Xn)
 3.3. STATIONARY NONERGODIC SOURCES 57

for which we wish to prove convergence. The following lemma shows that the two sample entropies converge to the same limit and hence Lemma 3.3.1 will also provide the limit of the sample entropy with respect to the stationary measure.

Lemma 3.3.2: Given a stationary source fXng, let fm; ‚ 2 g denote the ergodic decomposition and ˆ the ergodic component function of Theorem

1.8.3. Then

lim 1 ln mˆ(Xn) = 0; m ¡ a:e:

n!1 n

Proof: First observe that if m(an) is 0, then from the ergodic decomposition with probability 1 mˆ(an) will also be 0. One part is easy. For any † > 0 we have from the Markov inequality that

 1 ln m(Xn) > †) = m( m(Xn) > en†) m( n mˆ(Xn) mˆ(Xn) m(Xn) • Em( )e¡n†: mˆ(Xn) The expectation, however, can be evaluated as follows: Let A(‚) = f an : n ) > 0g. Then n m‚(a m(Xn) ¶ = Z m(an) Em µ mˆ(Xn) dPˆ(‚) an 2An m‚(an) m‚(an) X

Z

=dPˆ()m(A(n)) 1;

 where Pˆ is the distribution of ˆ. Thus 1 m(Xn) m( ln > †) • e¡n†: n mˆ(Xn) and hence 1 1 m(Xn) m( ln > †) < 1 n=1 n mˆ(Xn) X and hence from the Borel-Cantelli lemma m( 1 m(Xn) ln > † i.o.) = 0 n mˆ(Xn) and hence with m probability 1 1 ln m(Xn) lim sup • †: n m (Xn) n!1 ˆ Since † is arbitrary, 1 ln m(Xn) • 0; m ¡ a:e:: (3.22) lim sup n m (Xn) n!1 ˆ m(Xn)
m(Xn)
 58 CHAPTER 3. THE ENTROPY ERGODIC THEOREM For later use we restate this as lim inf 1 ln mˆ(Xn) 0; m a:e:: (3.23) m(Xn) ‚ ¡ n !1 n

We now turn to the converse inequality. For any positive integer k, we can construct a stationary k-step Markov approximation to m as in Section 2.6, that is, construct a process m(k) with the conditional probabilities

m(k)(Xn 2 F jXn) = m(k)(Xn 2 F jXnk¡k)

= m(Xn 2 F jXnk¡k)

and the same kth order distributions m(k)(Xk 2 F ) = m(Xk 2 F ). Consider the probability

 1 ln m(k)(Xn) ‚ †) = m( m(k)(Xn) ‚ en†) m( n m(Xn) m(Xn)

• Em( m(k)(Xn) )e¡n†:

The expectation is evaluated as

X m(k)(xn) m(xn) = 1 m(xn)

xn

and hence we again have using Borel-Cantelli that

lim sup 1 ln m(k)(Xn) 0:

n!1 n

We can apply the usual ergodic theorem to conclude that with probability 1 under m

 1 ln 1 • nlim 1 ln 1 = Emˆ [¡ ln m(XkjXk)]: lim sup n m(Xn) n m(k)(Xn) n!1 !1 Combining this result with (3.20) we have using Lemma 2.4.3 that 1 ln mˆ(Xn) „ k lim sup n m(Xn) • ¡Hmˆ (X) ¡ Emˆ [ln m(XkjX )]: n!1 „ (3.24) = Hmˆ jjm(k) (X):

This bound holds for any integer k and hence it must also be true that m-a.e. the following holds:

 1 mˆ(Xn) inf „ (k) lim sup n ln m(Xn) • (X) · ‡: (3.25) k Hmˆ jjm n!1