Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Теория информации / Gray R.M. Entropy and information theory. 1990., 284p

.pdf
Скачиваний:
28
Добавлен:
09.08.2013
Размер:
1.32 Mб
Скачать

Chapter 6

Information Rates II

6.1Introduction

In this chapter we develop general deflnitions of information rate for processes with standard alphabets and we prove a mean ergodic theorem for information densities. The L1 results are extensions of the results of Moy [105] and Perez [123] for stationary processes, which in turn extended the Shannon-McMillan theorem from entropies of discrete alphabet processes to information densities. (See also Kiefier [85].) We also relate several difierent measures of information rate and consider the mutual information between a stationary process and its ergodic component function. In the next chapter we apply the results of Chapter 5 on divergence to the deflnitions of this chapter for limiting information and entropy rates to obtain a number of results describing the behavior of such rates. In Chapter 8 almost everywhere ergodic theorems for relative entropy and information densities are proved.

6.2Information Rates for General Alphabets

Suppose that we are given a pair random process fXn; Yng with distribution p. The most natural deflnition of the information rate between the two processes is the extension of the deflnition for the flnite alphabet case:

1

I(X

n

; Y

n

):

I(X; Y ) = lim sup

 

 

 

 

 

 

n!1

n

 

 

 

 

This was the flrst general deflnition of information rate and it is due to Dobrushin [32]. While this deflnition has its uses, it also has its problems. Another deflnition is more in the spirit of the deflnition of information itself: We formed the general deflnitions by taking a supremum of the flnite alphabet deflnitions over all flnite alphabet codings or quantizers. The above deflnition takes the limit of such suprema. An alternative deflnition is to instead reverse the order

119

120

CHAPTER 6. INFORMATION RATES II

and take the supremum of the limit and hence the supremum of the information rate over all flnite alphabet codings of the process. This will provide a deflnition of information rate similar to the deflnition of the entropy of a dynamical system. There is a question as to what kind of codings we permit, that is, do the quantizers quantize individual outputs or long sequences of outputs. We shall shortly see that it makes no difierence. Suppose that we have a pair random process fXn; Yng with standard alphabets AX and AY and suppose that f : A1X ! Af and g : A1Y ! Ag are stationary codings of the X and Y sequence spaces into a flnite alphabet. We will call such flnite alphabet stationary mappings sliding block codes or stationary digital codes. Let ffn; gng be the induced output process, that is, if T denotes the shift (on any of the sequence spaces) then fn(x; y) = f (T nx) and gn(x; y) = g(T ny). Recall that f (T n(x; y)) = fn(x; y), that is, shifting the input n times results in the output being shifted n times.

Since the new process ffn; gng has a flnite alphabet, its mutual information rate is deflned. We now deflne the information rate for general alphabets as follows:

I(X; Y ) =

sup

 

I(f ; g)

 

sliding block codes f;g

=

sup

lim sup

1

I(f n; gn):

 

 

sliding block codes f;g n!1

n

We now focus on AMS processes, in which case the information rates for flnite alphabet processes (e.g., quantized processes) is given by the limit, that is,

I(X; Y ) =

sup

 

 

I(f ; g)

 

sliding block codes f;g

=

sup

lim

1

I(f n; gn):

 

sliding block codes f;g n!1 n

The following lemma shows that for AMS sources Ican also be evaluated by constraining the sliding block codes to be scalar quantizers.

Lemma 6.2.1: Given an AMS pair random process fXn; Yng with standard alphabet,

I(X; Y ) = sup I(q(X); r(Y )) = sup lim sup

1

I(q(X)n; r(Y )n);

 

q;r

q;r n!1

n

where the supremum is over all quantizers q of AX and r of AY and where q(X)n = q(X0); ¢ ¢ ¢ ; q(X1).

Proof: Clearly the right hand side above is less than Isince a scalar quantizer is a special case of a stationary code. Conversely, suppose that f and g

¡

are sliding block codes such that I(f ; g) I (X; Y ) . Then from Corollary

4.3.1 there are quantizers q and r and codes f 0 and g0 depending only on the

0 0 ¡ quantized processes q(Xn) and r(Yn) such that I(f ; g ) I(f ; g) . From

0 0 0 0

Lemma 4.3.3, however, I(q(X); r(Y )) I(f ; g ) since f and g are stationary

6.2. INFORMATION RATES FOR GENERAL ALPHABETS

121

¡

codings of the quantized processes. Thus I(q(X); r(Y )) I (X; Y ) 2, which proves the lemma. 2

Corollary 6.2.1:

I (X; Y ) I(X; Y ):

If the alphabets are flnite, then the two rates are equal.

Proof: The inequality follows from the lemma and the fact that

I(Xn; Y n) ‚ I(q(X)n; r(Y )n)

for any scalar quantizers q and r (where q(X)n is q(X0); ¢ ¢ ¢ ; q(X1)). If the alphabets are flnite, then the identity mappings are quantizers and yield I(Xn; Y n) for all n. 2

Pinsker [125] introduced the deflnition of information rate as a supremum over all scalar quantizers and hence we shall refer to this information rate as the Pinsker rate. The Pinsker deflnition has the advantage that we can use the known properties of information rates for flnite alphabet processes to infer those for general processes, an attribute the flrst deflnition lacks.

Corollary 6.2.2: Given a standard alphabet pair process alphabet AX £AY there is a sequence of scalar quantizers qm and rm such that for any AMS pair process fXn; Yng having this alphabet (that is, for any process distribution on the corresponding sequence space)

I(Xn; Y n) = lim I(qm(X)n; rm(Y )n)

m!1

I (X; Y ) = lim I(qm(X); rm(Y )):

m!1

Furthermore, the above limits can be taken to be increasing by using flner and flner quantizers. Comment: It is important to note that the same sequence of quantizers gives both of the limiting results.

Proof: The flrst result is Lemma 5.5.5. The second follows from the previous lemma. 2

Observe that

I(X; Y ) = lim lim sup

1

I(qm(X); rm(Y ))

 

 

m!1 n!1

 

n

whereas

1

 

 

I(qm(X); rm(Y )):

I(X; Y ) = lim sup lim

 

 

n!1 m!1 n

Thus the two notions of information rate are equal if the two limits can be interchanged. We shall later consider conditions under which this is true and we shall see that equality of these two rates is important for proving ergodic theorems for information densities.

Lemma 6.2.2: Suppose that fXn; Yng is an AMS standard alphabet random process with distribution p and stationary mean p„. Then

Ip(X; Y ) = Ip(X; Y ):

122

CHAPTER 6. INFORMATION RATES II

Ipis an a–ne function of the distribution p. If p„ has ergodic decomposition pxy, then

Z

Ip(X; Y ) = dp„(x; y)Ipxy (X; Y ):

If f and g are stationary codings of X and Y , then

Z

Ip(f ; g) = dp„(x; y)Ipxy (f ; g):

Proof: For any scalar quantizers q and r of X and Y we have that Ip(q(X); r(Y )) =

Ip(q(X); r(Y )). Taking a limit with ever flner quantizers yields the flrst equality. The fact that Iis a–ne follows similarly. Suppose that p„ has ergodic decomposition pxy. Deflne the induced distributions of the quantized process by m and mxy, that is, m(F ) = p„(x; y : fq(xi); r(yi); i 2 T g 2 F ) and similarly for mxy. It is easy to show that m is stationary (since it is a stationary coding of a stationary process), that the mxy are stationary ergodic (since they are stationary codings of stationary ergodic processes), and that the mxy form an ergodic decomposition of m. If we let Xn0 ; Yn0 denote the coordinate functions on the quantized output sequence space (that is, the processes fq(Xn); r(Yn)g and fXn0 ; Yn0g are equivalent), then using the ergodic decomposition of mutual information for flnite alphabet processes (Lemma 4.3.1) we have that

Z

„ „ 0 0 0 0 0 0

Ip(q(X); r(Y )) = Im(X ; Y ) = Imx0y0 (X ; Y ) dm(x ; y )

Z

= Ipxy (q(X); r(Y )) dp„(x; y):

Replacing the quantizers by the sequence qm, rm the result then follows by taking the limit using the monotone convergence theorem. The result for stationary codings follows similarly by applying the previous result to the induced distributions and then relating the equation to the original distributions. 2

The above properties are not known to hold for I in the general case. Thus

although I may appear to be a more natural deflnition of mutual information rate, Iis better behaved since it inherits properties from the discrete alphabet case. It will be of interest to flnd conditions under which the two rates are the

same, since then I will share the properties possessed by I . The flrst result of the next section adds to the interest by demonstrating that when the two rates are equal, a mean ergodic theorem holds for the information densities.

6.3A Mean Ergodic Theorem for Densities

Theorem 6.3.1: Given an AMS pair process fXn; Yng with standard alphabets, assume that for all n

PXn £ PY n >> PXnY n

dP ^ ^ n
XnY
dPXn;Y n

6.3. A MEAN ERGODIC THEOREM FOR DENSITIES

123

and hence that the information densities

iXn ;Y n = ln d(PXn £ PY n )

are well deflned. For simplicity we abbreviate iXn ;Y n to in when there is no possibility of confusion. If the limit

1

I(X

n

 

n

 

 

lim

 

 

; Y

 

) = I(X; Y )

n!1 n

 

 

 

 

 

 

 

exists and

 

 

 

 

 

 

;

I(X; Y ) = I(X; Y ) <

1

 

 

 

 

 

 

 

 

then n¡1in(Xn; Y n) converges in L1 to an invariant function i(X; Y ). If the stationary mean of the process has an ergodic decomposition pxy, then the limiting density is Ipxy (X; Y ), the information rate of the ergodic component in efiect.

Proof: Let qm and rm be asymptotically accurate quantizers for AX and AY .

^ ^

Deflne the discrete approximations Xn = qm(Xn) and Yn = rm(Yn). Observe

that PXn £ PY n >> PXnY n implies that PX^ n £ PY^ n >> PX^ nY^ n and hence we can deflne the information densities of the quantized vectors by

^

in = ln d(P ^ n £ P ^ n ) :

X Y

For any m we have that

 

 

 

Z jn in(xn; yn) ¡ Ipxy (X; Y )j dp(x; y)

 

 

 

 

1

 

 

 

 

 

Z jn in(xn; yn) ¡

n^in(qm(x)n; rm(y)n)j dp(x; y)+

 

 

 

 

1

 

 

1

 

 

Z jn^in(qm(x)n; rm(y)n) ¡ Ipxy (qm(X); rm(Y ))j dp(x; y)+

 

1

 

 

 

 

 

 

 

 

 

Z

jIpxy (qm(X); rm(Y )) ¡ Ipxy (X; Y )j dp(x; y);

(6.1)

where

qm(x)n = (qm(x0); ¢ ¢ ¢ ; qm(x1)); rm(y)n = (rm(y0); ¢ ¢ ¢ ; rm(y1));

f

and Ip(qm(X); rm(Y )) denotes the information rate of the process qm(Xn); rm(Yn); n = 0; 1; ¢ ¢ ¢ ; g when p is the process measure describing fXn; Yng.

Consider flrst the right-most term of (6.1). Since Iis the supremum over all quantized versions,

Z

j¡ j

Ipxy (qm(X); rm(Y )) I pxy (X; Y ) dp(x; y)

124

CHAPTER 6. INFORMATION RATES II

Z

¡

=(I pxy (X; Y ) Ipxy (qm(X); rm(Y ))) dp(x; y):

Using the ergodic decomposition of I (Lemma 6.2.2) and that of I for discrete alphabet processes (Lemma 4.3.1) this becomes

Z jIpxy (qm(X); rm(Y )) ¡ Ipxy (X; Y )j dp(x; y)

 

= Ip(X; Y ) ¡ Ip(qm(X); rm(Y )):

(6.2)

For flxed m the middle term of (6.1) can be made arbitrarily small by taking n large enough from the flnite alphabet result of Lemma 4.3.1. The flrst term on the right can be bounded above using Corollary 5.2.6 with F = ¾(q(X)n; r(Y )n) by

 

 

1

(I(X

n

 

n

 

 

^ n

^ n

 

2

)

 

n

 

; Y

 

) ¡ I(X

; Y

) +

e

 

 

 

 

 

 

 

 

 

 

 

 

 

 

which as n ! 1 goes to I(X; Y ) ¡I(qm(X); rm(Y )). Thus we have for any m

that

 

Z

jn n

 

 

 

¡

 

pxy

 

 

j

n!1

 

 

 

 

 

 

 

lim sup

 

1

i

(xn; yn)

 

I

 

 

(X; Y ) dp(x; y)

 

 

 

 

 

¡ ¡

I(X; Y ) I(qm(X); rm(Y )) + I (X; Y ) I(qm(X); rm(Y ))

! 1 ¡

which as m becomes I(X; Y ) I (X; Y ), which is 0 by assumption. 2

6.4Information Rates of Stationary Processes

In this section we introduce two more deflnitions of information rates for the case of stationary two-sided processes. These rates are useful tools in relating the Dobrushin and Pinsker rates and they provide additional interpretations of mutual information rates in terms of ordinary mutual information. The deflnitions follow Pinsker [125].

Henceforth assume that fXn; Yng is a stationary two-sided pair process with standard alphabets. Deflne the sequences y = fyi; i 2 T g and Y = fYi; i 2 T g

First deflne

~

1

I(X

n

; Y );

I(X; Y ) = lim sup

 

 

n!1

n

 

 

that is, consider the per-letter limiting information between n-tuples of X and the entire sequence from Y . Next deflne

I¡(X; Y ) = I(X0; Y jX¡1; X¡2; ¢ ¢ ¢);

that is, the average conditional mutual information between one letter from X and the entire Y sequence given the inflnite past of the X process. We could deflne the flrst rate for one-sided processes, but the second makes sense only when we can consider an inflnite past. For brevity we write X¡ = X¡1; X¡2; ¢ ¢ ¢

and hence

I¡(X; Y ) = I(X0; Y jX¡):

6.4. INFORMATION RATES OF STATIONARY PROCESSES

125

Theorem 6.4.1:

~ ¡

I(X; Y ) I(X; Y ) I (X; Y ) I (X; Y ):

If the alphabet of X is flnite, then the above rates are all equal.

Comment: We will later see more general su–cient conditions for the equality of the various rates, but the case where one alphabet is flnite is simple and important and points out that the rates are all equal in the flnite alphabet case.

Proof: We have already proved the middle inequality. The left inequality follows immediately from the fact that I(Xn; Y ) ‚ I(Xn; Y n) for all n. The remaining inequality is more involved. We prove it in two steps. First we prove the second half of the theorem, that the rates are the same if X has flnite alphabet. We then couple this with an approximation argument to prove the remaining inequality. Suppose now that the alphabet of X is flnite. Using the chain rule and stationarity we have that

1

n1 I(Xn; Y n) = n1 X I(Xi; Y njX0; ¢ ¢ ¢ ; X1)

i=0

1

= n1 X I(X0; Y¡nijX¡1; ¢ ¢ ¢ ; X¡i);

i=0

where Y¡ni is Y¡i; ¢ ¢ ¢ ; Y¡i+1, that is, the n-vector starting at ¡i. Since X has flnite alphabet, each term in the sum is bounded. We can show as in Section 5.5

(or using Kolmogorov’s formula and Lemma 5.5.1) that each term converges as i ! 1, n ! 1, and n ¡ i ! 1 to I(X0; Y jX¡1; X¡2; ¢ ¢ ¢) or I¡(X; Y ). These facts, however, imply that the above Cesµaro average converges to the same limit

¡ ~ and hence I = I . We can similarly expand I as

1 1

 

1 1

 

X

 

 

X

n

I(Xi; Y jX0; ¢ ¢ ¢ ; X1) =

n

I(X0; Y jX¡1; ¢ ¢ ¢ ; X¡i);

 

i=0

 

 

i=0

~ „ ¡ which converges to the same limit for the same reasons. Thus I = I = I for

stationary processes when the alphabet of X is flnite. Now suppose that X has a standard alphabet and let qm be an asymptotically accurate sequences of quantizers. Recall that the corresponding partitions are increasing, that is, each reflnes the previous partition. Fix † > 0 and choose m large enough so that the quantizer (X0) = qm(X0) satisfles

I((X0); Y jX¡) ‚ I(X0; Y jX¡) ¡ †:

Observe that so far we have only quantized X0 and not the past. Since

Fm = ¾((X0); Y; qm(X¡i); i = 1; 2; ¢ ¢ ¢) asymptotically generates

¾((X0); Y; X¡i; i = 1; 2; ¢ ¢ ¢);

126

CHAPTER 6. INFORMATION RATES II

given we can choose for m large enough (larger than before) a quantizer (x) = qm(x) such that if we deflne (X¡) to be (X¡1); fl(X¡2); ¢ ¢ ¢, then

jI((X0); (Y; fl(X¡))) ¡ I((X0); (Y; X¡))j • †

and

jI((X0); (X¡)) ¡ I((X0); X¡)j • †:

Using Kolmogorov’s formula this implies that

jI((X0); Y jX¡) ¡ I((X0); Y jfl(X¡))j • 2

and hence that

I((X0); Y jfl(X¡)) ‚ I((X0); Y jX¡) ¡ 2† ‚ I(X0; Y jX¡) ¡ 3†:

But the partition corresponding to reflnes that of and hence increases the information; hence

I((X0); Y jfl(X¡)) ‚ I((X0); Y jfl(X¡)) ‚ I(X0; Y jX¡) ¡ 3†:

Since (Xn) has a flnite alphabet, however, from the flnite alphabet result the

left-most term above must be I((X); Y ), which can be made arbitrarily close to I(X; Y ). Since is arbitrary, this proves the flnal inequality. 2

The following two theorems provide su–cient conditions for equality of the various information rates. The flrst result is almost a special case of the second, but it is handled separately as it is simpler, much of the proof applies to the second case, and it is not an exact special case of the subsequent result since it does not require the second condition of that result. The result corresponds to condition (7.4.33) of Pinsker [125], who also provides more general conditions. The more general condition is also due to Pinsker and strongly resembles that considered by Barron [9].

Theorem 6.4.2: Given a stationary pair process fXn; Yng with standard alphabets, if

I(X0; (X¡1; X¡2; ¢ ¢ ¢)) < 1;

then

 

 

 

 

 

 

 

 

 

I~(X; Y ) = I(X; Y ) = I(X; Y ) = I¡(X; Y ):

(6.3)

Proof: We have that

 

 

 

 

 

 

 

 

 

 

1

I(Xn; Y )

1

I(Xn; (Y; X¡))

 

 

 

 

 

 

n

n

 

=

 

1

I(Xn; X¡) +

1

I(Xn; Y X¡);

(6.4)

 

 

n

 

 

n

 

 

j

 

where, as before, X¡ = fX¡1; X¡2; ¢ ¢ ¢g. Consider the flrst term on the right. Using the chain rule for mutual information

1

I(Xn; X¡) =

1

1 I(X

; X¡

Xi)

 

 

n

n

i

j

 

 

 

 

i=0

 

 

 

 

 

 

X

 

 

 

6.4. INFORMATION RATES OF STATIONARY PROCESSES

127

1

= n1 X(I(Xi; (Xi; X¡)) ¡ I(Xi; Xi)): (6.5)

i=0

Using stationarity we have that

1

1 1

 

X

n I(Xn; X¡) = n (I(X0; X¡) ¡ I(X0; (X¡1; ¢ ¢ ¢ ; X¡i)): (6.6)

i=0

The terms I(X0; (X¡1; ¢ ¢ ¢ ; X¡i)) are converging to I(X0; X¡), hence the terms in the sum are converging to 0, i.e.,

lim

I(X

; X¡ Xi) = 0:

(6.7)

i

!1

i

j

 

 

 

 

 

The Cesµaro mean of (6.5) is converging to the same thing and hence

 

 

1

I(Xn; X¡) ! 0:

(6.8)

 

 

 

 

n

Next consider the term I(Xn; Y jX¡). For any positive integers n,m we have

I(Xn+m; Y X¡) = I(Xn; Y X¡) + I(Xm; Y X¡; Xn);

(6.9)

j

j

n

j

 

where Xnm = Xn; ¢ ¢ ¢ ; Xn+1. From stationarity, however, the rightmost term is just I(Xm; Y jX¡) and hence

I(Xm+n; Y jX¡) = I(Xn; Y jX¡) + I(Xm; Y jX¡):

(6.10)

This is just a linear functional equation of the form f (n + m) = f (n) + f (m) and the unique solution to such an equation is f (n) = nf (1), that is,

1

I(Xn; Y X¡) = I(X

; Y X¡) = I¡(X; Y ):

(6.11)

n

j

0

j

 

Taking the limit supremum in (6.4) we have shown that

I~(X; Y ) • I¡(X; Y );

(6.12)

which with Theorem 6.4.1 completes the proof. 2

Intuitively, the theorem states that if one of the processes has flnite average mutual information between one symbol and its inflnite past, then the Dobrushin and Pinsker information rates yield the same value and hence there is an L1 ergodic theorem for the information density.

To generalize the theorem we introduce a condition that will often be useful when studying asymptotic properties of entropy and information. A stationary process fXng is said to have the flnite-gap information property if there exists an integer K such that

I(XK ; X¡jXK ) < 1;

(6.13)

i=K
which is as (6.5) in the proof of Theorem 6.4.2 except that the flrst K terms are missing. The same argument then shows that the limit of the sum is 0. The remaining term is
1
I(XKn¡K ; X¡jXK ) = n1 X
then (6.3) holds.
(If K = 0 then there is no conditioning and (6.16) is trivial, that is, the previous theorem is the special case with K = 0.)
Comment: This theorem shows that if there is any flnite dimensional future vector (XK ; XK+1; ¢ ¢ ¢ ; XK+1) which has flnite mutual information with respect to the inflnite past X¡ when conditioned on the intervening gap (X0; ¢ ¢ ¢ ; X1), then the various deflnitions of mutual information are equivalent provided that
the mutual information betwen the \gap" XK and the sequence Y are flnite.
~
Note that this latter condition will hold if, for example, I(X; Y ) is flnite.
Proof: For n > K
n1 I(Xn; Y ) = n1 I(XK ; Y ) + n1 I(XKn¡K ; Y jXK ):
By assumption the flrst term on the left will tend to 0 as n ! 1 and hence we focus on the second, which can be broken up analogous to the previous theorem with the addition of the conditioning:
n1 I(XKn¡K ; Y jXK ) n1 I(XKn¡K ; (Y; X¡jXK ))
= n1 I(XKn¡K ; X¡jXK ) + n1 I(XKn¡K ; Y jX¡; XK ):
Consider flrst the term
1
n
where PX(Kn) is the Kth order Markov approximation to PXn .
Theorem 6.4.3: Given a stationary standard alphabet pair process fXn; Yng, if fXng satisfles the flnite-gap information property (6.13) and if, in addition,
I(XK ; Y ) < 1; (6.16)
I(Xi; X¡jXi);

128 CHAPTER 6. INFORMATION RATES II

where, as usual, X¡ = (X¡1; X¡2; ¢ ¢ ¢). When a process has this property for a speciflc K, we shall say that it has the K-gap information property. Observe that if a process possesses this property, then it follows from Lemma 5.5.4

I(XK ; (X¡1; ¢ ¢ ¢ ; X¡l)jXK ) < 1; l = 1; 2; ¢ ¢ ¢

(6.14)

Since these informations are flnite,

 

PX(Kn) >> PXn ; n = 1; 2; ¢ ¢ ¢ ;

(6.15)

n1 I(XKn¡K ; Y jX¡; XK ) = n1 I(Xn; Y jX¡)