Теория информации / Gray R.M. Entropy and information theory. 1990., 284p
.pdfChapter 6
Information Rates II
6.1Introduction
In this chapter we develop general deflnitions of information rate for processes with standard alphabets and we prove a mean ergodic theorem for information densities. The L1 results are extensions of the results of Moy [105] and Perez [123] for stationary processes, which in turn extended the Shannon-McMillan theorem from entropies of discrete alphabet processes to information densities. (See also Kiefier [85].) We also relate several difierent measures of information rate and consider the mutual information between a stationary process and its ergodic component function. In the next chapter we apply the results of Chapter 5 on divergence to the deflnitions of this chapter for limiting information and entropy rates to obtain a number of results describing the behavior of such rates. In Chapter 8 almost everywhere ergodic theorems for relative entropy and information densities are proved.
6.2Information Rates for General Alphabets
Suppose that we are given a pair random process fXn; Yng with distribution p. The most natural deflnition of the information rate between the two processes is the extension of the deflnition for the flnite alphabet case:
„ |
1 |
I(X |
n |
; Y |
n |
): |
I(X; Y ) = lim sup |
|
|
|
|||
|
|
|
||||
n!1 |
n |
|
|
|
|
This was the flrst general deflnition of information rate and it is due to Dobrushin [32]. While this deflnition has its uses, it also has its problems. Another deflnition is more in the spirit of the deflnition of information itself: We formed the general deflnitions by taking a supremum of the flnite alphabet deflnitions over all flnite alphabet codings or quantizers. The above deflnition takes the limit of such suprema. An alternative deflnition is to instead reverse the order
119
120 |
CHAPTER 6. INFORMATION RATES II |
and take the supremum of the limit and hence the supremum of the information rate over all flnite alphabet codings of the process. This will provide a deflnition of information rate similar to the deflnition of the entropy of a dynamical system. There is a question as to what kind of codings we permit, that is, do the quantizers quantize individual outputs or long sequences of outputs. We shall shortly see that it makes no difierence. Suppose that we have a pair random process fXn; Yng with standard alphabets AX and AY and suppose that f : A1X ! Af and g : A1Y ! Ag are stationary codings of the X and Y sequence spaces into a flnite alphabet. We will call such flnite alphabet stationary mappings sliding block codes or stationary digital codes. Let ffn; gng be the induced output process, that is, if T denotes the shift (on any of the sequence spaces) then fn(x; y) = f (T nx) and gn(x; y) = g(T ny). Recall that f (T n(x; y)) = fn(x; y), that is, shifting the input n times results in the output being shifted n times.
Since the new process ffn; gng has a flnite alphabet, its mutual information rate is deflned. We now deflne the information rate for general alphabets as follows:
I⁄(X; Y ) = |
sup |
|
I„(f ; g) |
|
|
sliding block codes f;g |
|||
= |
sup |
lim sup |
1 |
I(f n; gn): |
|
||||
|
sliding block codes f;g n!1 |
n |
We now focus on AMS processes, in which case the information rates for flnite alphabet processes (e.g., quantized processes) is given by the limit, that is,
I⁄(X; Y ) = |
sup |
|
|
I„(f ; g) |
|
sliding block codes f;g |
|||
= |
sup |
lim |
1 |
I(f n; gn): |
|
sliding block codes f;g n!1 n
The following lemma shows that for AMS sources I⁄ can also be evaluated by constraining the sliding block codes to be scalar quantizers.
Lemma 6.2.1: Given an AMS pair random process fXn; Yng with standard alphabet,
I⁄(X; Y ) = sup I„(q(X); r(Y )) = sup lim sup |
1 |
I(q(X)n; r(Y )n); |
|
|
|||
q;r |
q;r n!1 |
n |
where the supremum is over all quantizers q of AX and r of AY and where q(X)n = q(X0); ¢ ¢ ¢ ; q(Xn¡1).
Proof: Clearly the right hand side above is less than I⁄ since a scalar quantizer is a special case of a stationary code. Conversely, suppose that f and g
„ ‚ ⁄ ¡
are sliding block codes such that I(f ; g) I (X; Y ) †. Then from Corollary
4.3.1 there are quantizers q and r and codes f 0 and g0 depending only on the
„ 0 0 ‚ „ ¡ quantized processes q(Xn) and r(Yn) such that I(f ; g ) I(f ; g) †. From
„ ‚ „ 0 0 0 0
Lemma 4.3.3, however, I(q(X); r(Y )) I(f ; g ) since f and g are stationary
6.2. INFORMATION RATES FOR GENERAL ALPHABETS |
121 |
„ ‚ ⁄ ¡
codings of the quantized processes. Thus I(q(X); r(Y )) I (X; Y ) 2†, which proves the lemma. 2
Corollary 6.2.1:
⁄ • „
I (X; Y ) I(X; Y ):
If the alphabets are flnite, then the two rates are equal.
Proof: The inequality follows from the lemma and the fact that
I(Xn; Y n) ‚ I(q(X)n; r(Y )n)
for any scalar quantizers q and r (where q(X)n is q(X0); ¢ ¢ ¢ ; q(Xn¡1)). If the alphabets are flnite, then the identity mappings are quantizers and yield I(Xn; Y n) for all n. 2
Pinsker [125] introduced the deflnition of information rate as a supremum over all scalar quantizers and hence we shall refer to this information rate as the Pinsker rate. The Pinsker deflnition has the advantage that we can use the known properties of information rates for flnite alphabet processes to infer those for general processes, an attribute the flrst deflnition lacks.
Corollary 6.2.2: Given a standard alphabet pair process alphabet AX £AY there is a sequence of scalar quantizers qm and rm such that for any AMS pair process fXn; Yng having this alphabet (that is, for any process distribution on the corresponding sequence space)
I(Xn; Y n) = lim I(qm(X)n; rm(Y )n)
m!1
⁄ „
I (X; Y ) = lim I(qm(X); rm(Y )):
m!1
Furthermore, the above limits can be taken to be increasing by using flner and flner quantizers. Comment: It is important to note that the same sequence of quantizers gives both of the limiting results.
Proof: The flrst result is Lemma 5.5.5. The second follows from the previous lemma. 2
Observe that
I⁄(X; Y ) = lim lim sup |
1 |
I(qm(X); rm(Y )) |
||
|
|
|||
m!1 n!1 |
|
n |
||
whereas |
1 |
|
|
|
„ |
I(qm(X); rm(Y )): |
|||
I(X; Y ) = lim sup lim |
|
|
n!1 m!1 n
Thus the two notions of information rate are equal if the two limits can be interchanged. We shall later consider conditions under which this is true and we shall see that equality of these two rates is important for proving ergodic theorems for information densities.
Lemma 6.2.2: Suppose that fXn; Yng is an AMS standard alphabet random process with distribution p and stationary mean p„. Then
Ip⁄(X; Y ) = Ip„⁄(X; Y ):
122 |
CHAPTER 6. INFORMATION RATES II |
Ip⁄ is an a–ne function of the distribution p. If p„ has ergodic decomposition p„xy, then
Z
Ip⁄(X; Y ) = dp„(x; y)I⁄p„xy (X; Y ):
If f and g are stationary codings of X and Y , then
Z
Ip⁄(f ; g) = dp„(x; y)Ip„⁄xy (f ; g):
„
Proof: For any scalar quantizers q and r of X and Y we have that Ip(q(X); r(Y )) =
„
Ip„(q(X); r(Y )). Taking a limit with ever flner quantizers yields the flrst equality. The fact that I⁄ is a–ne follows similarly. Suppose that p„ has ergodic decomposition p„xy. Deflne the induced distributions of the quantized process by m and mxy, that is, m(F ) = p„(x; y : fq(xi); r(yi); i 2 T g 2 F ) and similarly for mxy. It is easy to show that m is stationary (since it is a stationary coding of a stationary process), that the mxy are stationary ergodic (since they are stationary codings of stationary ergodic processes), and that the mxy form an ergodic decomposition of m. If we let Xn0 ; Yn0 denote the coordinate functions on the quantized output sequence space (that is, the processes fq(Xn); r(Yn)g and fXn0 ; Yn0g are equivalent), then using the ergodic decomposition of mutual information for flnite alphabet processes (Lemma 4.3.1) we have that
Z
„ „ 0 0 „ 0 0 0 0
Ip(q(X); r(Y )) = Im(X ; Y ) = Imx0y0 (X ; Y ) dm(x ; y )
Z
„
= Ip„xy (q(X); r(Y )) dp„(x; y):
Replacing the quantizers by the sequence qm, rm the result then follows by taking the limit using the monotone convergence theorem. The result for stationary codings follows similarly by applying the previous result to the induced distributions and then relating the equation to the original distributions. 2
„
The above properties are not known to hold for I in the general case. Thus
„
although I may appear to be a more natural deflnition of mutual information rate, I⁄ is better behaved since it inherits properties from the discrete alphabet case. It will be of interest to flnd conditions under which the two rates are the
„ ⁄
same, since then I will share the properties possessed by I . The flrst result of the next section adds to the interest by demonstrating that when the two rates are equal, a mean ergodic theorem holds for the information densities.
6.3A Mean Ergodic Theorem for Densities
Theorem 6.3.1: Given an AMS pair process fXn; Yng with standard alphabets, assume that for all n
PXn £ PY n >> PXnY n
6.3. A MEAN ERGODIC THEOREM FOR DENSITIES |
123 |
and hence that the information densities
iXn ;Y n = ln d(PXn £ PY n )
are well deflned. For simplicity we abbreviate iXn ;Y n to in when there is no possibility of confusion. If the limit
1 |
I(X |
n |
|
n |
„ |
|
|
|
lim |
|
|
; Y |
|
) = I(X; Y ) |
|||
n!1 n |
|
|
|
|
|
|
|
|
exists and |
|
|
|
|
|
|
; |
|
I„(X; Y ) = I⁄(X; Y ) < |
1 |
|||||||
|
|
|
|
|
|
|
|
then n¡1in(Xn; Y n) converges in L1 to an invariant function i(X; Y ). If the stationary mean of the process has an ergodic decomposition p„xy, then the limiting density is I⁄p„xy (X; Y ), the information rate of the ergodic component in efiect.
Proof: Let qm and rm be asymptotically accurate quantizers for AX and AY .
^ ^
Deflne the discrete approximations Xn = qm(Xn) and Yn = rm(Yn). Observe
that PXn £ PY n >> PXnY n implies that PX^ n £ PY^ n >> PX^ nY^ n and hence we can deflne the information densities of the quantized vectors by
^
in = ln d(P ^ n £ P ^ n ) :
X Y
For any m we have that
|
|
|
Z jn in(xn; yn) ¡ I⁄p„xy (X; Y )j dp(x; y) • |
|
||||
|
|
|
1 |
|
|
|
|
|
|
Z jn in(xn; yn) ¡ |
n^in(qm(x)n; rm(y)n)j dp(x; y)+ |
|
|||||
|
|
|
1 |
|
|
1 |
|
|
Z jn^in(qm(x)n; rm(y)n) ¡ I„p„xy (qm(X); rm(Y ))j dp(x; y)+ |
|
|||||||
1 |
|
|
|
|
|
|
|
|
|
|
Z |
jI„p„xy (qm(X); rm(Y )) ¡ I⁄p„xy (X; Y )j dp(x; y); |
(6.1) |
where
qm(x)n = (qm(x0); ¢ ¢ ¢ ; qm(xn¡1)); rm(y)n = (rm(y0); ¢ ¢ ¢ ; rm(yn¡1));
„ f
and Ip(qm(X); rm(Y )) denotes the information rate of the process qm(Xn); rm(Yn); n = 0; 1; ¢ ¢ ¢ ; g when p is the process measure describing fXn; Yng.
Consider flrst the right-most term of (6.1). Since I⁄ is the supremum over all quantized versions,
Z
j„ ¡ ⁄ j
Ip„xy (qm(X); rm(Y )) I p„xy (X; Y ) dp(x; y)
124 |
CHAPTER 6. INFORMATION RATES II |
Z
⁄¡ „
=(I p„xy (X; Y ) Ip„xy (qm(X); rm(Y ))) dp(x; y):
⁄ „
Using the ergodic decomposition of I (Lemma 6.2.2) and that of I for discrete alphabet processes (Lemma 4.3.1) this becomes
Z jI„p„xy (qm(X); rm(Y )) ¡ I⁄p„xy (X; Y )j dp(x; y) |
|
= Ip⁄(X; Y ) ¡ I„p(qm(X); rm(Y )): |
(6.2) |
For flxed m the middle term of (6.1) can be made arbitrarily small by taking n large enough from the flnite alphabet result of Lemma 4.3.1. The flrst term on the right can be bounded above using Corollary 5.2.6 with F = ¾(q(X)n; r(Y )n) by
|
|
1 |
(I(X |
n |
|
n |
|
|
^ n |
^ n |
|
2 |
) |
||||
|
n |
|
; Y |
|
) ¡ I(X |
; Y |
) + |
e |
|||||||||
|
|
|
„ |
|
|
|
„ |
|
|
|
|
|
|
|
|
||
which as n ! 1 goes to I(X; Y ) ¡I(qm(X); rm(Y )). Thus we have for any m |
|||||||||||||||||
that |
|
Z |
jn n |
|
|
|
¡ |
|
⁄p„xy |
|
|
j |
|||||
n!1 |
|
|
|
|
|
|
|
||||||||||
lim sup |
|
1 |
i |
(xn; yn) |
|
I |
|
|
(X; Y ) dp(x; y) |
||||||||
|
|
|
|
|
• „ ¡ „ ⁄ ¡ „
I(X; Y ) I(qm(X); rm(Y )) + I (X; Y ) I(qm(X); rm(Y ))
! 1 „ ¡ ⁄
which as m becomes I(X; Y ) I (X; Y ), which is 0 by assumption. 2
6.4Information Rates of Stationary Processes
In this section we introduce two more deflnitions of information rates for the case of stationary two-sided processes. These rates are useful tools in relating the Dobrushin and Pinsker rates and they provide additional interpretations of mutual information rates in terms of ordinary mutual information. The deflnitions follow Pinsker [125].
Henceforth assume that fXn; Yng is a stationary two-sided pair process with standard alphabets. Deflne the sequences y = fyi; i 2 T g and Y = fYi; i 2 T g
First deflne
~ |
1 |
I(X |
n |
; Y ); |
I(X; Y ) = lim sup |
|
|
||
n!1 |
n |
|
|
that is, consider the per-letter limiting information between n-tuples of X and the entire sequence from Y . Next deflne
I¡(X; Y ) = I(X0; Y jX¡1; X¡2; ¢ ¢ ¢);
that is, the average conditional mutual information between one letter from X and the entire Y sequence given the inflnite past of the X process. We could deflne the flrst rate for one-sided processes, but the second makes sense only when we can consider an inflnite past. For brevity we write X¡ = X¡1; X¡2; ¢ ¢ ¢
and hence
I¡(X; Y ) = I(X0; Y jX¡):
6.4. INFORMATION RATES OF STATIONARY PROCESSES |
125 |
Theorem 6.4.1:
~ ‚ „ ‚ ⁄ ‚ ¡
I(X; Y ) I(X; Y ) I (X; Y ) I (X; Y ):
If the alphabet of X is flnite, then the above rates are all equal.
Comment: We will later see more general su–cient conditions for the equality of the various rates, but the case where one alphabet is flnite is simple and important and points out that the rates are all equal in the flnite alphabet case.
Proof: We have already proved the middle inequality. The left inequality follows immediately from the fact that I(Xn; Y ) ‚ I(Xn; Y n) for all n. The remaining inequality is more involved. We prove it in two steps. First we prove the second half of the theorem, that the rates are the same if X has flnite alphabet. We then couple this with an approximation argument to prove the remaining inequality. Suppose now that the alphabet of X is flnite. Using the chain rule and stationarity we have that
n¡1
n1 I(Xn; Y n) = n1 X I(Xi; Y njX0; ¢ ¢ ¢ ; Xi¡1)
i=0
n¡1
= n1 X I(X0; Y¡nijX¡1; ¢ ¢ ¢ ; X¡i);
i=0
where Y¡ni is Y¡i; ¢ ¢ ¢ ; Y¡i+n¡1, that is, the n-vector starting at ¡i. Since X has flnite alphabet, each term in the sum is bounded. We can show as in Section 5.5
(or using Kolmogorov’s formula and Lemma 5.5.1) that each term converges as i ! 1, n ! 1, and n ¡ i ! 1 to I(X0; Y jX¡1; X¡2; ¢ ¢ ¢) or I¡(X; Y ). These facts, however, imply that the above Cesµaro average converges to the same limit
„ ¡ ~ and hence I = I . We can similarly expand I as
1 n¡1 |
|
1 n¡1 |
||
|
X |
|
|
X |
n |
I(Xi; Y jX0; ¢ ¢ ¢ ; Xi¡1) = |
n |
I(X0; Y jX¡1; ¢ ¢ ¢ ; X¡i); |
|
|
i=0 |
|
|
i=0 |
~ „ ¡ which converges to the same limit for the same reasons. Thus I = I = I for
stationary processes when the alphabet of X is flnite. Now suppose that X has a standard alphabet and let qm be an asymptotically accurate sequences of quantizers. Recall that the corresponding partitions are increasing, that is, each reflnes the previous partition. Fix † > 0 and choose m large enough so that the quantizer fi(X0) = qm(X0) satisfles
I(fi(X0); Y jX¡) ‚ I(X0; Y jX¡) ¡ †:
Observe that so far we have only quantized X0 and not the past. Since
Fm = ¾(fi(X0); Y; qm(X¡i); i = 1; 2; ¢ ¢ ¢) asymptotically generates
¾(fi(X0); Y; X¡i; i = 1; 2; ¢ ¢ ¢);
126 |
CHAPTER 6. INFORMATION RATES II |
given † we can choose for m large enough (larger than before) a quantizer fl(x) = qm(x) such that if we deflne fl(X¡) to be fl(X¡1); fl(X¡2); ¢ ¢ ¢, then
jI(fi(X0); (Y; fl(X¡))) ¡ I(fi(X0); (Y; X¡))j • †
and
jI(fi(X0); fl(X¡)) ¡ I(fi(X0); X¡)j • †:
Using Kolmogorov’s formula this implies that
jI(fi(X0); Y jX¡) ¡ I(fi(X0); Y jfl(X¡))j • 2†
and hence that
I(fi(X0); Y jfl(X¡)) ‚ I(fi(X0); Y jX¡) ¡ 2† ‚ I(X0; Y jX¡) ¡ 3†:
But the partition corresponding to fl reflnes that of fi and hence increases the information; hence
I(fl(X0); Y jfl(X¡)) ‚ I(fi(X0); Y jfl(X¡)) ‚ I(X0; Y jX¡) ¡ 3†:
Since fl(Xn) has a flnite alphabet, however, from the flnite alphabet result the
„
left-most term above must be I(fl(X); Y ), which can be made arbitrarily close to I⁄(X; Y ). Since † is arbitrary, this proves the flnal inequality. 2
The following two theorems provide su–cient conditions for equality of the various information rates. The flrst result is almost a special case of the second, but it is handled separately as it is simpler, much of the proof applies to the second case, and it is not an exact special case of the subsequent result since it does not require the second condition of that result. The result corresponds to condition (7.4.33) of Pinsker [125], who also provides more general conditions. The more general condition is also due to Pinsker and strongly resembles that considered by Barron [9].
Theorem 6.4.2: Given a stationary pair process fXn; Yng with standard alphabets, if
I(X0; (X¡1; X¡2; ¢ ¢ ¢)) < 1;
then |
|
|
|
|
|
|
|
|
|
I~(X; Y ) = I„(X; Y ) = I⁄(X; Y ) = I¡(X; Y ): |
(6.3) |
||||||||
Proof: We have that |
|
|
|
|
|
|
|
|
|
|
1 |
I(Xn; Y ) • |
1 |
I(Xn; (Y; X¡)) |
|
||||
|
|
|
|
|
|||||
n |
n |
|
|||||||
= |
|
1 |
I(Xn; X¡) + |
1 |
I(Xn; Y X¡); |
(6.4) |
|||
|
|
n |
|||||||
|
|
n |
|
|
j |
|
where, as before, X¡ = fX¡1; X¡2; ¢ ¢ ¢g. Consider the flrst term on the right. Using the chain rule for mutual information
1 |
I(Xn; X¡) = |
1 |
n¡1 I(X |
; X¡ |
Xi) |
|
|
|
|||||
n |
n |
i |
j |
|
||
|
|
|
i=0 |
|
|
|
|
|
|
X |
|
|
|
6.4. INFORMATION RATES OF STATIONARY PROCESSES |
127 |
n¡1
= n1 X(I(Xi; (Xi; X¡)) ¡ I(Xi; Xi)): (6.5)
i=0
Using stationarity we have that
1 |
1 n¡1 |
|
X |
n I(Xn; X¡) = n (I(X0; X¡) ¡ I(X0; (X¡1; ¢ ¢ ¢ ; X¡i)): (6.6)
i=0
The terms I(X0; (X¡1; ¢ ¢ ¢ ; X¡i)) are converging to I(X0; X¡), hence the terms in the sum are converging to 0, i.e.,
lim |
I(X |
; X¡ Xi) = 0: |
(6.7) |
|||
i |
!1 |
i |
j |
|
||
|
|
|
|
|||
The Cesµaro mean of (6.5) is converging to the same thing and hence |
|
|||||
|
1 |
I(Xn; X¡) ! 0: |
(6.8) |
|||
|
|
|
||||
|
n |
Next consider the term I(Xn; Y jX¡). For any positive integers n,m we have
I(Xn+m; Y X¡) = I(Xn; Y X¡) + I(Xm; Y X¡; Xn); |
(6.9) |
|||
j |
j |
n |
j |
|
where Xnm = Xn; ¢ ¢ ¢ ; Xn+m¡1. From stationarity, however, the rightmost term is just I(Xm; Y jX¡) and hence
I(Xm+n; Y jX¡) = I(Xn; Y jX¡) + I(Xm; Y jX¡): |
(6.10) |
This is just a linear functional equation of the form f (n + m) = f (n) + f (m) and the unique solution to such an equation is f (n) = nf (1), that is,
1 |
I(Xn; Y X¡) = I(X |
; Y X¡) = I¡(X; Y ): |
(6.11) |
||
n |
|||||
j |
0 |
j |
|
Taking the limit supremum in (6.4) we have shown that
I~(X; Y ) • I¡(X; Y ); |
(6.12) |
which with Theorem 6.4.1 completes the proof. 2
Intuitively, the theorem states that if one of the processes has flnite average mutual information between one symbol and its inflnite past, then the Dobrushin and Pinsker information rates yield the same value and hence there is an L1 ergodic theorem for the information density.
To generalize the theorem we introduce a condition that will often be useful when studying asymptotic properties of entropy and information. A stationary process fXng is said to have the flnite-gap information property if there exists an integer K such that
I(XK ; X¡jXK ) < 1; |
(6.13) |
128 CHAPTER 6. INFORMATION RATES II
where, as usual, X¡ = (X¡1; X¡2; ¢ ¢ ¢). When a process has this property for a speciflc K, we shall say that it has the K-gap information property. Observe that if a process possesses this property, then it follows from Lemma 5.5.4
I(XK ; (X¡1; ¢ ¢ ¢ ; X¡l)jXK ) < 1; l = 1; 2; ¢ ¢ ¢ |
(6.14) |
Since these informations are flnite, |
|
PX(Kn) >> PXn ; n = 1; 2; ¢ ¢ ¢ ; |
(6.15) |
n1 I(XKn¡K ; Y jX¡; XK ) = n1 I(Xn; Y jX¡)