
Теория информации / Gray R.M. Entropy and information theory. 1990., 284p
.pdf2.5. CONDITIONAL ENTROPY AND INFORMATION |
39 |
coding. This property is the key to generalizing these quantities to nondiscrete alphabets.
T
We saw in Lemma 2.3.4 that entropy was a convex function of the underlying distribution. The following lemma provides similar properties of mutual information considered as a function of either a marginal or a conditional distribution.
Lemma 2.5.4: Let „ denote a pmf on a discrete space AX , „(x) = Pr(X = x), and let q be a conditional pmf, q(yjx) = Pr(Y = yjX = x). Let „q denote the resulting joint pmf „q(x; y) = „(x)q(yjx). Let I„q = I„q(X; Y ) be the average
mutual information. Then I„q is a convex |
function of q; that is, given two |
||||||||
conditional pmf’s q1 and q2, a ‚ 2 [0; 1], |
and q„ = ‚q + (1 |
¡ |
‚)q |
2 |
, then |
||||
S |
1 |
|
|
|
|||||
and „ = ‚„1 + (1 ¡T |
|
I„q„ • ‚I„q1 + (1 ¡ ‚)I„q2 ; |
|
|
|
|
|
||
2 |
|
|
|
|
|
|
|
|
|
and I„q is a convex |
function of „, that is, given two pmf’s „1 and „2, ‚ 2 [0; 1], |
||||||||
‚)„ , |
|
|
|
|
|
|
|
||
|
|
I„q ‚ ‚I„1q + (1 ¡ ‚)I„2q: |
|
|
|
|
|
||
Proof: Let r (respectively, r1, r2, r„) denote the pmf for Y |
resulting from q |
P
(respectively q1, q2, q„), that is, r(y) = Pr(Y = y) = x „(x)q(yjx). From (2.5)
I„q„ = ‚ x;y |
„(x)q1(x; y) log µ |
„(x)„r(y) |
„(x)q1(x; y) |
|
„(x)r1(y) |
¶ |
|||||||||||
X |
|
|
„(x)„q(x; y) |
|
„(x)r1(y) „(x)q1(x; y) |
|
|||||||||||
+(1 ¡ ‚) x;y |
„(x)q2(x; y) log µ |
„(x)„r(y) |
|
„(x)q2(x; y) |
|
„(x)r2(y) |
¶ |
||||||||||
X |
|
|
„(x)„q(x; y) „(x)r2(y) |
|
„(x)q2(x; y) |
||||||||||||
• ‚I„q1 + ‚ x;y „q1 |
(x; y) |
µ „(x)„r(y) |
„(x)q1 |
(x; y) ¡ 1¶ |
|
||||||||||||
|
X |
|
|
|
|
„(x)„q(x; y) |
|
„(x)r1(y) |
|
||||||||
+(1 ¡ ‚)I„q2 |
+ (1 ¡ ‚) x;y |
„(x)q2 |
(x; y) µ |
„(x)„r(y) |
|
„(x)q2(x; y) ¡ 1¶ |
|||||||||||
|
X |
|
|
„(x)„q(x; y) |
|
„(x)r2(y) |
|
||||||||||
|
|
|
|
|
|
|
|
X |
„q„(x; y) |
|
|
|
|||||
= ‚I„q1 + (1 ¡ ‚)I„q2 + ‚(¡1 + |
|
|
|
r1(y)) |
|
||||||||||||
|
x;y |
|
r„(y) |
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
X |
„(x)„q(x; y) |
|
|
|
|
|
|
|
|
|
|
|||||
+(1 ¡ ‚)(¡1 + |
|
|
|
r2(y)) = ‚I„q1 + (1 ¡ ‚)I„q2 : |
|
||||||||||||
|
r„(y) |
|
|
||||||||||||||
|
x;y |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Similarly, let „ = ‚„1 + (1 ¡ ‚)„2 and let r1, r2, and r„ denote the induced output pmf’s. Then
I„q = ‚ x;y |
„1(x)q(yjx) log µ qr„(yj) q(1y x) r1(jy) ¶ |
||||||
X |
|
|
(y x) |
r |
(y) |
q(y x) |
|
|
µ qr„(yj) |
|
|
j |
|
||
+(1 ¡ ‚) x;y |
„2(x)q(yjx) log |
|
q(2y x) r2(jy) ¶ |
||||
X |
|
(y x) |
|
r |
(y) |
q(y x) |
|
|
|
|
|
j |
|
40 |
|
CHAPTER 2. ENTROPY AND INFORMATION |
||||
|
|
|
X |
r„(y) |
||
= ‚I„1q + (1 ¡ ‚)I„2q ¡ ‚ „1(x)q(yjx) log |
|
|
||||
r1(y) |
||||||
|
|
|
x;y |
|
|
|
|
X |
|
r„(y) |
|
|
|
¡(1 ¡ ‚) |
„2(x)q(yjx) log |
|
‚ ‚I„1q + (1 ¡ ‚)I„2q |
|||
x;y |
r2(y) |
|||||
|
|
|
|
|
|
from another application of (2.5). 2
We consider one other notion of information: Given three flnite alphabet random variables X; Y; Z, deflne the conditional mutual information between X and Y given Z by
I(X; Y jZ) = D(PXY Z jjPX£Y jZ ) |
(2.25) |
where PX£Y jZ is the distribution deflned by its values on rectangles as
X
PX£Y jZ (F £G £D) = P (X 2 F jZ = z)P (Y 2 GjZ = z)P (Z = z): (2.26)
z2D
PX£Y jZ has the same conditional distributions for X given Z and for Y given Z as does PXY Z , but now X and Y are conditionally independent given Z. Alternatively, the conditional distribution for X; Y given Z under the distribution PX£Y jZ is the product distribution PX jZ £ PY jZ. Thus
|
|
|
X |
|
|
|
|
|
|
|
|
|
pXY Z (x; y; z) |
|
||||||||
|
|
|
|
|
|
|
|
|
j |
|
|
j |
|
|
j |
|
|
j |
|
|||
|
I(X; Y jZ) = |
pXY Z (x; y; z) ln pX |
Z |
(x z)pY |
Z (y z)pZ (z) |
|
||||||||||||||||
|
|
|
x;y;z |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
= |
X |
|
|
|
pXY jZ (x; yjz) |
: |
|
(2.27) |
|||||||||||||
|
|
|
|
j |
|
|
j |
|
j |
|
|
|
j |
|
||||||||
|
pXY Z (x; y; z) ln pX |
Z (x z)pY |
Z |
(y z) |
|
|||||||||||||||||
|
|
|
x;y;z |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Since |
|
pXY Z |
|
pXY Z |
|
pX |
|
|
pXY Z |
|
|
|
|
pY |
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
||||||||||||
|
|
|
|
= |
|
£ |
|
= |
|
|
£ |
|
|
|
||||||||
|
pXjZ pY jZ pZ |
pX pY Z |
pXjZ |
pXZ pY |
pY jZ |
|
||||||||||||||||
we have the flrst statement in the following lemma. |
|
|
|
|
|
|
|
|
||||||||||||||
Lemma 2.5.4: |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
I(X; Y jZ) + I(Y ; Z) = I(Y ; (X; Z)); |
|
|
(2.28) |
||||||||||||||||
|
|
|
|
|
|
I(X; Y jZ) ‚ 0; |
|
|
|
|
|
|
|
|
|
|
(2.29) |
with equality if and only if X and Y are conditionally independent given Z, that is, pXY jZ = pXjZ pY jZ . Given flnite valued measurements f and g,
I(f (X); g(Y )jZ) • I(X; Y jZ):
Proof: The second inequality follows from the divergence inequality (2.6)
with P = PXY Z and M = PX£Y jZ , i.e., the pmf’s pXY Z and pXjZ pY jZ pZ . The third inequality follows from Lemma 2.3.3 or its corollary applied to the same
measures. 2
Comments: Eq. (2.28) is called Kolmogorov’s formula. If X and Y are conditionally independent given Z in the above sense, then we also have that

2.6. ENTROPY RATE REVISITED |
41 |
pXjY Z = pXY jZ =pY jZ = pXjZ , in which case we say that Y ! Z ! X is a Markov chain and note that given Z, X does not depend on Y . (Note that if
Y ! Z ! X is a Markov chain, then so is X ! Z ! Y .) Thus the conditional mutual information is 0 if and only if the variables form a Markov chain with the conditioning variable in the middle. One might be tempted to infer from Lemma 2.3.3 that given flnite valued measurements f , g, and r
(?)
I(f (X); g(Y )jr(Z)) • I(X; Y jZ):
This does not follow, however, since it is not true that if Q is the partition
corresponding to the three quantizers, then D(Pf (X);g(Y );r(Z)jjPf (X)£g(Y )jr(Z))
is HPX;Y;Z jjPX£Y jZ (f (X); g(Y ); r(Z)) because of the way that PX£Y jZ is constructed; e.g., the fact that X and Y are conditionally independent given Z
implies that f (X) and g(Y ) are conditionally independent given Z, but it does not imply that f (X) and g(Y ) are conditionally independent given r(Z). Alternatively, if M is PX£ZjY , then it is not true that Pf (X)£g(Y )jr(Z) equals M (f gr)¡1. Note that if this inequality were true, choosing r(z) to be trivial (say 1 for all z) would result in I(X; Y jZ) ‚ I(X; Y jr(Z)) = I(X; Y ). This cannot be true in general since, for example, choosing Z as (X; Y ) would give I(X; Y jZ) = 0. Thus one must be careful when applying Lemma 2.3.3 if the measures and random variables are related as they are in the case of conditional mutual information.
We close this section with an easy corollary of the previous lemma and of the deflnition of conditional entropy. Results of this type are referred to as chain rules for information and entropy.
Corollary 2.5.1: Given flnite alphabet random variables Y , X1, X2, ¢ ¢ ¢,
Xn,
|
n |
H(X1; X2; ¢ ¢ ¢ ; Xn) = |
X |
H(XijX1; ¢ ¢ ¢ ; Xi¡1) |
|
|
i=1 |
|
n |
Hpjjm(X1; X2; ¢ ¢ ¢ ; Xn) = |
X |
Hpjjm(XijX1; ¢ ¢ ¢ ; Xi¡1) |
|
|
i=1 |
|
n |
I(Y ; (X1; X2; ¢ ¢ ¢ ; Xn)) = |
X |
I(Y ; XijX1; ¢ ¢ ¢ ; Xi¡1): |
|
|
i=1 |
2.6Entropy Rate Revisited
The chain rule of Corollary 2.5.1 provides a means of computing entropy rates for stationary processes. We have that
n¡1
n1 H(Xn) = n1 X H(XijXi):
i=0

42 |
CHAPTER 2. ENTROPY AND INFORMATION |
First suppose that the source is a stationary kth order Markov process, that is, for any m > k
Pr(Xn = xnjXi = xi; i = 0; 1; ¢ ¢ ¢ ; n ¡ 1)
= Pr(Xn = xnjXi = xi; i = n ¡ k; ¢ ¢ ¢ ; n ¡ 1):
For such a process we have for all n ‚ k that
H(XnjXn) = H(XnjXnk¡k) = H(XkjXk);
where Xim = Xi; ¢ ¢ ¢ ; Xi+m¡1. Thus taking the limit as n ! 1 of the nth order entropy, all but a flnite number of terms in the sum are identical and hence the Cesµaro (or arithmetic) mean is given by the conditional expectation. We have therefore proved the following lemma.
Lemma 2.6.1: If fXng is a stationary kth order Markov source, then
„ |
k |
): |
H(X) = H(XkjX |
If we have a two-sided stationary process fXng, then all of the previous deflnitions for entropies of vectors extend in an obvious fashion and a generalization of the Markov result follows if we use stationarity and the chain rule to write
n¡1
n1 H(Xn) = n1 X H(X0jX¡1; ¢ ¢ ¢ ; X¡i):
i=0
Since conditional entropy is nonincreasing with more conditioning variables ((2.22) or Lemma 2.5.2), H(X0jX¡1; ¢ ¢ ¢ ; X¡i) has a limit. Again using the fact that a Cesµaro mean of terms all converging to a common limit also converges to the same limit we have the following result.
Lemma 2.6.2: If fXng is a two-sided stationary source, then
„ j ¢ ¢ ¢
H(X) = lim H(X0 X¡1; ; X¡n):
n!1
It is tempting to identify the above limit as the conditional entropy given the inflnite past, H(X0jX¡1; ¢ ¢ ¢). Since the conditioning variable is a sequence and does not have a flnite alphabet, such a conditional entropy is not included in any of the deflnitions yet introduced. We shall later demonstrate that this interpretation is indeed valid when the notion of conditional entropy has been suitably generalized.
The natural generalization of Lemma 2.6.2 to relative entropy rates unfortunately does not work because conditional relative entropies are not in general monotonic with increased conditioning and hence the chain rule does not immediately yield a limiting argument analogous to that for entropy. The argument does work if the reference measure is a kth order Markov, as considered in the following lemma.
2.6. ENTROPY RATE REVISITED |
43 |
Lemma 2.6.3: If fXng is a source described by process distributions p and m and if p is stationary and m is kth order Markov with stationary transitions, then for n ‚ k Hpjjm(X0jX¡1; ¢ ¢ ¢ ; X¡n) is nondecreasing in n and
„ |
Hpjjm(X0jX¡1; ¢ ¢ ¢ ; X¡n) |
|
Hpjjm(X) = nlim |
||
!1 |
|
|
„ |
k |
)]: |
= ¡Hp(X) ¡ Ep[ln m(XkjX |
||
Proof: For n ‚ k we have that |
|
|
= ¡Hp(X0jX¡1 |
Hpjjm(X0jX¡1; ¢ ¢ ¢ ; X¡n) |
; ¢ ¢ ¢ ; X¡n) ¡ xk+1 pXk+1 (xk+1) ln mXk jXk (xkjxk): |
|
|
X |
Since the conditional entropy is nonincreasing with n and the remaining term does not depend on n, the combination is nondecreasing with n. The remainder of the proof then parallels the entropy rate result. 2
It is important to note that the relative entropy analogs to entropy properties often require kth order Markov assumptions on the reference measure (but not on the original measure).
Markov Approximations
„
Recall that the relative entropy rate Hpjjm(X) can be thought of as a distance between the process with distribution p and that with distribution m and that the rate is given by a limit if the reference measure m is Markov. A particular Markov measure relevant to p is the distribution p(k) which is the kth order Markov approximation to p in the sense that it is a kth order Markov source and it has the same kth order transition probabilities as p. To be more precise, the process distribution p(k) is specifled by its flnite dimensional distributions
p(k) |
(xk) = p |
X |
k (xk) |
Xk |
|
|
|
n¡1 |
|
|
|
pX(kn) (xn) = pXk (xk) |
pXljXlk¡k (xljxlk¡k); n = k; k + 1; ¢ ¢ ¢ |
||
Y |
|
|
|
l=k
so that
(k)
pXk jXk = pXk jXk :
It is natural to ask how good this approximation is, especially in the limit, that
„ ! 1 is, to study the behavior of the relative entropy rate Hpjjp(k) (X) as k .
Theorem 2.6.2: Given a stationary process p, let p(k) denote the kth order
Markov approximations to p. Then |
|
|||
lim |
„ |
|
„ |
(k) (X) = 0: |
H |
pjjp |
(k) (X) = inf H |
||
k!1 |
|
k |
pjjp |
44 |
CHAPTER 2. ENTROPY AND INFORMATION |
Thus the Markov approximations are asymptotically accurate in the sense that the relative entropy rate between the source and approximation can be made arbitrarily small (zero if the original source itself happens to be Markov).
Proof: As in the proof of Lemma 2.6.3 we can write for n ‚ k that
Hpjjp(k) (X0jX¡1; ¢ ¢ ¢ ; X¡n)
X
= ¡Hp(X0jX¡1; ¢ ¢ ¢ ; X¡n) ¡ pXk+1 (xk+1) ln pXk jXk (xkjxk)
xk+1
= Hp(X0jX¡1; ¢ ¢ ¢ ; X¡k) ¡ Hp(X0jX¡1; ¢ ¢ ¢ ; X¡n):
Note that this implies that p(Xkn) >> pXn for all n since the entropies are flnite. This automatic domination of the flnite dimensional distributions of a measure by those of its Markov approximation will not hold in the general case to be encountered later, it is speciflc to the flnite alphabet case. Taking the limit as n ! 1 gives
„ |
; ¢ ¢ ¢ ; X¡n) |
Hpjjp(k) (X) = nlim Hpjjp(k) (X0jX¡1 |
|
!1 |
|
„ |
|
= Hp(X0jX¡1; ¢ ¢ ¢ ; X¡k) ¡ Hp(X): |
|
The corollary then follows immediately from Lemma 2.6.2. 2 |
Markov approximations will play a fundamental role when considering relative entropies for general (nonflnite alphabet) processes. The basic result above will generalize to that case, but the proof will be much more involved.
2.7 Relative Entropy Densities
Many of the convergence results to come will be given and stated in terms of relative entropy densities. In this section we present a simple but important result describing the asymptotic behavior of relative entropy densities. Although the result of this section is only for flnite alphabet processes, it is stated and proved in a manner that will extend naturally to more general processes later on. The result will play a fundamental role in the basic ergodic theorems to come.
Throughout this section we will assume that M and P are two process
distributions describing a random process |
f |
X |
ng |
. Denote as before the sample |
||||||
n |
|
|
|
|
|
|
|
|
||
vector X = (X0; X1; ¢ ¢ ¢ ; Xn¡1), |
that is, the vector beginning at time 0 having |
|||||||||
|
n |
induced by M and P will be denoted by |
||||||||
length n. The distributions on X |
|
|||||||||
Mn and Pn, respectively. The corresponding pmf’s are mXn |
and pXn . |
The |
||||||||
key assumption in this section is |
that for |
all n if mXn (xn) |
= 0, then |
also |
||||||
pXn (xn) = 0, that is, |
|
|
|
|
|
|
|
|
|
|
|
Mn >> Pn for all n: |
(2.30) |

2.7. RELATIVE ENTROPY DENSITIES
If this is the case, we can deflne the relative entropy density
pXn (xn)
hn(x) · ln mXn (xn) = ln fn(x);
where
fn(x) · |
‰ |
|
p n (xn) |
|
6 |
||
0 X |
n |
|
otherwise |
||||
|
|
|
X |
) |
if mXn (xn) = 0 |
||
|
|
m |
n (x |
|
|
45
(2.31)
(2.32)
Observe that the relative entropy is found by integrating the relative entropy density:
|
|
X |
pXn (xn) |
|
||
HP jjM (Xn) = D(PnjjMn) = |
|
n pXn (xn) ln |
|
|
||
x |
mXn (xn) |
|
||||
|
|
|
|
|
|
|
= Z ln |
pXn (Xn) |
|
(2.33) |
|||
|
dP |
|
||||
mXn (Xn) |
|
|||||
Thus, for example, if we assume that |
|
|
|
|
|
|
HP jjM (Xn) < 1; all n; |
|
(2.34) |
then (2.30) holds.
The following lemma will prove to be useful when comparing the asymptotic behavior of relative entropy densities for difierent probability measures. It is the flrst almost everywhere result for relative entropy densities that we consider. It is somewhat narrow in the sense that it only compares limiting densities to zero and not to expectations. We shall later see that essentially the same argument implies the same result for the general case (Theorem 5.4.1), only the interim steps involving pmf’s need be dropped. Note that the lemma requires neither stationarity nor asymptotic mean stationarity.
Lemma 2.7.1: Given a flnite alphabet process fXng with process measures P; M satisfying (2.30), Then
|
|
|
|
1 |
hn • 0; M ¡ a.e. |
|
(2.35) |
|||||
|
|
lim sup |
|
|
|
|
||||||
n |
|
|||||||||||
|
|
n!1 |
|
|
|
|
|
|
|
|
|
|
and |
|
|
1 |
|
|
|
|
|
||||
|
|
lim inf |
|
|
|
|
|
|||||
|
n hn ‚ 0; P ¡ a.e.: |
|
(2.36) |
|||||||||
|
|
n |
!1 |
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
If in addition M >> P , then |
|
|
|
|
|
|
|
|
|
|||
|
|
nlim |
1 |
hn = 0; P ¡ a.e.: |
|
(2.37) |
||||||
|
|
|
|
|||||||||
n |
|
|||||||||||
|
|
|
!1 |
|
|
|
|
|
|
|
|
|
Proof: First consider the probability |
|
|
||||||||||
1 |
hn ‚ †) = M (fn ‚ en†) • |
EM (fn) |
|
|||||||||
M ( |
|
|
|
; |
||||||||
n |
|
en† |
46 |
CHAPTER 2. ENTROPY AND INFORMATION |
||||||||
where the flnal inequality is Markov’s inequality. But |
|||||||||
EM (fn) = Z dM fn |
= xn: mXn (xn)=0 mXn (xn) |
pXn (xn) |
|||||||
mXn (xn) |
|
||||||||
|
|
|
X |
X |
6 |
|
|
||
= |
|
|
6 |
|
|
|
|||
|
|
|
|
|
pXn (xn) • 1 |
||||
|
xn: mXn (xn)=0 |
|
|
|
|||||
and therefore |
|
|
1 |
|
|
|
|
|
|
|
|
|
hn ‚ †) • 2¡n† |
||||||
|
M ( |
|
|||||||
|
n |
||||||||
and hence |
1 |
|
|
|
1 |
|
|
|
|
1 |
|
|
|
|
|
|
|||
X |
|
|
|
|
|
X |
|
|
|
M ( |
n |
hn > †) • e¡n† < 1: |
|||||||
n=1 |
|
|
|
|
|
n=1 |
From the Borel-Cantelli Lemma (e.g., Lemma 4.6.3 of [50]) this implies that M (n¡1hn ‚ † i.o.) = 0 which implies the flrst equation of the lemma.
Next consider
|
|
1 |
|
|
|
|
X X (x |
|
|
|
P (¡n hn > †) = |
:¡n ln pX |
|
|
pXn (xn) |
||||||
|
|
|
|
xn 1 |
n (xn)=m n |
n)>† |
||||
|
¡n X (x |
|
X X |
|
and X |
(x )6=0 |
||||
= |
|
|
|
|
|
|
|
|
|
pXn (xn) |
xn: 1 ln p n |
n)=m n (xn)>† |
m n |
|
n |
|
where the last statement follows since if mXn (xn) = 0, then also pXn (xn) = 0 and hence nothing would be contributed to the sum. In other words, terms violating this condition add zero to the sum and hence adding this condition to the sum does not change the sum’s value. Thus
|
|
|
|
|
|
1 |
hn > †) = |
|
|
||||
|
|
|
|
|
P (¡ |
|
|
|
|||||
|
|
|
|
|
n |
|
|
||||||
|
|
|
|
|
X X |
|
|
|
|
|
|
pXn (xn) |
|
|
¡n |
|
X |
|
and |
|
X |
6 |
mXn (xn) mXn (xn) |
||||
xn: |
1 |
ln p |
|
n (xn)=m |
n (xn)>† |
|
|
|
m |
|
n (xn)=0 |
Z Z
= dM fn • dM e¡n†
fn<e¡n† fn<e¡n†
= e¡n†M (fn < e¡n†) • e¡n†:
Thus as before we have that P (n¡1hn > †) • e¡n† and hence that P (n¡1hn • ¡† i.o.) = 0 which proves the second claim. If also M >> P , then the flrst equation of the lemma is also true P -a.e., which when coupled with the second equation proves the third. 2
Chapter 3
The Entropy Ergodic
Theorem
3.1Introduction
The goal of this chapter is to prove an ergodic theorem for sample entropy of flnite alphabet random processes. The result is sometimes called the ergodic theorem of information theory or the asymptotic equipartion theorem, but it is best known as the Shannon-McMillan-Breiman theorem. It provides a common foundation to many of the results of both ergodic theory and information theory. Shannon [129] flrst developed the result for convergence in probability for stationary ergodic Markov sources. McMillan [103] proved L1 convergence for stationary ergodic sources and Breiman [19] [20] proved almost everywhere convergence for stationary and ergodic sources. Billingsley [15] extended the result to stationary nonergodic sources. Jacobs [67] [66] extended it to processes dominated by a stationary measure and hence to two-sided AMS processes. Gray and Kiefier [54] extended it to processes asymptotically dominated by a stationary measure and hence to all AMS processes. The generalizations to AMS processes build on the Billingsley theorem for the stationary mean. Following generalizations of the deflnitions of entropy and information, corresponding generalizations of the entropy ergodic theorem will be considered in Chapter 8.
Breiman’s and Billingsley’s approach requires the martingale convergence theorem and embeds the possibly one-sided stationary process into a two-sided process. Ornstein and Weiss [117] recently developed a proof for the stationary and ergodic case that does not require any martingale theory and considers only positive time and hence does not require any embedding into two-sided processes. The technique was described for both the ordinary ergodic theorem and the entropy ergodic theorem by Shields [132]. In addition, it uses a form of coding argument that is both more direct and more information theoretic in °avor than the traditional martingale proofs. We here follow the Ornstein and Weiss approach for the stationary ergodic result. We also use some modiflcations
47

48 |
CHAPTER 3. THE ENTROPY ERGODIC THEOREM |
similar to those of Katznelson and Weiss for the proof of the ergodic theorem. We then generalize the result flrst to nonergodic processes using the \sandwich" technique of Algoet and Cover [7] and then to AMS processes using a variation on a result of [54].
We next state the theorem to serve as a guide through the various steps. We also prove the result for the simple special case of a Markov source, for which the result follows from the usual ergodic theorem.
We consider a directly given flnite alphabet source fXng described by a distribution m on the sequence measurable space (›; B). Deflne as previously Xkn = (Xk; Xk+1; ¢ ¢ ¢ ; Xk+n¡1). The subscript is omitted when it is zero. For any random variable Y deflned on the sequence space (such as Xkn) we deflne the random variable m(Y ) by m(Y )(x) = m(Y = Y (x)).
Theorem 3.1.1: The Entropy Ergodic Theorem
Given a flnite alphabet AMS source fXng with process distribution m and stationary mean m„ , let fm„ x; x 2 ›g be the ergodic decomposition of the stationary mean m„ . Then
lim |
¡ ln m(Xn) |
= h; m |
¡ |
a.e. and in L1(m); |
||||||
n |
||||||||||
n |
!1 |
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
||
where h(x) is the invariant function deflned by |
||||||||||
|
|
|
|
|
„ |
|
|
|
|
|
|
|
h(x) = Hm„ x (X): |
||||||||
Furthermore, |
|
|
1 |
|
|
|
|
|
||
|
|
|
|
|
n |
„ |
||||
|
|
Emh = lim |
|
|
Hm(X |
|
|
) = Hm(X); |
||
|
|
n!1 n |
|
|
|
|
that is, the entropy rate of an AMS process is given by the limit, and
„ |
„ |
Hm„ |
(X) = Hm(X): |
(3.1)
(3.2)
(3.3)
(3.4)
Comments: The theorem states that the sample entropy using the AMS measure m converges to the entropy rate of the underlying ergodic component of the stationary mean. Thus, for example, if m is itself stationary and ergodic, then the sample entropy converges to the entropy rate of the process m-a.e. and in L1(m). The L1(m) convergence follows immediately from the almost everywhere convergence and the fact that sample entropy is uniformly integrable (Lemma 2.3.6). L1 convergence in turn immediately implies the lefthand equality of (3.3). Since the limit exists, it is the entropy rate. The flnal equality states that the entropy rates of an AMS process and its stationary mean are the same. This result follows from (3.2)-(3.3) by the following argument:
„ |
„ |
(X) = Em„ h, but h is invariant and hence |
We have that Hm(X) = Emh and Hm„ |
the two expectations are equal (see, e.g., Lemma 6.3.1 of [50]). Thus we need only prove almost everywhere convergence in (3.1) to prove the theorem.
In this section we limit ourselves to the following special case of the theorem that can be proved using the ordinary ergodic theorem without any new techniques.