Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Теория информации / Gray R.M. Entropy and information theory. 1990., 284p

.pdf
Скачиваний:
28
Добавлен:
09.08.2013
Размер:
1.32 Mб
Скачать

4.2. STATIONARY CODES AND APPROXIMATION

69

= P (Qi

Ri) ¡ P (

Qi

Ri Rj );

(4.3)

\

j<i

\

\

 

[

 

where we have used the fact that a set difierence is unchanged if the portion being removed is intersected with the set it is being removed from and we have used the fact that P (F ¡ G) = P (F ) ¡ P (G) if G ‰ F . Combining (4.2.2) and (4.2.3) we have that

[

\

[

\

\

P (Qi¢Q0i) • P (Qi Ri) ¡ P (Qi

 

Ri) + P (

Qi

Ri Rj )

[

\ \

 

j<i

 

\

 

X

 

= P (Qi¢Ri) + P ( Qi

Ri

Rj ) • ° +

P (Qi

Rj ):

j<i

 

 

j<i

 

 

For j 6= i, however, we have that

\ \ \ \

P (Qi Rj ) = P (Qi Rj Qcj ) • P (Rj Qcj )

• P (Rj ¢Qj ) • °;

which with the previous equation implies that

P (Qi¢Q0i) • K°; i = 1; 2; ¢ ¢ ¢ ; K ¡ 1:

For the remaining atom:

We have

 

P (QK ¢Q0K ) = P (QK \ Q0Kc [ QKc \ Q0K ):

(4.4)

 

 

 

 

 

 

 

 

 

 

Q

 

Q0c

= Q

(

Q0

) = Q

 

(

Q0

Qc);

 

K

\

K

\ [

 

 

K

\ [

\

 

 

K

 

j

 

 

j

j

 

 

 

 

j<K

 

 

 

j<K

 

 

where the last equality follows since points in Q0j that are also in Qj cannot contribute to the intersection with QK since the Qj are disjoint. Since Q0j T Qcj ‰ Q0j ¢Qj we have

QK

Q0Kc

‰ QK

(

Q0j

¢Qj ) ‰ Q0j ¢Qj :

 

\

 

\ [

 

[

 

 

 

j<K

 

j<K

A similar argument shows that

\[

QcK Q0K ‰ Q0j ¢Qj j<k

and hence with (4.4)

X

[

P (QK ¢Q0K ) • P (

Qj ¢Q0j ) • P (Qj ¢Q0j ) • K2°:

j<K

j<K

70

CHAPTER 4. INFORMATION RATES I

To summarize, we have shown that

P (Qi¢Q0i) • K2°; i = 1; 2; ¢ ¢ ¢ ; K

If we now choose ° so small that K2° • †=K, the lemma is proved. 2

Corollary 4.2.2: Let (›; B; P ) be a probability space and F a generating fleld. Let f : › ! A be a flnite alphabet measurement. Given † > 0 there

is a measurement g : › ! A that is measurable with respect to F (that is, g¡1(a) 2 F for all a 2 A) for which P (f 6= g) • †.

Proof: Follows from the previous lemma by setting Q = ff ¡1(a); a 2 Ag, choosing Q0 from the lemma, and then assigning g for atom Q0i in Q0 the same value that f takes on in atom Qi in Q. Then

P (f 6= g) = 12 X P (Qi¢Q0i) • †: 2

i

We now develop applications of the previous results which relate the idea of the entropy of a dynamical system with the entropy rate of a random process. The result is not required for later coding theorems, but it provides insight into the connections between entropy as considered in ergodic theory and entropy as used in information theory. In addition, the development involves some ideas of coding and approximation which are useful in proving the ergodic theorems of information theory used to prove coding theorems.

Let fXng be a random process with alphabet AX . Let A1X denote the one or two-sided sequence space. Consider the dynamical system (›; B; P; T ) deflned by (A1X ; B(AX )1; P; T ), where P is the process distribution and T the shift. Recall from Section 2.2 that a stationary coding or inflnite length sliding block coding of fXng is a measurable mapping f : A1X ! Af into a flnite alphabet which produces an encoded process ffng deflned by

fn(x) = f (T nx); x 2 A1X :

The entropy H(P; T ) of the dynamical system was deflned by

H(P; T ) = sup HP (f );

f

the supremum of the entropy rates of flnite alphabet stationary codings of the original process. We shall soon show that if the original alphabet is flnite, then the entropy of the dynamical system is exactly the entropy rate of the process. First, however, we require several preliminary results, some of independent interest.

Lemma 4.2.3: If f is a stationary coding of an AMS process, then the process ffng is also AMS. If the input process is ergodic, then so is ffng.

Proof: Suppose that the input process has alphabet AX and distribution P and that the measurement f has alphabet Af . Deflne the sequence mapping

1 ! 1 f 2 T g n

f : AX Af by f (x) = fn(x); n , where fn(x) = f (T x) and T is

4.2. STATIONARY CODES AND APPROXIMATION

71

the shift on the input sequence space AX1.

If T also denotes the shift on the

output space, then by construction f (T x) = T f (x) and hence for any output

„ „

event F , f ¡1(T ¡1F ) = T ¡1f ¡1(F ). Let m denote the process distribution for

¡1 2 B 1

the encoded process. Since m(F ) = P (f (F )) for any event F (Af ) , we have using the stationarity of the mapping f that

lim

1 1

m(T ¡iF ) =

lim

1 1

P (f¡1(T ¡iF ))

 

 

 

 

n!1 n i=0

 

n!1 n i=0

 

 

 

X

 

 

 

 

X

 

 

 

 

1 1

 

 

 

 

= lim

 

P (T ¡if¡1(F )) = P(f¡1(F ));

 

n!1 n i=0

 

 

 

 

 

X

Thus m is AMS. If G is an invariant

where P is the stationary mean of P .

„ „ „

output event, then f ¡1(G) is also invariant since T ¡1f ¡1(G) = f ¡1(T ¡1G). Hence if input invariant sets can only have probability 1 or 0, the same is true for output invariant sets. 2

The lemma and Theorem 3.1.1 immediately yields the following:

Corollary 4.2.3: If f is a stationary coding of an AMS process, then

1

H(f

n

);

H(f ) =

lim

 

 

 

n!1 n

 

 

 

that is, the limit exists.

For later use the next result considers general standard alphabets. A stationary code f is a scalar quantizer if there is a map q : AX ! Af such that f (x) = q(x0). Intuitively, f depends on the input sequence only through the current symbol. Mathematically, f is measurable with respect to ¾(X0). Such codes are efiectively the simplest possible and have no memory or dependence on the future.

Lemma 4.2.4: Let fXng be an AMS process with standard alphabet AX and distribution m. Let f be a stationary coding of the process with flnite alphabet Af . Fix † > 0. If the process is two-sided, then there is a scalar quantizer q : AX ! Aq, an integer N , and a mapping g : ANq ! Af such that

1

lim 1 X Pr(fi 6= g(q(Xi¡N ); q(Xi¡N+1); ¢ ¢ ¢ ; q(Xi+N ))) • †:

n!1 n i=0

If the process is one-sided, then there is a scalar quantizer q : AX ! Aq, an integer N , and a mapping g : ANq ! Af such that

1

lim 1 X Pr(fi 6= g(q(Xi); q(Xi+1); ¢ ¢ ¢ ; q(Xi+1))) • †:

n!1 n

i=0

Comment: The lemma states that any stationary coding of an AMS process can

be approximated by a code that depends only on a flnite number of quantized

72

CHAPTER 4. INFORMATION RATES I

inputs, that is, by a coding of a flnite window of a scalar quantized version of the original process. In the special case of a flnite alphabet input process, the lemma states that an arbitrary stationary coding can be well approximated by a coding depending only on a flnite number of the input symbols.

Proof: Suppose that m„ is the stationary mean and hence for any measurements f and g

1

m„ (f0 6= g0) = lim 1 X Pr(fi 6= gi):

n!1 n n=0

Let qn be an asymptotically accurate scalar quantizer in the sense that ¾(qn(X0)) asymptotically generates B(AX ). (Since AX is standard this exists. If AX is flnite, then take q(a) = a.) Then Fn = ¾(qn(Xi); i = 0; 1; 2; ¢ ¢ ¢ ; n ¡ 1) asymptotically generates B(AX )1 for one-sided processes and Fn = ¾(qn(Xi); i = ¡n; ¢ ¢ ¢ ; n) does the same for two-sided processes. Hence from Corollary 4.2.2 given we can flnd a su–ciently large n and a mapping g that is measurable with respect to Fn such that m„ (f 6= g) • †. Since g is measurable with respect to Fn, it must depend on only the flnite number of quantized samples that generate Fn. (See, e.g., Lemma 5.2.1 of [50].) This proves the lemma. 2

Combining the lemma and Corollary 4.2.1 immediately yields the following corollary, which permits us to study the entropy rate of general stationary codes by considering codes which depend on only a flnite number of inputs (and hence for which the ordinary entropy results for random vectors can be applied).

Corollary 4.2.4: Given a stationary coding f of an AMS process let Fn be deflned as above. Then given † > 0 there exists for su–ciently large n a code g measurable with respect to Fn such that

j ¡ j •

H(f ) H(g) †:

The above corollary can be used to show that entropy rate, like entropy, is reduced by coding. The general stationary code is approximated by a code depending on only a flnite number of inputs and then the result that entropy is reduced by mapping (Lemma 2.3.3) is applied.

Corollary 4.2.5: Given an AMS process fXng with flnite alphabet AX and a stationary coding f of the process, then

H(X) H(f );

that is, stationary coding reduces entropy rate.

Proof: For integer n deflne Fn = ¾(X0; X1; ¢ ¢ ¢ ; Xn) in the one-sided case and ¾(X¡n; ¢ ¢ ¢ ; Xn) in the two-sided case. Then Fn asymptotically generates B(AX )1. Hence given a code f and an † > 0 we can choose using the flnite alphabet special case of the previous lemma a large k and a Fk-measurable code

j ¡ j •

g such that H(f ) H(g) . We shall show that H(g) H(X), which will prove the lemma. To see this in the one-sided case observe that g is a function of Xk and hence gn depends only on Xn+k and hence

H(gn) • H(Xn+k)

4.3. INFORMATION RATE OF FINITE ALPHABET PROCESSES

73

and hence

 

 

 

 

 

 

 

 

 

 

1

n

 

1 n

n+k

 

H(g) = nlim

n

H(g

) nlim

n

 

n + k

H(X

 

) = H(X):

 

!1

 

 

!1

 

 

 

 

 

 

 

In the two-sided case g depends on fX¡k; ¢ ¢ ¢ ; Xkg and hence gn depends on fX¡k; ¢ ¢ ¢ ; Xn+kg and hence

H(gn) • H(X¡k; ¢ ¢ ¢ ; X¡1; X0; ¢ ¢ ¢ ; Xn+k) • H(X¡k; ¢ ¢ ¢ ; X¡1) + H(Xn+k):

Dividing by n and taking the limit completes the proof as before. 2

Theorem 4.2.1: Let fXng be a random process with alphabet AX . Let A1X denote the one or two-sided sequence space. Consider the dynamical system (›; B; P; T ) deflned by (A1X ; B(AX )1; P; T ), where P is the process distribution and T is the shift. Then

H(P; T ) = H(X):

Proof: From (2.2.4), H(P; T ) H(X). Conversely suppose that f is a code

 

which yields H(f ) ‚ H(P; T ) ¡ †. Since f is a stationary coding of the process

fXng, the previous corollary implies that H(f ) • H(X), which completes the proof. 2

4.3Information Rate of Finite Alphabet Processes

Let f(Xn; Yn)g be a one-sided random process with flnite alphabet A £ B and let ((A £ B)Z+ ; B(A £ B)Z+ ) be the corresponding one-sided sequence space of outputs of the pair process. We consider Xn and Yn to be the sampling functions on the sequence spaces A1 and B1 and (Xn; Yn) to be the pair sampling function on the product space, that is, for (x; y) 2 A1 £ B1, (Xn; Yn)(x; y) = (Xn(x); Yn(y)) = (xn; yn). Let p denote the process distribution induced by the original space on the process f(Xn; Yn)g. Analogous to entropy rate we can deflne the mutual information rate (or simply information rate) of a flnite alphabet pair process by

1

I(X

n

 

n

):

I(X; Y ) = lim sup

 

 

; Y

 

 

 

 

n!1

n

 

 

 

 

The following lemma follows immediately from the properties of entropy rates of Theorems 2.4.1 and 3.1.1 since for AMS flnite alphabet processes

„ „ „ ¡

I(X; Y ) = H(X) + H(Y ) H(X; Y )

and since from (3.1.4) the entropy rate of an AMS process is the same as that of its stationary mean. Analogous to Theorem 3.1.1 we deflne the random variables p(Xn; Y n) by p(Xn; Y n)(x; y) = p(Xn = xn; Y n = yn), p(Xn) by p(Xn)(x; y) = p(Xn = xn), and similarly for p(Y n).

74

CHAPTER 4. INFORMATION RATES I

Lemma 4.3.1: Suppose that fXn; Yng is an AMS flnite alphabet random process with distribution p and stationary mean p„. Then the limits supremum deflning information rates are limits and

„ „

Ip(X; Y ) = Ip(X; Y ):

Ip is an a–ne function of the distribution p. If p„ has ergodic decomposition pxy, then

Z

„ „

Ip(X; Y ) = dp„(x; y)Ipxy (X; Y ):

If we deflne the information density

 

 

 

 

 

in(Xn; Y n) = ln

 

p(Xn; Y n)

 

 

;

 

 

 

 

 

 

 

 

p(Xn)p(Y n)

then

 

 

 

 

 

 

 

1

 

n

 

n

 

lim

 

in(X

 

; Y

 

) = Ipxy (X; Y )

n!1 n

 

 

 

 

 

 

 

almost everywhere with respect to p„ and p and in L1(p).

The following lemmas follow either directly from or similarly to the corresponding results for entropy rate of the previous section.

Lemma 4.3.2: Suppose that fXn; Yn; X0n; Y 0ng is an AMS process and

P= lim 1 1 Pr((X ; Y ) = (X0

 

; Y 0 ))

 

 

X

 

i i 6

i

i

n!1 n i=0

 

(the limit exists since the process is AMS). Then

 

 

 

jI(X; Y ) ¡ I(X0; Y 0)j • 3(ln(jjAjj ¡ 1) + h2()):

Proof: The inequality follows from Corollary 4.2.1 since

 

 

(X; Y )

¡

I(X0; Y 0)

 

 

 

 

j

 

j •

 

 

 

jH(X) ¡ H(X0)j + jH(Y ) ¡ H(Y 0)j + jH(X; Y ) ¡ H(X0; Y 0)j

and since Pr((Xi; Yi) 6= (Xi0; Yi

0)) = Pr(Xi 6= Xi

0

or Yi

6= Yi0) is no smaller

than Pr(Xi 6= Xi0) or Pr(Yi 6= Yi0). 2

 

 

 

 

Corollary 4.3.1: Let fXn; Yng be an AMS process and let f and g be stationary measurements on X and Y , respectively. Given † > 0 there is an N su–ciently large, scalar quantizers q and r, and mappings f 0 and g0 which depend only on fq(X0); ¢ ¢ ¢ ; q(X1)g and fr(Y0); ¢ ¢ ¢ ; r(Y1)g in the one-sided case and fq(X¡N ); ¢ ¢ ¢ ; q(XN )g and fr(Y¡N ); ¢ ¢ ¢ ; r(YN )g in the two-sided case such that

j¡ 0 0 j •

I(f ; g) I(f ; g ) †:

Proof: Choose the codes f 0 and g0 from Lemma 4.2.4 and apply the previous lemma. 2

4.3. INFORMATION RATE OF FINITE ALPHABET PROCESSES

75

Lemma 4.3.3:If fXn; Yng is an AMS process and f and g are stationary codings of X and Y , respectively, then

I(X; Y ) I(f ; g):

Proof: This is proved as Corollary 4.2.5 by flrst approximating f and g by flnite-

window stationary codes, applying the result for mutual information (Lemma 2.5.2), and then taking the limit. 2

76

CHAPTER 4. INFORMATION RATES I

Chapter 5

Relative Entropy

5.1Introduction

A variety of information measures have been introduced for flnite alphabet random variables, vectors, and processes:entropy, mutual information, relative entropy, conditional entropy, and conditional mutual information. All of these can be expressed in terms of divergence and hence the generalization of these deflnitions to inflnite alphabets will follow from a general deflnition of divergence. Many of the properties of generalized information measures will then follow from those of generalized divergence.

In this chapter we extend the deflnition and develop the basic properties of divergence, including the formulas for evaluating divergence as expectations of information densities and as limits of divergences of flnite codings. We also develop several inequalities for and asymptotic properties of divergence. These results provide the groundwork needed for generalizing the ergodic theorems of information theory from flnite to standard alphabets. The general deflnitions of entropy and information measures originated in the pioneering work of Kolmogorov and his colleagues Gelfand, Yaglom, Dobrushin, and Pinsker [45] [90] [32] [125].

5.2Divergence

Given a probability space (›; B; P ) (not necessarily with flnite alphabet) and another probability measure M on the same space, deflne the divergence of P with respect to M by

D(P jjM ) = sup HP jjM (Q) = sup D(Pf jjMf );

(5.1)

Q

f

 

where the flrst supremum is over all flnite measurable partitions Q of › and the second is over all flnite alphabet measurements on ›. The two forms have the same interpretation: the divergence is the supremum of the relative entropies

77

78

CHAPTER 5. RELATIVE ENTROPY

or divergences obtainable by flnite alphabet codings of the sample space. The partition form is perhaps more common when considering divergence per se, but the measurement or code form is usually more intuitive when considering entropy and information. This section is devoted to developing the basic properties of divergence, all of which will yield immediate corollaries for the measures of information.

The flrst result is a generalization of the divergence inequality that is a trivial consequence of the deflnition and the flnite alphabet special case.

Lemma 5.2.1: The Divergence Inequality:

For any two probability measures P and M

D(P jjM ) 0

with equality if and only if P = M .

Proof: Given any partition Q, Theorem 2.3.1 implies that

X P (Q) ln P (Q) 0

Q2Q

M (Q)

with equality if and only if P (Q) = M (Q) for all atoms Q of the partition. Since D(P jjQ) is the supremum over all such partitions, it is also nonnegative. It can be 0 only if P and M assign the same probabilities to all atoms in all partitions (the supremum is 0 only if the above sum is 0 for all partitions) and hence the divergence is 0 only if the measures are identical. 2

As in the flnite alphabet case, Lemma 5.2.1 justifles interpreting divergence as a form of distance or dissimilarity between two probability measures. It is not a true distance or metric in the mathematical sense since it is not symmetric and it does not satisfy the triangle inequality. Since it is nonnegative and equals zero only if two measures are identical, the divergence is a distortion measure as considered in information theory [51], which is a generalization of the notion of distance. This view often provides interpretations of the basic properties of divergence. We shall develop several relations between the divergence and other distance measures. The reader is referred to Csisz¶ar [25] for a development of the distance-like properties of divergence.

The following two lemmas provide means for computing divergences and studying their behavior. The flrst result shows that the supremum can be conflned to partitions with atoms in a generating fleld. This will provide a means for computing divergences by approximation or limits. The result is due to Dobrushin and is referred to as Dobrushin’s theorem. The second result shows that the divergence can be evaluated as the expectation of an entropy density deflned as the logarithm of the Radon-Nikodym derivative of one measure relative to the other. This result is due to Gelfand, Yaglom, and Perez. The proofs largely follow the translator’s remarks in Chapter 2 of Pinsker [125] (which in turn follows Dobrushin [32]).

Lemma 5.2.2: Suppose that (›; B) is a measurable space where B is generated by a fleld F, B = ¾(F). Then if P and M are two probability measures