Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Теория информации / Gray R.M. Entropy and information theory. 1990., 284p

.pdf
Скачиваний:
28
Добавлен:
09.08.2013
Размер:
1.32 Mб
Скачать
MX£ZjY

5.3. CONDITIONAL RELATIVE ENTROPY

99

or memoryless approximation to PXY . The next corollary further enhances this name by showing that PX £ PY is the best such approximation in the sense of yielding the minimum divergence with respect to the original distribution.

Corollary 5.3.2: Given a distribution PXY let M denote the class of all product distributions for XY ; that is, if MXY 2 M, then MXY = MX £ MY . Then

inf D(PXY jjMXY ) = D(PXY jjPX £ PY ):

MXY 2M

Proof: We need only consider those M yielding flnite divergence (since if there are none, both sides of the formula are inflnite and the corollary is trivially true). Then

D(PXY jjMXY ) = D(PXY jjPX £ PY ) + D(PX £ PY jjMXY )

‚ D(PXY jjPX £ PY )

with equality if and only if D(PX £ PY jjMXY ) = 0, which it will be if MXY =

PX £ PY . 2

Recall that given random variables (X; Y; Z) with distribution MXY Z , then X ! Y ! Z is a Markov chain (with respect to MXY Z ) if for any event F 2 BAZ with probability one

MZjY X (F jy; x) = MZjY (F jy):

If this holds, we also say that X and Z are conditionally independent given Y . Equivalently, if we deflne the distribution MX£ZjY by

Z

MX£ZjY (FX £ FZ £ FY ) = MXjY (FX jy)MZjY (FZ jy)dMY (y);

Fy

FX 2 BAX ; FZ 2 BAZ ; FY 2 BAY ;

then Z ! Y ! X is a Markov chain if MX£ZjY = MXY Z . (See Section 5.10 of [50].) This construction shows that a Markov chain is symmetric in the sense that X ! Y ! Z if and only if Z ! Y ! X.

Note that for any measure MXY Z , X ! Y ! Z is a Markov chain under by construction.

The following corollary highlights special properties of the various densities and relative entropies when the dominating measure is a Markov chain. It will lead to the idea of a Markov approximation to an arbitrary distribution on triples extending the independent approximation of the previous corollary.

Corollary 5.3.3: Given a probability space, suppose that MXY Z >> PXY Z are two distributions for a random vector (X; Y; Z) with the property that Z ! Y ! X forms a Markov chain under M . Then

MXY Z >> PX£ZjY >> PXY Z

and

dPXY Z

=

fXjY Z

(5.21)

dPX£ZjY

fXjY

 

 

100

 

 

 

 

 

CHAPTER 5. RELATIVE ENTROPY

 

 

dPX£ZjY

 

= fY Z f

XjY

:

(5.22)

 

 

dMXY Z

 

 

 

 

 

 

 

 

 

 

Thus

 

 

 

 

 

 

 

 

 

 

 

 

ln

dPXY Z

 

+ h

XjY

= h

XjY Z

 

 

 

 

dP

X£ZjY

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

ln

dPX£ZjY

= hY Z + h

XjY

 

 

 

dMXY Z

 

 

 

 

 

 

 

 

and taking expectations yields

 

 

 

 

 

 

 

 

 

 

D(PXY Z jjPX£ZjY ) + HP jjM (XjY ) = HP jjM (XjY Z)

(5.23)

D(PX£ZjY jjMXY Z ) = D(PY Z jjMY Z ) + HP jjM (XjY ):

(5.24)

Furthermore,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PX£ZjY =

 

;

(5.25)

 

 

PXjY PY Z

that is,

 

 

 

 

 

 

 

 

 

 

 

 

PX£ZjY (FX £ FZ £ FY ) = ZFY £FZ PXjY (FX jy)dPZY (z; y):

(5.26)

Lastly, if Z ! Y ! X is a Markov chain under M , then it is also a Markov chain under P if and only if

 

hXjY = hXjY Z

(5.27)

in which case

 

 

 

 

 

 

HP jjM (XjY ) = HP jjM (XjY Z):

(5.28)

Proof: Deflne

 

 

 

 

 

 

g(x; y; z) =

fXjY Z (xjy; z)

=

fXY Z (x; y; z)

 

fY (y)

 

fXjY (xjy)

fY Z (y; z) fXY (x; y)

 

 

 

 

and simplify notation by deflning the measure Q = PX£ZjY . Note that Z ! Y ! X is a Markov chain with respect to Q. To prove the flrst statement of the Corollary requires proving the following relation:

Z

PXY Z (FX £ FY £ FZ ) = gdQ;

FX £FY £FZ

all FX 2 BAX ; FZ 2 BAZ ; FY 2 BAY :

From iterated expectation with respect to Q (e.g., Section 5.9 of [50])

E(g1FX (X)1FZ (Z)1FY (Y )) = E(1FY (Y )1FZ (Z)E(g1FX (X)jY Z))

Z Z

=1FY (y)1FZ (z)( g(x; y; z) dQXjY Z (xjy; z)) dQY Z (y; z):

FX

PXY Z

5.3. CONDITIONAL RELATIVE ENTROPY

101

Since QY Z = PY Z and QXjY Z = PXjY Q-a.e. by construction, the previous formula implies that

Z Z Z

g dQ = dPY Z gdPXjY :

FX £FY £FZ FY £FZ FX

This proves (5.25. Since MXY Z >> PXY Z , we also have that MXY >> PXY and hence application of Theorem 5.3.1 yields

Z Z Z

gdQ = dPY Z gfXjY dMXjY

FX £FY £FZ FY £FZ FX

Z Z

= dPY Z fXjY Z dMXjY : FY £FZ FX

By assumption, however, MXjY = MXjY Z a.e. and therefore

Z Z Z

g dQ = dPY Z fXjY Z dMXjY Z

FX £FY £FZ FY £FZ FX

Z Z

= dPY Z dPXjY Z = PXY Z (FX £ FY £ FZ ); FY £FZ FX

where the flnal step follows from iterated expectation. This proves (5.21 and

that Q >> PXY Z .

 

 

 

To prove (5.22) we proceed in a similar manner and replace g by f

XjY

fZY

^

 

 

and replace Q by MXY Z = MX£Y jZ . Also abbreviate PX£Y jZ to P

. As in the

proof of (5.21) we have since Z ! Y ! X is a Markov chain under

M that

Z Z Z

g dQ = dMY Z g dMXjY

FX £FY £FZ FY £FZ FX

Z µZ Z µZ

= fZY dMY Z fXjY dMXjY = dPY Z fXjY dMXjY : FY £FZ FX FY £FZ FX

From Theorem 5.3.1 this is

Z

PXjY (FX jy) dPY Z :

FY £FZ

^

 

But PY Z = PY Z and

 

^

^

PXjY (FX jy) = PXjY (FX jy) = PXjY Z (FX jyz)

^

^

since P yields a Markov chain. Thus the previous formula is P (FX £ FY £ FZ ), proving (5.22) and the corresponding absolute continuity.

If Z ! Y ! X is a Markov chain under both M and P , then PX£ZjY = and hence

dPXY Z = 1 = fXjY Z ;

dPX£ZjY

fXjY

102 CHAPTER 5. RELATIVE ENTROPY

which implies (5.27). Conversely, if (5.27) holds, then fXjY Z = fXjY which with (5.21) implies that PXY Z = PX£ZjY , proving that Z ! Y ! X is a Markov chain under P . 2

The previous corollary and one of the constructions used will prove important later and hence it is emphasized now with a deflnition and another corollary

giving an interesting interpretation.

 

 

 

 

 

Given a distribution PXY Z , deflne the distribution P

X£ZjY

as the Markov

approximation to PXY Z . Abbreviate P

 

^

 

 

 

to P .

The deflnition has two

motivations.

 

 

 

 

^ X£ZjY

 

 

 

 

First, the distribution P makes Z ! Y ! X a Markov chain

 

 

 

 

 

 

^

= PZY and the same conditional

which has the same initial distribution PZY

 

^

= P

 

 

 

 

^

 

 

distribution P

XjY

XjY

, the only difierence is that P yields a Markov chain,

^

^

 

 

 

 

 

 

that is, P

XjZY

= P

XjY

. The second motivation is the following corollary which

 

 

 

 

^

 

 

 

 

 

shows that of all Markov distributions, P is the closest to P in the sense of minimizing the divergence.

Corollary 5.3.4: Given a distribution P = PXY Z , let M denote the class of all distributions for XY Z for which Z ! Y ! X is a Markov chain under

MXY Z (MXY Z = MX£ZjY ). Then

inf D(PXY Z jjMXY Z ) = D(PXY Z jjPX£ZjY );

MXY Z 2M

that is, the inflmum is a minimum and it is achieved by the Markov approximation.

Proof: If no MXY Z in the constraint set satisfles MXY Z >> PXY Z , then both sides of the above equation are inflnite. Hence conflne interest to the case MXY Z >> PXY Z . Similarly, if all such MXY Z yield an inflnite divergence, we are done. Hence we also consider only MXY Z yielding flnite divergence. Then the previous corollary implies that MXY Z >> PX£ZjY >> PXY Z and hence

D(PXY Z jjMXY Z ) = D(PXY Z jjPX£ZjY ) + D(PX£ZjY jjMXY Z )

‚ D(PXY Z jjPX£ZjY )

with equality if and only if

D(PX£ZjY jjMXY Z ) = D(PY Z jjMY Z ) + HP jjM (XjY ) = 0:

But this will be zero if M is the Markov approximation to P since then MY Z = PY Z and MXjY = PXjY by construction. 2

Generalized Conditional Relative Entropy

We now return to the issue of providing a general deflnition of conditional relative entropy, that is, one which does not require the existence of the densities or, equivalently, the absolute continuity of the underlying measures. We require, however, that the general deflnition reduce to that considered thus far when the densities exist so that all of the results of this section will remain valid when

5.3. CONDITIONAL RELATIVE ENTROPY

103

applicable. The general deflnition takes advantage of the basic construction of the early part of this section. Once again let MXY and PXY be two measures, where we no longer assume that MXY >> PXY . Deflne as in Theorem 5.3.1 the modifled measure SXY by

Z

SXY (F £ G) = MXjY (F jy)dPY (y); (5.29)

G

that is, SXY has the same Y marginal as PXY and the same conditional distribution of X given Y as MXY . We now replace the previous deflnition by the following: The conditional relative entropy is deflned by

HP jjM (XjY ) = D(PXY jjSXY ):

(5.30)

If MXY >> PXY as before, then from Theorem 5.3.1 this is the same quantity as the original deflnition and there is no change. The divergence of (5.30), however, is well-deflned even if it is not true that MXY >> PXY and hence the densities used in the original deflnition do not work. The key question is whether or not the chain rule

HP jjM (Y ) + HP jjM (XjY ) = HP jjM (XY )

(5.31)

remains valid in the more general setting. It has already been proven in the case that MXY >> PXY , hence suppose this does not hold. In this case, if it is also true that MY >> PY does not hold, then both the marginal and joint relative entropies will be inflnite and (5.31) again must hold since the conditional relative entropy is nonnegative. Thus we need only show that the formula holds for the case where MY >> PY but it is not true that MXY >> PXY . By assumption there must be an event F for which

Z

MXY (F ) = MXjY (Fy) dMY (y) = 0

but

Z

PXY (F ) = PXjY (Fy) dPY (y) 6= 0;

where Fy = f(x; y) : (x; y) 2 F g is the section of F at Fy. Thus MXjY (Fy) = 0 MY -a.e. and hence also PY -a.e. since MY >> PY . Thus

Z

SXY (F ) = MXjY (Fy) dPY (y) = 0

and hence it is not true that SXY >> PXY and therefore

D(PXY jjSXY ) = 1;

which proves that the chain rule holds in the general case.

It can happen that PXY is not absolutely continuous with respect to MXY , and yet D(PXY jjSXY ) < 1 and hence PXY << SXY and hence

HP jjM (XjY ) = Z

dPXY ln dSXY ;

 

 

dPXY

104

CHAPTER 5. RELATIVE ENTROPY

in which case it makes sense to deflne the conditional density

dPXY

fXjY · dSXY

so that exactly as in the original tentative deflnition in terms of densities (5.17) we have that

Z

HP jjM (XjY ) = dPXY ln fXjY :

Note that this allows us to deflne a meaningful conditional density even though the joint density fXY does not exist! If the joint density does exist, then the conditional density reduces to the previous deflnition from Theorem 5.3.1.

We summarize the generalization in the following theorem.

Theorem 5.3.2 The conditional relative entropy deflned by (5.30) and (5.29) agrees with the deflnition (5.17) in terms of densities and satisfles the chain rule (5.31). If the conditional relative entropy is flnite, then

Z

HP jjM (XjY ) = dPXY ln fXjY ;

where the conditional density is deflned by

dPXY fXjY · dSXY :

If MXY >> PXY , then this reduces to the usual deflnition

fXY

fXjY = fY :

The generalizations can be extended to three or more random variables in the obvious manner.

5.4Limiting Entropy Densities

We now combine several of the results of the previous section to obtain results characterizing the limits of certain relative entropy densities.

Lemma 5.4.1: Given a probability space (›; B) and an asymptotically generating sequence of sub-¾-flelds Fn and two measures M >> P , let Pn = PFn , Mn = MFn and let hn = ln dPn=dMn and h = ln dP=dM denote the entropy densities. If D(P jjM ) < 1, then

Z

lim jhn ¡ hj dP = 0;

n!1

that is, hn ! h in L1. Thus the entropy densities hn are uniformly integrable. Proof: Follows from the Corollaries 5.2.3 and 5.2.6. 2

The following lemma is Lemma 1 of Algoet and Cover [7].

5.4. LIMITING ENTROPY DENSITIES

105

Lemma 5.4.2: Given a sequence of nonnegative random variables ffng deflned on a probability space (›; B; P ), suppose that

E(fn) 1; all n:

Then

lim sup 1 ln fn 0:

n!1 n

Proof: Given any † > 0 the Markov inequality and the given assumption imply that

P (fn > en†) E(fn) • e¡n†:

en†

We therefore have that

P ( n1 ln fn ‚ †) • e¡n†

and therefore

1

1

1

1

 

X

 

 

X

 

 

n=1

P (

n

ln fn ‚ †) • e¡n† =

e†¡1

< 1;

 

 

n=1

 

 

Thus from the Borel-Cantelli lemma (Lemma 4.6.3 of [50]), P (n¡1hn ‚ † i.o.) = 0. Since is arbitrary, the lemma is proved. 2

The lemma easily gives the flrst half of the following result, which is also due to Algoet and Cover [7], but the proof is difierent here and does not use martingale theory. The result is the generalization of Lemma 2.7.1.

Theorem 5.4.1: Given a probability space (›; B) and an asymptotically generating sequence of sub-¾-flelds Fn, let M and P be two probability measures with their restrictions Mn = MFn and Pn = PFn . Suppose that Mn >> Pn for all n and deflne fn = dPn=dMn and hn = ln fn. Then

lim sup 1 hn 0; M ¡ a:e:

n!1 n

and

lim inf 1 hn 0; P ¡ a:e::

n!1 n

If it is also true that M >> P (e.g., D(P jjM ) < 1), then

nlim

1

hn = 0; P ¡ a:e::

 

n

!1

 

 

 

Proof: Since

EM fn = EMn fn = 1;

the flrst statement follows from the previous lemma. To prove the second statement consider the probability

P (

 

1

ln

dPn

> †) = P (

 

1

ln f

 

> †) = P

(f

 

< e¡n†)

 

 

 

 

 

 

 

¡n Mn

n

¡n

n

n

 

n

 

106 CHAPTER 5. RELATIVE ENTROPY

Z Z

= dPn = fn dMn

fn<e¡n† fn<e¡n†

Z

< e¡n† dMn = e¡n†Mn(fn < e¡n†) • e¡n†:

fn<e¡n†

Thus it has been shown that

P ( n1 hn < ¡†) • e¡n†

and hence again applying the Borel-Cantelli lemma we have that

P (n¡1hn • ¡† i.o.) = 0

which proves the second claim of the theorem.

If M >> P , then the flrst result also holds P -a.e., which with the second result proves the flnal claim. 2

Barron [9] provides an additional property of the sequence hn=n. If M >> P , then the sequence hn=n is dominated by an integrable function.

5.5Information for General Alphabets

We can now use the divergence results of the previous sections to generalize the deflnitions of information and to develop their basic properties. We assume now that all random variables and processes are deflned on a common underlying probability space (›; B; P ). As we have seen how all of the various information quantities{entropy, mutual information, conditional mutual information{can be expressed in terms of divergence in the flnite case, we immediately have deflnitions for the general case. Given two random variables X and Y , deflne the average mutual information between them by

I(X; Y ) = D(PXY jjPX £ PY );

(5.32)

where PXY is the joint distribution of the random variables X and Y and PX £ PY is the product distribution.

Deflne the entropy of a single random variable X by

 

H(X) = I(X; X):

(5.33)

From the deflnition of divergence this implies that

I(X; Y ) = sup HPXY jjPX £PY (Q):

Q

From Dobrushin’s theorem (Lemma 5.2.2), the supremum can be taken over partitions whose elements are contained in generating fleld. Letting the generating fleld be the fleld of all rectangles of the form F £ G, F 2 BAX and G 2 BAY , we have the following lemma which is often used as a deflnition for mutual information.

5.5. INFORMATION FOR GENERAL ALPHABETS

107

Lemma 5.5.1:

I(X; Y ) = sup I(q(X); r(Y ));

q;r

where the supremum is over all quantizers q and r of AX and AY . Hence there exist sequences of increasingly flne quantizers qn : AX ! An and rn : AY ! Bn such that

I(X; Y ) = lim I(qn(X); rn(Y )):

n!1

Applying this result to entropy we have that

H(X) = sup H(q(X));

q

where the supremum is over all quantizers.

By \increasingly flne" quantizers is meant that the corresponding partitions Qn = fqn¡1(a); a 2 Ang are successive reflnements, e.g., atoms in Qn are unions of atoms in Qn+1. (If this were not so, a new quantizer could be deflned for which it was true.) There is an important drawback to the lemma (which will shortly be removed in Lemma 5.5.5 for the special case where the alphabets are standard): the quantizers which approach the suprema may depend on the underlying measure PXY . In particular, a sequence of quantizers which work for one measure need not work for another.

Given a third random variable Z, let AX , AY , and AZ denote the alphabets of X, Y , and Z and deflne the conditional average mutual information

I(X; Y jZ) = D(PXY Z jjPX£Y jZ ):

(5.34)

This is the extension of the discrete alphabet deflnition of (2.25) and it makes sense only if the distribution PX£Y jZ exists, which is the case if the alphabets are standard but may not be the case otherwise. We shall later provide an alternative deflnition due to Wyner [152] that is valid more generally and equal to the above when the spaces are standard.

Note that I(X; Y jZ) can be interpreted using Corollary 5.3.4 as the divergence between PXY Z and its Markov approximation.

Combining these deflnitions with Lemma 5.2.1 yields the following generalizations of the discrete alphabet results.

Lemma 5.5.2: Given two random variables X and Y , then

I(X; Y ) 0

with equality if and only if X and Y are independent. Given three random variables X, Y , and Z, then

I(X; Y jZ) 0

with equality if and only if Y ! Z ! X form a Markov chain.

Proof: The flrst statement follow from Lemma 5.2.1 since X and Y are independent if and only if PXY = PX £ PY . The second statement follows

108

CHAPTER 5. RELATIVE ENTROPY

from (5.5.3) and the fact that Y ! Z ! X is a Markov chain if and only if PXY Z = PX£Y jZ (see, e.g., Corollary 5.10.1 of [50]). 2

The properties of divergence provide means of computing and approximating these information measures. From Lemma 5.2.3, if I(X; Y ) is flnite, then

I(X; Y ) = Z

ln d(PX £ PY ) dPXY

(5.35)

 

 

 

 

dPXY

 

and if I(X; Y jZ) is flnite, then

 

 

 

 

 

 

 

I(X; Y jZ) = Z

ln

dPXY Z

(5.36)

 

dPXY Z :

dPX£Y jZ

For example, if X; Y are two random variables whose distribution is absolutely continuous with respect to Lebesgue measure dxdy and hence which have a pdf fXY (x; y) = dPXY (xy)=dxdy, then

Z

I(X; Y ) = dxdyfXY (xy) ln fXY (x; y) ; fX (x)fY (y)

where fX and fY are the marginal pdf’s, e.g.,

Z

fX (x) = fXY (x; y) dy = dPX (x) : dx

In the cases where these densities exist, we deflne the information densities

dPXY

iX;Y = ln d(PX £ PY )

(5.37)

dPXY Z iX;Y jZ = ln dPX£Y jZ :

The results of Section 5.3 can be used to provide conditions under which the various information densities exist and to relate them to each other. Corollaries 5.3.1 and 5.3.2 combined with the deflnition of mutual information immediately yield the following two results.

Lemma 5.5.3: Let X and Y be standard alphabet random variables with distribution PXY . Suppose that there exists a product distribution MXY = MX £ MY such that MXY >> PXY . Then

MXY >> PX £ PY >> PXY ;

 

iX;Y = ln(fXY =fX fY ) = ln(fXjY =fX )

 

and

 

I(X; Y ) + HP jjM (X) = HP jjM (XjY ):

(5.38)