Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Теория информации / Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p

.pdf
Скачиваний:
232
Добавлен:
09.08.2013
Размер:
10.58 Mб
Скачать

4.5 FUNCTIONS OF MARKOV CHAINS

85

It would be useful computationally to have upper and lower bounds converging to the limit from above and below. We can halt the computation when the difference between upper and lower bounds is small, and we

will then have a good estimate of the limit.

 

 

We

already know

that H (Yn|Yn−1, . . . , Y1)

converges

monoton-

ically

to H (Y) from

above. For a lower

bound, we

will use

H (Yn|Yn−1, . . . , Y1, X1). This is a neat trick based on the idea that X1 contains as much information about Yn as Y1, Y0, Y−1, . . . .

Lemma 4.5.1

 

 

 

H (Yn|Yn−1, . . . , Y2, X1) H (Y).

(4.57)

Proof: We have for k = 1, 2, . . . ,

 

 

 

(a)

, . . . , Y2

, Y1, X1)

(4.58)

H (Yn|Yn−1, . . . , Y2, X1) = H (Yn|Yn−1

(b)

 

, X1, X0, X−1, . . . , Xk )

= H (Yn|Yn−1, . . . , Y1

 

 

 

(4.59)

(c)

, . . . , Y1

, X1, X0, X−1, . . . ,

 

= H (Yn|Yn−1

(4.60)

Xk , Y0, . . . , Yk )

(d)

 

 

(4.61)

H (Yn|Yn−1, . . . , Y1, Y0, . . . , Yk )

(e)

 

 

(4.62)

= H (Yn+k+1|Yn+k , . . . , Y1),

where (a) follows from that fact that Y1 is a function of X1, and (b) follows from the Markovity of X, (c) follows from the fact that Yi is a function of Xi , (d) follows from the fact that conditioning reduces entropy, and (e) follows by stationarity. Since the inequality is true for all k, it is true in the limit. Thus,

H (Y

Y

, . . . , Y , X

)

lim H (Y

Y

, . . . , Y )

(4.63)

 

n| n−1

1 1

 

k

n+k+1| n+k

1

 

 

 

 

 

= H (Y).

 

 

(4.64)

The next lemma shows that the interval between the upper and the lower bounds decreases in length.

Lemma 4.5.2

H (Yn|Yn−1, . . . , Y1) H (Yn|Yn−1, . . . , Y1, X1) → 0.

(4.65)

86 ENTROPY RATES OF A STOCHASTIC PROCESS

 

Proof: The interval length can be rewritten as

 

H (Yn|Yn−1, . . . , Y1) H (Yn|Yn−1, . . . , Y1, X1)

(4.66)

= I (X1; Yn|Yn−1, . . . , Y1).

By the properties of mutual information,

 

I (X1; Y1, Y2, . . . , Yn) H (X1),

(4.67)

and I (X1; Y1, Y2, . . . , Yn) increases with n. Thus, lim I (X1; Y1, Y2, . . . , Yn) exists and

nlim

I (X1; Y1, Y2, . . . , Yn) H (X1).

(4.68)

→∞

 

 

 

 

 

 

By the chain rule,

 

 

 

 

 

 

H (X) nlim

I (X1; Y1, Y2, . . . , Yn)

(4.69)

→∞

 

 

 

 

 

 

= n→∞

n

1;

 

i | i−1

1

(4.70)

 

 

lim

I (X

 

Y

Y

, . . . , Y )

 

i=1

 

 

 

 

 

 

 

 

 

 

 

= I (X1; Yi |Yi−1, . . . , Y1).

(4.71)

i=1

Since this infinite sum is finite and the terms are nonnegative, the terms must tend to 0; that is,

lim I (X1; Yn|Yn−1, . . . , Y1) = 0,

(4.72)

which proves the lemma.

 

Combining Lemmas 4.5.1 and 4.5.2, we have the following theorem.

Theorem 4.5.1 If X1, X2, . . . , Xn form a stationary Markov chain, and

Yi = φ (Xi ), then

H (Yn|Yn−1, . . . , Y1, X1) H (Y) H (Yn|Yn−1, . . . , Y1)

(4.73)

and

lim H (Yn|Yn−1, . . . , Y1, X1) = H (Y) = lim H (Yn|Yn−1, . . . , Y1). (4.74)

In general, we could also consider the case where Yi is a stochastic function (as opposed to a deterministic function) of Xi . Consider a Markov

SUMMARY 87

process X1, X2, . . . , Xn, and define a new process Y1, Y2, . . . , Yn, where each Yi is drawn according to p(yi |xi ), conditionally independent of all the other Xj , j =i; that is,

n−1

n

 

 

 

 

p(xn, yn) = p(x1)

p(xi+1|xi ) p(yi |xi ).

(4.75)

i=1

i=1

 

Such a process, called a hidden Markov model (HMM), is used extensively in speech recognition, handwriting recognition, and so on. The same argument as that used above for functions of a Markov chain carry over to hidden Markov models, and we can lower bound the entropy rate of a hidden Markov model by conditioning it on the underlying Markov state. The details of the argument are left to the reader.

SUMMARY

Entropy rate. Two definitions of entropy rate for a stochastic process are

H (X) = nlim

1

H (X1, X2

, . . . , Xn),

 

(4.76)

 

n

 

H (

 

 

 

→∞

 

 

 

 

 

 

 

 

 

 

X

)

lim

H (X

n|

X

, X

n−2

, . . . , X

).

(4.77)

 

 

= n

→∞

 

 

 

n−1

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

For a stationary stochastic process,

 

 

 

 

 

 

 

 

 

 

 

 

H (X) = H (X).

 

(4.78)

Entropy rate of a stationary Markov chain

 

 

 

 

 

 

 

H (X) = − µi Pij log Pij .

(4.79)

 

 

 

 

 

 

 

 

 

 

ij

 

 

 

 

Second law of thermodynamics. For a Markov chain:

1.Relative entropy D(µn||µn) decreases with time

2.Relative entropy D(µn||µ) between a distribution and the stationary distribution decreases with time.

3.Entropy H (Xn) increases if the stationary distribution is uniform.

88ENTROPY RATES OF A STOCHASTIC PROCESS

4.The conditional entropy H (Xn|X1) increases with time for a stationary Markov chain.

5.The conditional entropy H (X0|Xn) of the initial condition X0 increases for any Markov chain.

Functions of a Markov chain. If X1, X2, . . . , Xn form a stationary Markov chain and Yi = φ (Xi ), then

H (Yn|Yn−1, . . . , Y1, X1) H (Y) H (Yn|Yn−1, . . . , Y1) (4.80)

and

 

 

nlim H (Yn|Yn−1

, . . . , Y1

, X1) = H (Y) = nlim H (Yn|Yn−1, . . . , Y1).

→∞

 

→∞

(4.81)

PROBLEMS

4.1 Doubly stochastic matrices . An n × n matrix P = [Pij ] is said

to be doubly stochastic if Pij ≥ 0 and

j Pij = 1 for all i and

 

 

i Pij = 1 for all j . An n × n matrix

P

is said to be a permu-

tation matrix if it is doubly stochastic and there is precisely one Pij = 1 in each row and each column. It can be shown that every doubly stochastic matrix can be written as the convex combination of permutation matrices.

(a)Let at = (a1, a2, . . . , an), ai ≥ 0, ai = 1, be a probability vector. Let b = aP , where P is doubly stochastic. Show that b is a probability vector and that H (b1, b2, . . . , bn) H (a1, a2,

. . . , an). Thus, stochastic mixing increases entropy.

(b)Show that a stationary distribution µ for a doubly stochastic matrix P is the uniform distribution.

(c)Conversely, prove that if the uniform distribution is a stationary distribution for a Markov transition matrix P , then P is doubly stochastic.

4.2Time’s arrow . Let {Xi }i=−∞ be a stationary stochastic process. Prove that

H (X0|X−1, X−2, . . . , Xn) = H (X0|X1, X2, . . . , Xn).

PROBLEMS 89

In other words, the present has a conditional entropy given the past equal to the conditional entropy given the future. This is true even though it is quite easy to concoct stationary random processes for which the flow into the future looks quite different from the flow into the past. That is, one can determine the direction of time by looking at a sample function of the process. Nonetheless, given the present state, the conditional uncertainty of the next symbol in the future is equal to the conditional uncertainty of the previous symbol in the past.

4.3Shuffles increase entropy . Argue that for any distribution on shuffles T and any distribution on card positions X that

H (T X) H (T X|T )

|

(4.82)

=

 

(4.83)

 

H (T −1T X T )

= H (X|T )

 

(4.84)

= H (X)

 

(4.85)

if X and T are independent.

4.4Second law of thermodynamics . Let X1, X2, X3, . . . be a station-

ary first-order Markov chain. In Section 4.4 it was shown that H (Xn | X1) H (Xn−1 | X1) for n = 2, 3, . . . . Thus, conditional uncertainty about the future grows with time. This is true although the unconditional uncertainty H (Xn) remains constant. However, show by example that H (Xn|X1 = x1) does not necessarily grow with n for every x1.

4.5Entropy of a random tree. Consider the following method of generating a random tree with n nodes. First expand the root node:

Then expand one of the two terminal nodes at random:

At time k, choose one of the k − 1 terminal nodes according to a uniform distribution and expand it. Continue until n terminal nodes

90 ENTROPY RATES OF A STOCHASTIC PROCESS

have been generated. Thus, a sequence leading to a five-node tree might look like this:

Surprisingly, the following method of generating random trees yields the same probability distribution on trees with n terminal nodes. First choose an integer N1 uniformly distributed on {1, 2, . . . , n − 1}. We then have the picture

N1

n N1

Then choose an integer

N2 uniformly distributed over

{1, 2, . . . , N1 − 1}, and independently choose another integer N3 uniformly over {1, 2, . . . , (n N1) − 1}. The picture is now

N2

N1 N2

N3

n N1 N3

Continue the process until no further subdivision can be made. (The equivalence of these two tree generation schemes follows, for example, from Polya’s urn model.)

Now let Tn denote a random n-node tree generated as described. The probability distribution on such trees seems difficult to describe, but we can find the entropy of this distribution in recursive form.

First some examples. For n = 2, we have only one tree. Thus, H (T2) = 0. For n = 3, we have two equally probable trees:

PROBLEMS

91

Thus, H (T3) = log 2. For n = 4, we have five possible trees, with

probabilities 13 , 16 , 16 , 16 , 16 .

Now for the recurrence relation. Let N1(Tn) denote the number of terminal nodes of Tn in the right half of the tree. Justify each of the steps in the following:

(a)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(4.86)

H (Tn) = H (N1, Tn)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(b)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(4.87)

= H (N1) + H (Tn|N1)

 

 

 

(c)

+ H (Tn|N1)

(4.88)

= log(n − 1)

(d)

 

 

 

 

 

1

 

 

 

 

n−1

 

= log(n − 1) +

 

 

 

 

 

 

(H (Tk ) + H (Tnk ))

(4.89)

n

1

 

 

 

 

 

 

 

 

 

 

k=1

 

(e)

 

 

 

 

 

2

 

 

 

 

n−1

 

 

 

= log(n − 1)

+

 

 

 

 

 

 

H (Tk )

(4.90)

n

1

 

 

 

 

 

 

 

 

 

 

 

k=1

 

 

 

 

 

 

2

n−1

 

 

 

= log(n − 1) +

Hk .

(4.91)

n 1

 

 

 

 

 

 

k=1

 

 

 

(f) Use this to show that

 

 

 

 

 

 

 

 

 

 

(n − 1)Hn = nHn−1 + (n − 1) log(n − 1) (n − 2) log(n − 2)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(4.92)

or

 

Hn

 

 

 

Hn−1

 

 

 

 

 

 

(4.93)

 

 

 

 

=

 

 

 

+ cn

 

 

 

n

n

1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

for appropriately defined cn. Since cn = c < ∞, you have proved that n1 H (Tn) converges to a constant. Thus, the expected number of bits necessary to describe the random tree Tn grows linearly with n.

4.6Monotonicity of entropy per element . For a stationary stochastic process X1, X2, . . . , Xn, show that

(a)

 

H (X1, X2, . . . , Xn)

 

H (X1, X2, . . . , Xn−1)

.

(4.94)

 

 

n

n

1

(b)

 

 

 

 

 

 

 

 

 

 

 

 

 

H (X1, X2, . . . , Xn)

H (Xn|Xn−1, . . . , X1).

(4.95)

 

 

n

92ENTROPY RATES OF A STOCHASTIC PROCESS

4.7Entropy rates of Markov chains

(a)Find the entropy rate of the two-state Markov chain with tran-

sition matrix

P

=

1 − p01

1

p01 .

 

p10

p10

 

 

(b)What values of p01, p10 maximize the entropy rate?

(c)Find the entropy rate of the two-state Markov chain with tran-

sition matrix

 

=

1

0

P

 

1 − p

p .

(d)Find the maximum value of the entropy rate of the Markov

chain of part (c). We expect that the maximizing value of p

should be less than 12 , since the 0 state permits more information to be generated than the 1 state.

(e)Let N (t ) be the number of allowable state sequences of length t for the Markov chain of part (c). Find N (t ) and calculate

1

H0 = lim log N (t ).

t →∞ t

[Hint: Find a linear recurrence that expresses N (t ) in terms of N (t − 1) and N (t − 2). Why is H0 an upper bound on the entropy rate of the Markov chain? Compare H0 with the maximum entropy found in part (d).]

4.8Maximum entropy process . A discrete memoryless source has the alphabet {1, 2}, where the symbol 1 has duration 1 and the sym-

bol 2 has duration 2. The probabilities of 1 and 2 are p1 and p2, respectively. Find the value of p1 that maximizes the source entropy

per unit time H (X) = H (X) . What is the maximum value H (X)?

ET

4.9Initial conditions . Show, for a Markov chain, that

H (X0|Xn) H (X0|Xn−1).

Thus, initial conditions X0 become more difficult to recover as the future Xn unfolds.

4.10Pairwise independence. Let X1, X2, . . . , Xn−1 be i.i.d. random variables taking values in {0, 1}, with Pr{Xi = 1} = 12 . Let Xn = 1

n 1

Xi is odd and Xn = 0 otherwise. Let n ≥ 3.

if i=1

PROBLEMS 93

(a) Show that Xi and Xj are independent for i =j , i, j {1, 2,

.. . , n}.

(b)Find H (Xi , Xj ) for i =j .

(c)Find H (X1, X2, . . . , Xn). Is this equal to nH (X1)?

4.11Stationary processes . Let . . . , X−1, X0, X1, . . . be a stationary (not necessarily Markov) stochastic process. Which of the following statements are true? Prove or provide a counterexample.

(a)H (Xn|X0) = H (Xn|X0) .

(b)H (Xn|X0) H (Xn−1|X0) .

(c)H (Xn|X1, X2, . . . , Xn−1, Xn+1) is nonincreasing in n.

(d)H (Xn|X1, . . . , Xn−1, Xn+1, . . . , X2n) is nonincreasing in n.

4.12Entropy rate of a dog looking for a bone. A dog walks on the integers, possibly reversing direction at each step with probability p = 0.1. Let X0 = 0. The first step is equally likely to be positive or negative. A typical walk might look like this:

(X0, X1, . . .) = (0, −1, −2, −3, −4, −3, −2, −1, 0, 1, . . .).

(a)Find H (X1, X2, . . . , Xn).

(b)Find the entropy rate of the dog.

(c)What is the expected number of steps that the dog takes before reversing direction?

4.13The past has little to say about the future. For a stationary stochastic process X1, X2, . . . , Xn, . . . , show that

nlim

1

I (X1

, X2

, . . . , Xn; Xn+1, Xn+2, . . . , X2n) = 0. (4.96)

2n

→∞

 

 

 

 

Thus, the dependence between adjacent n-blocks of a stationary process does not grow linearly with n.

4.14Functions of a stochastic process

(a)Consider a stationary stochastic process X1, X2, . . . , Xn, and let Y1, Y2, . . . , Yn be defined by

Yi = φ (Xi ), i = 1, 2, . . .

(4.97)

for some function φ. Prove that

H (Y) H (X).

(4.98)

94 ENTROPY RATES OF A STOCHASTIC PROCESS

(b) What is the relationship between the entropy rates H (Z) and

H (X) if

Zi = ψ (Xi , Xi+1), i = 1, 2, . . .

(4.99)

for some function ψ ?

4.15Entropy rate. Let {Xi } be a discrete stationary stochastic process with entropy rate H (X). Show that

1

| X0, X−1, . . . , Xk ) H (X)

(4.100)

n H (Xn, . . . , X1

for k = 1, 2, . . ..

4.16Entropy rate of constrained sequences . In magnetic recording, the mechanism of recording and reading the bits imposes constraints on the sequences of bits that can be recorded. For example, to ensure proper synchronization, it is often necessary to limit the length of runs of 0’s between two 1’s. Also, to reduce intersymbol interference, it may be necessary to require at least one 0 between any two 1’s. We consider a simple example of such a constraint.

Suppose that we are required to have at least one 0 and at most two 0’s between any pair of 1’s in a sequences. Thus, sequences like 101001 and 0101001 are valid sequences, but 0110010 and 0000101 are not. We wish to calculate the number of valid sequences of length n.

(a)Show that the set of constrained sequences is the same as the set of allowed paths on the following state diagram:

(b) Let Xi (n) be the number of valid paths of length n ending at state i. Argue that X(n) = [X1(n) X2(n) X3(n)]t satisfies the