Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Теория информации / Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p

.pdf
Скачиваний:
214
Добавлен:
09.08.2013
Размер:
10.58 Mб
Скачать

PROBLEMS 65

for any > 0,

Pr {|Y µ| > } ≤

σ 2

(3.32)

2 .

(c) (Weak law of large numbers ) Let Z1, Z2, . . . , Zn be a sequence of i.i.d. random variables with mean µ and variance σ 2. Let

 

n = n1

n

 

 

 

 

 

 

 

 

Z

i 1 Zi be the sample mean. Show that

 

 

 

=

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

σ 2

 

 

 

Pr

Zn µ

>

n 2 .

(3.33)

 

 

 

 

 

 

 

 

 

 

Thus, Pr Zn µ > → 0 as n → ∞. This is known as the weak law of large numbers.

3.2AEP and mutual information. Let (Xi , Yi ) be i.i.d. p(x, y). We form the log likelihood ratio of the hypothesis that X and Y are independent vs. the hypothesis that X and Y are dependent. What is the limit of

1p(Xn)p(Y n)

n log p(Xn, Y n) ?

3.3Piece of cake.

A cake is sliced roughly in half, the largest piece being chosen each time, the other pieces discarded. We will assume that a random cut creates pieces of proportions

( 2 , 1 ) with probability 3

P = 3 3 4

( 25 , 35 ) with probability 14

Thus, for example, the first cut (and choice of largest piece) may result in a piece of size 35 . Cutting and choosing from this piece might reduce it to size 3 2 at time 2, and so on. How large, to

53

first order in the exponent, is the piece of cake after n cuts?

3.4AEP . Let Xi be iid p(x), x {1, 2, . . . , m}. Let µ = EX and

H

= −

p(x) log p(x).

Let

An

xn

X

n

1

p(xn)

 

 

n

= {x

n

X

n

 

1

 

=n {

 

: | − n log

 

H | ≤ }. Let B

 

 

 

: | n

i=1 Xi µ| ≤ }.

 

 

(a)Does Pr{Xn An} −→ 1?

(b)Does Pr{Xn An Bn} −→ 1?

66ASYMPTOTIC EQUIPARTITION PROPERTY

(c)Show that |An Bn| ≤ 2n(H + ) for all n.

(d)Show that |An Bn| ≥ 12 2n(H ) for n sufficiently large.

3.5Sets defined by probabilities . Let X1, X2, . . . be an i.i.d. sequence of discrete random variables with entropy H (X). Let

Cn(t ) = {xn Xn : p(xn) ≥ 2nt }

denote the subset of n-sequences with probabilities ≥ 2nt .

(a)Show that |Cn(t )| ≤ 2nt .

(b)For what values of t does P ({Xn Cn(t )}) → 1?

3.6AEP-like limit . Let X1, X2, . . . be i.i.d. drawn according to probability mass function p(x). Find

1

lim (p(X1, X2, . . . , Xn)) n .

n→∞

3.7AEP and source coding . A discrete memoryless source emits a sequence of statistically independent binary digits with probabilities p(1) = 0.005 and p(0) = 0.995. The digits are taken 100 at a time and a binary codeword is provided for every sequence of 100 digits containing three or fewer 1’s.

(a)Assuming that all codewords are the same length, find the minimum length required to provide codewords for all sequences with three or fewer 1’s.

(b)Calculate the probability of observing a source sequence for which no codeword has been assigned.

(c)Use Chebyshev’s inequality to bound the probability of observing a source sequence for which no codeword has been assigned. Compare this bound with the actual probability computed in part (b).

3.8Products . Let

1, with probability

X = 2, with probability

3, with probability

1

2

1

4

1

4

Let X1, X2, . . . be drawn i.i.d. according to this distribution. Find the limiting behavior of the product

1

(X1X2 · · · Xn) n .

PROBLEMS 67

3.9AEP . Let X1, X2, . . . be independent, identically distributed random variables drawn according to the probability mass function

p(x), x {1,12, . . . , m}. Thus, p(x1, x2, . . . , xn) =

 

n

 

i=1 p(xi ). We

know that

n

log

 

n

, X

, . . . , X

 

)

H (X)

in

probability. Let

 

p(X

n

 

 

 

1

2

 

 

 

 

 

 

 

q(x1, x2, . . . , xn) =

i=1 q(xi ),

where

q is another probability

 

 

 

{

 

 

}

 

 

 

 

 

 

 

 

 

 

 

 

, 2, . . . , m .

 

 

 

 

 

 

 

 

mass function on 1

 

 

 

 

 

 

 

 

 

 

(a)Evaluate lim − n1 log q(X1, X2, . . . , Xn), where X1, X2, . . . are i.i.d. p(x).

(b) Now

evaluate

the limit of the log likelihood ratio

1

 

q(X1,...,Xn)

when X1, X2, . . . are i.i.d. p(x). Thus, the

n

log p(X1,...,Xn)

odds favoring q are exponentially small when p is true.

3.10Random box size.

An n-dimensional rectangular box with sides X1, X2, X3, . . . , Xn is

to be constructed. The volume is V

n = i=1

X

i . The edge

1/n

n

 

length l

of a n-cube with the same volume as the random box is l = Vn . Let X1, X2, . . . be i.i.d. uniform random variables over the unit

interval [0, 1]. Find lim

 

V 1/n

1

n→∞

and compare to (EV ) n . Clearly,

 

n

n

the expected edge length does not capture the idea of the volume of the box. The geometric mean, rather than the arithmetic mean, characterizes the behavior of products.

3.11Proof of Theorem 3.3.1. This problem shows that the size of the smallest “probable” set is about 2nH . Let X1, X2, . . . , Xn be i.i.d.

p(x). Let Bδ(n) Xn such that Pr(Bδ(n)) > 1 − δ. Fix < 12 .

(a)Given any two sets A, B such that Pr(A) > 1 − 1 and Pr(B) >

1 − 2, show that Pr(A B) > 1 − 1 2. Hence, Pr(A(n)

Bδ(n)) ≥ 1 − − δ.

(b)Justify the steps in the chain of inequalities

1 − − δ ≤ Pr(A(n) Bδ(n))

=p(xn)

A(n)Bδ(n)

2n(H )

A(n)Bδ(n)

= |A(n) Bδ(n)|2n(H )

|Bδ(n)|2n(H ).

(c)Complete the proof of the theorem.

(3.34)

(3.35)

(3.36)

(3.37)

(3.38)

68 ASYMPTOTIC EQUIPARTITION PROPERTY

3.12Monotonic convergence of the empirical distribution.

Let pˆn denote the empirical probability mass function corresponding to X1, X2, . . . , Xn i.i.d. p(x), x X. Specifically,

n

1

pˆn(x) = n i=1 I (Xi = x)

is the proportion of times that Xi = x in the first n samples, where I is the indicator function.

(a) Show for X binary that

ED(pˆ2n p) ED(pˆn p).

Thus, the expected relative entropy “distance” from the empirical distribution to the true distribution decreases with sample size. (Hint: Write pˆ2n = 12 pˆn + 12 pˆn and use the convexity of D.)

(b) Show for an arbitrary discrete X that

ED(pˆn p) ED(pˆn−1 p).

(Hint: Write pˆn as the average of n empirical mass functions with each of the n samples deleted in turn.)

3.13Calculation of typical set . To clarify the notion of a typical set A(n) and the smallest set of high probability Bδ(n), we will calculate the set for a simple example. Consider a sequence of i.i.d. binary

random variables, X1, X2, . . . , Xn, where the probability that Xi = 1 is 0.6 (and therefore the probability that Xi = 0 is 0.4).

(a)Calculate H (X).

(b)With n = 25 and = 0.1, which sequences fall in the typical set A(n)? What is the probability of the typical set? How

many elements are there in the typical set? (This involves computation of a table of probabilities for sequences with k 1’s, 0 ≤ k ≤ 25, and finding those sequences that are in the typical set.)

(c)How many elements are there in the smallest set that has probability 0.9?

(d)How many elements are there in the intersection of the sets in parts (b) and (c)? What is the probability of this intersection?

 

 

 

 

 

HISTORICAL NOTES

69

 

 

 

 

 

 

 

 

 

 

k

n

n pk (1

p)nk

 

1

log p(xn)

 

 

 

 

 

k

k

 

n

 

0

1

0.000000

1.321928

 

 

1

25

0.000000

1.298530

 

 

2

300

0.000000

1.275131

 

 

3

2300

0.000001

1.251733

 

 

4

12650

0.000007

1.228334

 

 

5

53130

0.000054

1.204936

 

 

6

177100

0.000227

1.181537

 

 

7

480700

0.001205

1.158139

 

 

8

1081575

0.003121

1.134740

 

 

9

2042975

0.013169

1.111342

 

 

10

3268760

0.021222

1.087943

 

 

11

4457400

0.077801

1.064545

 

 

12

5200300

0.075967

1.041146

 

 

13

5200300

0.267718

1.017748

 

 

14

4457400

0.146507

0.994349

 

 

15

3268760

0.575383

0.970951

 

 

16

2042975

0.151086

0.947552

 

 

17

1081575

0.846448

0.924154

 

 

18

480700

0.079986

0.900755

 

 

19

177100

0.970638

0.877357

 

 

20

53130

0.019891

0.853958

 

 

21

12650

0.997633

0.830560

 

 

22

2300

0.001937

0.807161

 

 

23

300

0.999950

0.783763

 

 

24

25

0.000047

0.760364

 

 

25

1

0.000003

0.736966

 

 

 

 

 

 

 

 

 

 

 

 

HISTORICAL NOTES

The asymptotic equipartition property (AEP) was first stated by Shannon in his original 1948 paper [472], where he proved the result for i.i.d. processes and stated the result for stationary ergodic processes. McMillan [384] and Breiman [74] proved the AEP for ergodic finite alphabet sources. The result is now referred to as the AEP or the Shan- non–McMillan–Breiman theorem. Chung [101] extended the theorem to the case of countable alphabets and Moy [392], Perez [417], and Kieffer [312] proved the L1 convergence when {Xi } is continuous valued and ergodic. Barron [34] and Orey [402] proved almost sure convergence for real-valued ergodic processes; a simple sandwich argument (Algoet and Cover [20]) will be used in Section 16.8 to prove the general AEP.

CHAPTER 4

ENTROPY RATES

OF A STOCHASTIC PROCESS

The asymptotic equipartition property in Chapter 3 establishes that nH (X) bits suffice on the average to describe n independent and identically distributed random variables. But what if the random variables are dependent? In particular, what if the random variables form a stationary process? We will show, just as in the i.i.d. case, that the entropy H (X1, X2, . . . , Xn) grows (asymptotically) linearly with n at a rate H (X), which we will call the entropy rate of the process. The interpretation of H (X) as the best achievable data compression will await the analysis in Chapter 5.

4.1MARKOV CHAINS

A stochastic process {Xi } is an indexed sequence of random variables. In general, there can be an arbitrary dependence among the random variables. The process is characterized by the joint probability mass functions

Pr{(X1, X2, . . . , Xn) = (x1, x2, . . . , xn)} = p(x1, x2, . . . , xn), (x1, x2, . . . , xn) Xn for n = 1, 2, . . . .

Definition A stochastic process is said to be stationary if the joint distribution of any subset of the sequence of random variables is invariant with respect to shifts in the time index; that is,

Pr{X1 = x1, X2 = x2, . . . , Xn = xn}

= Pr{X1+l = x1, X2+l = x2, . . . , Xn+l = xn} (4.1) for every n and every shift l and for all x1, x2, . . . , xn X.

Elements of Information Theory, Second Edition, By Thomas M. Cover and Joy A. Thomas Copyright 2006 John Wiley & Sons, Inc.

71

72 ENTROPY RATES OF A STOCHASTIC PROCESS

A simple example of a stochastic process with dependence is one in which each random variable depends only on the one preceding it and is conditionally independent of all the other preceding random variables. Such a process is said to be Markov.

Definition A discrete stochastic process X1, X2, . . . is said to be a

Markov chain or a Markov process if for n = 1, 2, . . . ,

Pr(Xn+1 = xn+1|Xn = xn, Xn−1 = xn−1, . . . , X1 = x1)

(4.2)

= Pr (Xn+1 = xn+1|Xn = xn)

for all x1, x2, . . . , xn, xn+1 X.

In this case, the joint probability mass function of the random variables

can be written as

 

p(x1, x2, . . . , xn) = p(x1)p(x2|x1)p(x3|x2) · · · p(xn|xn−1).

(4.3)

Definition The Markov chain is said to be time invariant if the conditional probability p(xn+1|xn) does not depend on n; that is, for n =

1, 2, . . . ,

Pr{Xn+1 = b|Xn = a} = Pr{X2 = b|X1 = a} for all a, b X. (4.4)

We will assume that the Markov chain is time invariant unless otherwise stated.

If {Xi } is a Markov chain, Xn is called the state at time n. A timeinvariant Markov chain is characterized by its initial state and a probability

transition matrix P = [Pij ], i, j {1, 2, . . . , m}, where Pij = Pr{Xn+1 = j |Xn = i}.

If it is possible to go with positive probability from any state of the Markov chain to any other state in a finite number of steps, the Markov chain is said to be irreducible. If the largest common factor of the lengths of different paths from a state to itself is 1, the Markov chain is said to aperiodic.

If the probability mass function of the random variable at time n is

p(xn), the probability mass function at time n + 1 is

 

p(xn+1) = p(xn)Pxnxn+1 .

(4.5)

xn

 

A distribution on the states such that the distribution at time n + 1 is the same as the distribution at time n is called a stationary distribution. The

4.1 MARKOV CHAINS

73

stationary distribution is so called because if the initial state of a Markov chain is drawn according to a stationary distribution, the Markov chain forms a stationary process.

If the finite-state Markov chain is irreducible and aperiodic, the stationary distribution is unique, and from any starting distribution, the distribution of Xn tends to the stationary distribution as n → ∞.

Example 4.1.1 Consider a two-state Markov chain with a probability transition matrix

P

=

1 − α

1

α

 

(4.6)

 

β

β

 

 

 

 

as shown in Figure 4.1.

Let the stationary distribution be represented by a vector µ whose components are the stationary probabilities of states 1 and 2, respectively. Then the stationary probability can be found by solving the equation µP = µ or, more simply, by balancing probabilities. For the stationary distribution, the net probability flow across any cut set in the state transition graph is zero. Applying this to Figure 4.1, we obtain

µ1α = µ2β.

(4.7)

Since µ1 + µ2 = 1, the stationary distribution is

µ1 =

 

β

 

,

µ2 =

 

α

 

.

(4.8)

 

 

 

 

 

 

α

+

β

α

+

β

 

 

 

 

 

 

 

 

 

If the Markov chain has an initial state drawn according to the stationary distribution, the resulting process will be stationary. The entropy of the

1 − α

α

1 − β

State 1

β

State 2

FIGURE 4.1. Two-state Markov chain.

74 ENTROPY RATES OF A STOCHASTIC PROCESS

 

 

 

 

 

state Xn at time n is

 

 

 

 

 

 

 

 

H (Xn) = H

 

β

 

,

 

α

 

.

(4.9)

α

+

β

α

+

β

 

 

 

 

 

 

 

 

However, this is not the rate at which entropy grows for H (X1, X2, . . . , Xn). The dependence among the Xi ’s will take a steady toll.

4.2ENTROPY RATE

If we have a sequence of n random variables, a natural question to ask is: How does the entropy of the sequence grow with n? We define the entropy rate as this rate of growth as follows.

Definition The entropy of a stochastic process {Xi } is defined by

H (X) = nlim

1

H (X1

, X2, . . . , Xn)

(4.10)

n

→∞

 

 

 

 

when the limit exists.

We now consider some simple examples of stochastic processes and their corresponding entropy rates.

1.Typewriter .

Consider the case of a typewriter that has m equally likely output letters. The typewriter can produce mn sequences of length n, all of them equally likely. Hence H (X1, X2, . . . , Xn) = log mn and the entropy rate is H (X) = log m bits per symbol.

2.X1, X2, . . . are i.i.d. random variables . Then

H (X) = lim

H (X1, X2, . . . , Xn)

= lim

nH (X1)

= H (X1),

n

n

 

 

 

 

(4.11)

which is what one would expect for the entropy rate per symbol.

3.Sequence of independent but not identically distributed random variables . In this case,

n

 

H (X1, X2, . . . , Xn) = H (Xi ),

(4.12)

i=1

but the H (Xi )’s are all not equal. We can choose a sequence of distributions on X1, X2, . . . such that the limit of n1 H (Xi ) does not exist. An example of such a sequence is a random binary sequence