Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный педагогический университет им. Макаренко

Предмет:

Теория вероятностей и математическая статистика

Файл:

Теория информации / Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p

.pdf

Скачиваний:

214

Добавлен:

09.08.2013

Размер:

10.58 Mб

Скачать

☆

<<< < Предыдущая 1 2 3 4 5 6 7 8 910 / 7810 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

PROBLEMS 65

for any > 0,

Pr {\|Y − µ\| > } ≤	σ 2	(3.32)
	2 .

(c) (Weak law of large numbers ) Let Z1, Z2, . . . , Zn be a sequence of i.i.d. random variables with mean µ and variance σ 2. Let

	n = n1	n
Z	n = n1	i 1 Zi be the sample mean. Show that
		=
						σ 2
		Pr	Zn − µ	>	≤ n 2 .		(3.33)

Thus, Pr Zn − µ > → 0 as n → ∞. This is known as the weak law of large numbers.

3.2AEP and mutual information. Let (Xi , Yi ) be i.i.d. p(x, y). We form the log likelihood ratio of the hypothesis that X and Y are independent vs. the hypothesis that X and Y are dependent. What is the limit of

1p(Xn)p(Y n)

n log p(Xn, Y n) ?

3.3Piece of cake.

A cake is sliced roughly in half, the largest piece being chosen each time, the other pieces discarded. We will assume that a random cut creates pieces of proportions

( 2 , 1 ) with probability 3

P = 3 3 4

( 25 , 35 ) with probability 14

Thus, for example, the ﬁrst cut (and choice of largest piece) may result in a piece of size 35 . Cutting and choosing from this piece might reduce it to size 3 2 at time 2, and so on. How large, to

ﬁrst order in the exponent, is the piece of cake after n cuts?

3.4AEP . Let Xi be iid p(x), x {1, 2, . . . , m}. Let µ = EX and

= −

p(x) log p(x).

Let

p(xn)

−

= {x

=n {

: | − n log

H | ≤ }. Let B

: | n

i=1 Xi − µ| ≤ }.

(a)Does Pr{Xn An} −→ 1?

(b)Does Pr{Xn An ∩ Bn} −→ 1?

66ASYMPTOTIC EQUIPARTITION PROPERTY

(c)Show that |An ∩ Bn| ≤ 2n(H + ) for all n.

(d)Show that |An ∩ Bn| ≥ 12 2n(H − ) for n sufﬁciently large.

3.5Sets deﬁned by probabilities . Let X1, X2, . . . be an i.i.d. sequence of discrete random variables with entropy H (X). Let

Cn(t ) = {xn Xn : p(xn) ≥ 2−nt }

denote the subset of n-sequences with probabilities ≥ 2−nt .

(a)Show that |Cn(t )| ≤ 2nt .

(b)For what values of t does P ({Xn Cn(t )}) → 1?

3.6AEP-like limit . Let X1, X2, . . . be i.i.d. drawn according to probability mass function p(x). Find

lim (p(X1, X2, . . . , Xn)) n .

n→∞

3.7AEP and source coding . A discrete memoryless source emits a sequence of statistically independent binary digits with probabilities p(1) = 0.005 and p(0) = 0.995. The digits are taken 100 at a time and a binary codeword is provided for every sequence of 100 digits containing three or fewer 1’s.

(a)Assuming that all codewords are the same length, ﬁnd the minimum length required to provide codewords for all sequences with three or fewer 1’s.

(b)Calculate the probability of observing a source sequence for which no codeword has been assigned.

(c)Use Chebyshev’s inequality to bound the probability of observing a source sequence for which no codeword has been assigned. Compare this bound with the actual probability computed in part (b).

3.8Products . Let

1, with probability

X = 2, with probability

3, with probability

Let X1, X2, . . . be drawn i.i.d. according to this distribution. Find the limiting behavior of the product

(X1X2 · · · Xn) n .

PROBLEMS 67

3.9AEP . Let X1, X2, . . . be independent, identically distributed random variables drawn according to the probability mass function

p(x), x {1,12, . . . , m}. Thus, p(x1, x2, . . . , xn) =

i=1 p(xi ). We

know that

− n

log

, X

, . . . , X

)

→

H (X)

probability. Let

p(X

q(x1, x2, . . . , xn) =

i=1 q(xi ),

where

q is another probability

{

}

, 2, . . . , m .

mass function on 1

(a)Evaluate lim − n1 log q(X1, X2, . . . , Xn), where X1, X2, . . . are i.i.d. p(x).

(b) Now		evaluate	the limit of the log likelihood ratio
1		q(X1,...,Xn)	when X1, X2, . . . are i.i.d. p(x). Thus, the
n	log p(X1,...,Xn)

odds favoring q are exponentially small when p is true.

3.10Random box size.

An n-dimensional rectangular box with sides X1, X2, X3, . . . , Xn is

to be constructed. The volume is V	n = i=1	X	i . The edge	1/n
to be constructed. The volume is V	n	X		length l

of a n-cube with the same volume as the random box is l = Vn . Let X1, X2, . . . be i.i.d. uniform random variables over the unit

interval [0, 1]. Find lim		V 1/n	1
	n→∞		and compare to (EV ) n . Clearly,
		n	n

the expected edge length does not capture the idea of the volume of the box. The geometric mean, rather than the arithmetic mean, characterizes the behavior of products.

3.11Proof of Theorem 3.3.1. This problem shows that the size of the smallest “probable” set is about 2nH . Let X1, X2, . . . , Xn be i.i.d.

p(x). Let Bδ(n) Xn such that Pr(Bδ(n)) > 1 − δ. Fix < 12 .

(a)Given any two sets A, B such that Pr(A) > 1 − 1 and Pr(B) >

1 − 2, show that Pr(A ∩ B) > 1 − 1 − 2. Hence, Pr(A(n) ∩

Bδ(n)) ≥ 1 − − δ.

(b)Justify the steps in the chain of inequalities

1 − − δ ≤ Pr(A(n) ∩ Bδ(n))

=p(xn)

A(n)∩Bδ(n)

≤2−n(H − )

A(n)∩Bδ(n)

= |A(n) ∩ Bδ(n)|2−n(H − )

≤|Bδ(n)|2−n(H − ).

(c)Complete the proof of the theorem.

(3.34)

(3.35)

(3.36)

(3.37)

(3.38)

68 ASYMPTOTIC EQUIPARTITION PROPERTY

3.12Monotonic convergence of the empirical distribution.

Let pˆn denote the empirical probability mass function corresponding to X1, X2, . . . , Xn i.i.d. p(x), x X. Speciﬁcally,

pˆn(x) = n i=1 I (Xi = x)

is the proportion of times that Xi = x in the ﬁrst n samples, where I is the indicator function.

(a) Show for X binary that

ED(pˆ2n p) ≤ ED(pˆn p).

Thus, the expected relative entropy “distance” from the empirical distribution to the true distribution decreases with sample size. (Hint: Write pˆ2n = 12 pˆn + 12 pˆn and use the convexity of D.)

(b) Show for an arbitrary discrete X that

ED(pˆn p) ≤ ED(pˆn−1 p).

(Hint: Write pˆn as the average of n empirical mass functions with each of the n samples deleted in turn.)

3.13Calculation of typical set . To clarify the notion of a typical set A(n) and the smallest set of high probability Bδ(n), we will calculate the set for a simple example. Consider a sequence of i.i.d. binary

random variables, X1, X2, . . . , Xn, where the probability that Xi = 1 is 0.6 (and therefore the probability that Xi = 0 is 0.4).

(a)Calculate H (X).

(b)With n = 25 and = 0.1, which sequences fall in the typical set A(n)? What is the probability of the typical set? How

many elements are there in the typical set? (This involves computation of a table of probabilities for sequences with k 1’s, 0 ≤ k ≤ 25, and ﬁnding those sequences that are in the typical set.)

(c)How many elements are there in the smallest set that has probability 0.9?

(d)How many elements are there in the intersection of the sets in parts (b) and (c)? What is the probability of this intersection?

					HISTORICAL NOTES			69

k	n	n pk (1	−	p)n−k		1	log p(xn)
k	n	n pk (1		p)n−k			log p(xn)
	k	k			− n
0	1	0.000000			1.321928
1	25	0.000000			1.298530
2	300	0.000000			1.275131
3	2300	0.000001			1.251733
4	12650	0.000007			1.228334
5	53130	0.000054			1.204936
6	177100	0.000227			1.181537
7	480700	0.001205			1.158139
8	1081575	0.003121			1.134740
9	2042975	0.013169			1.111342
10	3268760	0.021222			1.087943
11	4457400	0.077801			1.064545
12	5200300	0.075967			1.041146
13	5200300	0.267718			1.017748
14	4457400	0.146507			0.994349
15	3268760	0.575383			0.970951
16	2042975	0.151086			0.947552
17	1081575	0.846448			0.924154
18	480700	0.079986			0.900755
19	177100	0.970638			0.877357
20	53130	0.019891			0.853958
21	12650	0.997633			0.830560
22	2300	0.001937			0.807161
23	300	0.999950			0.783763
24	25	0.000047			0.760364
25	1	0.000003			0.736966

HISTORICAL NOTES

The asymptotic equipartition property (AEP) was ﬁrst stated by Shannon in his original 1948 paper [472], where he proved the result for i.i.d. processes and stated the result for stationary ergodic processes. McMillan [384] and Breiman [74] proved the AEP for ergodic ﬁnite alphabet sources. The result is now referred to as the AEP or the Shan- non–McMillan–Breiman theorem. Chung [101] extended the theorem to the case of countable alphabets and Moy [392], Perez [417], and Kieffer [312] proved the L1 convergence when {Xi } is continuous valued and ergodic. Barron [34] and Orey [402] proved almost sure convergence for real-valued ergodic processes; a simple sandwich argument (Algoet and Cover [20]) will be used in Section 16.8 to prove the general AEP.

CHAPTER 4

ENTROPY RATES

OF A STOCHASTIC PROCESS

The asymptotic equipartition property in Chapter 3 establishes that nH (X) bits sufﬁce on the average to describe n independent and identically distributed random variables. But what if the random variables are dependent? In particular, what if the random variables form a stationary process? We will show, just as in the i.i.d. case, that the entropy H (X1, X2, . . . , Xn) grows (asymptotically) linearly with n at a rate H (X), which we will call the entropy rate of the process. The interpretation of H (X) as the best achievable data compression will await the analysis in Chapter 5.

4.1MARKOV CHAINS

A stochastic process {Xi } is an indexed sequence of random variables. In general, there can be an arbitrary dependence among the random variables. The process is characterized by the joint probability mass functions

Pr{(X1, X2, . . . , Xn) = (x1, x2, . . . , xn)} = p(x1, x2, . . . , xn), (x1, x2, . . . , xn) Xn for n = 1, 2, . . . .

Deﬁnition A stochastic process is said to be stationary if the joint distribution of any subset of the sequence of random variables is invariant with respect to shifts in the time index; that is,

Pr{X1 = x1, X2 = x2, . . . , Xn = xn}

= Pr{X1+l = x1, X2+l = x2, . . . , Xn+l = xn} (4.1) for every n and every shift l and for all x1, x2, . . . , xn X.

72 ENTROPY RATES OF A STOCHASTIC PROCESS

A simple example of a stochastic process with dependence is one in which each random variable depends only on the one preceding it and is conditionally independent of all the other preceding random variables. Such a process is said to be Markov.

Deﬁnition A discrete stochastic process X1, X2, . . . is said to be a

Markov chain or a Markov process if for n = 1, 2, . . . ,

Pr(Xn+1 = xn+1\|Xn = xn, Xn−1 = xn−1, . . . , X1 = x1)	(4.2)
= Pr (Xn+1 = xn+1\|Xn = xn)

for all x1, x2, . . . , xn, xn+1 X.

In this case, the joint probability mass function of the random variables

can be written as
p(x1, x2, . . . , xn) = p(x1)p(x2\|x1)p(x3\|x2) · · · p(xn\|xn−1).	(4.3)

Deﬁnition The Markov chain is said to be time invariant if the conditional probability p(xn+1|xn) does not depend on n; that is, for n =

1, 2, . . . ,

Pr{Xn+1 = b|Xn = a} = Pr{X2 = b|X1 = a} for all a, b X. (4.4)

We will assume that the Markov chain is time invariant unless otherwise stated.

If {Xi } is a Markov chain, Xn is called the state at time n. A timeinvariant Markov chain is characterized by its initial state and a probability

transition matrix P = [Pij ], i, j {1, 2, . . . , m}, where Pij = Pr{Xn+1 = j |Xn = i}.

If it is possible to go with positive probability from any state of the Markov chain to any other state in a ﬁnite number of steps, the Markov chain is said to be irreducible. If the largest common factor of the lengths of different paths from a state to itself is 1, the Markov chain is said to aperiodic.

If the probability mass function of the random variable at time n is

p(xn), the probability mass function at time n + 1 is
p(xn+1) = p(xn)Pxnxn+1 .	(4.5)
xn

A distribution on the states such that the distribution at time n + 1 is the same as the distribution at time n is called a stationary distribution. The

4.1 MARKOV CHAINS

stationary distribution is so called because if the initial state of a Markov chain is drawn according to a stationary distribution, the Markov chain forms a stationary process.

If the ﬁnite-state Markov chain is irreducible and aperiodic, the stationary distribution is unique, and from any starting distribution, the distribution of Xn tends to the stationary distribution as n → ∞.

Example 4.1.1 Consider a two-state Markov chain with a probability transition matrix

P	=	1 − α	1	α		(4.6)
		β		−	β
		β			β

as shown in Figure 4.1.

Let the stationary distribution be represented by a vector µ whose components are the stationary probabilities of states 1 and 2, respectively. Then the stationary probability can be found by solving the equation µP = µ or, more simply, by balancing probabilities. For the stationary distribution, the net probability ﬂow across any cut set in the state transition graph is zero. Applying this to Figure 4.1, we obtain

µ1α = µ2β.

(4.7)

Since µ1 + µ2 = 1, the stationary distribution is

µ1 =		β		,	µ2 =		α		.	(4.8)

	α	+	β			α	+	β
		+					+

If the Markov chain has an initial state drawn according to the stationary distribution, the resulting process will be stationary. The entropy of the

1 − α

1 − β

State 1

State 2

FIGURE 4.1. Two-state Markov chain.

74 ENTROPY RATES OF A STOCHASTIC PROCESS
state Xn at time n is
H (Xn) = H		β		,		α		.	(4.9)
H (Xn) = H	α	+	β	,	α	+	β	.	(4.9)
		+				+

However, this is not the rate at which entropy grows for H (X1, X2, . . . , Xn). The dependence among the Xi ’s will take a steady toll.

4.2ENTROPY RATE

If we have a sequence of n random variables, a natural question to ask is: How does the entropy of the sequence grow with n? We deﬁne the entropy rate as this rate of growth as follows.

Deﬁnition The entropy of a stochastic process {Xi } is deﬁned by

H (X) = nlim	1	H (X1	, X2, . . . , Xn)	(4.10)
	n
→∞

when the limit exists.

We now consider some simple examples of stochastic processes and their corresponding entropy rates.

1.Typewriter .

Consider the case of a typewriter that has m equally likely output letters. The typewriter can produce mn sequences of length n, all of them equally likely. Hence H (X1, X2, . . . , Xn) = log mn and the entropy rate is H (X) = log m bits per symbol.

2.X1, X2, . . . are i.i.d. random variables . Then

H (X) = lim	H (X1, X2, . . . , Xn)	= lim	nH (X1)	= H (X1),
H (X) = lim	n	= lim	n	= H (X1),
				(4.11)

which is what one would expect for the entropy rate per symbol.

3.Sequence of independent but not identically distributed random variables . In this case,

n
H (X1, X2, . . . , Xn) = H (Xi ),	(4.12)

i=1

but the H (Xi )’s are all not equal. We can choose a sequence of distributions on X1, X2, . . . such that the limit of n1 H (Xi ) does not exist. An example of such a sequence is a random binary sequence

<<< < Предыдущая 1 2 3 4 5 6 7 8 910 / 7810 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

Соседние файлы в папке Теория информации

#
09.08.201310.58 Mб214Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p.pdf
#
09.08.20131.32 Mб28Gray R.M. Entropy and information theory. 1990., 284p.pdf