Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Теория информации / Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p

.pdf
Скачиваний:
269
Добавлен:
09.08.2013
Размер:
10.58 Mб
Скачать

 

 

 

 

 

 

 

 

SUMMARY 335

as a double maximization using Lemma 10.8.1,

 

 

 

C

max max

 

r(x)p(y x) log

q(x|y)

.

(10.145)

 

= q(x

y) r(x)

|

 

r(x)

 

 

|

 

x y

 

 

 

 

 

In this case, the Csiszar´ – Tusnady algorithm becomes one of alternating maximization — we start with a guess of the maximizing distribution r(x) and find the best conditional distribution, which is, by Lemma 10.8.1,

q(x y)

=

r(x)p(y|x)

.

(10.146)

|

x r(x)p(y|x)

 

For this conditional distribution, we find the best input

distribution

r(x) by solving the constrained maximization problem with Lagrange multipliers. The optimum input distribution is

 

 

(q(x y))p(y|x)

 

 

 

r(x) =

y

|

p(y

x)

,

(10.147)

 

x y (q(x|y))

|

 

 

 

which we can use as the basis for the next iteration.

These algorithms for the computation of the channel capacity and the rate distortion function were established by Blahut [65] and Arimoto [25] and the convergence for the rate distortion computation was proved by Csiszar´ [139]. The alternating minimization procedure of Csiszar´ and Tusnady can be specialized to many other situations as well, including the EM algorithm [166], and the algorithm for finding the log-optimal portfolio for a stock market [123].

SUMMARY

Rate distortion. The rate distortion function for a source X p(x) and distortion measure d(x, x)ˆ is

R(D)

= p(xˆ |x): (x,x)ˆ

min

I (X

;

ˆ

(10.148)

X),

 

p(x)p(xˆ|x)d(x,x)ˆ ≤D

 

 

 

where the minimization is over all conditional distributions p(xˆ|x) for which the joint distribution p(x, x)ˆ = p(x)p(xˆ|x) satisfies the expected distortion constraint.

336 RATE DISTORTION THEORY

Rate distortion theorem. If R > R(D), there exists a

sequence of

 

ˆ n

(X

n

)

ˆ n

( )

nR

X

 

X

| ≤ 2 with

codes n

ˆ n

 

 

 

n with the number of codewords |

 

·

Ed(X

, X

(X

)) D. If R < R(D), no such codes exist.

Bernoulli source. For a Bernoulli source with Hamming distortion,

R(D) = H (p) H (D).

(10.149)

Gaussian source. For a Gaussian source with squared-error distortion,

R(D) =

1

log

σ 2

(10.150)

 

 

.

2

D

Source – channel separation. A source with rate distortion R(D) can be sent over a channel of capacity C and recovered with distortion D if and only if R(D) < C.

Multivariate Gaussian source. The rate distortion function for a multivariate normal vector with Euclidean mean-squared-error distortion is given by reverse water-filling on the eigenvalues.

PROBLEMS

10.1One-bit quantization of a single Gaussian random variable. Let

X N(0, σ 2) and let the distortion measure be squared error. Here we do not allow block descriptions. Show that the optimum

reproduction points for 1-bit quantization are ± π2 σ and that the

expected distortion for 1-bit quantization is ππ−2 σ 2. Compare this

with the distortion rate bound D = σ 22−2R

for

R = 1.

10.2Rate distortion function with infinite distortion. Find the rate dis-

 

 

ˆ

1

tortion function R(D) = min I (X; X) for X Bernoulli (

2 ) and

distortion

 

 

 

 

0,

x = xˆ

 

d(x, x)ˆ =

1,

x = 1, xˆ = 0

 

, x = 0, xˆ = 1.

PROBLEMS 337

10.3Rate distortion for binary source with asymmetric distortion . Fix

 

 

ˆ

 

 

 

 

 

p(xˆ|x) and evaluate I (X; X) and D for

 

X

 

Bernoulli

 

1

 

,

2

 

 

 

b

 

.

d(x, x)ˆ =

0

 

 

 

 

0

a

 

(The rate distortion function cannot be expressed in closed form.)

10.4 Properties of R(D).

Consider a

discrete source X X =

{1, 2, . . . , m} with distribution p1, p2, . . . , pm and a distortion

measure d(i, j ). Let

R(D) be the

rate distortion function for

this source and distortion measure. Let d (i, j ) = d(i, j ) wi be a new distortion measure, and let R (D) be the corresponding rate distortion function. Show that R (D) = R(D + w), where w = pi wi , and use this to show that there is no essential loss of generality in assuming that minxˆ d(i, x)ˆ = 0 (i.e., for each x X, there is one symbol xˆ that reproduces the source with zero distortion). This result is due to Pinkston [420].

10.5Rate distortion for uniform source with Hamming distortion. Consider a source X uniformly distributed on the set {1, 2, . . . , m}. Find the rate distortion function for this source with Hamming distortion; that is,

d(x, x)ˆ =

0 if x = x,ˆ

1 if x =xˆ.

10.6Shannon lower bound for the rate distortion function. Consider a source X with a distortion measure d(x, x)ˆ that satisfies the following property: All columns of the distortion matrix are permutations of the set {d1, d2, . . . , dm}. Define the function

φ (D)

= p:

max

H (p).

(10.151)

 

m

p d

D

 

 

 

i=1

i

i

 

The Shannon lower bound on the rate distortion function [485]

is proved by the following steps:

 

 

(a)

Show that φ (D) is a concave function of D.

ˆ

(b)

Justify the following series

of inequalities for

I (X; X) if

 

ˆ

 

 

 

Ed(X, X) D,

 

 

 

ˆ

ˆ

(10.152)

 

I (X; X) = H (X) H (X|X)

338 RATE DISTORTION THEORY

= H (X) p(x)Hˆ (X|Xˆ = x)ˆ

(10.153)

xˆ

 

 

H (X) p(x)φˆ (Dxˆ )

(10.154)

xˆ

 

 

H (X) φ

p(x)Dˆ xˆ

(10.155)

 

xˆ

 

H (X) φ (D),

(10.156)

where Dxˆ = x p(x|x)dˆ (x, x)ˆ .

 

(c) Argue that

 

 

R(D) H (X) φ (D),

(10.157)

which is the Shannon lower bound on the rate distortion function.

(d)If, in addition, we assume that the source has a uniform distribution and that the rows of the distortion matrix are permutations of each other, then R(D) = H (X) φ (D) (i.e., the lower bound is tight).

10.7Erasure distortion. Consider X Bernoulli ( 12 ), and let the distortion measure be given by the matrix

d(x, x)ˆ

=

 

1

0

(10.158)

 

0 1

.

 

 

 

 

 

Calculate the rate distortion function for this source. Can you suggest a simple scheme to achieve any value of the rate distortion function for this source?

10.8Bounds on the rate distortion function for squared-error distortion. For the case of a continuous random variable X with mean zero and variance σ 2 and squared-error distortion, show that

 

1

 

1 σ 2

 

h(X)

 

log(2π eD) R(D)

 

log

 

.

(10.159)

2

2

D

For the upper bound, consider the following joint distribution:

PROBLEMS 339

 

 

 

 

Z ~

 

 

 

 

Ds 2

 

 

 

 

 

0, s 2 D s 2 D

 

 

 

 

 

 

 

 

 

 

 

 

s 2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

^

 

 

 

 

 

 

X

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

X

 

 

2

D

 

 

 

 

 

 

 

 

 

 

 

X = s

(X + Z )

^

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s 2

 

 

 

 

 

 

 

 

 

 

 

 

Are Gaussian random variables harder or easier to describe than other random variables with the same variance?

10.9Properties of optimal rate distortion code. A good (R, D) rate

distortion code with R

ˆ

 

 

 

 

R(D) puts severe constraints on the rela-

tionship of the source X

n

and the representations X

n

. Examine the

 

 

chain of inequalities (10.58 – 10.71) considering the conditions for equality and interpret as properties of a good code. For example,

ˆ n

is a deterministic function

equality in (10.59) implies that X

of Xn.

 

10.10Rate distortion. Find and verify the rate distortion function R(D) for X uniform on X = {1, 2, . . . , 2m} and

d(x, x)ˆ

=

1

for x

xˆ

odd,

0

for x

xˆ

even,

 

 

 

 

 

ˆ

ˆ

 

 

 

 

where X is defined on

X = {1, 2, . . . , 2m}. (You may wish to use

the Shannon lower bound in your argument.)

10.11 Lower bound . Let

ex4

X ex4 dx

−∞

and

x4ex4 dx = c.

ex4 dx

Define g(a) = max h(X) over all densities such that EX4 a. Let R(D) be the rate distortion function for X with the density above and with distortion criterion d(x, x)ˆ = (x x)ˆ 4. Show that

R(D) g(c) g(D).

340 RATE DISTORTION THEORY

10.12Adding a column to the distortion matrix . Let R(D) be the rate

distortion function for an i.i.d. process with probability mass function p(x) and distortion function d(x, x)ˆ , x X, xˆ Xˆ . Now suppose that we add a new reproduction symbol xˆ0 to Xˆ with associated distortion d(x, xˆ0), x X. Does this increase or decrease R(D), and why?

10.13Simplification. Suppose that X = {1, 2, 3, 4}, Xˆ = {1, 2, 3, 4},

p(i) = 14 , i = 1, 2, 3, 4, and X1, X2, . . . are i.i.d. p(x). The distortion matrix d(x, x)ˆ is given by

 

1

2

3

4

1

0

0

1

1

2

0

0

1

1

3

1

1

0

0

4

1

1

0

0

 

 

 

 

 

(a)Find R(0), the rate necessary to describe the process with zero distortion.

(b)Find the rate distortion function R(D). There are some irrelevant distinctions in alphabets X and Xˆ , which allow the problem to be collapsed.

(c) Suppose that we have a nonuniform distribution p(i) = pi ,

i= 1, 2, 3, 4. What is R(D)?

10.14Rate distortion for two independent sources . Can one compress two independent sources simultaneously better than by compressing the sources individually? The following problem addresses this

question. Let {Xi } be i.i.d. p(x) with distortion d(x, x)ˆ and rate distortion function RX(D). Similarly, let {Yi } be i.i.d. p(y) with distortion d(y, y)ˆ and rate distortion function RY (D). Suppose we now wish to describe the process {(Xi , Yi )} subject to distortions

 

 

ˆ

 

 

ˆ

. Thus, a rate RX,Y (D1, D2)

Ed(X, X) D1

and Ed(Y, Y ) D2

is sufficient, where

 

 

 

 

R

 

(D , D )

=

min

I (X, Y

;

ˆ ˆ

X,Y

X, Y ).

 

1 2

ˆ

ˆ

 

 

 

 

 

p(x,ˆ yˆ|x,y):Ed(X,X)D1,Ed(Y,Y )D2

 

 

Now suppose that the {Xi } process and the {Yi } process are independent of each other.

(a) Show that

RX,Y (D1, D2) RX(D1) + RY (D2).

PROBLEMS 341

(b) Does equality hold? Now answer the question.

10.15Distortion rate function. Let

D(R) =

min ˆ

ˆ

(10.160)

Ed(X, X)

 

p(xˆ |x):I (X;X)R

 

 

be the distortion rate function.

(a)Is D(R) increasing or decreasing in R?

(b)Is D(R) convex or concave in R?

(c)Converse for distortion rate functions: We now wish to prove the converse by focusing on D(R). Let X1, X2, . . . , Xn be

i.i.d.

 

p(x). Suppose that one is given a (2nR , n)

nrate

dis-

 

code X

n

 

i(X

n

)

 

ˆ n

(i(X

n

))

 

i(X

nR

,

tortion

 

 

X

 

, with

)

n

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2 ˆ n

and

suppose that the resulting distortion is D

=

Ed(X

, X

 

n

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(i(X

 

))). We must show that D D(R). Give reasons for the

following steps in the proof:

D =

(a)

=

(b)

=

(c)

 

 

 

 

 

 

n

ˆ n

n

)))

(10.161)

Ed(X

, X

(i(X

 

 

 

 

1

 

n

 

 

 

 

E

d(Xi , Xˆ i )

(10.162)

 

n

 

 

 

 

 

i=1

 

 

 

1

 

 

n

 

 

 

 

 

 

Ed(Xi , Xˆ i )

(10.163)

 

 

 

 

 

n

 

 

 

 

i=1

 

 

 

 

1

 

 

 

n

 

 

 

 

 

 

 

i

 

1

 

 

 

 

 

 

 

 

 

 

 

 

 

n

D I (Xi ; Xˆ i )

(10.164)

 

 

 

 

=

 

 

 

 

 

(d)

(e)

 

n

 

 

;

 

 

D

1

n

I (Xi

 

Xˆ i )

(10.165)

 

 

 

i=1

 

1

n

ˆ n

 

(10.166)

D

n I (X

; X

)

(f)

(10.167)

D(R).

10.16Probability of conditionally typical sequences. In Chapter 7 we calculated the probability that two independently drawn sequences

Xn and Y n are weakly jointly typical. To prove the rate distortion theorem, however, we need to calculate this probability when

342 RATE DISTORTION THEORY

one of the sequences is fixed and the other is random. The techniques of weak typicality allow us only to calculate the average set size of the conditionally typical set. Using the ideas of strong typicality, on the other hand, provides us with stronger bounds

that work for all typical xn sequences. We outline the proof that Pr{(xn, Y n) A (n)} ≈ 2nI (X;Y ) for all typical xn. This approach

was introduced by Berger [53] and is fully developed in the book by Csiszar´ and Korner¨ [149].

Let (Xi , Yi ) be drawn i.i.d. p(x, y). Let the marginals of X and Y be p(x) and p(y), respectively.

(a) Let A (n) be the strongly typical set for X. Show that

|

A (n)

.

2nH (X).

(10.168)

 

|=

 

 

(Hint: Theorems 11.1.1 and 11.1.3.)

(b) The joint type of a pair of sequences (xn, yn) is the proportion of times (xi , yi ) = (a, b) in the pair of sequences:

n

pxn,yn (a, b) = n1 N (a, b|xn, yn) = n1 I (xi = a, yi = b).

i=1

(10.169) The conditional type of a sequence yn given xn is a stochastic matrix that gives the proportion of times a particular element of Y occurred with each element of X in the pair of sequences. Specifically, the conditional type Vyn|xn (b|a) is defined as

V n

|x

n (b a)

N (a, b|xn, yn)

.

(10.170)

 

y

| =

N (a|xn)

 

Show that the number of conditional types is bounded by (n +

1)|X||Y|.

(c) The set of sequences yn Yn with conditional type V with respect to a sequence xn is called the conditional type class TV (xn). Show that

 

 

 

1

 

2nH (Y |X)

≤ |TV (xn)| ≤ 2nH (Y |X).

(10.171)

 

(n

+

1)

|X||Y|

 

 

 

 

 

 

 

 

 

 

 

(d) The sequence yn

Y

n

is

said to be -strongly conditionally

 

 

 

 

 

 

 

 

n

with respect to the conditional

typical with the sequence x

 

distribution V (·|·) if the conditional type is close to V . The conditional type should satisfy the following two conditions:

 

 

 

PROBLEMS 343

(i) For all (a, b) X × Y with V (b|a) > 0,

 

 

 

 

1

N (a, b|xn, yn) V (b|a)N (a|xn)

 

 

.

 

 

 

 

 

n

|Y|

1

 

 

 

 

 

(10.172)+

 

(ii) N (a, b|xn, yn) = 0 for all (a, b) such that V (b|a) = 0.

The set of such sequences is called the conditionally typical set and is denoted A (n)(Y |xn). Show that the number

of sequences yn that are conditionally typical with a given xn Xn is bounded by

 

 

1

 

2n(H (Y |X)1) ≤ |A (n)(Y |xn)|

(n

+

1)

|X||Y|

 

 

 

+

1)|X||Y|

2n(H (Y |X)+ 1), (10.173)

 

 

 

 

(n

 

 

 

 

 

 

 

where 1 → 0 as → 0.

(e) For a pair of random variables (X, Y ) with joint distribution p(x, y), the -strongly typical set A (n) is the set of sequences

(xn, yn) Xn × Yn satisfying

(i)

n

|

 

 

 

 

 

 

1

N (a, b xn, yn)

 

p(a, b)

<

(10.174)

 

 

 

 

 

 

 

 

|X||Y|

 

 

 

 

 

 

 

 

 

 

 

for every pair (a, b) X × Y with p(a, b) > 0.

(ii) N (a, b|xn, yn) = 0

for

all

(a, b) X × Y with

p(a, b) = 0.

 

 

 

 

 

 

 

The set of -strongly jointly typical sequences is called the

-strongly jointly typical set

and is denoted A (n)(X, Y ). Let

(X, Y )

be drawn i.i.d.

p(x, y)

 

 

any xn

such that there

 

 

 

n

n

 

 

. For(n)

 

 

exists at least one pair (x , y )

 

 

 

 

 

 

 

 

 

 

 

 

A (X, Y ), the set of se-

quences yn such that (xn, yn) A (n) satisfies

 

 

 

 

 

 

 

 

 

1

 

2n(H (Y |X)δ( ))

≤ |{

yn : (xn, yn)

 

A (n)

}|

 

 

 

 

 

 

 

(n

 

 

1)

 

 

+

|X||Y|

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

+

1)|X||Y|2n(H (Y |X)+δ( )),

 

 

 

(10.175)

 

 

 

 

 

 

(n

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

where δ( ) → 0 as → 0. In particular, we can write

 

2n(H (Y |X)2)

 

yn

: (xn, yn)

 

 

 

A (n)

}| ≤

2n(H (Y |X)+ 2),

 

 

 

 

 

 

≤ |{

 

 

 

 

 

 

 

 

 

 

 

(10.176)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

344 RATE DISTORTION THEORY

where we can make 2 arbitrarily small with an appropriate choice of and n.

(f)Let Y1, Y2, . . . , Yn be drawn i.i.d. p(yi ). For xn A (n), the probability that (xn, Y n) A (n) is bounded by

2n(I (X;Y )+ 3) ≤ Pr((xn, Y n) A (n)) ≤ 2n(I (X;Y )3),

(10.177)

where 3 goes to 0 as → 0 and n → ∞.

10.17Source–channel separation theorem with distortion. Let V1, V2, . . . , Vn be a finite alphabet i.i.d. source which is encoded

as a sequence of n input symbols Xn of a discrete memoryless channel. The output of the channel Y n is mapped onto the recon-

1

n

struction alphabet Vˆ n = g(Y n). Let D = Ed(V n, Vˆ n) = n

i=1

 

ˆ

 

 

 

Ed(Vi , Vi ) be the average distortion achieved by this combined

source and channel coding scheme.

 

 

V n

X n(V n)

Channel Capacity C

Y n

^

V n

(a)Show that if C > R(D), where R(D) is the rate distortion function for V , it is possible to find encoders and decoders that achieve a average distortion arbitrarily close to D.

(b)(Converse) Show that if the average distortion is equal to D, the capacity of the channel C must be greater than R(D).

10.18 Rate distortion.

Let d(x, x)ˆ

be a distortion function. We have

a source X p(x). Let R(D) be the associated rate distortion

function.

R(D)

 

R(D),

R(D)

 

 

(a)

Find

in terms of

is the

rate

˜

 

where

˜

 

 

 

 

 

 

 

d(x, x)

 

distortion function associated with the distortion ˜

ˆ =

(b)

d(x, x)ˆ + a for some constant a > 0. (They are not equal.)

Now suppose that d(x, x)ˆ

≥ 0 for all x, xˆ and define a new

 

distortion function d (x, x)ˆ = bd(x, x),ˆ where b is some num-

 

ber ≥ 0. Find the associated rate distortion function R (D) in

 

terms of R(D).

 

= 5(x x)ˆ 2 + 3. What is

(c)

Let

X N (0, σ 2) and d(x, x)ˆ

R(D)?