Теория информации / Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p

Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный педагогический университет им. Макаренко

Предмет:

Теория вероятностей и математическая статистика

Файл:

log(2π e)n|K| = h(X1

, X2, . . . , Xn) ≤

log 2π e|Kii |,

.pdf

Скачиваний:

214

Добавлен:

09.08.2013

Размер:

10.58 Mб

Скачать

☆

<<< < Предыдущая 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 7071 / 7871 72 73 74 75 76 77 78 > Следующая >>>

17.8 ENTROPY POWER INEQUALITY AND BRUNN–MINKOWSKI INEQUALITY

675

alternative proof of the entropy power inequality. We also show how the entropy power inequality and the Brunn – Minkowski inequality are related by means of a common proof.

We can rewrite the entropy power inequality for dimension n = 1 in a form that emphasizes its relationship to the normal distribution. Let X and Y be two independent random variables with densities, and let X and Y be independent normals with the same entropy as X and Y , respectively. Then 22h(X) = 22h(X ) = (2π e)σX2 and similarly, 22h(Y ) =

(2π e)σ 2	. Hence the entropy power inequality can be rewritten as
Y
	22h(X+Y ) ≥ (2π e)(σX2 + σY2 ) = 22h(X +Y ),	(17.89)

since X and Y are independent. Thus, we have a new statement of the entropy power inequality.

Theorem 17.8.1 (Restatement of the entropy power inequality ) For two independent random variables X and Y ,

h(X + Y ) ≥ h(X + Y ),

(17.90)

where X and Y are independent normal random variables with h(X ) = h(X) and h(Y ) = h(Y ).

This form of the entropy power inequality bears a striking resemblance to the Brunn – Minkowski inequality, which bounds the volume of set sums.

Deﬁnition The set sum A + B of two sets A, B Rn is deﬁned as the set {x + y : x A, y B}.

Example 17.8.1 The set sum of two spheres of radius 1 is a sphere of radius 2.

Theorem 17.8.2 (Brunn – Minkowski inequality) The volume of the set sum of two sets A and B is greater than the volume of the set sum of two spheres A and B with the same volume as A and B, respectively:

V (A + B) ≥ V (A + B ),

(17.91)

where A and B are spheres with V (A ) = V (A) and V (B ) = V (B).

The similarity between the two theorems was pointed out in [104]. A common proof was found by Dembo [162] and Lieb, starting from a

676 INEQUALITIES IN INFORMATION THEORY

strengthened version of Young’s inequality. The same proof can be used to prove a range of inequalities which includes the entropy power inequality and the Brunn – Minkowski inequality as special cases. We begin with a few deﬁnitions.

Deﬁnition Let f and g be two densities over Rn and let f g denote the convolution of the two densities. Let the Lr norm of the density be deﬁned by

f r (x) dx r .

(17.92)

Lemma 17.8.1 (Strengthened Young’s inequality )

For any two densi-

ties f and g over Rn,

≤

CpCq

q ,

(17.93)

where

− 1

(17.94)

and

p p

= 1.

(17.95)

Cp =

Proof: The proof of this inequality may be found in [38] and [73].

We deﬁne a generalization of the entropy.

Deﬁnition The Renyi entropy hr (X) of order r is deﬁned as

hr (X) =		1		log	f r (x) dx	(17.96)
	1	−	r

for 0 < r < ∞, r =1. If we take the limit as r → 1, we obtain the Shannon entropy function,

h(X) = h1(X) = − f (x) log f (x) dx. (17.97)

If we take the limit as r → 0, we obtain the logarithm of the volume of the support set,

h0(X) = log (µ{x : f (x) > 0}) .

(17.98)

17.8 ENTROPY POWER INEQUALITY AND BRUNN–MINKOWSKI INEQUALITY

677

Thus, the zeroth-order Renyi entropy gives the logarithm of the measure of the support set of the density f , and the Shannon entropy h1 gives the logarithm of the size of the “effective” support set (Theorem 8.2.2). We now deﬁne the equivalent of the entropy power for Renyi entropies.

Deﬁnition The Renyi entropy power Vr (X) of order r is deﬁned as

2 r

− n r ,

f r (x) dx

0 < r

≤ ∞

, r

1 , r

(X)

exp [

h(X)],

µ({x : f (x) > 0}) n ,

r = 0

(17.99)

Theorem 17.8.3

For two independent random variables X and Y and

any 0 ≤ r < ∞ and any 0 ≤ λ ≤ 1, we have

log Vr (X + Y ) ≥ λ log Vp(X) + (1 − λ) log Vq (Y ) + H (λ)

1 r

λ(1

+ 1−+r H

+1+r−

− H

(17.100)

1+r

where p =

, q =

and H (λ) = −λ log λ − (1 − λ)

(r+λ(1−r))

(r+(1−λ)(1−r))

log(1 − λ).

Proof: If we take the logarithm of Young’s inequality (17.93), we obtain

log Vr (X

+ Y ) ≥

log Vp(X) +

log Vq (Y ) + log Cr

− log Cp − log Cq .

(17.101)

Setting λ

r /p

and using (17.94), we have 1

r /q

, p

−

= r+λ(1−r)

and q =

. Thus, (17.101) becomes

r+(1−λ)(1−r)

log Vr (X + Y ) ≥

λ log Vp(X) + (1 − λ) log Vq (Y ) +

log r − log r

−

log p +

log p −

log q +

log q

(17.102)

= λ log Vp(X) + (1 − λ) log Vq (Y )

+r log r − (λ + 1 − λ) log r r

− r log p + λ log p − r log q + (1 − λ) log q p q

(17.103)

678 INEQUALITIES IN INFORMATION THEORY

= λ log Vp(X) + (1 − λ) log Vq (Y ) + r − 1 log r + H (λ)

−

r + λ(1

− r)

log

−

λ(1

−

r + (1 − λ)(1 − r)

log

−

λ)(1

(17.104)

= λ log Vp(X) + (1 − λ) log Vq (Y ) + H (λ)

1 + r

r + λ(1 − r)

−

where the details of the algebra for the last step are omitted.

The Brunn – Minkowski inequality and the entropy power can then be obtained as special cases of this theorem.

(17.105)

inequality

•The entropy power inequality. Taking the limit of (17.100) as r → 1 and setting

λ =	V1(X)	,	(17.106)
	V1(X) + V1(Y )
we obtain			(17.107)
V1(X + Y ) ≥ V1(X) + V1(Y ),

which is the entropy power inequality.

• The Brunn – Minkowski inequality. Similarly, letting r → 0 and choos-

ing

√

λ =

√

V0(X)

(17.108)

+ √

V0(X)

V0(Y )

we obtain

V0(X + Y )

≥

V0(X)

V0(Y )

(17.109)

Now let A be the support set of X and B be the support set of Y . Then A + B is the support set of X + Y , and (17.109) reduces to

1	1	1
[µ(A + B)] n ≥ [µ(A)] n + [µ(B)] n ,			(17.110)

which is the Brunn – Minkowski inequality.

17.9 INEQUALITIES FOR DETERMINANTS

679

The general theorem uniﬁes the entropy power inequality and the Brunn – Minkowski inequality and introduces a continuum of new inequalities that lie between the entropy power inequality and the Brunn – Minkowski inequality. This further strengthens the analogy between entropy power and volume.

17.9INEQUALITIES FOR DETERMINANTS

Throughout the remainder of this chapter, we assume that K is a nonnegative deﬁnite symmetric n × n matrix. Let |K| denote the determinant of

We ﬁrst give an information-theoretic proof of a result due to Ky Fan [199].

Theorem 17.9.1 log |K| is concave.

Proof: Let X1 and X2 be normally distributed n-vectors, Xi N(0, Ki ), i = 1, 2. Let the random variable θ have the distribution

Pr{θ = 1} = λ,	(17.111)
Pr{θ = 2} = 1 − λ	(17.112)
for some 0 ≤ λ ≤ 1. Let θ , X1, and X2 be	independent, and let Z =

Xθ . Then Z has covariance KZ = λK1 + (1 − λ)K2. However, Z will not be multivariate normal. By ﬁrst using Theorem 17.2.3, followed by Theorem 17.2.1, we have

1		log(2π e)n\|λK1	+ (1 − λ)K2\| ≥ h(Z)						(17.113)
	2	log(2π e)n\|λK1	+ (1 − λ)K2\| ≥ h(Z)						(17.113)
			≥ h(Z\|θ )						(17.114)
			1		log(2π e)n\|K1				\|
			= λ
			= λ	2
			+ (1 −			1		log(2π e)n\|K2\|.
						λ)
						λ)	2
Thus,
		\|λK1 + (1 − λ)K2\| ≥ \|K1\|λ\|K2\|1−λ,							(17.115)
as desired.

We now give Hadamard’s inequality using an information-theoretic proof [128].

680 INEQUALITIES IN INFORMATION THEORY

Theorem 17.9.2 (Hadamard ) |K| ≤ Kii , with equality iff Kij = 0, i =j .

Proof: Let X N(0, K). Then

with equality iff X1, X2, . . . , Xn are independent (i.e., Kij = 0, i =j ).

We now prove a generalization of Hadamard’s inequality due to Szasz [391]. Let K(i1, i2, . . . , ik ) be the k × k principal submatrix of K formed by the rows and columns with indices i1, i2, . . . , ik .

Theorem 17.9.3 (Szasz ) If K is a positive deﬁnite n × n matrix and Pk denotes the product of the determinants of all the principal k-rowed minors of K, that is,

Pk =

|K(i1, i2, . . . , ik )|,

(17.117)

1≤i1<i2<···<ik ≤n

then

(n−1 1)

≥

(n−2

≥ · · · ≥ Pn.

(17.118)

P1 ≥ P2

Proof:

Let X

( , K)

. Then

the

theorem

follows

directly from

N 0

(n)

Theorem 17.6.1, with the identiﬁcation hk

log Pk +

log 2π e.

2n(nk−11)

−

We can also prove a related theorem.

Theorem 17.9.4 Let K be a positive deﬁnite n × n matrix and let

(n) = 1 Sk n k

	1	(17.119)
	\|K(i1, i2, . . . , ik )\| k .

1≤i1<i2<···<ik ≤n

Then
1		1
		tr(K) = S1(n) ≥ S2(n) ≥ · · · ≥ Sn(n) = \|K\| n .	(17.120)
	n

Proof: This follows directly from the corollary to Theorem 17.6.1, with the identiﬁcation tk(n) = (2π e)Sk(n) and r = 2.

17.9 INEQUALITIES FOR DETERMINANTS

681

Theorem 17.9.5

Let

k(nk)

k =

| |

K(Sc)

Then

S:|S|=k |

σi2

Qn 1

K n .

≤

≤ · · · ≤

− ≤

= | |

i=1

(17.121)

(17.122)

Proof: The theorem follows immediately from Theorem 17.6.3 and the identiﬁcation

\|		=	2		\|K(Sc)\|
h(X(S) X(Sc))			1	log(2π e)k	\|K\|	.	(17.123)
h(X(S) X(Sc))				log(2π e)k		.	(17.123)
The outermost inequality, Q1 ≤ Qn, can be rewritten as
				n
			\|K\| ≥ σi2,				(17.124)
where				i=1
where
σ 2				\|K\|			(17.125)
σ 2	= \|K(1, 2 . . . , i − 1, i + 1, . . . , n)\|						(17.125)
i	= \|K(1, 2 . . . , i − 1, i + 1, . . . , n)\|

is the minimum mean-squared error in the linear prediction of Xi from the remaining X’s. Thus, σi2 is the conditional variance of Xi given the remaining Xj ’s if X1, X2, . . . , Xn are jointly normal. Combining this with Hadamard’s inequality gives upper and lower bounds on the determinant of a positive deﬁnite matrix.

Corollary

Kii ≥ \|K\| ≥	σi2.	(17.126)
i	i

Hence, the determinant of a covariance matrix lies between the product of the unconditional variances Kii of the random variables Xi and the product of the conditional variances σi2.

We now prove a property of Toeplitz matrices, which are important as the covariance matrices of stationary random processes. A Toeplitz matrix K is characterized by the property that Kij = Krs if |i − j | = |r − s|. Let Kk denote the principal minor K(1, 2, . . . , k). For such a matrix, the following property can be proved easily from the properties of the entropy function.

682 INEQUALITIES IN INFORMATION THEORY

Theorem 17.9.6 If the positive deﬁnite n × n matrix K is Toeplitz, then

|K1| ≥ |K2| 2 ≥ · · · ≥ |Kn−1|

and |Kk |/|Kk−1| is decreasing in k, and

1	1	(17.127)
(n−1)	≥ \|Kn\| n	(17.127)


lim		n\|	1			lim	\|Kn\|				(17.128)
lim	K		n			lim	\|Kn\|		.		(17.128)
n→∞	\|			= n→∞			\|Kn−1\|
Proof: Let (X1, X2, . . . , Xn) N(0, Kn). We observe that
h(Xk \|Xk−1, . . . , X1) = h(Xk) − h(Xk−1)											(17.129)
		=			1	log(2π e)		\|Kk \|		.	(17.130)
				2		log(2π e)		\|Kk−1\|		.	(17.130)
				2				\|Kk−1\|

Thus, the monotonicity of |Kk |/|Kk−1| follows from the monotonocity of h(Xk |Xk−1, . . . , X1), which follows from

h(Xk \|Xk−1, . . . , X1) = h(Xk+1	\|Xk , . . . , X2)	(17.131)
≥ h(Xk+1	\|Xk , . . . , X2, X1),	(17.132)

where the equality follows from the Toeplitz assumption and the inequality from the fact that conditioning reduces entropy. Since h(Xk |Xk−1, . . . , X1) is decreasing, it follows that the running averages

h(Xi |Xi−1, . . . , X1)

(17.133)

h(X1, . . . , Xk ) =

i=1

are decreasing in k. Then

(17.127) follows from

h(X

, X

, . . . , X

)

|Kk |.

log(2π e)

Finally, since h(Xn|Xn−1, . . . , X1) is a decreasing sequence, it has a limit. Hence by the theorem of the Cesaro´ mean,

	h(X1, X2, . . . , Xn)		1	n
lim	h(X1, X2, . . . , Xn)	lim	1		h(X	X	k−1	, . . . , X		)
lim	n	lim			h(X	X		, . . . , X		)
n→∞	n	= n→∞ n				k \|			1
				k=1					(17.134)
		= nlim h(Xn\|Xn−1, . . . , X1).							(17.134)
		→∞

17.10 INEQUALITIES FOR RATIOS OF DETERMINANTS

683

Translating this to determinants, one obtains

lim		n\|	1	lim	\|Kn\|		(17.135)
	K		n			.
n→∞	\|			= n→∞ \|Kn−1\|
Theorem 17.9.7 (Minkowski inequality [390] )
\|K1 + K2\|1/n ≥ \|K1\|1/n + \|K2\|1/n.							(17.136)

Proof: Let X1, X2 be independent with Xi N(0, Ki ). Noting that X1 + X2 N(0, K1 + K2) and using the entropy power inequality (Theorem 17.7.3) yields

(2π e) K1	+	K2	1/n	=	2 n2 h(X1 + X2)	(17.137)
\|	+		\|	=	2 n2 h(X1) + 2 n2 h(X2)
				≥	2 n2 h(X1) + 2 n2 h(X2)	(17.138)

=(2π e)|K1|1/n + (2π e)|K2|1/n. (17.139)

17.10INEQUALITIES FOR RATIOS OF DETERMINANTS

We now prove similar inequalities for ratios of determinants. Before developing the next theorem, we make an observation about minimum mean- squared-error linear prediction. If (X1, X2, . . . , Xn) N(0, Kn), we know that the conditional density of Xn given (X1, X2, . . . , Xn−1) is univariate

normal with mean linear in X

, X

, . . . , X

n−1

and conditional variance

over all

σn

. Here σn is the minimum mean squared error E(Xn − Xn)

based on X1, X2, . . . , Xn−1.

linear estimators Xn

Lemma 17.10.1

σn2 = |Kn|/|Kn−1|.

Proof: Using the conditional normality of Xn, we have

log 2π eσn2 = h(Xn|X1, X2, . . . , Xn−1)

(17.140)

= h(X1, X2, . . . , Xn) − h(X1, X2, . . . , Xn−1) (17.141)

log(2π e)n|Kn| −

log(2π e)n−1|Kn−1|

(17.142)

log 2π e|Kn|/|Kn−1|.

(17.143)

684 INEQUALITIES IN INFORMATION THEORY

Minimization of σn2 over a set of allowed covariance matrices {Kn} is aided by the following theorem. Such problems arise in maximum entropy spectral density estimation.

Theorem 17.10.1	(Bergstrøm [42] ) log(\|Kn\|/\|Kn−p\|) is	concave
in Kn.
Proof: We remark	that Theorem 17.9.1 cannot be used	because

log(|Kn|/|Kn−p|) is the difference of two concave functions. Let Z = Xθ , where X1 N(0, Sn), X2 N(0, Tn), Pr{θ = 1} = λ = 1 − Pr{θ = 2}, and let X1, X2, θ be independent. The covariance matrix Kn of Z is given by

Kn = λSn + (1 − λ)Tn.

(17.144)

The following chain of inequalities proves the theorem:

log(2π e)p|Sn|/|Sn−p

log(2π e)p|Tn|/|Tn−p|

| + (1 − λ)

(a)

= λh(X1,n, X1,n−1, . . . , X1,n−p+1|X1,1, . . . , X1,n−p)

+ (1 − λ)h(X2,n, X2,n−1, . . . , X2,n−p+1|X2,1, . . . , X2,n−p)

(17.145)

= h(Zn, Zn−1, . . . , Zn−p+1|Z1, . . . , Zn−p, θ )

(17.146)

(b)

(17.147)

≤ h(Zn, Zn−1, . . . , Zn−p+1|Z1, . . . , Zn−p)

(c)

log(2π e)p

|Kn|

(17.148)

≤

|Kn−p|

where

(a) follows

from

h(Xn, Xn−1, . . . , Xn−p+1|X1, . . . , Xn−p) =

h(X1, . . . , Xn) − h(X1, . . . , Xn−p), (b) follows from the

conditioning

lemma, and (c) follows from a conditional version of Theorem 17.2.3.

Theorem 17.10.2 (Bergstrøm [42] ) |Kn|/|Kn−1| is concave in Kn.

Proof: Again we use the properties of Gaussian random variables. Let us assume that we have two independent Gaussian random n-vectors, X N(0, An) and Y N(0, Bn). Let Z = X + Y. Then

log 2π e

|An

+ Bn|

(a)

h(Z

, Z

, . . . , Z

) (17.149)

|An−1 + Bn−1|

n−1

n−2

<<< < Предыдущая 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 7071 / 7871 72 73 74 75 76 77 78 > Следующая >>>

Соседние файлы в папке Теория информации

#
09.08.201310.58 Mб214Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p.pdf
#
09.08.20131.32 Mб28Gray R.M. Entropy and information theory. 1990., 284p.pdf