Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный педагогический университет им. Макаренко

Предмет:

Теория вероятностей и математическая статистика

Файл:

Теория информации / Gray R.M. Entropy and information theory. 1990., 284p

.pdf

Скачиваний:

Добавлен:

09.08.2013

Размер:

1.32 Mб

Скачать

☆

<<< < Предыдущая 1 2 3 4 5 6 7 8 910 / 3110 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

4.2. STATIONARY CODES AND APPROXIMATION				69
= P (Qi	Ri) ¡ P (	Qi	Ri Rj );	(4.3)
\	j<i	\	\
\	[	\	\

where we have used the fact that a set diﬁerence is unchanged if the portion being removed is intersected with the set it is being removed from and we have used the fact that P (F ¡ G) = P (F ) ¡ P (G) if G ‰ F . Combining (4.2.2) and (4.2.3) we have that

[	\		[	\	\
P (Qi¢Q0i) • P (Qi Ri) ¡ P (Qi			Ri) + P (	Qi	Ri Rj )
[	\ \		j<i		\
[	\ \		X		\
= P (Qi¢Ri) + P ( Qi	Ri	Rj ) • ° +		P (Qi	Rj ):
j<i			j<i

For j 6= i, however, we have that

\ \ \ \

P (Qi Rj ) = P (Qi Rj Qcj ) • P (Rj Qcj )

• P (Rj ¢Qj ) • °;

which with the previous equation implies that

P (Qi¢Q0i) • K°; i = 1; 2; ¢ ¢ ¢ ; K ¡ 1:

For the remaining atom:

We have		P (QK ¢Q0K ) = P (QK \ Q0Kc [ QKc \ Q0K ):								(4.4)
We have
Q		Q0c	= Q	(	Q0	) = Q		(	Q0	Qc);
	K	\	K	\ [			K	\ [	\
	K	K	K		j		K		j	j
				j<K				j<K

where the last equality follows since points in Q0j that are also in Qj cannot contribute to the intersection with QK since the Qj are disjoint. Since Q0j T Qcj ‰ Q0j ¢Qj we have

QK	Q0Kc	‰ QK	(	Q0j	¢Qj ) ‰ Q0j ¢Qj :
	\		\ [		[
			j<K		j<K

A similar argument shows that

QcK Q0K ‰ Q0j ¢Qj j<k

and hence with (4.4)	X
[	X
P (QK ¢Q0K ) • P (	Qj ¢Q0j ) • P (Qj ¢Q0j ) • K2°:
j<K	j<K

70	CHAPTER 4. INFORMATION RATES I

To summarize, we have shown that

P (Qi¢Q0i) • K2°; i = 1; 2; ¢ ¢ ¢ ; K

If we now choose ° so small that K2° • †=K, the lemma is proved. 2

Corollary 4.2.2: Let (›; B; P ) be a probability space and F a generating ﬂeld. Let f : › ! A be a ﬂnite alphabet measurement. Given † > 0 there

is a measurement g : › ! A that is measurable with respect to F (that is, g¡1(a) 2 F for all a 2 A) for which P (f 6= g) • †.

Proof: Follows from the previous lemma by setting Q = ff ¡1(a); a 2 Ag, choosing Q0 from the lemma, and then assigning g for atom Q0i in Q0 the same value that f takes on in atom Qi in Q. Then

P (f 6= g) = 12 X P (Qi¢Q0i) • †: 2

We now develop applications of the previous results which relate the idea of the entropy of a dynamical system with the entropy rate of a random process. The result is not required for later coding theorems, but it provides insight into the connections between entropy as considered in ergodic theory and entropy as used in information theory. In addition, the development involves some ideas of coding and approximation which are useful in proving the ergodic theorems of information theory used to prove coding theorems.

Let fXng be a random process with alphabet AX . Let A1X denote the one or two-sided sequence space. Consider the dynamical system (›; B; P; T ) deﬂned by (A1X ; B(AX )1; P; T ), where P is the process distribution and T the shift. Recall from Section 2.2 that a stationary coding or inﬂnite length sliding block coding of fXng is a measurable mapping f : A1X ! Af into a ﬂnite alphabet which produces an encoded process ffng deﬂned by

fn(x) = f (T nx); x 2 A1X :

The entropy H(P; T ) of the dynamical system was deﬂned by

„

H(P; T ) = sup HP (f );

the supremum of the entropy rates of ﬂnite alphabet stationary codings of the original process. We shall soon show that if the original alphabet is ﬂnite, then the entropy of the dynamical system is exactly the entropy rate of the process. First, however, we require several preliminary results, some of independent interest.

Lemma 4.2.3: If f is a stationary coding of an AMS process, then the process ffng is also AMS. If the input process is ergodic, then so is ffng.

Proof: Suppose that the input process has alphabet AX and distribution P and that the measurement f has alphabet Af . Deﬂne the sequence mapping

„ 1 ! 1 „ f 2 T g n

f : AX Af by f (x) = fn(x); n , where fn(x) = f (T x) and T is

4.2. STATIONARY CODES AND APPROXIMATION

the shift on the input sequence space AX1.	If T also denotes the shift on the
„	„
output space, then by construction f (T x) = T f (x) and hence for any output

„ „

event F , f ¡1(T ¡1F ) = T ¡1f ¡1(F ). Let m denote the process distribution for

„¡1 2 B 1

the encoded process. Since m(F ) = P (f (F )) for any event F (Af ) , we have using the stationarity of the mapping f that

lim	1 n¡1		m(T ¡iF ) =		lim	1 n¡1		P (f„¡1(T ¡iF ))

n!1 n i=0				n!1 n i=0
		X					X
			1 n¡1
= lim				P (T ¡if„¡1(F )) = P„(f„¡1(F ));
	n!1 n i=0
„				X	Thus m is AMS. If G is an invariant
where P is the stationary mean of P .

„ „ „

output event, then f ¡1(G) is also invariant since T ¡1f ¡1(G) = f ¡1(T ¡1G). Hence if input invariant sets can only have probability 1 or 0, the same is true for output invariant sets. 2

The lemma and Theorem 3.1.1 immediately yields the following:

Corollary 4.2.3: If f is a stationary coding of an AMS process, then

„	1	H(f	n	);
H(f ) =	lim	H(f		);
	n!1 n

that is, the limit exists.

For later use the next result considers general standard alphabets. A stationary code f is a scalar quantizer if there is a map q : AX ! Af such that f (x) = q(x0). Intuitively, f depends on the input sequence only through the current symbol. Mathematically, f is measurable with respect to ¾(X0). Such codes are eﬁectively the simplest possible and have no memory or dependence on the future.

Lemma 4.2.4: Let fXng be an AMS process with standard alphabet AX and distribution m. Let f be a stationary coding of the process with ﬂnite alphabet Af . Fix † > 0. If the process is two-sided, then there is a scalar quantizer q : AX ! Aq, an integer N , and a mapping g : ANq ! Af such that

n¡1

lim 1 X Pr(fi 6= g(q(Xi¡N ); q(Xi¡N+1); ¢ ¢ ¢ ; q(Xi+N ))) • †:

n!1 n i=0

If the process is one-sided, then there is a scalar quantizer q : AX ! Aq, an integer N , and a mapping g : ANq ! Af such that

n¡1

lim 1 X Pr(fi 6= g(q(Xi); q(Xi+1); ¢ ¢ ¢ ; q(Xi+N¡1))) • †:

n!1 n

i=0

Comment: The lemma states that any stationary coding of an AMS process can

be approximated by a code that depends only on a ﬂnite number of quantized

72	CHAPTER 4. INFORMATION RATES I

inputs, that is, by a coding of a ﬂnite window of a scalar quantized version of the original process. In the special case of a ﬂnite alphabet input process, the lemma states that an arbitrary stationary coding can be well approximated by a coding depending only on a ﬂnite number of the input symbols.

Proof: Suppose that m„ is the stationary mean and hence for any measurements f and g

n¡1

m„ (f0 6= g0) = lim 1 X Pr(fi 6= gi):

n!1 n n=0

Let qn be an asymptotically accurate scalar quantizer in the sense that ¾(qn(X0)) asymptotically generates B(AX ). (Since AX is standard this exists. If AX is ﬂnite, then take q(a) = a.) Then Fn = ¾(qn(Xi); i = 0; 1; 2; ¢ ¢ ¢ ; n ¡ 1) asymptotically generates B(AX )1 for one-sided processes and Fn = ¾(qn(Xi); i = ¡n; ¢ ¢ ¢ ; n) does the same for two-sided processes. Hence from Corollary 4.2.2 given † we can ﬂnd a su–ciently large n and a mapping g that is measurable with respect to Fn such that m„ (f 6= g) • †. Since g is measurable with respect to Fn, it must depend on only the ﬂnite number of quantized samples that generate Fn. (See, e.g., Lemma 5.2.1 of [50].) This proves the lemma. 2

Combining the lemma and Corollary 4.2.1 immediately yields the following corollary, which permits us to study the entropy rate of general stationary codes by considering codes which depend on only a ﬂnite number of inputs (and hence for which the ordinary entropy results for random vectors can be applied).

Corollary 4.2.4: Given a stationary coding f of an AMS process let Fn be deﬂned as above. Then given † > 0 there exists for su–ciently large n a code g measurable with respect to Fn such that

j „ ¡ „ j •

H(f ) H(g) †:

The above corollary can be used to show that entropy rate, like entropy, is reduced by coding. The general stationary code is approximated by a code depending on only a ﬂnite number of inputs and then the result that entropy is reduced by mapping (Lemma 2.3.3) is applied.

Corollary 4.2.5: Given an AMS process fXng with ﬂnite alphabet AX and a stationary coding f of the process, then

„ ‚ „

H(X) H(f );

that is, stationary coding reduces entropy rate.

Proof: For integer n deﬂne Fn = ¾(X0; X1; ¢ ¢ ¢ ; Xn) in the one-sided case and ¾(X¡n; ¢ ¢ ¢ ; Xn) in the two-sided case. Then Fn asymptotically generates B(AX )1. Hence given a code f and an † > 0 we can choose using the ﬂnite alphabet special case of the previous lemma a large k and a Fk-measurable code

j „ ¡ „ j • „ • „

g such that H(f ) H(g) †. We shall show that H(g) H(X), which will prove the lemma. To see this in the one-sided case observe that g is a function of Xk and hence gn depends only on Xn+k and hence

H(gn) • H(Xn+k)

4.3. INFORMATION RATE OF FINITE ALPHABET PROCESSES									73
and hence
„	1	n		1 n			n+k	„
H(g) = nlim	n	H(g	) • nlim	n	n + k	H(X		) = H(X):
!1			!1

In the two-sided case g depends on fX¡k; ¢ ¢ ¢ ; Xkg and hence gn depends on fX¡k; ¢ ¢ ¢ ; Xn+kg and hence

H(gn) • H(X¡k; ¢ ¢ ¢ ; X¡1; X0; ¢ ¢ ¢ ; Xn+k) • H(X¡k; ¢ ¢ ¢ ; X¡1) + H(Xn+k):

Dividing by n and taking the limit completes the proof as before. 2

Theorem 4.2.1: Let fXng be a random process with alphabet AX . Let A1X denote the one or two-sided sequence space. Consider the dynamical system (›; B; P; T ) deﬂned by (A1X ; B(AX )1; P; T ), where P is the process distribution and T is the shift. Then

„

H(P; T ) = H(X):

‚ „

Proof: From (2.2.4), H(P; T ) H(X). Conversely suppose that f is a code

„
which yields H(f ) ‚ H(P; T ) ¡ †. Since f is a stationary coding of the process
„	„

fXng, the previous corollary implies that H(f ) • H(X), which completes the proof. 2

4.3Information Rate of Finite Alphabet Processes

Let f(Xn; Yn)g be a one-sided random process with ﬂnite alphabet A £ B and let ((A £ B)Z+ ; B(A £ B)Z+ ) be the corresponding one-sided sequence space of outputs of the pair process. We consider Xn and Yn to be the sampling functions on the sequence spaces A1 and B1 and (Xn; Yn) to be the pair sampling function on the product space, that is, for (x; y) 2 A1 £ B1, (Xn; Yn)(x; y) = (Xn(x); Yn(y)) = (xn; yn). Let p denote the process distribution induced by the original space on the process f(Xn; Yn)g. Analogous to entropy rate we can deﬂne the mutual information rate (or simply information rate) of a ﬂnite alphabet pair process by

„	1	I(X	n		n	):
I(X; Y ) = lim sup				; Y
I(X; Y ) = lim sup				; Y
n!1	n

The following lemma follows immediately from the properties of entropy rates of Theorems 2.4.1 and 3.1.1 since for AMS ﬂnite alphabet processes

„ „ „ ¡ „

I(X; Y ) = H(X) + H(Y ) H(X; Y )

and since from (3.1.4) the entropy rate of an AMS process is the same as that of its stationary mean. Analogous to Theorem 3.1.1 we deﬂne the random variables p(Xn; Y n) by p(Xn; Y n)(x; y) = p(Xn = xn; Y n = yn), p(Xn) by p(Xn)(x; y) = p(Xn = xn), and similarly for p(Y n).

74	CHAPTER 4. INFORMATION RATES I

Lemma 4.3.1: Suppose that fXn; Yng is an AMS ﬂnite alphabet random process with distribution p and stationary mean p„. Then the limits supremum deﬂning information rates are limits and

„ „

Ip(X; Y ) = Ip„(X; Y ):

„

Ip is an a–ne function of the distribution p. If p„ has ergodic decomposition p„xy, then

„ „

Ip(X; Y ) = dp„(x; y)Ip„xy (X; Y ):

If we deﬂne the information density
in(Xn; Y n) = ln						p(Xn; Y n)
							;
							;
						p(Xn)p(Y n)
then
1		n		n		„
lim	in(X		; Y		) = Ip„xy (X; Y )
n!1 n

almost everywhere with respect to p„ and p and in L1(p).

The following lemmas follow either directly from or similarly to the corresponding results for entropy rate of the previous section.

Lemma 4.3.2: Suppose that fXn; Yn; X0n; Y 0ng is an AMS process and


P„ = lim 1 n¡1 Pr((X ; Y ) = (X0							; Y 0 ))	†
		X			i i 6	i	i	•
n!1 n i=0
(the limit exists since the process is AMS). Then
jI„(X; Y ) ¡ I„(X0; Y 0)j • 3(† ln(jjAjj ¡ 1) + h2(†)):
Proof: The inequality follows from Corollary 4.2.1 since
	„(X; Y )			¡	I„(X0; Y 0)
	j				j •
jH„ (X) ¡ H„ (X0)j + jH„ (Y ) ¡ H„ (Y 0)j + jH„ (X; Y ) ¡ H„ (X0; Y 0)j
and since Pr((Xi; Yi) 6= (Xi0; Yi			0)) = Pr(Xi 6= Xi			0	or Yi	6= Yi0) is no smaller
than Pr(Xi 6= Xi0) or Pr(Yi 6= Yi0). 2

Corollary 4.3.1: Let fXn; Yng be an AMS process and let f and g be stationary measurements on X and Y , respectively. Given † > 0 there is an N su–ciently large, scalar quantizers q and r, and mappings f 0 and g0 which depend only on fq(X0); ¢ ¢ ¢ ; q(XN¡1)g and fr(Y0); ¢ ¢ ¢ ; r(YN¡1)g in the one-sided case and fq(X¡N ); ¢ ¢ ¢ ; q(XN )g and fr(Y¡N ); ¢ ¢ ¢ ; r(YN )g in the two-sided case such that

j„ ¡ „ 0 0 j •

I(f ; g) I(f ; g ) †:

Proof: Choose the codes f 0 and g0 from Lemma 4.2.4 and apply the previous lemma. 2

4.3. INFORMATION RATE OF FINITE ALPHABET PROCESSES

Lemma 4.3.3:If fXn; Yng is an AMS process and f and g are stationary codings of X and Y , respectively, then

„ ‚ „

I(X; Y ) I(f ; g):

Proof: This is proved as Corollary 4.2.5 by ﬂrst approximating f and g by ﬂnite-

window stationary codes, applying the result for mutual information (Lemma 2.5.2), and then taking the limit. 2

76	CHAPTER 4. INFORMATION RATES I

Chapter 5

Relative Entropy

5.1Introduction

A variety of information measures have been introduced for ﬂnite alphabet random variables, vectors, and processes:entropy, mutual information, relative entropy, conditional entropy, and conditional mutual information. All of these can be expressed in terms of divergence and hence the generalization of these deﬂnitions to inﬂnite alphabets will follow from a general deﬂnition of divergence. Many of the properties of generalized information measures will then follow from those of generalized divergence.

In this chapter we extend the deﬂnition and develop the basic properties of divergence, including the formulas for evaluating divergence as expectations of information densities and as limits of divergences of ﬂnite codings. We also develop several inequalities for and asymptotic properties of divergence. These results provide the groundwork needed for generalizing the ergodic theorems of information theory from ﬂnite to standard alphabets. The general deﬂnitions of entropy and information measures originated in the pioneering work of Kolmogorov and his colleagues Gelfand, Yaglom, Dobrushin, and Pinsker [45] [90] [32] [125].

5.2Divergence

Given a probability space (›; B; P ) (not necessarily with ﬂnite alphabet) and another probability measure M on the same space, deﬂne the divergence of P with respect to M by

D(P jjM ) = sup HP jjM (Q) = sup D(Pf jjMf );		(5.1)
Q	f

where the ﬂrst supremum is over all ﬂnite measurable partitions Q of › and the second is over all ﬂnite alphabet measurements on ›. The two forms have the same interpretation: the divergence is the supremum of the relative entropies

78	CHAPTER 5. RELATIVE ENTROPY

or divergences obtainable by ﬂnite alphabet codings of the sample space. The partition form is perhaps more common when considering divergence per se, but the measurement or code form is usually more intuitive when considering entropy and information. This section is devoted to developing the basic properties of divergence, all of which will yield immediate corollaries for the measures of information.

The ﬂrst result is a generalization of the divergence inequality that is a trivial consequence of the deﬂnition and the ﬂnite alphabet special case.

Lemma 5.2.1: The Divergence Inequality:

For any two probability measures P and M

D(P jjM ) ‚ 0

with equality if and only if P = M .

Proof: Given any partition Q, Theorem 2.3.1 implies that

X P (Q) ln P (Q) ‚ 0

Q2Q

M (Q)

with equality if and only if P (Q) = M (Q) for all atoms Q of the partition. Since D(P jjQ) is the supremum over all such partitions, it is also nonnegative. It can be 0 only if P and M assign the same probabilities to all atoms in all partitions (the supremum is 0 only if the above sum is 0 for all partitions) and hence the divergence is 0 only if the measures are identical. 2

As in the ﬂnite alphabet case, Lemma 5.2.1 justiﬂes interpreting divergence as a form of distance or dissimilarity between two probability measures. It is not a true distance or metric in the mathematical sense since it is not symmetric and it does not satisfy the triangle inequality. Since it is nonnegative and equals zero only if two measures are identical, the divergence is a distortion measure as considered in information theory [51], which is a generalization of the notion of distance. This view often provides interpretations of the basic properties of divergence. We shall develop several relations between the divergence and other distance measures. The reader is referred to Csisz¶ar [25] for a development of the distance-like properties of divergence.

The following two lemmas provide means for computing divergences and studying their behavior. The ﬂrst result shows that the supremum can be conﬂned to partitions with atoms in a generating ﬂeld. This will provide a means for computing divergences by approximation or limits. The result is due to Dobrushin and is referred to as Dobrushin’s theorem. The second result shows that the divergence can be evaluated as the expectation of an entropy density deﬂned as the logarithm of the Radon-Nikodym derivative of one measure relative to the other. This result is due to Gelfand, Yaglom, and Perez. The proofs largely follow the translator’s remarks in Chapter 2 of Pinsker [125] (which in turn follows Dobrushin [32]).

Lemma 5.2.2: Suppose that (›; B) is a measurable space where B is generated by a ﬂeld F, B = ¾(F). Then if P and M are two probability measures

<<< < Предыдущая 1 2 3 4 5 6 7 8 910 / 3110 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > Следующая >>>

Соседние файлы в папке Теория информации

#
09.08.201310.58 Mб214Cover T.M., Thomas J.A. Elements of Information Theory. 2006., 748p.pdf
#
09.08.20131.32 Mб28Gray R.M. Entropy and information theory. 1990., 284p.pdf