Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

An Introduction to Statistical Signal Processing

.pdf
Скачиваний:
2
Добавлен:
10.07.2022
Размер:
1.81 Mб
Скачать
= lim

4.15. WEAK LAW OF LARGE NUMBERS

245

A Mean Ergodic Theorem

Theorem 4.11 Let {Xn} be a discrete time uncorrelated random process such that EXn = X is finite and σX2 n = σX2 < ∞ for all n; that is, the mean and variance are the same for all sample times. Then

l.i.m. 1 n−1 Xi = X ,

n→∞ n i=0

that is, n1 n−1Xi → X in mean square.

i=0

Proof. The proof follows directly from the last section with Sn =

1 n−1

n i=0

Xi, ESn = EXi = X. To summarize from (4.106),

nlim E[(Sn

X

)2] =

nlim E[(Sn − ESn)2]

→∞

→∞

=

lim σ2

 

 

 

Sn

 

 

 

n→∞

σX2 = 0 .

n→∞ n

This theorem is called a mean ergodic theorem because it is a special case of the more general mean ergodic theorem — it is a special case since it holds only for uncorrelated random processes. We shall later consider more general results along this line, but this simple result and the one to follow provide the basic ideas.

Combining lemma 4.3 with the mean ergodic theorem 4.11 yields the following famous result, one of the original limit theorems of probability theory:

Theorem 4.12 The Weak Law of Large Numbers.

Let {Xn} be a discrete time process with finite mean EXn = X and

variance σX2 n = σX2 < ∞ for all n. If the process is uncorrelated, then the

n−1

sample average n1 Xi converges to X in probability.

i=0

An alternative means of describing a law of large numbers is to define the limiting time-average or sample average of a sequence of random variables

{Xn} by

 

 

 

n−1

 

< Xn >= lim

1

n1

i

(4.109)

 

Xi,

n→∞ n

=0

 

246

CHAPTER 4. EXPECTATION AND AVERAGES

if the limit exists in any of the manners considered, e.g., in mean square, in probability, or with probability 1. Note that ordinarily the limiting time average must be considered as a random variable since it is function of random variables. Laws of large numbers then provide conditions under which

< Xn >= E(Xk),

(4.110)

which requires that < Xn > not be a random variable, i.e., that it be a constant and not vary with the underlying sample point ω, and that E(Xk) not depend on time, i.e., that it be a constant and not vary with time k.

The best-known (and earliest) application of the weak law of large numbers is to iid processes such as the Bernoulli process. Note that the iid specification is not needed, however. All that is used for the weak law of large numbers is constant means, constant variances, and uncorrelation. The actual distributions could be time varying and dependent within these constraints. The weak law is called weak because convergence in probability is one of the weaker forms of convergence. Convergence of individual realizations of the random process is not assured. This could be very annoying because in many practical engineering applications, we have only one realization to work with (i.e., only one ω), and we need to calculate averages that converge as determined by actual calculations, e.g., with a computer.

The strong law of large numbers considers convergence with probability one. Such strong theorems are much harder to prove, but fortunately are satisfied in most engineering situations.

The astute reader may have noticed the remarkable di erence in be-

havior caused by the apparently slight change of division by n instead of n when normalizing sums of iid random variables. In particular, if {Xn} is a zero mean process with unit variance, then the weighted sum

n

1/2

 

n−1

 

 

 

 

 

k=0 Xk converges to a Gaussian random variable in some sense be-

 

 

of the central limit theorem, while the weighted sum n

1

n−1

cause

 

 

k=0 Xk

converges to a constant, the mean 0 of the individual random

variables!

 

 

 

4.16Strong Law of Large Numbers

The strong law of large numbers replaces the convergence in probability of the weak law with convergence with probability one. It will shortly be shown that convergence with probability one implies convergence in probability, so the “strong” law is indeed stronger than the “weak” law. Although the two terms sound the same, they are really quite di erent. Convergence with probability one applies to individual realizations of the

4.16. STRONG LAW OF LARGE NUMBERS

247

random process, while convergence in probability does not. Convergence with probability one is closer to the usual definition of convergence of a sequence of numbers since it says that for each sample point ω, the limiting

sample average limn→∞ n1 Xn exists in the usual sense for all ω

n=1

in a set of probability one. Although a more satisfying notion of convergence, it is notably harder to prove than the weaker result and hence we consider only the special case of iid sequences, where the added di culty is moderate. In this section convergence with probability one is considered and a strong law of large numbers is proved. The key new tools are the Borel-Cantelli lemma, which provides a condition ensuring convergence with probability one, and the Cherno inequality, an improvement on the Tchebychev inequality which is a simple result of the Markov inequality.

Lemma 4.4 If Yn converges to Y with probability one, then it also converges in probability.

Proof: Given an * > 0, define the sequence of sets

Fn(*) = : |Ym(ω) − Y (ω)| > * for some m ≥ n}.

The Fn(*) form a decreasing sequence of sets as n grows, that is, Fn Fn−1 for all n. Thus Pr(Fn) is nonincreasing in n and hence it must converge to some limit. From the definition of convergence with probability one, this limit must be 0 since if Yn(ω) converges to Y (ω), given * there must be an n such that for all m ≥ n |Yn(ω) − Y (ω)| < *. Thus

nlim Pr(|Yn − Y | > *) nlim Pr(Fn(*)) = 0,

→∞

→∞

which establishes convergence in probability.

Convergence in probability does not imply convergence with probability one; i.e., they are not equivalent. This can be shown by counterexample (problem 32). There is, however, a test that can be applied to determine convergence with probability one. The result is one form of a result known as the first Borel-Cantelli lemma..

Lemma 4.5 Yn converges to Y with probability one if for any * > 0

Pr(|Yn − Y | > *) < ∞.

(4.111)

n=1

Proof: Consider two collections of bad sequences. Let F (*) be the set of all ω such that the corresponding sequence sequence Yn(ω) does not satisfy the convergence criterion, i.e.,

F (*) = : |Yn − Y | > *, for some n ≥ N, for any N < ∞}.

248 CHAPTER 4. EXPECTATION AND AVERAGES

F (*) is the set of points for which the sequence does not converge. Consider also the simpler sets where things look bad at a particular time:

Fn(*) = : |Yn − Y | > *}.

The complicated collection of points with nonconvergent sequences can be written as a subset of the union of all of the simpler sets:

F (*) Fn(*) ≡ GN (*))

n≥N

for any finite N. This in turn implies that

Pr(F (*)) Pr(

Fn(*)).

n≥N

 

 

From the union bound this implies that

 

 

Pr(F (*))

Pr(Fn(*)).

=N

 

n

 

By assumption

n=0

Pr(Fn(*)) < ∞,

 

 

which implies that

lim

Pr(Fn(*)) = 0

N→∞

=N

n

and hence Pr(F (*)) = 0, proving the result.

Convergence with probability one does not imply — nor is it implied by — convergence in mean square. This can be shown by counterexamples (problem 32).

We now apply this result to sample averages to obtain a strong law of large numbers for an iid random process {Xn}. For simplicity we focus on a zero mean Gaussian iid process and prove that with probability one

lim Sn = 0

n→∞

 

where

 

 

1 n−1

 

 

 

Sn =

 

 

Xk.

n

 

k=0

4.16. STRONG LAW OF LARGE NUMBERS

249

Assuming zero mean does not lose any generality since if this result is true, the result for nonzero mean m follows immediately by applying the zero mean result to the process to the zero-mean process {Xn − m}.

The approach is to use the Borel-Cantelli lemma with that Yn = Sn and Y = 0 = E[Xn] and hence the immediate problem is to bound Pr(|Sn| > *) in a way so that that the sum over n will be finite. The Tchebychev inequality does not work here as it would give a sum

σX2 n1 ,

n=1

which is not finite. A better upper bound than Tchebychev is needed, and this is provided by a di erent application of the Markov inequality. Given a random variable Y , fix a λ > 0 and observe that Y > y if and only if eλY > eλy. Application of the Markov inequality then yields

Pr(Y > y) = Pr(eλY > eλy)

=Pr(eλ(Y −y) > 1)

E[eλ(Y −y)]

(4.112)

 

 

This inequality is called the Cherno inequality and it provides the needed bound.

Applying the Cherno inequality yields for any λ > 0 Pr(|Sn| > *) = Pr(Sn > *) + Pr(Sn < −*)

=Pr(Sn > *) + Pr(−Sn > *) ≤ E[eλ(Sn)] + E[eλ(Sn)]

=eλ &E[λSn ] + E[λSn ]'

=e−λ (MSn (λ) + MSn (−λ)) .

These moment generating functions are easily found from lemma 4.1 to be

E[eγSn ] = MXn (nγ ), (4.113)

Where MX(ju) = E[ejuX] is the common characteristic function of the iid Xi and MX(w) is the corresponding moment generating function. Combining these steps yields the bound

Pr( S

n|

> *)

e−λ Mn

(

λ

) + Mn (

 

λ

) .

(4.114)

n

 

|

 

X

 

X

n

 

250

CHAPTER 4.

EXPECTATION AND AVERAGES

 

So far λ > 0 is completely arbitrary and we can choose a di erent λ for

each n. Choosing λ = n*/σ2 yields

 

 

 

 

 

 

 

X

 

 

 

 

 

 

 

n

2

 

*

 

*

 

 

 

2

n

n

 

 

σX

 

 

 

 

 

Pr(|Sn| > *) ≤ e

 

 

MX(

σX2

) + MX(σX2 ) .

(4.115)

Plugging in the form for the Gaussian moment generating function MX(w) = ew2σX2 /2 yields

 

 

 

 

2

 

 

 

 

 

2

 

n

Pr(|Sn| > *)

2e

n

 

 

 

 

)

2 σX

 

σX

 

e

 

σX

 

2

 

 

 

 

 

2

 

(

2

 

 

 

 

 

 

e

2

 

n

 

 

 

 

 

=

2

2

 

 

 

 

 

 

(4.116)

2σX

 

 

 

 

 

 

which has the form Pr(|Sn| > *) ≤ βn for β < 1. Hence summing a geometric progression yields

 

 

 

 

 

 

 

 

Pr(|Sn| > *)

2

βn

 

n=1

 

n=1

 

 

=

2

β

< ∞,

(4.117)

 

1 − β

which completes the proof for the iid Gaussian case.

The nonGaussian case can be handled by combining the above approach with the approximation of (4.16). The bound for the Borel-Cantelli limit need only be demonstrated for small * since if it is true for small * it must

also be true for large *. For small *, however, (4.16) implies that MX(±σ2 )

X

in (4.115) can be written as 1 + *2/2σX2 + o(*2/2σX2 ) which is arbitrarily close to e 2/2σX2 for su ciently small *, and the proof is completed as above.

The following theorem summarizes the results of this section.

Theorem 4.13 Strong Law of Large Numbers

Given an iid process {Xn} with finite mean E[X] and variance, then

 

1 n−1

 

 

 

 

 

lim

 

 

Xk = E[X] with probability 1.

(4.118)

n→∞ n

 

 

k=0

4.17. STATIONARITY

251

4.17Stationarity

Stationarity Properties

In the development of the weak law of large numbers we made two assumptions on a random process {Xt; n Z}: that the mean EXt of the process did not depend on time and that the covariance function had the

form KX(t, s) = σX2 δt−s.

The assumption of a constant mean, independent of time, is an example of a stationarity property in the sense that it assumes that some property describing a random process does not vary with time (or is time-invariant). The process itself is not usually “stationary” in the usual literal sense of remaining still, but attributes of the process, such as the first moment in this case, can remain still in the sense of not changing with time. In the

mean example we can also express this as

 

EXt = EXt+τ ; all t, τ,

(4.119)

which can interpret as saying that the mean of a random variable at time t is not a ected by a shift of any amount of time τ. Conditions on moments can be thought of as weak stationarity properties since they constrain only an expectation and not the distribution itself. Instead of simply constraining a moment, we could make the stronger assumption of constraining the marginal distribution. The assumption of a constant mean would follow, for example, if the marginal distribution of the process, the distribution of a single random variable Xt, did not depend on the sample time t. Thus a su cient (but not necessary) condition for ensuring that a random process has a constant mean is that its marginal distribution PXt satisfies the condition

PXt = PXt+τ ; all t, τ.

(4.120)

This will be true, for example, if the same relation holds with the distribution replaced by cdf’s, pdf’s, or pmf’s. If a process meets this condition, it is said to be first order stationary. For example, an iid process is clearly first order stationary. The word stationary refers to the fact that the first order distribution (in this case) does not change with time, i.e., it is not a ected by shifting the sample time by an amount τ.

Next consider the covariance used to prove the weak law of large numbers. It has a very special form in that it is the variance if the two sample times are the same, and zero otherwise. This class of constant mean, constant variance, and uncorrelated processes is admittedly a very special case. A more general class of processes which will share many important properties with this very special case is formed by requiring a mean and variance

252

CHAPTER 4. EXPECTATION AND AVERAGES

that do not change with time, but easing the restriction on the covariance. We say that a random process is weakly stationary or stationary in the weak sense if EXt does not depend on t, σX2 t does not depend on t, and if the covariance KX(t, s) depends on t and s only through the di erence t − s, that is, if

KX(t, s) = KX(t + τ, s + τ)

(4.121)

for all t, s, τ for which s, s + τ, t, t + τ T . When this is true, it is often expressed by writing

KX(t, t + τ) = KX(τ).

(4.122)

for all t, τ such that t, t + τ T . A function of two variables of this type is said to be Toeplitz [26, 21] and much of the theory of weakly stationary processes follows from the theory of Toeplitz forms.

If we form a covariance matrix by sampling such a covariance function, then the matrix (called a Toeplitz matrix) while have the property that all elements on any fixed diagonal of the matrix will be equal. For example, the (3,5) element will be the same as the (7,9) element since 5-3=9-7. Thus, for example, if the sample times are 0, 1, . . . , n −1, then the covariance matrix is {KX(k, j) = KX(j − k); k = 0, 1, . . . , n − 1, j = 0, 1, . . . , n − 1 or

 

 

 

KX(0)

KX(1)

KX(2)

· · ·

 

KX(1)

KX(0)

KX(1)

 

 

 

 

 

.

 

.

 

 

 

KX(

2)

KX( 1) KX(0)

 

 

 

 

 

 

 

 

X

 

 

..

 

 

 

 

..

 

 

 

 

 

 

 

 

K

 

(

(n

1))

 

· · ·

 

 

 

 

 

 

 

KX(n − 1)

...

KX(0)

As in the case of the constant mean, the adjective weakly refers to the fact that the constraint is placed on the moments and not on the distributions. Mimicking the earlier discussion, we could make a stronger assumption that is su cient to ensure weak stationarity. A process is said to be second order stationary if the pairwise distributions are not a ected by shifting, that is, if analogous to the moment condition (4.121) we make the stronger assumption that

PXt,Xs = PXt+τ ,Xs+τ ; all t, s, τ.

(4.123)

Observe that second order stationarity implies first order since the marginals can be computed from the joints. The class of iid processes is second order stationary since the joint probabilities are products of the marginals, which do not depend on time.

4.17. STATIONARITY

253

There are a variety of such stationarity properties that can be defined, but weakly stationary is one of the two most important for two reasons. The first reason will be seen shortly — combining weak stationarity with an asymptotic version of uncorrelated gives a more general law of large numbers than the ones derived previously. The second reason will be seen in the next chapter: if a covariance depends only on a single argument (the di erence of the sample times), then it will have an ordinary Fourier transform. Transforms of correlation and covariance functions provide a useful analysis tool for stochastic systems.

It is useful before proceeding to consider the other most important stationarity property: strict stationarity (sometimes the adjective “strict” is omitted). As the notion of weak stationary can be considered as a generalization of uncorrelated, the notion of strict stationary can be considered as a generalization of iid: if a process is iid, the probability distribution of a k-dimensional random vector Xn, Xn+1, . . . , Xn+k−1 does not depend on the starting time of the collection of samples, i.e., for an iid process we have that

PXn,Xn+1,... ,Xn+k−1 = PXn+m,Xn+m+1,... ,Xn+m+k−1 (x), all n, k, m. (4.124)

This property can be interpreted as saying that the probability of any event involving a finite collection of samples of the random process does not depend on the starting time n of the samples and hence on the definition of time 0. Alternatively, these joint distributions are not a ected by shifting the samples by a common amount m. In the simple Bernoulli process case this means things like

pXn (0)

=

pX0 (0) = 1 − p, all n

pXn,Xk (0, 1)

=

pX0

,Xk−n (0, 1) = p(1 − p), all n, k

pXn,Xk,Xl (0, 1, 0)

=

pX0

,Xk−n,Xl−n (0, 1, 0) = (1 − p)2p, all n, k, m,

and so on. Note that the relative sample times stay the same, that is, the di erences between the sample times are preserved, but all of the samples together are shifted without changing the probabilities. A process need not be iid to possess this property of joint probabilities being una ected by shifts, so we formalize this idea with a definition.

A discrete time random process {Xn} is said to be stationary or strictly stationary or stationary in the strict sense if (4.124) holds. We have argued that a discrete alphabet iid process is an example of a stationary random process. This definition extends immediately to continuous alphabet discrete time processes by replacing the pmf’s by pdf’s. Both cases can be combined by using cdf’s or the distributions. Hence we can make a

254

CHAPTER 4. EXPECTATION AND AVERAGES

more general definition for discrete time processes: A discrete time random process {Xn} is said to be stationary if

PXn,Xn+1,... ,Xn+k−1 = PXn+m,Xn+m+1,... ,Xn+m+k−1 , all k, n, m. (4.125)

This will hold if the corresponding formula holds for pmf’s, pdf’s, or cdf’s. For example, any iid random process is stationary.

Generalizing the definition to include continuous time random processes requires only a little more work, much like that used to describe the Kolmogorov extension theorem. We would like all joint distributions involving a finite collection of samples to not depend on the starting time or, equivalently, to not be e ected by shifts. The following general definition does this and it reduces to the previous definition when the process is a discrete time process.

A random process {Xt; t T } is stationary if

PXt0 ,Xt1 ,... ,Xtk−1 = PXt0−τ ,Xt1−τ ,... ,Xtk−1−τ , all k, t0, t1, . . . , tk−1, τ.

(4.126)

The word “all” above must be interpreted with care, it means all choices of dimension k, sample times t0, . . . , tk−1, and shift τ for which the equation makes sense, e.g., k must be a positive integer and ti T and ti − τ T for i = 0, . . . , k − 1.

It should be obvious that strict stationarity implies weak stationarity since it implies that PXt does not depend on t, and hence the mean computed from this distribution does not depend on t, and PXt,Xs = PXt−s,X0 and hence KX(t, s) = KX(t − s, 0). The converse is generally not true

— knowing that two moments are una ected by shifts does not in general imply that all finite dimensional distributions will be una ected by shifts. This is why weak stationarity is indeed a “weaker” definition of stationarity. There is, however, one extremely important case where weak stationarity is su cient to ensure strict stationarity – the case of Gaussian random processes. We shall not construct a careful proof of this fact because it is a notational mess that obscures the basic idea, which is actually rather easy to describe. A Gaussian process {Xt; t T } is completely characterized by knowledge of its mean function {mt; t T } and its covariance function {KX(t, s); t, s T }. All joint pdf’s for all possible finite collections of sample times are expressed in terms of these two functions. If the process is known to be weakly stationary, then mt = m for all t, and KX(t, s) = KX(t −s) for all t, s. This implies that all of the joint pdf’s will be una ected by a time shift, since the mean vector stays the same and the covariance matrix depends only on the relative di erences of the sample

Соседние файлы в предмете Социология