Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

An Introduction to Statistical Signal Processing

.pdf
Скачиваний:
2
Добавлен:
10.07.2022
Размер:
1.81 Mб
Скачать

4.12. THE CENTRAL LIMIT THEOREM

235

4.12The Central Limit Theorem

The characteristic function of a sum of iid Gaussian random variables has been shown to also be Gaussian and linear combinations of jointly Gaussian variables have also be shown to be Gaussian. Far more surprising is that the characteristic function of the sum of many non-Gaussian random variables turns out to be approximately Gaussian if the variables are suitably scaled and shifted. This result is called the central limit theorem and is the one of the primary reasons for the importance of Gaussian distributions. When a large number of e ects are added up with suitable scaling and shifting, the resulting random variable looks Gaussian even if the underlying individual e ects are not at all Gaussian. This result is developed in this subsection.

Just as with laws of large numbers, there is no single central limit theorem — there are many versions of central limit theorems. The various central limit theorems di er in the conditions of applicability. However, they have a common conclusion: the distribution or characteristic function of the sum of a collection of random variables converges to that of a Gaussian random variable. We will present only the simplest form of central limit theorem, a central limit theorem for iid random variables.

Suppose that {Xn} is an iid random process with a common distribution FX described by a pmf or pdf except that it has a finite mean EXn = m and finite variance σX2 n = σ2. It will also be assumed that the characteristic function MX(ju) is well behaved for small u in a manner to be made precise. Consider the “standardized” or “normalized” sum

1

Rn = n1/2

n−1

Xi − m . (4.100)

σ

k=0

By subtracting the means and dividing by the square root of the variance (the standard deviation), the resulting random variable is easily seen to have zero mean and unit variance; that is,

ERn = 0 , σR2 n = 1 ,

hence the description “standardized,” or “normalized.” Note that unlike the sample average that appears in the law of large numbers, the sum here is normalized by n1/2 and not n1.

Using characteristic functions, we have from the independence of the {Xi} and lemma 4.1 that

 

ju

n

 

MRn (ju) = M(X−m)

.

(4.101)

 

n1/2

236

CHAPTER 4. EXPECTATION AND AVERAGES

 

We wish to investigate the asymptotic behavior of the characteristic

function of (4.101) as n → ∞. This is accomplished by assuming

that σ2

ju

n

is finite and applying the approximation of of (4.16) to M(X−m)

 

'

n1/2

and then finding the limiting behavior of the expression. Let Y &= (X

m). Y has zero mean and a second moment of 1, and hence from (4.16)

M(X−m)(jun1/2) = 1 u2 + o(u2/n) ,

2n

where the rightmost term goes to zero faster than u2/n.

result with (4.101) produces

 

 

2n

+ o n

 

n→∞ MRn (ju) = n→∞

 

 

 

 

 

u2

u2

n

lim

lim

1

 

 

 

 

From elementary real analysis, however, this limit is

lim MRn (ju) = e(u2/2) ,

n→∞

(4.102)

Combining this

.

the characteristic function of a Gaussian random variable with zero mean and unit variance! Thus, provided that (4.102) holds, a standardized sum of a family of iid random variables has a transform that converges to the transform of a Gaussian random variable regardless of the actual marginal distribution of the iid sequence.

By taking inverse transforms, the convergence of transforms implies that the cdf’s will also converge to a Gaussian cdf (provided some technical conditions are satisfied to ensure that the operations of limits and integration can be exchanged). This does not imply convergence to a Gaussian pdf , however, because, for example, a finite sum of discrete random variables will not have a pdf (unless one resorts to Dirac delta functions). Given a sequence of random variables Rn with cdf Fn and a random variable R with distribution F , then if limn→∞ Fn(r) = F (r) for all real r, we say that Rn converges to R in distribution. Thus the central limit theorem states that under certain conditions, sums of iid random variables adjusted to have zero mean and unit variance converge in distribution to a Gaussian random variable with the same mean and variance.

A slight modification of the above development shows that if {Xn} is an iid sequence with mean m and variance σ2, then

n−1

n1/2 (Xi − m)

k=0

will have a transform and a cdf converging to those of a Gaussian random variable with mean 0 and variance σ2. We summarize the central limit theorem that we have established as follows.

4.13. SAMPLE AVERAGES

237

Theorem 4.10 (A Central Limit Theorem). Let {Xn} be an iid random process with a finite mean m and variance σ2. Then

n−1

n1/2 (Xi − m)

k=0

converges in distribution to a Gaussian random variable with mean m and variance σ2.

Intuitively the theorem states that if we sum up a large number of independent random variables and normalize by n1/2 so that the variance of the normalized sum stays constant, then the resulting sum will be approximately Gaussian. For example, a current meter across a resistor will measure the e ects of the sum of millions of electrons randomly moving and colliding with each other. Regardless of the probabilistic description of these micro-events, the global current will appear to be Gaussian. Making this precise yields a model of thermal noise in resistors. Similarly, if dust particles are suspended on a dish of water and subjected to the random collisions of millions of molecules, then the motion of any individual particle in two dimensions will appear to be Gaussian. Making this rigorous yields the classic model for what is called “Brownian motion.” A similar development in one dimension yields the Wiener process.

Note that in (4.101), if the Gaussian characteristic function is substituted on the right-hand side, a Gaussian characteristic function appears on the left. Thus the central limit theorem says that if you sum up random variables, you approach a Gaussian distribution. Once you have a Gaussian distribution, you “get stuck” there — adding more random variables of the same type (or Gaussian random variables) to the sum does not change the Gaussian characteristic. The Gaussian distribution is an example of an infinitely divisible distribution. The nth root of its characteristic function is a distribution of the same type as seen in (4.101). Equivalently stated, the distribution class is invariant under summations.

4.13Sample Averages

In many applications, engineers analyze the accuracy of estimates, the probability of detector error, etc., as a function of the amount of data available. This and the next sections are a prelude to such analyses. They also provide some very good practice manipulating expectations and a few results of interest in their own right.

In this section we study the behavior of the arithmetic average of the first n values of a discrete time random process with either a discrete or a

238

CHAPTER 4. EXPECTATION AND AVERAGES

continuous alphabet. Specifically, the variance of the average is considered as a function of n.

Suppose we are given a process {Xn}. The sample average of the first

 

n−1

 

 

 

i

 

 

n values of {Xn} is Sn = n1 Xi. The mean of Sn is found easily using

 

=0

 

 

the linearity of expectation (expectation property 3) as

 

ESn = E

$n1 n−1 Xi%

= n1 n−1 EXi .

(4.103)

 

 

i

 

 

i=0

=0

 

Hence the mean of the sample average is the same as the average of the mean of the random variables produced by the process. Suppose now that we assume that the mean of the random variables is a constant, EXi = X

independent of i. Then ESn = X. In terms of estimation theory, if one estimates an unknown random process mean, X, by Sn, then the estimate is said to be unbiased because the expected value of the estimate is equal to the value being estimated. Obviously an unbiased estimate is not unique, so being unbiased is only one desirable characteristic of an estimate (problem 4.25).

Next consider the variance of the sample average:

 

 

 

 

 

 

σS2n = E[(Sn − E(Sn))2]

 

 

= E

n1n−1Xi − n1n−1EXi 2

 

 

 

i

 

 

n

 

1

2

= E

 

i=0

=0

 

n1

(Xi − EXi)

 

 

i

 

 

 

 

=0

 

n−1 n−1

 

 

= n2

i

 

 

 

 

 

 

E[(Xi − EXi)(Xj − EXj)] .

 

=0 j=0

 

 

 

The reader should be certain that the preceding operations are well understood, as they are frequently encountered in analyses. Note that expanding the square requires the use of separate dummy indices in order to get all of the cross products. Once expanded, linearity of expectation permits the interchange of expectation and summation.

Recognizing the expectation in the sum as the covariance function, the variance of the sample average becomes

n−1 n−1

 

j

 

σS2n = n2

KX(i, j) .

(4.104)

i=0

=0

 

4.14. CONVERGENCE OF RANDOM VARIABLES

239

Note that so far we have used none of the specific knowledge of the process, i.e., the above formula holds for general discrete time processes and does not require such assumptions as time-constant mean, time-constant variance, identical marginal distributions, independence, uncorrelated processes, etc. If we now use the assumption that the process is uncorrelated, the covariance becomes zero except when i = j, and expression (4.104) becomes

n−1

 

i

 

σS2n = n2 σX2 i .

(4.105)

=0

 

If we now also assume that the variances σX2 i are equal to some constant value σX2 for all times i, e.g., the process has identical marginal distributions as for an iid process, then the equations become

σS2n = n1σX2 .

(4.106)

Thus, for uncorrelated discrete time random processes with mean and variance not depending on time, the sample average has expectation equal to the (time-constant) mean of the process, and the variance of the sample average tends to zero as n → ∞. Of course we have only specified su cient conditions. Expression (4.104) goes to zero with n under more general circumstances, as we shall see later.

For now, however, we stick with uncorrelated process with mean and variance independent of time and require only a definition to obtain our first law of large numbers, a result implicit in equation (4.106).

4.14Convergence of Random Variables

The preceding section demonstrated a form of convergence for the sequence of random variables, {Sn}, the sequence of sample averages, that is di erent from convergence as it is seen for a nonrandom sequence. To review, a nonrandom sequence {xn} is said to converge to the limit x if for every * > 0 there exists an N such that |xn − x| < * for every n > N. The preceding section did not see Sn converge in this sense. Nothing was said about the individual realizations Sn(ω) as a function of ω. Only the variance of the sequence σS2n was shown to converge to zero in the usual *, N sense. The variance calculation probabilistically averages across ω. For any particular ω, the realization Sn may, in fact, not converge to zero.

Thus, in order to make precise the notion of convergence of sample averages to a limit, we need to make precise the notion of convergence of a sequence of random variables. In this section we will describe four notions

240

CHAPTER 4. EXPECTATION AND AVERAGES

of convergence of random variables. These are perhaps the most commonly encountered, but they are by no means an exhaustive list. The common goal is to quantify a useful definition for saying that a sequence of random variables, say Yn, n = 1, 2, . . . , converges to a random variable Y , which will be considered the limit of the sequence. Our main application will be the case where Yn = Sn, a sample average of n samples of a random process, and Y is the expectation of the samples, that is, the limit is a trivial random variable, a constant.

The most straightforward generalization of the usual idea of a limit to random variables is easy to define, but virtually useless. If for every sample point ω we had limn→∞ Yn(ω) = Y (ω) in the usual sense of convergence of numbers, then we could say that Yn converges pointwise to Y , that is, for every sample point in the sample space. Unfortunately it is rarely possible to prove so strong a result, nor is it necessary.

A slight variation of this yields a far more important important notion of convergence. A sequence of random variables Yn, n = 1, 2, . . . , is said to converge to a random variable Y with probability one or convergence w.p. 1 if the set of samples points ω such that limn→∞ Yn(ω) = Y (ω) is an event with probability one. Thus a sequence converges with probability one if it converges pointwise on a set of probability one, it can do anything outside of that set, e.g., converge to something else or not converge at all. Since the total probability of all such bad sequences is 0, this has no practical significance. Although the easiest useful concept of convergence to define, it is the most di cult to work with and most proofs involving convergence with probability are far beyond the mathematical prerequisites and capabilities of this course. Hence we will focus on two other notions of convergence that are perhaps less intuitive to understand, but are far easier to use when proving results. First note, however, that there are many equivalent names for convergence with probability one. It is often called convergence almost surely and abbreviated a.s. or convergence almost everywhere and abbreviated a.e. Convergence with probability one will not be considered in any depth here, but some toy examples will be considered in the problems to help get the concept across.

Henceforth two definitions of convergence of random variables will be emphasized, both well suited to the type of results developed here (and one that is used in the first such results, Bernoulli’s weak law of large numbers for iid random processes). The first is convergence in mean square, convergence of the type seen in the last section, which leads to a result called a mean ergodic theorem. The second is called convergence in probability, which is implied by the first and leads to a result called the weak law of large numbers. The second result will follow from the first via a simple but powerful inequality relating probabilities and expectations.

4.14. CONVERGENCE OF RANDOM VARIABLES

241

A sequence of random variables Yn; n = 1, 2, . . . is said to converge in mean square or converge in quadratic mean to a random variable Y if

lim E[(Yn − Y )2] = 0 .

n→∞

This is also written Yn → Y in mean square or Yn → Y in quadratic mean. If Yn converges to Y in mean square, we state this convergence mathe-

matically by writing

l.i.m. Yn = Y ,

n→∞

where lim is an acronym for “limit in the mean.” Although it is likely not obvious to the novice, it is important to understand that convergence in mean square does not imply convergence with probability one. Examples converging in one sense and not the other may be found in problem 32

Thus a sequence of random variables converges in mean square to another random variable if the second moment of the di erence converges to zero in the ordinary sense of convergence of a sequence of real numbers. Although the definition encompasses convergence to a random variable with any degree of “randomness,” in most applications that we shall encounter the limiting random variable is a degenerate random variable, i.e., a constant. In particular, the sequence of sample averages, {Sn}, of the preceding section is next seen to converge in this sense.

The final notion of convergence bares a strong resemblance to the notion of convergence with probability one, but the resemblance is a faux ami, the two notions are fundamentally di erent. A sequence of random variables Yn; n = 1, 2, . . . is said to converge in probability to a random variable Y if for every * > 0,

lim Pr(|Yn − Y | > *) = 0 .

n→∞

Thus a sequence of random variables converges in probability if the probability that the nth member of the sequence di ers from the limit by more than an arbitrarily small * goes to zero as n → ∞. Note that just as with convergence in mean square, convergence in probability is silent on the question of convergence of individual realizations Yn(ω). You could, in fact, have no realizations converge individually and yet have convergence in probability. All convergence in probability states is that at each n, Pr(ω : |Yn(ω) − Y (ω)| > *) tends to zero with n. Suppose at time n a given subset of Ω satisfies the inequality, at time n + 2 still a di erent subset satisfies the inequality, etc. As long as the subsets have diminishing probability, convergence in probability can occur without convergence of the individual sequences.

Also, as in convergence in the mean square sense, convergence in probability is to a random variable in general, but this includes the most interesting case of a degenerate random variable — i.e., a constant.

242

CHAPTER 4. EXPECTATION AND AVERAGES

The two notions of convergence — convergence in mean square and convergence in probability — can be related to each other via simple, but important, inequalities. It will be seen that convergence in mean square is the stronger of the two notions; that is, if it converges in mean square, then it also converges in probability, but not necessarily vice versa. The two inequalities are slight variations on each other, but they are stated separately for clarity and both an elementary and a more elegant proof are presented.

The Tchebychev Inequality

Suppose that X is a random variable with mean mX, and variance σX2 . Then

Pr(|X − mX| > *)

σ2

(4.107)

X

*2 .

We prove the result here for the discrete case. The continuous case is similar (and can be inferred from the more general proof of the Markov inequality to follow.)

The result follows from a sequence of inequalities.

σX2 = E[(X − mX)2]

=(x − mX)2pX(x)

x

 

|

 

=

 

(x − mX)2pX(x) +

(x − mX)2pX(x)

x:|x−mX |≤

x: x−mX |>

 

 

 

 

(x − mX)2pX(x)

 

x:|x−mX |>

 

 

|

 

>

*2

 

pX(x)

=

 

x: x−mX|>

*2

Pr(|X − mX| > *).

Note that the Tchebychev inequality implies that

1 Pr(|V − V | ≥ γσV ) γ2 ,

that is, the probability that V is farther from its mean by more than γ times its standard deviation (the square root of its variance) is no greater than γ2.

The Markov Inequality. Given a nonnegative random variable U with finite expectation EU, for any a > 0 we have

EU

Pr(U ≥ a) = PU ([a, ∞)) a .

4.14. CONVERGENCE OF RANDOM VARIABLES

243

Proof: The result can be approved in the same manner as the Tchebychev inequality by separate consideration of the discrete and continuous cases. Here we give a more general proof. Fix a > 0 and set F = {u : u ≥ a}. Let 1F (u) be the indicator of the function F , 1 if u ≥ a and 0 otherwise. Then since F ∩F c = and F F c = Ω, we have using the linearity of expectation and the fact that U ≥ 0 with probability one that

E[U] = E[U(1F (U) + 1F c (U))]

=E[U(1F (U))] + E[U1F c (U))] ≥ E[U(1F (U))] ≥ aE[1F (U)]

=aP (F ).

completing the proof.

Observe that if a random variable U is nonnegative and has small ex-

pectation, say EU ≤ *, then the Markov inequality with a = * implies

that

√ √

Pr(U ≥ * ) ≤ * .

This can be interpreted as saying that the random variable can take on

√ √

values greater that * no more than * of the time.

Before applying this result, we pause to present a second proof of the Markov inequality that has a side result of some interest in its own right. As before assume that 0. Assume for the moment that U is continuous

so that

0

xfX(x) dx.

E[U] =

Consider the admittedly strange looking equality

x = 1[α,∞)(x) dα,

0

which follows since the integrand is 1 if and only if α ≤ x and hence integrating 1 as α ranges from 0 to x yields x. Plug this equality into the previous integral expression for expectation and changing the order of integration yields

E[U] =

001[α,∞)(x) dα fX(x) dx

=

 

1[α,∞)(x)fX(x) dx dα,

 

 

0

0

 

244

CHAPTER 4. EXPECTATION AND AVERAGES

which can be expressed as

 

 

E[U] = 0

Pr(U > α) = 0

(1 − FU (α)) dα.

(4.108)

This result immediately gives the Markov inequality since for any fixed a > 0,

E[U] = Pr(U > α) dα ≥ a Pr(U > a).

0

To see this, Pr(U > α) is monotonically nonincreasing with α, so for all α ≤ a we must have Pr(U > α) Pr(U > a) (and for other α Pr(U > α) 0). Plugging the bound into the integral yields the claimed inequality.

Lemma 4.3 If Yn converges to Y in mean square, then it also converges in probability.

Proof. From the Markov inequality applied to |Yn − Y |2, we have for any * > 0

Pr(|Yn − Y | > *) = Pr(|Yn − Y |2 > *2) E(|Yn − Y |2) . *2

The right-hand term goes to zero as n → ∞ by definition of convergence in mean square.

Although convergence in mean square implies convergence in probability, the reverse statement cannot be made; i.e., they are not equivalent. This is shown by a simple counterexample. Let Yn be a discrete random variable with pmf.

1

1/n

if y = 0

pYn =

1/n

if y = n .

Convergence in probability to zero without convergence in mean square is easily verified. In particular, the sequence converges in probability since Pr[|Yn 0| > *] = Pr[Yn > 0] = 1/n, which goes to 0 as n → ∞. On the other hand, E[|Yn 0|2] would have to go to 0 for Yn to converge to 0 in mean square, but it is E[Yn2] = 0(1 1/n) + n2/n = n, which does not converge to 0 as n → ∞.

4.15Weak Law of Large Numbers

We now have the definitions and preliminaries to prove laws of large numbers showing that sample averages converge to the expectation of the individual samples. The basic (and classical) results hold for uncorrelated random processes with constant variance.

Соседние файлы в предмете Социология