
An Introduction to Statistical Signal Processing
.pdf4.10. CORRELATION AND LINEAR ESTIMATION |
225 |
allow the estimator to be an arbitrary function of the observation, it must be a linear function of the form
ˆ |
(4.76) |
Y (x) = ax + b, |
where a and b are fixed constants which are chosen to minimize the mean squared error. Strictly speaking, this is an a ne function rather than a linear function, it is linear if b = 0. The terminology is common, however, and we will use it.
The goal now is to find a and b which minimizes
ˆ |
2 |
2 |
(4.77) |
E[(Y − Y (X)) |
] = E[(Y − aX − b) ]. |
Rewriting the formula for the error in terms of the mean-removed random variables yields for any a, b:
E &[Y − (aX + b)]2'
=E &[(Y − EY ) − a(X − EX) − (b − EY + aEX)]2'
=σY2 + a2σX2 + (b − EY + aEX)2 − 2aCOV (X, Y )
since the remaining cross products are all zero (why?). Since the first term does not depend on a or b, minimizing the mean squared error is equivalent to minimizing
a2σX2 + (b − EY + aEX)2 − 2aCOV (X, Y ).
First note that the middle term is nonnegative. Once a is chosen, this term will be minimized by choosing b = EY − aEX, which makes this term 0. Thus the best a must minimize a2σX2 − 2aCOV (X, Y ). A little calculus shows that the minimizing a is
a = |
COV (X, Y ) |
|
|
|
(4.78) |
|
|
|
|||||
|
|
σ2 |
|
|
||
|
|
X |
|
|
||
and hence the best b is |
|
|
|
|
|
|
b = EY − |
COV (X, Y ) |
EX. |
(4.79) |
|||
|
|
|||||
σX2 |
The connection of second order moments and linear estimation also plays a fundamental role in the vector analog to the problem of the previous section, that is, in the estimation of a vector Y given an observed vector X. The details are more complicated, but the basic ideas are essentially the same.
226 |
CHAPTER 4. EXPECTATION AND AVERAGES |
Unfortunately the conditional expectation is mathematically tractable only in a few very special cases, e.g., the case of jointly Gaussian vectors. In the Gaussian case the conditional expectation given X is formed by a simple matrix multiplication on X with possibly a constant vector being added; that is, the optimal estimate has a linear form. (As in the scalar case, technically this is an a ne form and not a linear form if a constant vector is added.) Even when the random vectors are not Gaussian, linear predictors or estimates are important because of their simplicity. Although they are not in general optimal, they play an important role in signal processing. Hence we next turn to the problem of finding the optimal linear estimate of one vector given another.
Suppose as before that we are given an k-dimensional vector X and wish to predict an m-dimensional vector Y . We now restrict ourselves to estimates of the form
ˆ
Y = AX,
where the m × k-dimensional matrix A can be considered as a matrix of k-dimensional row vectors atk; k = 0, · · · , m − 1:
|
|
A = [a0, a2, · · · , am−1]t |
|
|
||
ˆ ˆ |
ˆ |
t |
, then |
|
|
|
so that if Y = (Y0 |
, · · · , Ym−1) |
|
|
|||
|
|
|
ˆ |
t |
|
|
|
|
|
Yi |
= aiX |
|
|
and hence |
|
|
|
|
|
|
|
|
|
k |
|
|
|
|
2 |
ˆ |
i |
t 2 |
]. |
(4.80) |
|
* |
(Y ) = |
E[(Yi − aiX) |
|||
|
|
|
=1 |
|
|
|
The goal is to find the matrix A that minimizes *2, which can be con-
ˆ
sidered as a function of the estimate Y or of the matrix A defining the estimate. We shall provide two separate solutions which are almost, but not quite, equivalent. The first is constructive in nature: a specific A will be given and shown to be optimal. The second development is descriptive: without actually giving the matrix A, we will show that a certain property is necessary and su cient for the matrix to be optimal. That property is called the orthogonality principle, and it states that the optimal matrix is
− ˆ
the one that causes the error vector Y Y to be orthogonal to (have zero correlation with) the observed vector X. The first development is easier to use because it provides a formula for A that can be immediately computed in many cases. The second development is less direct and less immediately applicable, but it turns out to be more general: the descriptive property can
4.10. CORRELATION AND LINEAR ESTIMATION |
227 |
be used to derive A even when the first development is not applicable. The orthogonality principal plays a fundamental role in all of linear estimation theory.
The error *2(A) is minimized if each term E[(Yi − atiX)2] is minimized over ai since there is no interaction among the terms in the sum. We can do no better when minimizing a sum of such positive terms than to minimize each term separately. Thus the fundamental problem is the following simpler one: Given a random vector X and a random variable (one-dimensional or scalar vector) Y , we seek a vector a that minimizes
*2(a) = E[(Y − atX)2]. |
(4.81) |
One way to find the optimal a is to use calculus, setting derivatives of *2(a) to zero and verifying that the stationary point so obtained is a global minimum. As previously discussed, variational techniques can be avoided via elementary inequalities if the answer is known. We shall show that the optimal a is a solution of
atRX = E(Y Xt), |
(4.82) |
so that if the autocorrelation matrix defined by
RX = E[XXt] = {RX(k, i) = E(XkXi); k, i = 0, · · · , k − 1}
is invertible, then the optimal a is given by
at = E(Y Xt)RX−1. |
(4.83) |
To prove this we assume that a satisfies (4.83) and show that for any other vector b
*2(b) ≥ *2(a). |
(4.84) |
To do this we write
*2(b) = E[(Y − btX)2] = E[(Y − atX + atX − btX)2]
=E[(Y − atX)2] + 2E[(Y − atX)(atX − btX)]
+E[(atX − btX)2].
Of the final terms, the first term is just *2(a) and the rightmost term is obviously nonnegative. Thus we have the bound
*2(b) ≥ *2(a) + 2E[(Y − atX)(at − bt)X]. |
(4.85) |
228 CHAPTER 4. EXPECTATION AND AVERAGES
The crossproduct term can be written as
2E[(Y − atX)(at − bt)X] = 2E[(Y − atX)Xt(a − b)]
=2E[(Y − atX)Xt](a − b)
=2 &E[Y Xt] − atE[XXt]'(a − b)
&'
=2 E[Y Xt] − atRX (a − b)
= 0 |
(4.86) |
invoking (4.82). Combining this with (4.85) proves (4.84) and hence optimality. Note that because of the symmetry of autocorrelation matrices and their inverses, we can rewrite (4.83) as
a = RX−1E[Y X]. |
(4.87) |
Using the above result to perform a termwise minimization of (4.80) now yields the following theorem describing the optimal linear vector predictor.
Theorem 4.7 The minimum mean squared error linear predictor of the
ˆ
form Y = AX is given by any solution A of the equation:
ARX = E(Y Xt).
If the matrix RX is invertible, then A is uniquely given by
At = RX−1E[XY t],
that is, the matrix A has rows at; i = 0, 1, . . . , m, with |
|
i |
|
ai = RX−1E[YiX]. |
|
Alternatively, |
|
A = E[Y Xt]RX−1. |
(4.88) |
Having found the best linear estimate, it is easy to modify the development to find the best estimate of the form
ˆ |
(4.89) |
Y (X) = AX + b, |
where now we allow an additional constant term. This is also often called a linear estimate, although as previously noted it is more correctly called an a ne estimate because of the extra constant vector term. As the end result and proof strongly resemble the linear estimate result, we proceed directly to the theorem.
4.10. CORRELATION AND LINEAR ESTIMATION |
229 |
ˆ
Theorem 4.8 The minimum mean squared estimate of the form Y = AX + b is given by any solution A of the equation:
AKX = E[(Y − E(Y ))(X − E(X))t] |
(4.90) |
where the covariance matrix KX is defined by |
|
KX = E[(X − E(X))(X − E(X))t] = RX−E(X), |
|
and |
|
b = E(Y ) − AE(X). |
|
If KX is invertible, then |
|
A = E[(Y − E(Y ))(X − E(X))t]KX−1. |
(4.91) |
Note that if X and Y have zero means, then the result reduces to the previous result; that is, a ne predictors o er no advantage over linear predictors for zero mean random vectors. To prove the theorem, let C be any matrix and d any vector (both of suitable dimensions) and note that
E('Y − (CX + d)'2)
=E('(Y − E(Y )) − C(X − E(X)) + E(Y ) − CE(X) − d'2)
=E('(Y − E(Y )) − C(X − E(X))'2)
+E('E(Y ) − CE(X) − d'2)
+2E[Y − E(Y ) − C(X − E(X))]t[E(Y ) − CE(X) − d].
From Theorem 4.7, the first term is minimized by choosing C = A, where A is a solution of (4.90); also, the second term is the expectation of the squared norm of a vector that is identically zero if C = A and d = b, and similarly for this choice of C and d the third term is zero. Thus
E('Y − (CX + d)'2) ≥ E('Y − (AX + b)'2).
We often restrict interest to linear estimates by assuming that the various vectors have zero mean. This is not always possible, however. For example, groups of pixels in a sampled image intensity raster can be used to predict other pixel groups, but pixel values are always nonnegative and hence always have nonzero means. Hence in some problems a ne predictors may be preferable. Nonetheless, we will often follow the common practice of focusing on the linear case and extending when necessary. In most studies of linear prediction it is assumed that the mean is zero, i.e., that any dc value of the process has been removed. If this assumption is not made, linear estimation theory is still applicable but will generally give inferior performance to the use of a ne prediction.
230 |
CHAPTER 4. EXPECTATION AND AVERAGES |
The Orthogonality Principle
Although we have proved the form of the optimal linear predictor of one vector given another, there is another way to describe the result that is often useful for deriving optimal linear predictors in somewhat di erent situations. To develop this alternative viewpoint we focus on the error vector
ˆ |
(4.92) |
e = Y − Y . |
ˆ
Rewriting (4.92) as Y = Y + e points out that the vector Y can be consid-
ered as its estimate plus an error or “noise” term. The goal of an optimal
predictor is then to minimize the error energy ete = k−1 e2 . If the esti-
n=0 n
mate is linear, then
e = Y − AX.
As with the basic development for the linear predictor, we simplify things for the moment and look at the scalar prediction problem of pre-
|
|
ˆ |
t |
X yielding a scalar error of e = |
dicting a random variable Y by Y |
= a |
|||
ˆ |
t |
X. Since we have seen that the overall mean squared error |
||
Y −t Y = Y − a |
||||
E[e e] in the vector case is minimized by separately minimizing each com- |
||||
ponent E[e2 |
], we can later easily extend our results for the scalar case to |
|||
k |
|
|
|
|
the vector case.
Suppose that a is chosen optimally and consider the crosscorrelation between an arbitrary error term and the observable vector:
− ˆ − t
E[(Y Y )X] = E[(Y a X)X]
=E[Y X] − E[X(Xta)]
=E[Y X] − RXa = 0
using (4.82).
Thus for the optimal predictor, the error satisfies
E[eX] = 0,
or, equivalently,
E[eXn] = 0; n = 0, · · · , k. |
(4.93) |
When two random variables e and X are such that their expected product E(eX) is 0, they are said to be orthogonal and we write
e X.
We have therefore shown that the optimal linear estimate of a scalar random variable given a vector of observations causes the error to be orthogonal to
4.11. CORRELATION AND COVARIANCE FUNCTIONS |
231 |
all of the observables and hence orthogonality of error and observations is a necessary condition for optimality of a linear estimate.
Conversely, suppose that we know a linear estimate a is such that it renders the prediction error orthogonal to all of the observations. Arguing as we have before, suppose that b is any other linear predictor vector and observe that
*2(b) = E[(Y − btX)2]
= E[(Y − atX + atX − btX)2]
≥ *2(a) + 2E[(Y − atX)(atX − btX)],
where the equality holds if b = a. Letting e = Y − atX denote the error resulting from an a that makes the error orthogonal with the observations, the rightmost term can be rewritten as
2E[e(atX − btX)] = 2(at − bt)E[eX] = 0.
Thus we have shown that *2(b) ≥ *2(a) and hence no linear estimate can outperform one yielding an error orthogonal to the observations and hence such orthogonality is su cient as well as necessary for optimality.
Since the optimal estimate of a vector Y given X is given by the componentwise optimal predictions given X, we have thus proved the following alternative to Theorem 4.7.
Theorem 4.9 The Orthogonality Principle:
ˆ
A linear estimate Y = AX is optimal (in the in the mean squared error sense) sense) if and only if the resulting errors are orthogonal to the observations, that is, if e = Y − AX, then
E[ekXn] = 0; k = 1, · · · , K; n = 1, · · · , N.
4.11Correlation and Covariance Functions
We turn now to correlation in the framework of random processes. The notion of an iid random process can be generalized by specifying the component random variables to be merely uncorrelated rather than independent. Although requiring the random process to be uncorrelated is a much weaker requirement, the specification is su cient for many applications, as will be seen in several ways. In particular, in this chapter, the basic laws of large numbers require only the weaker assumption and hence are more general than they would be if independence were required. To define the
232 |
CHAPTER 4. EXPECTATION AND AVERAGES |
class of uncorrelated processes, it is convenient to introduce the notions of autocorrelation functions and covariance functions of random processes.
Given a random process {Xt; t T }, the autocorrelation function RX(t, s); t, s T is defined by
RX(t, s) = E(XtXs) ; all t, s T .
The autocovariance function or simply the covariance function KX(t, s); t, s, T is defined by
KX(t, s) = COV (Xt, Xs) .
Observe that (4.19) relates the two functions by
KX(t, s) = RX(t, s) − (EXt)(EXs) . |
(4.94) |
Thus the autocorrelation and covariance functions are equal if the process has zero mean, that is, if EXt = 0 for all t. The covariance function of a process {Xt} can be viewed as the autocorrelation function of the process {Xt − EXt} formed by removing the mean from the given process to form a new process having zero mean.
The autocorrelation function of a random process is given by the correlation of all possible pairs of samples; the covariance function is the covariance of all possible pairs of samples. Both functions provide a measure of how dependent the samples are and will be seen to play a crucial role in laws of large numbers. Note that both definitions are valid for random processes in either discrete time or continuous time and having either a discrete alphabet or a continuous alphabet.
In terms of the correlation function, a random process {Xt; t T } is said to be uncorrelated if
E(X2) |
|
if t = s |
t |
|
if t = s |
RX(t, s) = EXtEXs |
||
|
|
|
or, equivalently, if |
|
|
σ2 |
if t = s |
|
KX(t, s) = 0Xt |
if t = s |
|
|
|
|
The reader should not overlook the obvious fact that if a process is iid or uncorrelated, the random variables are independent or uncorrelated only if taken at di erent times. That is, Xt and Xs will not be independent or uncorrelated when t = s, only when t = s (except, of course, in such trivial cases as that where {Xt} = {at}, a sequence of constants where E(XtXt) = atat = EXtEXt and hence Xt is uncorrelated with itself).
4.11. CORRELATION AND COVARIANCE FUNCTIONS |
233 |
Gaussian Processes Revisited
Recall from chapter 3 that a Gaussian random process {Xt; t T } is completely described by a mean function {mt; t T } and a covariance function {Λ(t, s); t, s T }. As one might suspect, the names of these functions come from the fact that they are indeed the mean and covariance functions as defined in terms of expectations, i.e.,
mt = |
EXt |
(4.95) |
Λ(t, s) = |
KX(t, s). |
(4.96) |
The result for the mean follows immediately from our computation of the mean of a Gaussian N(m, σ2) random variable. The result for the covariance can be derived by brute force integration (not too bad if the integrator is well versed in matrix transformations of multidimensional integrals) or looked up in tables somewhere. The computation is tedious and we will simply state the result without proof. The multidimensional characteristic functions to be introduced later can be used to a relatively simple proof, but again it is not worth the e ort to fill in the details.
A more important issue is the properties that were required for a covariance function when the Gaussian process was defined. Recall that it was required that the covariance function of the process be symmetric, i.e., KX(t, s) = KX(s, t), and positive definite, i.e., given any positive integer k, any collection of sample times {t0, . . . , tk−1}, and any k real numbers ai; i = 0, . . . , k − 1 (not all 0), then
n−1 n−1
aialKX(ti, tl) ≥ 0. |
(4.97) |
i=0 l=0
We now return to these conditions to see if they are indeed necessary conditions for all covariance functions, Gaussian or not.
Symmetry is easy. It immediately follows from the definitions that
KX(t, s) = E[(Xt − EXt)(Xs − EXs)] |
|
|
= |
E[(Xs − EXs)(Xt − EXt)] |
|
= |
KX(s, t) |
(4.98) |
and hence clearly all covariance functions are symmetric, and so are covariance matrices formed by sampling covariance functions. To see that
234 CHAPTER 4. EXPECTATION AND AVERAGES
positive definiteness is indeed almost a requirement, consider the fact that
n−1 n−1 |
n−1 n−1 |
|
l |
|
|
aialKX(ti, tl) = |
aialE[(Xti − EXti )(Xtl − EXtl )] |
|
i=0 =0 |
i=0 l=0 |
|
= E n−1 n−1 aial(Xti − EXti )(Xtl − EXtl ) |
||
|
l |
|
|
i=0 =0 |
|
= E | n−1 ai(Xti − EXti )|2 |
||
|
i |
|
≥ |
=0 |
|
0. |
(4.99) |
Thus any covariance function KX must at least be nonnegative definite, which implies that any covariance matrix matrix formed by sampling the covariance function must also be nonnegative definite. Thus nonnegative definiteness is necessary for a covariance function and our requirement for a Gaussian process was only slightly stronger that what was needed. We will later see how to define a Gaussian process when the covariance function is only nonnegative definite and not necessarily positive definite.
A slight variation on the above argument shows that if X = (X0, . . . , Xk−1)t is any random vector, then the covariance matrix Λ = {λi,l; i, l Zk} defined by λi,l = E[(Xi − EXi)(Xl − EXl)] must also be symmetric and nonnegative definite. This was the reason for assuming that the covariance matrix for a Gaussian random vector had at least these properties.
We make two important observations before proceeding. First, remember that the four basic properties of expectation have nothing to do with independence. In particular, whether or not the random variables involved are independent or uncorrelated, one can always interchange the expectation operation and the summation operation (property 3), because expectation is linear. On the other hand, one cannot interchange the expectation operation with the product operation (this is not a property of expectation) unless the random variables involved are uncorrelated, e.g., when they are independent. Second, an iid process is also a discrete time uncorrelated random process with identical marginal distributions. The converse statement is not true in general; that is, the notion of an uncorrelated process is more general than that of an iid process. Correlation measures only a weak pairwise degree of independence. A random process could even be pairwise independent (and hence uncorrelated) but still not be iid (problem 4.28).