
An Introduction to Statistical Signal Processing
.pdf4.7. JOINTLY GAUSSIAN VECTORS |
215 |
and show that a is a k × k identity matrix, d is an m × m identity matrix, and that c and d contain all zeros so that the right hand matrix is indeed an identity matrix.
The conditional pdf for Y given X follows directly from the definitions
as
fY |X(y|x)
= |
fXY (x, y) |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
fY (y) |
|
|
|
|
|
|
|
|
|
|
− |
|
− |
|
|
|
− |
|
|
|
|
|
y − mY |
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||||
= |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
U |
|
|||||||
|
(2π)−(k+m)/2(det KU )−1/2 exp |
1/2((x |
|
mX)t (y |
|
mY )t)K−1( |
x − mX ) |
|
|
||||||||||||||||||||||
|
|
|
|
|
|
(2π)−k/2(det KX)−1/2 exp |
1/2(x |
|
mX)tK−1 |
(x |
|
mX) |
|
|
|
|
|||||||||||||||
= (2π)− |
m/2 |
|
|
det KU |
1/2 |
|
|
|
&− |
|
|
|
− |
|
|
|
X |
|
− |
|
' |
|
|
|
|
||||||
|
|
( |
|
)− |
|
× |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||
|
|
det KX |
|
|
|
y − mY |
|
|
|
− |
|
|
|
|
− |
|
|
||||||||||||||
|
|
− |
|
|
|
|
− |
|
|
|
|
− |
|
U |
|
|
|
|
|
X |
|
|
|||||||||
|
exp |
|
1/2((x |
|
mX)t (y |
|
mY )t)K−1( |
x − mX |
) + (x |
|
mX)tK−1(x |
|
mX) |
Again using some brute force linear algebra, it can be shown that the quadratic terms in the exponential can be expressed in the form
((x |
− |
mX)t , (y |
− |
mY )t)K−1 |
x − mX |
+ (x |
− |
mX)tK−1 |
(x |
− |
mX) |
||
|
|
U |
y |
− |
mY |
|
X |
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
=(y − mY − KY XKX−1(x − mX))tKY−|1X(y − mY − KY XKX−1(x − mX)).
Defining |
|
|
||
mY |x = mY + KY XKX−1(x − mX) |
(4.40) |
|||
the conditional density simplifies to |
|
|
||
|
det KU |
−1/2(y − mY |x)tKY−|1X |
(y − mY |x) , |
|
fY |X(y|x) = (2π)−m/2( |
|
)−1/2 × exp |
||
det KX |
||||
|
|
|
|
(4.41) |
which shows that conditioned on X = x, Y has a Gaussian density. This means that we can immediately recognize the conditional expectation of Y given X as
E(Y |X = x) = mY |x = mY + KY XKX−1(x − mX), |
(4.42) |
so that the conditional expectation is an a ne function of the vector x. We can also infer from the form that KY |X is the (conditional) covariance
KY |X = E[(Y − E(Y |X = x))(Y − E(Y |X = x))t|x], |
(4.43) |
216 |
CHAPTER 4. EXPECTATION AND AVERAGES |
which unlike the conditional mean does not depend on the vector x! Furthermore, since we know how the normalization must relate to the covariance matrix, we have that
det(K |
Y |X |
) = |
|
det(KU ) |
. |
(4.44) |
|
|
|||||
|
|
det(KX) |
|
These relations completely describe the conditional densities of one subvector of a Gaussian vector given another subvector. We shall see, however, that the importance of these results goes beyond the above evaluation and provides some fundamental results regarding optimal nonlinear estimation for Gaussian vectors and optimal linear estimation in general.
4.8Expectation as Estimation
Suppose that one is asked to guess the value that a random variable Y will take on, knowing the distribution of the random variable. What is the
ˆ
best guess or estimate, say Y ? Obviously there are many ways to define a best estimate, but one of the most popular ways to define a cost or
|
ˆ |
|
|
distortion resulting from estimating the “true” value of Y by Y is to look |
|||
ˆ |
ˆ |
2 |
], the so |
at the expected value of the square of the error Y − Y , E[(Y |
− Y ) |
|
called mean squared error or MSE. Many arguments have been advanced in support of this approach, perhaps the simplest being that if one views the error as a voltage, then the average squared error is the average energy in the error. The smaller the energy, the weaker the signal in some sense. Perhaps a more honest reason for the popularity of the measure is its tractability in a wide variety of problems, it often leads to nice solutions that indeed work well in practice. As an example, we show that the optimal estimate of the value of an unknown random variable is in fact the mean of the random variable, a result that is highly intuitive. Rather than use calculus to prove this result — a tedious approach requiring setting derivatives to zero and then looking at second derivatives to verify that indeed the stationary point is a minimum — we directly prove the global optimality of the result.
ˆ
Suppose that that our estimate is Y = a, some constant. We will show that this estimate can never have mean squared error smaller than that resulting from using the expected value of Y as an estimate. This is accomplished by a simple sequence of equalities and inequalities. Begin by adding and subtracting the mean, expanding the square, and using the second and third properties of expectation as
E[(Y − a)2] = E[(Y − EY + EY − a)2]
= E[(Y − EY )2] + 2E[(Y − EY )(EY − a)] + (EY − a)2.
4.8. EXPECTATION AS ESTIMATION |
217 |
The cross product is evaluated using the linearity of expectation and the fact that EY is a constant as
E[(Y − EY )(EY − a)] = (EY )2 − aEY − (EY )2 + aEY = 0
and hence from Property 1 of expectation,
E[(Y − a)2] = E[(Y − EY )2] + (EY − a)2 ≥ E[(Y − EY )2], (4.45)
which is the mean squared error resulting from using the mean of Y as an estimate. Thus the mean of a random variable is the minimum mean squared error estimate (MMSE) of the value of a random variable in the absence of any a priori information.
What if one is given a priori information? For example, suppose that now you are told that X = x. What then is the best estimate of Y , say
ˆ
Y (X)? This problem is easily solved by modifying the previous derivation to use conditional expectation, that is, by using the conditional distribution for Y given X instead of the a priori distribution for Y . Once again we try to minimize the mean squared error:
ˆ |
|
ˆ |
|
|
|
2 |
2 |
|X] |
|
||
E[(Y − Y (X)) ] = E E[(Y − Y (X)) |
|
||||
|
|
|
ˆ |
|
|
|
= |
pX(x)E[(Y − Y (X)) |
|x]. |
x
Each of the terms in the sum, however, is just a mean squared error between a random variable and an estimate of that variable with respect to a distribution, here the conditional distribution pY |X(·|x). By the same argument as was used in the unconditional case, the best estimate is the mean, but now the mean with respect to the conditional distribution, i.e.,
| ˆ
E(Y x). In other words, for each x the best Y (x) in the sense of minimizing the mean squared error is E(Y |x). Plugging in the random variable X in place of the dummy variable x we have the following interpretation
The conditional expectation E(Y |X) of a random variable Y given a random variable X is the minimum mean squared estimate of Y given X.
A direct proof of this result without invoking the conditional version of the result for unconditional expectation follows from general iterated expectation. Suppose that g(X) is an estimate of Y given X. Then the resulting mean squared error is
E[(Y − g(X))2] = E[(Y − E(Y |X) + E(Y |X) − g(X))2]
=E[(Y − E(Y |X))2]
−2E[(Y − E(Y |X))(E(Y |X) − g(X))] +E[(E(Y |X) − g(X))2].
218 |
CHAPTER 4. |
EXPECTATION AND AVERAGES |
Expanding the cross term yields |
|
|
E[(Y − E(Y |X))(E(Y |X) − g(X))] |
= E[Y E(Y |X)] − E[Y g(X)] |
|
|
|
−E[E(Y |X)2] + E[E(Y |X)g(X)] |
From the general iterated expectation (4.36), E[Y E(Y |X)] = E[E(Y |X)2] (setting g(X) of the lemma to E(Y |X) and h(X, Y ) = Y ) and E[Y g(X)] = E[E(Y |X)g(X)] (setting g(X) of the lemma to the g(X) used here and h(X, Y ) = Y ).
As with ordinary expectation, the ideas of conditional expectation can be extended to continuous random variables by substituting conditional pdf’s for the unconditional pdf’s. As is the case with conditional probability, however, this constructive definition has its limitations and only makes sense when the pdf’s are well defined. The rigorous development of conditional expectation is, like conditional probability, analogous to the rigorous treatment of the Dirac delta, it is defined by its behavior underneath the integral sign rather than by a construction. When the constructive definition makes sense, the two approaches agree.
One of the unfortunately rare examples for which conditional expectations can be explicitly evaluated is the case of jointly Gaussian random
variables. In this case we can immediately identify from (3.61) that |
|
E[Y |X] = mY + ρ(σY /σX)(X − mX). |
(4.46) |
It will prove important that this is in fact an a ne function of X.
The same ideas extend from scalars to vectors. Suppose we observe a real-valued column vector X = (X0, · · · , Xk−1)t and we wish to predict or estimate a second random vector Y = (Y0, · · · , Ym−1)t. Note that the dimensions of the two vectors need not be the same.
ˆ ˆ
The prediction Y = Y (X) is to be chosen as a function of X which yields the smallest possible mean squared error, as in the scalar case. The mean squared error is defined as
2ˆ
* (Y )
|
ˆ |
2 |
∆ |
ˆ t |
ˆ |
= E('Y − Y |
' |
) = E[(Y − Y ) |
(Y − Y )] |
||
|
m−1 |
|
|
|
|
= |
i |
|
ˆ 2 |
]. |
(4.47) |
E[(Yi − Yi) |
|||||
|
=0 |
|
|
|
|
An estimator or predictor is said to be optimal within some class of predictors if it minimizes the mean squared error over all predictors in the given class.
Two specific examples of vector estimation are of particular interest. In the first case, the vector X consists of k consecutive samples from a

4.8. EXPECTATION AS ESTIMATION |
219 |
stationary random process, say X = (Xn−1, Xn−2, . . . , Xn−k) and Y is the next, or “future”, sample Y = Xn. In this case the goal is to find the best one-step predictor given the finite past. In the second example, Y is a rectangular subblock of pixels in a sampled image intensity raster and X consists of similar subgroups above and to the left of Y . Here the goal is to use portions of an image already coded or processed to predict a new portion of the same image. This vector prediction problem is depicted in Figure 4.1 where subblocks A, B, and C would be used to predict subblock D.
A B
CD
Figure 4.1: Vector Prediction of Image Subblocks
The following theorem shows that the best nonlinear predictor of Y given X is simply the conditional expectation of Y given X. Intuitively, our best guess of an unknown vector is its expectation or mean given whatever observations that we have. This extends the interpretation of a conditional expectation as an optimal estimator to the vector case.
Theorem 4.5 Given two random vectors Y and X, the minimum mean squared error estimate of Y given X is
ˆ |
(4.48) |
Y (X) = E(Y |X). |
Proof: As in the scalar case, the proof does not require calculus or
ˆ
Lagrange minimizations. Suppose that Y is the claimed optimal estimate
˜ ˜
and that Y is some other estimate. We will show that Y must yield a mean
ˆ
squared error no smaller than does Y . To see this consider
2 |
˜ |
˜ |
2 |
|
|
ˆ |
|
ˆ |
˜ |
2 |
) |
|
|
|
|
* |
(Y ) = E('Y − Y |
' |
) = E('Y − Y + Y |
− Y |
' |
|
|
|
|
||||||
|
|
ˆ |
2 |
ˆ |
˜ |
2 |
|
|
|
ˆ t |
|
ˆ |
˜ |
|
|
|
= E('Y − Y |
' |
) + E('Y |
− Y |
' |
) + 2E[(Y − Y ) |
(Y |
− Y )] |
|
||||||
|
2 |
ˆ |
|
ˆ t |
|
ˆ |
˜ |
|
|
|
|
|
|
|
|
|
≥ * |
(Y ) + 2E[(Y − Y ) |
(Y − Y )]. |
|
|
|
|
˜ |
|
ˆ |
|||||
|
|
|
|
|
|
|
|
|
|
|
2 |
2 |
|||
We will prove that the rightmost term is zero and hence that * |
|
(Y ) ≥ * |
(Y ), |
||||||||||||
|
|
|
|
|
|
|
ˆ |
|
|
|X) and hence |
|
||||
which will prove the theorem. Recall that Y = E(Y |
|
− ˆ |
E[(Y Y ) X] = 0.
220 |
CHAPTER 4. |
EXPECTATION AND AVERAGES |
|||||
ˆ |
˜ |
|
|
|
|
|
|
Since Y |
− Y is a deterministic function of X, |
|
|
||||
|
|
|
ˆ t |
ˆ |
˜ |
|
|
|
E[(Y − Y ) |
(Y |
− Y )|X] = 0. |
|
|
||
Then, by iterated expectation applied to vectors, we have |
|||||||
|
ˆ t |
ˆ |
˜ |
|
ˆ t |
ˆ |
˜ |
|
E(E[(Y − Y ) |
(Y |
− Y )|X]) = E[(Y − Y ) |
(Y |
− Y )] = 0 |
as claimed, which proves the theorem.
As in the scalar case, the conditional expectation is in general a di cult function to evaluate with the notable exception of jointly Gaussian vectors. Recall that (4.41)–(4.44) the conditional pdf for jointly Gaussian vectors Y
and X with K(X,Y ) = E[((Xt, Y t)−(mtX −mtY ))t((Xt, Y t)−(mtX −mtY ))], KY = E[(Y − mY )(Y − mY )t], KX = E[(X − mX)(X − mX)t], KXY =
E[(X − mX)(Y − mY )t], KY X = E[(Y − mY )(Y − mY )t] is
fY |X(y|x) = |
(2π)−m/2(det(KY |X))−1/2 × |
|
||
where |
exp −1/2(y − mY |x)tKY−|1X(y − mY |x) , |
(4.49) |
||
|
|
|
|
|
∆ |
|
|
|
|
KY |X = KY − KY XKX−1KXY |
|
|||
= |
E[(Y − E(Y |X))(Y − E(Y |X))t|X], |
(4.50) |
||
|
det(K(Y,X)) |
|
||
|
det(KY |X) = |
|
, |
(4.51) |
|
det(KX) |
|||
and |
|
|
|
|
E(Y |X = x) = mY |x = mY + KY XKX−1(x − mX), |
(4.52) |
|||
and hence the minimum mean square estimate of Y given X is |
|
|||
E(Y |X) = mY + KY XKX−1(X − mX) , |
(4.53) |
which is an a ne (linear plus constant) function of X! The resulting mean squared error is (using iterated expectation)
E[(Y − E(Y |X))t(Y − E(Y |X))] |
|
|
(4.54) |
||||||
= E E[(Y − E(Y |X))t(Y − E(Y |X))|X] |
|
||||||||
= |
E &E[Tr[(Y |
− |
E(Y |
X))(Y |
− |
E(Y |
X))t]'X] |
|
|
|
|
|
| |
|
| |
| |
|
||
= |
Tr(K |
Y |X). |
|
|
|
|
|
' |
(4.55) |
& |
|
|
|
|
|
4.8. EXPECTATION AS ESTIMATION |
221 |
In the special case where X = Xn = (X0, X1, . . . , Xn−1) and Y = Xn, the so called one-step linear prediction problem, the solution takes an interesting form. For this case define the nth order covariance matrix as the n × n matrix
|
|
|
|
KX(n) = E[(Xn − E(Xn))(Xn − E(Xn))t], |
(4.56) |
||||||
i.e., the |
(k, j) |
entry of KX(n) |
is E[(Xk |
− |
E(Xk))(Xj |
− |
E(Xj))], k, j = |
||||
0, 1, . . . , n − 1. |
Then if Xn+1 |
|
|
|
|||||||
|
n |
is |
is Gaussian, the optimal one-step predic- |
||||||||
tor for Xn given X |
|
|
|
|
|
|
|||||
ˆ |
|
n |
) = E(Xn)+ |
|
|
|
|
|
|||
Xn(X |
|
|
|
|
|
|
E[(Xn − E(Xn))(Xn − E(Xn))t](KX(n))−1(Xn − E(Xn)) (4.57)
which has an a ne form
ˆ |
n |
) = AX |
n |
+ b |
(4.58) |
Xn(X |
|
|
where
|
A = rt(K |
(n))−1, |
|
|
|||||
|
|
|
|
|
X |
|
|
|
|
|
|
|
KX(n, 0) |
|
|
|
|||
|
|
KX(n, 1) |
|
|
|||||
r = |
|
|
|
... |
|
|
|
, |
|
|
|
|
X |
|
|
− |
|
|
|
|
|
K |
|
(n, n |
1) |
|
|
||
|
|
|
|
|
|
and
(4.59)
(4.60)
b = E(Xn) − AE(Xn).
The resulting mean squared error is
MMSE = |
|
ˆ |
|
|
n 2 |
||
E[(Xn − Xn(X |
)) ] |
||||||
= |
Tr(KY − KY XKX−1KXY ) |
||||||
= σX2 n − rt(KX(n))−1r |
|||||||
or |
|
|
|
|
|
|
|
|
|
ˆ |
n |
2 |
|
2 |
|
MMSE = E[(Xn − Xn(X |
|
)) |
] = σXn|Xn , |
||||
which from (4.51) can be expressed as |
|
|
|
|
|
||
MMSE = |
det(K(n)) |
|
|
||||
|
X |
|
|
, |
|||
det(K |
(n−1)) |
||||||
|
|
|
X |
|
|
|
(4.61)
(4.62)
(4.63)
222 |
CHAPTER 4. EXPECTATION AND AVERAGES |
a classical result from minimum mean squared error estimation theory.
If the Xn are samples of a weakly stationary random process with zero mean, then this simplifies to
ˆ (Xn) = rt(K(n))−1Xn,
Xn X
where r is the n-dimensional vector
KX(n)
KX(n − 1) r = ..
.
KX(1)
(4.64)
(4.65)
4.9Implications for Linear Estimation
The development of optimal mean squared estimation for the Gaussian case provides a prevue and an approach to the problem of optimal mean squared estimation for the situation of completely general random vectors (not necessarily Gaussian) where only linear or a ne estimators are allowed (to avoid the problem of possibly intractable conditional expectations in the nonGaussian case). This topic will be developed in some detail in a later section, but the key results will here be shown to follow directly from the Gaussian case by reinterpreting the results.
The key fact is that the optimal estimator for a vector Y given a vector X when the two are jointly Gaussian was found to be an a ne estimator, that is, to have the form
ˆ
Y (X) = AX + b.
Since it was found the lowest possible MMSE over all possible estimators was achieved by an estimator of this form with A = KY XKX−1 and b = E(Y )+AE(X) with a resulting MSE of MMSE = Tr(KY −KY XKX−1KXY ), then it is obviously true that this MMSE must be the minimum achievable MSE over all a ne estimators, i.e., that for all k × m matrices A and m-dimensional vectors b it is true that
MMSE(A, b) = |
Tr (Y |
AX − b)(Y − AX − b)t |
|
|
|
≥ |
Tr(&KY |
− |
|
' |
(4.66) |
|
− KY XKX−1KXY ) |
|
|||
and that equality holds if and only if A = KY XKX−1 |
and b = E(Y ) + |
AE(X). We shall now see that this version of the result has nothing to do with Gaussianity and that the inequality and solution are true for any distribution (providing of course that KX is invertible).
4.9. IMPLICATIONS FOR LINEAR ESTIMATION |
223 |
Expanding the MSE and using some linear algebra results in MMSE(A, b)
=Tr &(Y − AX − b)(Y − AX − b)t'
=Tr ((Y − mY + A(X − mX) − b + mY + AmX)'
× (Y − mY + A(X − mX) − b + mY + AmX)t
=Tr &KY − AKXY − KY XAt + AKXAt' +(b − mY − AmX)t(b − mY − AmX)
where all the remaining cross terms are zero. Regardless of A the final term is nonnegative and hence it is bound below by 0, a minimum achieved by the choice
|
|
|
|
|
b = mY + AmX. |
|
|
|
|
|
|
(4.67) |
||
Thus the inequality we wish to prove becomes |
|
|
|
|
|
|
||||||||
Tr KY |
− |
AKXY |
− |
KY XAt + AKXAt |
' |
≥ |
Tr(KY |
− |
K |
|
K−1KXY ) |
|||
& |
|
|
|
|
|
|
Y X |
X |
(4.68) |
|||||
or |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Tr |
KY XKX−1KXY + AKXAt − AKXY − KY XAt |
|
≥ 0. |
(4.69) |
||||||||||
Since K |
X&is a covariance matrix it is Hermitian and |
since it has an inverse, |
||||||||||||
|
|
|
' |
|
|
|
it must be positive definite. Hence it has a well defined squareroot K1/2 |
|
(see Section A.4) and hence |
X |
|
|
Tr (AKX1/2 − KY XKX−1/2)(AKX1/2 − KY XKX−1/2)t |
(4.70) |
(just expand this expression to verify it is the same as the previous ex-
pression). |
|
But this has the form Tr(BBt) which is just |
|
i |
b2 |
, which is |
||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
i,i |
|
|
|
|||
nonnegative, proving the inequality. Plugging in A = K |
XK−1 achieves |
|||||||||||||||||||||||||||
the lower bound with equality. |
|
|
|
|
|
|
|
|
|
|
Y |
X |
|
|
|
|||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||
|
We summarize the result in the following theorem. |
|
|
|
|
|
|
|
|
|||||||||||||||||||
Theorem 4.6 Given random vectors X and Y with K |
|
|
|
|
|
|
|
|
t t |
|||||||||||||||||||
(X,Y ) = E[((X , Y t)− |
||||||||||||||||||||||||||||
(m |
t |
|
m |
t |
t |
t |
t |
) |
|
|
t |
|
t |
|
|
= |
|
|||||||||||
X |
− |
Y |
)) |
((X |
, Y |
− |
(mX |
− |
mY ))], KY |
E[(Y |
− |
mY )(Y |
− |
mY ) ], |
||||||||||||||
|
|
|
|
|
|
|
t |
|
|
|
|
|
|
|
|
|
|
t |
|
|||||||||
KX = E[(X − mX)(X t− mX) |
], KXY = E[(X − mX)(Y − mY ) |
], KY X = |
||||||||||||||||||||||||||
E[(Y − mY )(Y − mY ) ], assume that KX is invertible (e.g., it is positive |
||||||||||||||||||||||||||||
definite). Then |
|
|
|
|
|
|
& |
− |
|
|
|
|
|
|
|
|
|
|
|
' |
|
|||||||
|
|
min MMSE(A, b) |
|
|
|
|
AX |
1 |
b)(Y |
|
AX |
|
b)t |
|
||||||||||||||
|
|
|
A,b |
|
|
|
|
= |
A,b |
|
|
|
|
− |
|
|
− |
|
|
− |
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
= |
Tr(KY − KY XKX− KXY ) |
|
|
|
|
|
|
(4.71) |
and the minimimum is achieved by A = KY XKX−1 and b = E(Y )+AE(X).
224 |
CHAPTER 4. EXPECTATION AND AVERAGES |
In particular, this result does not require that the vectors be jointly Gaussian.
As in the Gaussian case, the results can be specialized to the situation where Y = Xn and X = Xn and {Xn} is a weakly stationary process to obtain that the optimal linear estimator of Xn given (X0, . . . , Xn−1) in the sense of minimizing the mean squared error is
ˆ (Xn) = rt(K(n))−1Xn,
Xn X
where r is the n-dimensional vector |
|
|
||
|
|
KX(n) |
|
|
|
KX(n − 1) |
|||
r = |
|
|
... |
. |
|
|
|
X |
|
|
|
|
|
|
|
|
K (1) |
|
(4.72)
(4.73)
The resulting minimum mean squared error (called the “linear least squares error”) is
LLSE = |
σX2 − rt(KX(n))−1r |
(4.74) |
||
= |
|
det(K(n)) |
(4.75) |
|
|
X |
|||
|
|
. |
||
det(K(n−1)) |
||||
|
|
X |
|
a classical result of linear estimation theory. Note that the equation with the determinant form does not require a Gaussian density, although a Gaussian density was used to identify the first form with the deternminant form (both being σX2 n|Xn in the Gaussian case).
4.10Correlation and Linear Estimation
As an example of the application of correlations, we consider a constrained form of the minimum mean squared error estimation problem that provided an application and interpretation for conditional expectation. A problem with the earlier result is that in some applications the conditional expectation will be complicated or unknown, but the simpler correlation might be known or at least one can approximate it based on observed data. While the conditional expectation provides the optimal estimator over all possible estimators, the correlation turns out to provide an optimal estimator over a restricted class of estimators.
Suppose again that the value of X is observed and that a good estimate
ˆ
of Y , say Y (X) is desired. Once again the quality of an estimator will be measured by the resulting mean squared error, but this time we do not