Литература / Advanced Digital Signal Processing and Noise Reduction (Saeed V. Vaseghi) / 04 - Bayesian estimation
.pdfBayesian Estimation
fY|Θ ( y |θ ) |
|
MAP |
|
ML |
|
MMSE |
|
MAVE |
|
mean, mode, |
θ |
median |
|
|
|
109 |
fY|Θ ( y |θ ) |
|
|
|
MAP |
|
|
|
|
MAVE |
|
|
|
MMSE |
|
|
mode |
median |
mean |
θ |
Figure 4.10 Illustration of a symmetric and an asymmetric pdf and their respective mode, mean and median and the relations to MAP, MAVE and MMSE estimates.
4.2.6 The Influence of the Prior on Estimation Bias and Variance
The use of a prior pdf introduces a bias in the estimate towards the range of parameter values with a relatively high prior pdf, and reduces the variance of the estimate. To illustrate the effects of the prior pdf on the bias and the variance of an estimate, we consider the following examples in which the bias and the variance of the ML and the MAP estimates of the mean of a process are compared.
Example 4.6 Consider the ML estimation of a random scalar parameter θ, observed in a zero-mean additive white Gaussian noise (AWGN) n(m), and expressed as
y(m) = θ + n(m), m= 0,..., N–1 |
(4.55) |
It is assumed that, for each realisation of the parameter θ, N observation samples are available. Note that, since the noise is assumed to be a zeromean process, this problem is equivalent to estimation of the mean of the process y(m). The likelihood of an observation vector y=[y(0), y(1), …, y(N–1)] and a parameter value of θ is given by
|
N −1 |
|
|
|
|
fY |Θ ( y | θ ) = ∏ f N |
(y(m) −θ ) |
|
|||
|
m=0 |
|
|
|
|
= |
1 |
|
|
− |
|
|
|
exp |
|||
(2πσ n2 )N / 2 |
|||||
|
|
|
1 |
N −1 |
|
(4.56) |
|
|||
|
∑[y(m) −θ ]2 |
|
|
2σ n2 |
|
||
m=0 |
|
|
110 |
Bayesian Estimation |
From Equation (4.56) the log-likelihood function is given by
|
N |
|
1 |
N −1 |
|
ln fY|Θ ( y |θ ) = − |
ln(2πσ n2 ) − |
∑[y(m) −θ ]2 |
|||
|
|
||||
2 |
|
2σ n2 m=0 |
The ML estimate of θ , obtained by setting the derivative of ln f Y zero, is given by
(4.57)
|Θ ( y θ) to
|
1 |
N −1 |
|
||
θˆML = |
∑ y(m) = |
|
(4.58) |
||
y |
|||||
|
|||||
|
N m=0 |
|
where y denotes the time average of y(m). From Equation (4.56), we note that the ML solution is an unbiased estimate
|
|
|
1 |
|
N −1 |
|
|
|
|
|
|
|
|
||
|
ˆ |
|
|
∑[θ |
|
|
=θ |
|
|
|
|
(4.59) |
|||
|
|
|
|
|
|
|
|
||||||||
|
E[θ ML ] = E |
|
|
+n(m)] |
|
|
|
|
|||||||
|
|
|
N m=0 |
|
|
|
|
|
|
|
|
|
|||
and the variance of the ML estimate is given by |
|
|
|
|
|
|
|||||||||
|
|
|
2 |
|
|
|
|
1 |
N −1 |
|
2 |
|
|
2 |
|
ˆ |
ˆ |
|
] |
=E |
|
∑ y(m) |
|
|
= |
σ n |
(4.60) |
||||
|
|
|
|
||||||||||||
Var[θ ML ]=E[(θ ML −θ ) |
|
|
N |
−θ |
|
N |
|||||||||
|
|
|
|
|
|
|
|
m=0 |
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Note that the variance of the ML estimate decreases with increasing length of observation.
Example 4.7 Estimation of a uniformly-distributed parameter observed in AWGN. Consider the effects of using a uniform parameter prior on the mean and the variance of the estimate in Example 4.6. Assume that the prior for the parameter θ is given by
1/(θ max −θ min ) |
θmin ≤θ ≤θmax |
(4.61) |
|
fΘ (θ ) = |
0 |
otherwise |
|
|
|
as illustrated in Figure 4.11. From Bayes’ rule, the posterior pdf is given by
Bayesian Estimation |
|
111 |
|
fY|Θ ( y | θ) |
fΘ (θ) |
fΘ|Y (θ| y) |
|
Likelihood |
Prior |
|
Posterior |
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
θ ML |
θ |
θ min |
θ max θ |
θ MAP θ MMSE |
θ |
Figure 4.11 Illustration of the effects of a uniform prior.
fΘ|Y (θ | y) = |
1 |
fY |
|Θ ( y |θ ) fΘ (θ ) |
|
|
|
|||||
fY ( y) |
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|||
|
1 |
|
|
|
1 |
|
|
1 |
N −1 |
|
|
|
|
|
|
|
exp − |
∑[y(m) −θ ]2 |
, |
θ min ≤ θ ≤θ max |
|||
|
|
|
|
|
|
|
|||||
= fY ( y) (2πσ n2 ) N / 2 |
2σ n2 |
m=0 |
|
|
|||||||
|
|
0, |
|
|
|
|
|
|
otherwise |
||
|
|
|
|
|
|
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
(4.62) |
The MAP estimate is obtained by maximising the posterior pdf:
θ min
θˆ ( )= θˆ
MAP y ML ( y)
θ
max
if θˆML ( y) < θ min
if θ min ≥ θˆML ( y) ≥θ max (4.63) if θˆML ( y)> θ max
Note that the MAP estimate is constrained to the range θmin to θmax. This constraint is desirable and moderates the estimates that, due to say low signal-to-noise ratio, fall outside the range of possible values of θ. It is easy to see that the variance of an estimate constrained to a range of θmin to θmax is less than the variance of the ML estimate in which there is no constraint on the range of the parameter estimate:
θ max |
∞ |
Var[θˆMAP ]= ∫ (θˆMAP −θ )2 fY|, ( y |θ dy |
≤ Var[θˆML ]= ∫ (θˆML −θ )2 fY|, ( y |θ ) dy |
θ min |
-∞ |
|
(4.64) |
112 |
|
|
|
Bayesian Estimation |
|
fY|Θ ( y |θ ) |
|
fΘ (θ ) |
|
fΘ|Y (θ | y) fY ( y) |
|
|
|
|
|
||
Likelihood |
|
|
Prior |
|
Posterior |
|
× |
|
|
= |
|
θML |
θ |
θ max=µθ |
θ |
θ MAP |
θ |
Figure 4.12 Illustration of the posterior pdf as product of the likelihood and the prior.
Example 4.8 Estimation of a Gaussian-distributed parameter observed in AWGN. In this example, we consider the effect of a Gaussian prior on the mean and the variance of the MAP estimate. Assume that the parameter θ is Gaussian-distributed with a mean µθ and a variance σθ2 as
fΘ (θ ) = |
1 |
|
(θ − µ |
)2 |
|
|
|
exp − |
θ |
|
|
(4.65) |
|
(2πσθ2 )1/ 2 |
2σθ2 |
|
||||
|
|
|
|
|
From Bayes rule the posterior pdf is given as the product of the likelihood and the prior pdfs as:
fΘ|Y |
(θ | y) = |
|
1 |
fY|Θ ( y |θ ) fΘ (θ ) |
|||
fY ( y) |
|||||||
|
|
|
|
|
|||
|
= |
1 |
1 |
||||
|
|
|
|
|
|||
|
fY ( y) (2πσ n2 ) N / 2 (2πσθ2 )1/ 2 |
||||||
|
|
|
exp −
1 |
N −1 |
|
∑[y(m) |
||
2σ n2 |
||
m=0 |
|
1 |
|
|
|
|
−θ ]2 − |
(θ − µ |
|
)2 |
|
|
|
θ |
|
|||
|
|||||
|
2 |
|
|
|
|
|
2σθ |
|
|
|
|
(4.66) The maximum posterior solution is obtained by setting the derivative of the log-posterior function, ln f Θ|Y (θ | y), with respect to θ to zero:
ˆ |
|
|
|
σθ2 |
|
|
|
|
σn2 N |
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
θMAP |
(y) |
= |
σθ2 |
+ σn2 N |
y + |
σθ2 |
+ σn2 |
|
µθ |
(4.67) |
||
|
|
|
|
|
|
N |
|
N −1
where y = ∑ y(m) / N .
m=0
Note that the MAP estimate is an interpolation between the ML estimate y and the mean of the prior pdf µθ, as shown in Figure 4.12. The expectation
Bayesian Estimation |
|
|
|
|
|
|
113 |
|
|
|
|
|
|
N2 |
>> N1 |
|
|
|
N1 |
|
|
|
|
|
|
|
fY|Θ ( y |θ ) |
|
|
fY|Θ ( y |θ ) |
|
fΘ (θ ) |
|
|
fΘ (θ ) |
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
µθ θ |
θ |
ML |
θ |
µ |
θ |
θ MAP |
θ |
MAP |
|
|
θML |
Figure 4.13 Illustration of the effect of increasing length of observation on the variance an estimator.
of the MAP estimate is obtained by noting that the only random variable on
the right-hand side of Equation (4.67) is the term |
|
|
|
, and that E [ |
|
]=θ |
|||||||||||||||
|
|
y |
y |
||||||||||||||||||
E[θˆ |
( y)] = |
|
|
σθ2 |
θ + |
|
σ n2 |
N |
µ |
|
(4.68) |
||||||||||
|
|
+σ n2 N |
σθ2 +σ n2 N |
||||||||||||||||||
MAP |
|
|
σθ2 |
|
|
θ |
|
|
|||||||||||||
and the variance of the MAP estimate is given as |
|
|
|
|
|
|
|
|
|||||||||||||
Var[θˆMAP ( y)] |
|
|
σθ2 |
|
|
|
|
|
|
|
σ n2 |
N |
|
|
|
||||||
= |
|
|
× Var[y] = |
|
(4.69) |
||||||||||||||||
|
|
|
|
|
|
+ σ n2 |
Nσθ2 |
||||||||||||||
|
σθ2 |
+σ n2 N |
|
|
1 |
|
|
||||||||||||||
Substitution of Equation (4.58) in Equation (4.67) yields |
|
|
|||||||||||||||||||
Var[θˆMAP ( y)] = |
|
Var[θˆML ( y)] |
|
|
|
|
|
|
|
(4.70) |
|||||||||||
|
+ Var[θˆML ( y)] σθ2 |
|
|
||||||||||||||||||
|
|
|
1 |
|
|
|
|
Note that as σθ2 , the variance of the parameter θ, increases the influence of the prior decreases, and the variance of the MAP estimate tends towards the variance of the ML estimate.
4.2.7 The Relative Importance of the Prior and the Observation
A fundamental issue in the Bayesian inference method is the relative influence of the observation signal and the prior pdf on the outcome. The importance of the observation depends on the confidence in the observation, and the confidence in turn depends on the length of the observation and on
114 |
Bayesian Estimation |
the signal-to-noise ratio (SNR). In general, as the number of observation samples and the SNR increase, the variance of the estimate and the influence of the prior decrease. From Equation (4.67) for the estimation of a Gaussian distributed parameter observed in AWGN, as the length of the observation N increases, the importance of the prior decreases, and the MAP estimate tends to the ML estimate:
|
|
|
|
2 |
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
limitθˆ |
|
( y ) = limit |
|
σθ |
|
|
|
+ |
|
σ n |
N |
|
µ |
|
= |
|
=θˆ |
|
(4.71) |
|
MAP |
|
|
y |
|
y |
ML |
||||||||||||||
2 |
2 |
|
2 |
|
2 |
|
||||||||||||||
N →∞ |
|
N |
|
|
|
|
N |
|
θ |
|
|
|
|
|||||||
|
N →∞ |
σθ |
+σ n |
|
|
|
σθ |
+σ n |
|
|
|
|
|
|
|
As illustrated in Figure 4.13, as the length of the observation N tends to infinity then both the MAP and the ML estimates of the parameter should tend to its true value θ.
Example 4.9 MAP estimation of a signal in additive noise. Consider the estimation of a scalar-valued Gaussian signal x(m), observed in an additive Gaussian white noise n(m), and modelled as
y(m)=x(m)+n(m)
The posterior pdf of the signal x(m) is given by
f X |Y (x(m) y(m))=
=
( 1 ) fY|X (y(m) x(m)) f X ( x(m)) fY y(m)
( 1 ) f N (y(m) − x(m)) f X ( x(m)) fY y(m)
(4.72)
(4.73)
where f X (x(m))=N (x(m),µ x ,σ x2 ) and f N (n(m))=N (n(m),µn ,σ n2 ) are the Gaussian pdfs of the signal and noise respectively. Substitution of the signal and noise pdfs in Equation (4.73) yields
f |
|
(x(m) | y(m)) = |
|
1 |
|
1 |
|
|
|
− |
[y(m) − x(m) − µn ]2 |
||
X |Y |
|
(y(m)) |
|
exp |
|
2 |
|
||||||
|
f |
Y |
2πσ n |
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
2σ n |
|
||||
|
|
|
|
|
1 |
|
|
|
|
|
2 |
|
|
|
|
|
|
|
|
|
− |
[x(m) − µ x ] |
|
|
|||
|
|
|
|
|
|
exp |
|
|
2 |
|
|
||
|
|
|
|
|
2πσ x |
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
2σ x |
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
(4.74) |
This equation can be rewritten as
Bayesian Estimation |
|
|
|
|
|
|
|
|
|
|
|
115 |
|||||
|
|
|
|
|
1 |
1 |
|
|
|
|
2 |
2 |
2 |
2 |
|
||
|
|
(x(m) | y(m)) = |
|
|
|
|
|
− |
σ x |
[y(m) − x(m) − µn ] |
+ σ n |
[x(m) − µ x ] |
|
||||
f |
X |Y |
|
|
|
|
|
|
|
|
exp |
|
|
|
|
|
||
|
|
f |
Y |
(y(m)) 2πσ |
n |
σ |
x |
|
|
|
2 2 |
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
2σ x σ n |
|
|
||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(4.75) |
To obtain the MAP estimate we set the derivative of the log-likelihood |
||||||||||||
function ln f X |Y (x(m) | y(m)) with respect to x(m) to zero as |
|
|
||||||||||
|
∂ [ln f X |Y (x(m) | y(m))] |
= − |
−2σ x2 (y(m) − x(m) − |
µn ) + 2σ n2 (x(m) − µx ) |
= 0 |
|||||||
|
ˆ |
|
|
|
2 |
2 |
|
|
||||
|
|
|
|
|
|
|
|
|
||||
|
∂x(m) |
|
|
|
|
2σ xσ n |
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
(4.76) |
|
From Equation (4.76) the MAP signal estimate is given by |
|
|
||||||||||
|
ˆ |
|
2 |
|
|
|
2 |
|
|
|
|
|
|
|
x |
|
|
|
n |
|
|
|
|
||
|
x(m) = |
|
|
|
|
[ y(m) − n ]+ |
|
|
|
x |
(4.77) |
|
|
|
2 |
|
2 |
|
|
2 |
|||||
|
|
|
+ |
2 |
+ |
|
|
|
||||
|
|
x |
n |
|
x |
n |
|
|
|
Note that the estimate ˆ is x(m)
unconditional mean of x(m), µx,
a weighted linear interpolation between the and the observed value (y(m)–µn). At a very
|
2 |
<< |
2 |
|
|
|
|
ˆ |
|
poor SNR i.e. when |
x |
|
= |
n |
we have |
x(m) ≈ µ x ; and, on the other hand, |
|||
|
|
2 |
|
|
|
|
= |
ˆ |
|
for a noise-free signal |
|
n |
|
0 and |
|
n |
|
0 and we have x(m) = y(m) . |
Example 4.10 MAP estimate of a Gaussian–AR process observed in AWGN. Consider a vector of N samples x from an autoregressive (AR) process observed in an additive Gaussian noise, and modelled as
y = x + n |
(4.78) |
From Chapter 8, a vector x from an AR process may be expressed as
e= Ax |
(4.79) |
where A is a matrix of the AR model coefficients, and the vector e is the input signal of the AR model. Assuming that the signal x is Gaussian, and that the P initial samples x0 are known, the pdf of the signal x is given by
116 |
|
|
|
|
|
|
|
|
|
|
|
|
Bayesian Estimation |
|
|
|
|
|
|
|
|
|
1 |
|
|
1 |
|
|
|
f |
|
( x | x |
|
) = f |
|
(e)= |
|
exp |
− |
|
xT AT Ax |
(4.80) |
||
X |
0 |
E |
( |
N / 2 |
|
2 |
||||||||
|
|
|
|
|
|
2σ |
|
|
||||||
|
|
|
|
|
|
|
2 ) |
|
|
e |
|
|
||
|
|
|
|
|
|
|
2πσ e |
|
|
|
|
|
where it is assumed that the input signal e of the AR model is a zero-mean uncorrelated process with varianceσ e2 . The pdf of a zero-mean Gaussian noise vector n, with covariance matrix Σnn, is given by
|
1 |
|
|
|
|
|
1 |
|
|
|
f N (n) = |
|
|
|
|
|
exp |
− |
|
nTΣ nn−1 n |
(4.81) |
(2π )N / 2 |
|
Σ nn |
|
1/ 2 |
|
|||||
|
|
|
|
|
2 |
|
|
From Bayes’ rule, the pdf of the signal given the noisy observation is
f X |Y ( x | y) = |
fY|X ( y |
|
x ) f |
X ( x) |
= |
1 |
f N ( y − x) f X ( x) (4.82) |
|
|||||||
|
|
|
|
|
|||
fY ( y) |
|
fY ( y) |
|||||
|
|
|
|
Substitution of the pdfs of the signal and noise in Equation (4.82) yields
|
|
|
1 |
|
|
|
|
|
|
f X |Y (x | y) = |
|
|
|
|
|
|
|
− |
|
|
|
|
|
|
|
|
exp |
||
fY ( y)(2π ) |
N |
σ e |
|
Σ nn |
|
1/ 2 |
|||
|
|
|
|
||||||
|
|
N / 2 |
|
|
|
|
|
1 |
|
x |
T |
Α |
T |
Ax |
|
(y − x)T Σ nn−1 (y − x) + |
|
|
|
||||
|
|
|
|
|
|
|
|
2 |
|
|
σ e |
|
|||
|
|
|
|
|
|||
|
|
|
|
2 |
|
|
(4.83)
The MAP estimate corresponds to the minimum of the argument of the exponential function in Equation (4.83). Assuming that the argument of the exponential function is differentiable, and has a well-defined minimum, we can obtain the MAP estimate from
ˆ |
|
∂ |
|
|
|
|
− |
|
T |
|
−1 |
− |
|
x T Α T Ax |
|||
|
|
|
( y |
x) |
|
Σ nn ( y |
x)+ |
|
(4.84) |
||||||||
xMAP ( y)=arg zero |
|
|
|
|
|
|
σ e2 |
||||||||||
x |
|
∂x |
|
|
|
|
|
|
|
|
|
|
|
|
|||
The MAP estimate is |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
|
|
|
|
|
−1 |
|
|
|
|||
ˆ |
|
|
|
|
|
|
|
|
T |
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|||||
+ |
|
2 Σ nnΑ |
|
y |
|
(4.85) |
|||||||||||
xMAP ( y ) = I |
|
|
A |
|
|||||||||||||
|
|
|
|
|
e |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where I is the identity matrix.
Estimate–Maximise (EM) Method |
117 |
4.3 The Estimate–Maximise (EM) Method
The EM algorithm is an iterative likelihood maximisation method with applications in blind deconvolution, model-based signal interpolation, spectral estimation from noisy observations, estimation of a set of model parameters from a training data set, etc. The EM is a framework for solving problems where it is difficult to obtain a direct ML estimate either because the data is incomplete or because the problem is difficult.
To define the term incomplete data, consider a signal x from a random process X with an unknown parameter vector θ and a pdf fX;Θ(x;θ). The notation fX;Θ(x;θ) expresses the dependence of the pdf of X on the value of the unknown parameter θ. The signal x is the so-called complete data and the ML estimate of the parameter vector θ may be obtained from fX;Θ(x;θ). Now assume that the signal x goes through a many-to-one non-invertible transformation (e.g. when a number of samples of the vector x are lost) and is observed as y. The observation y is the so-called incomplete data.
Maximisation of the likelihood of the incomplete data, fY;Θ(y;θ), with respect to the parameter vector θ is often a difficult task, whereas maximisation of the likelihood of the complete data fX;Θ(x;θ) is relatively
easy. Since the complete data is unavailable, the parameter estimate is obtained through maximisation of the conditional expectation of the loglikelihood of the complete data defined as
E[ln f X ;Θ ( x;θ ) |
|
y]= ∫ f X|Y ;Θ (x |
|
y;θ )ln f X ;Θ ( x;θ ) dx |
(4.86) |
|
|
||||
|
|
|
|||
|
|
X |
|
In Equation (4.86), the computation of the term fX|Y;Θ(x|y;θ) requires an
estimate of the unknown parameter vector θ. For this reason, the expectation of the likelihood function is maximised iteratively starting with an initial estimate of θ, and updating the estimate as described in the following.
|
|
“Complete data” |
|
“Incomplete data” |
|
Signal process |
|
x |
Non-invertable |
y |
|
with |
|
|
|
|
|
|
|
|
transformation |
fY ;Θ ( y;θ ) |
|
parameter |
θ |
|
f X ;Θ (x;θ ) |
||
|
|
||||
|
|
|
Figure 4.14 Illustration of transformation of complete data to incomplete data.
118
EM Algorithm
Bayesian Estimation
Step 1: Initialisation Select an initial parameter estimate θ0 , and for i = 0, 1, ... until convergence:
Step 2: Expectation Compute
U (θ ,θˆ ) = E[ln f X ;Θ (x;θ ) | y;θˆi )
= ∫ f X |Y ;Θ (x | y;θˆi )ln f X ;Θ (x;θ ) dx (4.87)
X
Step 3: Maximisation Select
θˆ + |
= arg maxU (θ ,θˆ |
) |
(4.88) |
i 1 |
i |
|
|
|
θ |
|
|
Step 4: Convergence test If not converged then go to Step 2.
4.3.1 Convergence of the EM Algorithm
In this section, it is shown that the EM algorithm converges to a maximum of the likelihood of the incomplete data fY;Θ(y;θ). The likelihood of the complete data can be written as
f X ,Y ;Θ (x, y;θ ) = f X|Y ;Θ (x|y;θ ) fY ;Θ ( y;θ ) |
(4.89) |
where fX,Y;Θ(x,y;θ) is the likelihood of x and y with θ as a parameter. From Equation (4.89), the log-likelihood of the incomplete data is obtained as
ln fY ;Θ ( y;θ ) = ln f X ,Y ;Θ (x, y;θ ) − ln f X|Y ;Θ (x| y;θ ) |
(4.90) |
Using an estimate θˆi of the parameter vector θ, and taking the expectation of Equation (4.90) over the space of the complete signal x, we obtain
ln f |
( y;θ )=U (θ ;θˆ )−V (θ ;θˆ ) |
(4.91) |
|
Y ;Θ |
i |
i |
|
where for a given y, the expectation of ln fY;Θ(y;θ) is itself, and the function
U(θ ;θˆ ) is the conditional expectation of ln fX,Y;Θ(x,y;θ):