An Introduction to Statistical Signal Processing
.pdf3.10. BINARY DETECTION IN GAUSSIAN NOISE |
145 |
Since the denominator of the conditional pmf does not depend on x (only on y), given y the denominator has no e ect on the maximization
ˆ |
|
argmax f |
|
(y |
|
x)p |
|
(x). |
X(y) = argmax |
pX|Y (x|y) = |
W |
− |
X |
||||
x |
x |
|
|
|
Assume for simplicity that X is equally likely to be 0 or 1 so that the rule becomes
ˆ |
|
|
|
1 |
− |
1 (x−y)2 |
|
||
|
|
|
2 |
2 |
|
||||
|
|
|
|
|
σW |
|
|||
|
|
|
|
|
|
|
|||
X(y) = argmax pX |
Y (x |
y) == argmax |
2πσW2 |
e |
|
|
. |
||
x |
| |
| |
x |
|
|
|
|
||
The constant in front of the pdf does not e ect the maximization. In addition, the exponential is a mononotically decreasing function of |x−y|, so that the exponential is maximized by minimizing this magnitude di erence, i.e.,
ˆ |
|
argmin |
x |
|
y |
, |
(3.99) |
|
X(y) = argmax |
pX|Y (x|y) == |
− |
||||||
x |
x |
| |
|
| |
|
|
||
which yields a final simple rule: see if x = 0 or 1 is closer to y as the best guess of x. This choice yields the MAP detection and hence the minimum probability of error. In our example this yields the rule
|
0 |
y < 0.5 |
|
Xˆ |
(y) = 1 |
y > 0.5 . |
(3.100) |
Because the optimal detector chooses the x that minimizes the Euclidean distance |x−y| to the observation y, it is called a minimum distance detector or rule. Because the guess can be computed by comparing the observation to a threshold (the value midway between the two possible values of x), the detector is also called a threshold detector.
Assumptions have been made to keep things fairly simple. The reader is invited to work out what happens if the random variable X is biased and if its alphabet is taken to be {−1, 1} instead of {0, 1}. It is instructive to sketch the conditional pmf’s for these cases.
Having derived the optimal detector, it is reasonable to look at the resulting, minimized, probability of error. This can be found using condi-
146 |
|
CHAPTER 3. RANDOM OBJECTS |
tional probability: |
|
|
Pe = |
ˆ |
= X) |
Pr(X(Y ) |
||
= |
ˆ |
ˆ |
Pr(X(Y ) |
= 0|X = 0)pX(0) + Pr(X(Y ) = 1|X = 1)pX(1) |
|
= |
Pr(Y > 0.5|X = 0)pX(0) + Pr(Y < 0.5|X = 1)pX(1) |
|
= |
Pr(W + X > 0.5|X = 0)pX(0) + Pr(W + X < 0.5|X = 1)pX(1) |
|
= |
Pr(W > 0.5|X = 0)pX(0) + Pr(W + 1 < 0.5|X = 1)pX(1) |
|
= |
Pr(W > 0.5)pX(0) + Pr(W < −0.5)pX(1) |
|
where we have used the independence of W and X. These probabilities can be stated in terms of the Φ function of (2.78) as in (2.82), which combined with the assumption that X is uniform and (2.84)yields
Pe = |
1 |
(1 |
− Φ( |
0.5 |
) + Φ(− |
0.5 |
)) = Φ( |
1 |
). |
(3.101) |
2 |
σW |
σW |
2σW |
3.11Statistical Estimation
Discrete conditional probabilities were seen to provide method for guessing an unknown class from an observation: if all incorrect choices have equal costs so that the overall optimality criterion is to minimize the probability of error, then the optimal classification rule is to guess that the class X = k, where pX|Y (k|y) = maxz pX|Y (x|y), the maximum a posteriori or MAP decision rule. There is an analogous problem and solution in the continuous case, but the result does not have as strong an interpretation as in the discrete case. A more complete analogy will be derived in the next chapter.
As in the discrete case, suppose that a random variable Y is observed
ˆ
and the goal is to make a good guess X(Y ) of another random variable X that is jointly distributed with Y . Unfortunately in the continuous case it does not make sense to measure the quality of such a guess by the probability of its being correct because now that probability is usually zero. For example, if Y is formed by adding a Gaussian signal X to an independent Gaussian noise W to form an observation Y = X + W as in the previous section, then no rule is going to recover X perfectly from Y . Nonetheless, intuitively there should be reasonable ways to make such guesses in continuous situations. Since X is continuous, such guesses are refered to as “estimation” or “prediction” of X rather than as “classification” or “detection” as used in the discrete case. In the statistical literature the general problem is referred to as “regression”.
One approach is to mimic the discrete approach on intuitive grounds. If the best guess in the classification problem of a random variable X given an
3.12. CHARACTERISTIC FUNCTIONS |
147 |
ˆ | observation Y is the MAP classifier XMAP(y) = argmaxxpX|Y (x y), then
a natural analog in the continuous case is the so-called MAP estimator defined by
ˆ |
(3.102) |
XMAP(y) = argmaxxfX|Y (x|y), |
the value of x maximizing the conditional pdf given y. The advantage of this estimator is that it is easy to describe and provides an immediate application of conditional pdf’s paralleling that of classification for discrete conditional probability. The disadvantage is that we cannot argue that this estimate is “optimal” in the sense of optimizing some specified criterion, it is essentially an ad hoc (but reasonable) rule. As an example of its use, consider the Gaussian signal plus noise of the previous section. There it was
found that the pdf fX|Y (x|y) is Gaussian with mean |
σX2 |
|
|
y. Since the |
|
σX2 +σW2 |
||
Gaussian density has its peak at its mean, in this case the MAP estimate
σ2
of X given Y = y is given by the conditional mean 2 X 2 y.
σX +σW
Knowledge of the conditional pdf is all that is needed to define another estimator: the maximum likelihood or ML estimate of X given Y = y is defined as the value of x that maximizes the conditional pdf fY |X(y|x), the pdf with the roles of input and output reversed from that of the MAP estimator. Thus
ˆ |
argmax |
fY |X(y|x). |
(3.103) |
XML(y) = |
x |
ˆ
Thus in the Gaussian case treated above, XML(Y ) = y.
The main interest in the ML estimator in some applications is that it is sometimes simpler and that it does not require any assumption on the input statistics. The MAP estimator depends strongly on fX, the ML estimator does not depend on it at all. It is easy to see that if the input pdf is uniform, the MAP estimator and the ML estimator are the same.
3.12Characteristic Functions
We have seen that summing two random variables produces a new random variable whose pmf or pdf is found by convolving the two pmf’s or pdf’s of the original random variables. Anyone with an engineering background will likely have had experience with convolution and recall they can be somewhat messy to evaluate. To make matters worse, if one wishes to sum
additional independent random variables to the existing sum, say form
N
Y = k=1 Xk from an iid collection {Xk}, then the result will be an N- fold convolution, a potential nightmare in all but the simplest of cases. As
148 |
CHAPTER 3. RANDOM OBJECTS |
in other engineering applications such as circuit design, convolutions can be avoided by Fourier transform methods and in this subsection we describe the method as an alternative approach for the examples to come. We begin with the discrete case.
Historically the transforms used in probability theory have been slightly di erent from those in traditionally Fourier analysis. For a discrete random variable with pmf pX, define the characteristic function MX of the random variable (or of the pmf) as
|
|
MX(ju) = pX(x)ejux, |
(3.104) |
x
where u is usually assumed to be real. Recalling the definition (2.34) of the expectation of a function g defined on a sample space, choosing g(ω) = ejuX(ω) shows that the characteristic function can be be more simply defined as
MX(ju) = E[ejuX]. |
(3.105) |
Thus characteristic functions, like probabilities, can be viewed as special cases of expectations.
This transform, which is also referred to as an exponential transform or operational transform, bares a strong resemblance to the discrete-parameter Fourier transform
|
|
|
Fν(pX) = |
pX(x)e−j2πνx |
(3.106) |
x |
|
|
and the z-transform |
|
|
Zz(pX) = |
|
|
pX(x)zx. |
(3.107) |
|
x
In particular, MX(ju) = F−2πu(pX) = Zeju (pX). As a result, all of the properties of characteristic functions follow immediately from (are equivalent to) similar properties from Fourier or z transforms. As with Fourier and z transforms, the original pmf pX can be recovered from the transform MX by suitable inversion. For example, given a pmf pX(k); k ZN ,
1 |
π/2 |
|
1 |
π/2 |
|
|
|
e−iuk du |
|
−π/2 |
MX(ju)e−iuk du = |
−π/2 |
|
x |
pX(x)ejux |
||||
2π |
2π |
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
|
π/2 |
|
|
|
= |
|
x |
pX(x) |
−π/2 eju(x−k) du |
|||
|
|
|
2π |
||||||
|
|
|
|
|
|
|
|
||
|
|
= |
pX(x)δk−x = pX(k). |
(3.108) |
|||||
|
|
|
|
x |
|
|
|
|
|
3.12. CHARACTERISTIC FUNCTIONS |
149 |
Consider again the problem of summing two independent random variables X and W with pmf’s pX and pW with characteristic functions MX and MW , respectively. If Y = X + W as before we can evaluate the characteristic function of Y as
MY (ju) = pY (y)ejuy
y
where from the inverse image formula
pY (y) = |
pX,W (x, w) |
|
x+w=y |
|
x,w: |
so that
MY (ju) = |
|
pX,W (x, w) ejuy |
y |
x,w:x+w=y |
|
|
|
= |
|
pX,W (x, w)ejuy |
y |
x,w:x+w=y |
|
|
|
= |
pX,W (x, w)eju(x+w) |
y x,w:x+w=y
=pX,W (x, w)eju(x+w)
x,w
where the last equality follows because each of the sums for distinct y collects together di erent x and w and together the sums for all y gather all of the x and w. This last sum factors, however, as
MY (ju) = pX(x)pW (w)ejuxejuw
x,w
= pX(x)ejux pW (w)ejuw
x |
w |
|
= MX(ju)MW (ju), |
(3.109) |
|
which shows that the transform of the pmf of the sum of independent random variables is simply the product of the transforms.
Iterating (3.109) several times gives an extremely useful result that we state formally as a theorem. It can be proved by repeating the above argument, but we shall later see a shorter proof.
150 CHAPTER 3. RANDOM OBJECTS
Theorem 3.1 If {Xi; i = 1, . . . , N} are independent random variables with characteristic functions MXi , then the characteristic function of the
N |
|
|
random variable Y = i=1 Xi is |
N |
|
MY (ju) = |
i |
(3.110) |
MXi (ju). |
||
|
=1 |
|
If the Xi are independent and identically distributed with common characteristic function MX, then
MY (ju) = MXN (ju). |
(3.111) |
As a simple example, the characteristic function of a binary random variable X with parameter p = pX(1) = 1 − pX(0) is easily found to be
1
|
|
|
MX(ju) = |
ejukpX(k) = (1 − p) + peju . |
|
(3.112) |
||||||||||
|
|
|
|
|
|
k=0 |
|
|
|
|
|
|
|
|
|
|
If |
{ |
X |
i; i = 1, . . . , n} are |
independent Bernoulli random variables with iden- |
||||||||||||
|
|
|
|
= |
n |
|
X |
, then M |
|
(ju) = [(1 |
|
p) + pe |
ju |
n |
||
tical distributions and Y |
|
|
|
|
− |
] |
|
|||||||||
and hence |
|
n |
k=1 |
i |
|
Yn |
|
|
|
|
||||||
|
|
|
|
|
|
n |
|
|
|
|
|
|
|
|
|
|
|
|
|
MYn (ju) |
= |
|
pYn (k)ejuk |
|
|
|
|
|
|
||||
|
|
|
|
|
|
k=0 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
= |
|
− |
|
|
ju n |
|
|
|
|
|
|
|
|
|
|
|
n |
p) + pe ) |
|
|
|
|
|
|
|||||
|
|
|
|
= |
((1 |
|
|
|
, |
|
|
|
||||
|
|
|
|
k=0 |
|
k |
(1 − p)n−kpk ejuk |
|
|
|
||||||
|
|
|
|
|
|
|
|
|
n |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
where we have invoked the binomial theorem in the last step. For the equality to hold, however, we have from the uniqueness of transforms that pYn (k) must be the bracketed term, that is, the binomial pmf
pYn (k) = |
n |
(1 − p)n−kpk; k Zn+1. |
(3.113) |
k |
As in the discrete case, convolutions can be avoided by transforming the densities involved. The derivation is exactly analogous to the discrete case, with integrals replacing sums in the usual way.
For a continous random variable X with pmf fX, define the characteristic function MX of the random variable (or of the pmf) as
MX(ju) = fX(x)ejux dx. |
(3.114) |
3.12. CHARACTERISTIC FUNCTIONS |
151 |
As in the discrete case, this can be considered as a special case of expectation for continuous random variables as defined in (2.34) so that
MX(ju) = E[ejuX]. |
(3.115) |
The characteristic function is related to the the continuous-parameter Fourier transform
Fν(fX) = |
fX(x)e−j2πνx dx |
(3.116) |
and the Laplace transform |
|
|
Ls(fX) = |
|
|
fX(x)esx dx |
(3.117) |
by MX(ju) = F−2πu(fX) = Lju(fX). As a result, all of the properties of characteristic functions of densities follow immediately from (are equivalent to) similar properties from Fourier or Laplace transforms. For example, given a well-behaved density fX(x); x with characteristic function
MX(ju),
|
1 |
∞ |
|
fX(x) = |
|
−∞ MX(ju)e−jux du. |
(3.118) |
2π |
Consider again the problem of summing two independent random variables X and Y with pdf’s fX and fW with characteristic functions MX and MW , respectively. As in the discrete case it can be shown that
MY (ju) = MX(ju)MW (ju). |
(3.119) |
Rather than mimic the proof of the discrete case, however, we postpone the proof to a more general treatment of characteristic functions in chapter 4.
As in the discrete case, iterating (3.119) several times yields the following result, which now includes both discrete and continous cases.
Theorem 3.2 If {Xi; i = 1, . . . , N} are independent random variables with characteristic functions MXi , then the characteristic function of the
N |
|
|
random variable Y = i=1 Xi is |
N |
|
MY (ju) = |
i |
(3.120) |
MXi (ju). |
||
|
=1 |
|
If the Xi are independent and identically distributed with common characteristic function MX, then
MY (ju) = MXN (ju). |
(3.121) |
152 |
CHAPTER 3. RANDOM OBJECTS |
As an example of characteristic functions and continuous random variables, consider the Gaussian random variable. The evaluation requires a bit of e ort, either using the “complete the square” technique of calculus or by looking up in published tables. Assume that X is a Gaussian random variable with mean m and variance σ2. Then
MX(ju) = |
E(ejuX) |
|
|
|
|
|
|
|
|
|
|
|
|
||
|
∞ |
|
|
2 |
2 |
|
|
|
|
|
|
|
|
||
= |
−∞ |
1 |
e−(x−m) |
/2σ ejux dx |
|
|
|
|
|
||||||
(2πσ2)1/2 |
|
|
|
|
|
||||||||||
|
∞ |
2 |
|
2 |
|
|
|
2 |
|
2 |
|
|
|||
= |
−∞ |
1 |
e−(x |
−2mx−2σ |
jux+m |
)/2σ |
dx |
|
|
||||||
(2πσ2)1/2 |
|
|
|
|
|
|
|
||||||||
= |
−∞ (2πσ12)1/2 e−(x−(m+juσ |
|
)) |
/2σ |
2 |
dx ejum−y |
σ |
/2 |
|||||||
|
∞ |
|
|
|
2 |
2 |
|
|
2 |
2 |
|
||||
= |
ejum−u2σ2/2 . |
|
|
|
|
|
|
|
|
|
|
(3.122) |
|||
Thus the characteristic function of a Gaussian random variable with mean m and variance σX2 is
MX(ju) = ejum−u2σ2/2 . |
(3.123) |
If {Xi; i = 1, . . . , n} are independent Gaussian random variables with
identical densities N(m, σ2) and Yn = n Xi, then
k=1
MYn (ju) = [ejum−u2σ2/2]n = eju(nm)−u2(nσ2)/2, |
(3.124) |
which is the characteristic function of a Gaussian random variable with mean nm and variance nσ2.
The following maxim should be kept in mind whenever faced with sums of independent random variables:
When given a derived distribution problem involving the sum of independent random variables, first find the characteristic function of the sum by taking the product of the characteristic functions of the individual random variables. Then find the corresponding probability function by inverting the transform. This technique is valid if the random variables are independent
—they do not have to be identically distributed.
3.13Gaussian Random Vectors
A random vector vector is said to be Gaussian if its density is Gaussian, that is, if its distribution is described by the multidimensional pdf explained in
3.13. GAUSSIAN RANDOM VECTORS |
153 |
chapter 2. The component random variables of a Gaussian random vector are said to be jointly Gaussian random variables. Note that the symmetric matrix Λ of the k−dimensional vector pdf has k(k + 1)/2 parameters and that the vector m has k parameters. On the other hand, the k marginal pdf’s together have only 2k parameters. Again we note the impossibility of constructing joint pdf’s without more specification than the marginal pdf’s alone. As previously, the marginals will su ce to describe the entire vector if we also know that the vector has independent components, e.g., the vector is iid. In this case the matrix Λ is diagonal.
Although di cult to describe, Gaussian random vectors have several nice properties. One of the most important of these properties is that linear or a ne operations on Gaussian random vectors produce Gaussian random vectors. This result can be demonstrated with only a modest amount of work using multidimensional characteristic functions, the extension of transforms from scalars to vectors.
The multidimensional characteristic function of a distribution is defined as follows: Given a random vector X = (X0, . . . , Xn−1) and a vector parameter u = (u0, . . . , un−1), the n-dimensional characteristic function MX(ju) is defined by
MX(ju) = |
MX0 |
,... ,Xn−1 (ju0, . . . , jun−1) |
|
||||
|
|
|
|
|
|
|
|
= |
E ejutX |
|
|
|
|
||
|
|
|
n |
|
1 |
. |
|
|
|
|
|
|
|||
= |
E |
exp j |
− ukXk |
(3.125) |
|||
k=0
It can be shown using multivariable calculus (problem 3.49) that a Gaussian random vector with mean vector m and covariance matrix Λ has characteristic function
MX(ju) = |
ejutm−1/2utΛu |
|
.(3.126) |
|
= |
exp |
$j n−1 ukmk − 1/2 n−1 n−1 ukΛ(k, m)um% |
||
|
|
|
|
|
|
|
k=0 |
k=0 m=0 |
|
Observe that the Gaussian characteristic function has the same form as the Gaussian pdf — an exponential quadratic in its argument. However, unlike the pdf, the characteristic function depends on the covariance matrix directly, whereas the pdf contains the inverse of the covariance matrix. Thus the Gaussian characteristic function is in some sense simpler than the Gaussian pdf. As a further consequence of the direct dependence on the covariance matrix, it is interesting to note that, unlike the Gaussian pdf, the characteristic function is well-defined even if Λ is only nonnegative
154 |
CHAPTER 3. RANDOM OBJECTS |
definite and not strictly positive definite. Previously we give a definition of a Gaussian random vector in terms of its pdf. Now we can give an alternate, more general (in the sense that a strictly positive definite covariance matrix is not required) definition of a Gaussian random vector and hence random process):
A random vector is Gaussian if and only if it has a characteristic function of the form of (3.126).
3.14Examples: Simple Random Processes
In this section several examples of random processes defined on simple probability spaces are given to illustrate the basic definition of an infinite collection of random variables defined on a single space. In the next section more complicated examples are considered by defining random variables on a probability space which is the output space for another random process, a setup that can be viewed as signal processing.
[3.22] Consider the binary probability space (Ω, F, P ) with Ω = {0, 1}, F the usual event space, and P induced by the pmf p(0) = α and p(1) = 1 − α, where α is some constant, 0 ≤ α ≤ 1. Define a random process on this space as follows:
X(t, ω) = cos(ωt) =
cos(t), t if ω = 1 1, t if ω = 0 .
Thus if a 1 occurs a cosine is sent forever, and if a 0 occurs a constant 1 is sent forever.
This process clearly has continuous time and at first glance it might appear to also have continuous amplitude, but only two waveforms are possible, a cosine and a constant. Thus the alphabet at each time contains at most two values and these possible values change with time. Hence this process is in fact a discrete amplitude process and random vectors drawn from this source are described by pmf’s. We can consider the alphabet of the process to be either T or [−1, 1]T , among other possibilities. Fix time at t = π/2. Then X(π/2) is a random variable with pmf
pX(π/2)(x) = |
α, |
if x = 1 |
1 − α, |
if x = 0 . |
The reader should try other instances of time. What happens at t = 0, 2π, 4π, m . . . ?
