
Handbook_of_statistical_analysis_using_SAS
.pdf
recorded, and which of these five measurements are most informative in this classification exercise.
|
|
|
Type A Skulls |
|
|
|
|
X1 |
X2 |
X3 |
X4 |
X5 |
|
|
|
|
|
|
|
|
|
190.5 |
152.5 |
145.0 |
73.5 |
136.5 |
|
|
172.5 |
132.0 |
125.5 |
63.0 |
121.0 |
|
|
167.0 |
130.0 |
125.5 |
69.5 |
119.5 |
|
|
169.5 |
150.5 |
133.5 |
64.5 |
128.0 |
|
|
175.0 |
138.5 |
126.0 |
77.5 |
135.5 |
|
|
177.5 |
142.5 |
142.5 |
71.5 |
131.0 |
|
|
179.5 |
142.5 |
127.5 |
70.5 |
134.5 |
|
|
179.5 |
138.0 |
133.5 |
73.5 |
132.5 |
|
|
173.5 |
135.5 |
130.5 |
70.0 |
133.5 |
|
|
162.5 |
139.0 |
131.0 |
62.0 |
126.0 |
|
|
178.5 |
135.0 |
136.0 |
71.0 |
124.0 |
|
|
171.5 |
148.5 |
132.5 |
65.0 |
146.5 |
|
|
180.5 |
139.0 |
132.0 |
74.5 |
134.5 |
|
|
183.0 |
149.0 |
121.5 |
76.5 |
142.0 |
|
|
169.5 |
130.0 |
131.0 |
68.0 |
119.0 |
|
|
172.0 |
140.0 |
136.0 |
70.5 |
133.5 |
|
|
170.0 |
126.5 |
134.5 |
66.0 |
118.5 |
|
|
|
|
Type B Skulls |
|
|
|
|
X1 |
X2 |
X3 |
X4 |
X5 |
|
|
|
|
|
|
|
|
|
182.5 |
136.0 |
138.5 |
76.0 |
134.0 |
|
|
179.5 |
135.0 |
128.5 |
74.0 |
132.0 |
|
|
191.0 |
140.5 |
140.5 |
72.5 |
131.5 |
|
|
184.5 |
141.5 |
134.5 |
76.5 |
141.5 |
|
|
181.0 |
142.0 |
132.5 |
79.0 |
136.5 |
|
|
173.5 |
136.5 |
126.0 |
71.5 |
136.5 |
|
|
188.5 |
130.0 |
143.0 |
79.5 |
136.0 |
|
|
175.0 |
153.0 |
130.0 |
76.5 |
142.0 |
|
|
196.0 |
142.5 |
123.5 |
76.0 |
134.0 |
|
|
200.0 |
139.5 |
143.5 |
82.5 |
146.0 |
|
|
185.0 |
134.5 |
140.0 |
81.5 |
137.0 |
|
|
|
|
|
|
|
|
©2002 CRC Press LLC

|
Type B Skulls (Continued) |
|
||
X1 |
X2 |
X3 |
X4 |
X5 |
|
|
|
|
|
174.5 |
143.5 |
132.5 |
74.0 |
136.5 |
195.5 |
144.0 |
138.5 |
78.5 |
144.0 |
197.0 |
131.5 |
135.0 |
80.5 |
139.0 |
182.5 |
131.0 |
135.0 |
68.5 |
136.0 |
|
|
|
|
|
Note: X1 = greatest length of skull; X2 = greatest horizontal breadth of skull; X3 = height of skull; X4 = upper face height; and X5 = face breadth, between outermost points of cheek bones.
Display 15.1
15.2Discriminant Function Analysis
Discriminant analysis is concerned with deriving helpful rules for allocating observations to one or another of a set of a priori defined classes in some optimal way, using the information provided by a series of measurements made of each sample member. The technique is used in situations in which the investigator has one set of observations, the training sample, for which group membership is known with certainty a priori, and a second set, the test sample, consisting of the observations for which group membership is unknown and which we require to allocate to one of the known groups with as few misclassifications as possible.
An initial question that might be asked is: since the members of the training sample can be classified with certainty, why not apply the procedure used in their classification to the test sample? Reasons are not difficult to find. In medicine, for example, it might be possible to diagnose a particular condition with certainty only as a result of a post-mortem examination. Clearly, for patients still alive and in need of treatment, a different diagnostic procedure would be useful!
Several methods for discriminant analysis are available, but here we concentrate on the one proposed by Fisher (1936) as a method for classifying an observation into one of two possible groups using measurements x1, x2, …, xp. Fisher’s approach to the problem was to seek a linear function z of the variables:
©2002 CRC Press LLC

z = a1 x1 + a2 x2 + … + ap xp |
(15.1) |
such that the ratio of the between-groups variance of z to its within-group variance is maximized. This implies that the coefficients a′ = [a1, …, ap ] have to be chosen so that V, given by:
V = |
a----'--Ba------ |
(15.2) |
|
a'Sa |
|
is maximized. In Eq. (15.2), S is the pooled within-groups covariance matrix; that is
S = |
(---n----1---–-----1---)---S---2---+-----(---n---2----–----1---)---S----2 |
(15.3) |
|
n1 + n2 – 2 |
|
where S1 and S2 are the covariance matrices of the two groups, and n1 and n2 the group sample sizes. The matrix B in Eq. (15.2) is the covariance matrix of the group means.
The vector a that maximizes V is given by the solution of the equation:
(B – λ S) a = 0 |
(15.4) |
In the two-group situation, the single solution can be shown to be:
|
|
|
|
a = S–1( |
|
1 – |
|
2) |
(15.5) |
|
|
|
|
x |
x |
||||
where |
– |
and |
– |
are the mean vectors of the |
measurements for the |
||||
x1 |
x2 |
observations in each group.
The assumptions under which Fisher’s method is optimal are
The data in both groups have a multivariate normal distribution.
The covariance matrices of each group are the same.
If the covariance matrices are not the same, but the data are multivariate normal, a quadratic discriminant function may be required. If the data are not multivariate normal, an alternative such as logistic discrimination (Everitt and Dunn [2001]) may be more useful, although Fisher’s method is known to be relatively robust against departures from normality (Hand [1981]).
Assuming |
– |
> |
– |
– |
and |
– |
are the discriminant function score |
z1 |
z2, where |
z1 |
z2 |
means in each group, the classification rule for an observation with discriminant score zi is:
©2002 CRC Press LLC

Assign to group 1 if zi – zc < 0,
Assign to group 2 if zi – zc ≥ 0,
where
zc = |
z----1---+-----z---2 |
(15.6) |
|
z |
|
(This rule assumes that the prior probabilities of belonging to each group are the same.) Subsets of variables most useful for discrimination can be identified using procedures similar to the stepwise methods described in Chapter 4.
A question of some importance about a discriminant function is: how well does it perform? One possible method of evaluating performance would be to apply the derived classification rule to the training set data and calculate the misclassification rate; this is known as the resubstitution estimate. However, estimating misclassifications rates in this way, although simple, is known in general to be optimistic (in some cases wildly so). Better estimates of misclassification rates in discriminant analysis can be defined in a variety of ways (see Hand [1997]). One method that is commonly used is the so-called leaving one out method, in which the discriminant function is first derived from only n–1 sample members, and then used to classify the observation not included. The procedure is repeated n times, each time omitting a different observation.
15.3Analysis Using SAS
The data from Display 15.1 can be read in as follows:
data skulls;
infile 'n:\handbook2\datasets\tibetan.dat' expandtabs; input length width height faceheight facewidth;
if _n_ < 18 then type='A'; else type='B';
run;
A parametric discriminant analysis can be specified as follows:
proc discrim data=skulls pool=test simple manova wcov cross validate;
class type;
var length--facewidth; run;
©2002 CRC Press LLC

The option pool=test provides a test of the equality of the within-group covariance matrices. If the test is significant beyond a level specified by slpool, then a quadratic rather than a linear discriminant function is derived. The default value of slpool is 0.1,
The manova option provides a test of the equality of the mean vectors of the two groups. Clearly, if there is no difference, a discriminant analysis is mostly a waste of time.
The simple option provides useful summary statistics, both overall and within groups; wcov gives the within-group covariance matrices, the crossvalidate option is discussed later in the chapter; the class statement names the variable that defines the groups; and the var statement names the variables to be used to form the discriminant function.
The output is shown in Display 15.2. The results for the test of the equality of the within-group covariance matrices are shown in Display 15.2. The chi-squared test of the equality of the two covariance matrices is not significant at the 0.1 level and thus a linear discriminant function will be derived. The results of the multivariate analysis of variance are also shown in Display 15.2. Because there are only two groups here, all four test criteria lead to the same F-value, which is significant well beyond the 5% level.
The results defining the discriminant function are given in Display 15.2. The two sets of coefficients given need to be subtracted to give the discriminant function in the form described in the previous chapter section.
This leads to: |
|
|
|
|
|
a′ = [–0.0893, 0.1158, 0.0052, –0.1772, –0.1774] |
(15.7) |
||||
The group means on the discriminant function are |
– |
= –28.713, |
– |
= |
|
z1 |
z2 |
||||
– |
|
|
|
|
|
–32.214, leading to a value of zc = –30.463. |
|
|
|
|
|
Thus, for example, a skull having a vector of measurements x′ = [185, 142, 130, 72, 133] has a discriminant score of –30.07, and z1 – zc in this case is therefore 0.39 and the skull should be assigned to group 1.
|
|
The DISCRIM Procedure |
|
|||
|
Observations |
32 |
DF Total |
|
31 |
|
|
Variables |
5 |
DF Within Classes |
30 |
||
|
Classes |
|
2 |
DF Between Classes |
1 |
|
|
|
Class Level Information |
|
|||
|
Variable |
|
|
|
|
Prior |
type |
Name |
Frequency |
Weight |
Proportion |
Probability |
|
A |
A |
|
17 |
17.0000 |
0.531250 |
0.500000 |
B |
B |
|
15 |
15.0000 |
0.468750 |
0.500000 |
|
|
|
|
|
|
|
©2002 CRC Press LLC

The DISCRIM Procedure
Within-Class Covariance Matrices
|
|
|
type = A, |
|
DF = 16 |
|
|
|
|
Variable |
length |
|
width |
|
height |
|
faceheight |
|
facewidth |
length |
45.52941176 25.22242647 |
12 |
.39062500 |
22 |
.15441176 27.97242647 |
||||
width |
25.22242647 57 |
.80514706 |
11 |
.87500000 |
7 |
.51930147 |
48 |
.05514706 |
|
height |
12.39062500 |
11 |
.87500000 |
36 |
.09375000 |
-0.31250000 |
1 |
.40625000 |
|
faceheight |
22.15441176 |
7 |
.51930147 |
-0.31250000 20 |
.93566176 |
16 |
.76930147 |
||
facewidth |
27.97242647 |
48 |
.05514706 |
1 |
.40625000 |
16 |
.76930147 |
66 |
.21139706 |
--------------------------------------------------------------------------------------------------
|
|
|
type = B, |
|
DF = 14 |
|
|
|
|
|
Variable |
|
length |
|
width |
|
height |
|
faceheight |
|
facewidth |
length |
74 |
.42380952 |
-9.52261905 |
22 |
.73690476 17.79404762 |
11 |
.12500000 |
|||
width |
-9.52261905 37 |
.35238095 |
-11.26309524 0 |
.70476190 |
9 |
.46428571 |
||||
height |
22.73690476 -11 |
.26309524 |
36 |
.31666667 |
10 |
.72380952 |
7 |
.19642857 |
||
faceheight |
17.79404762 |
0 |
.70476190 |
10 |
.72380952 |
15 |
.30238095 |
8 |
.66071429 |
|
facewidth |
11.12500000 |
9 |
.46428571 |
7 |
.19642857 |
8 |
.66071429 |
17 |
.96428571 |
--------------------------------------------------------------------------------------------------
The DISCRIM Procedure
Simple Statistics
Total-Sample
|
|
|
|
|
|
Standard |
Variable |
N |
Sum |
|
Mean |
Variance |
Deviation |
length |
32 |
5758 |
179 |
.93750 |
87.70565 |
9.3651 |
width |
32 |
4450 |
139 |
.06250 |
46.80242 |
6.8412 |
height |
32 |
4266 |
133 |
.29688 |
36.99773 |
6.0826 |
faceheight |
32 |
2334 |
72 |
.93750 |
29.06048 |
5.3908 |
facewidth |
32 |
4279 |
133 |
.70313 |
55.41709 |
7.4443 |
--------------------------------------------------------------------------------------------------
©2002 CRC Press LLC

|
|
|
type = A |
|
|
|
|
|
|
|
|
|
Standard |
Variable |
N |
Sum |
|
Mean |
Variance |
Deviation |
length |
17 |
2972 |
174 |
.82353 |
45.52941 |
6.7475 |
width |
17 |
2369 |
139 |
.35294 |
57.80515 |
7.6030 |
height |
17 |
2244 |
132 |
.00000 |
36.09375 |
6.0078 |
faceheight |
17 |
1187 |
69 |
.82353 |
20.93566 |
4.5756 |
facewidth |
17 |
2216 |
130 |
.35294 |
66.21140 |
8.1370 |
--------------------------------------------------------------------------------------------------
type = B
|
|
|
|
|
|
Standard |
Variable |
N |
Sum |
|
Mean |
Variance |
Deviation |
length |
15 |
2786 |
185 |
.73333 |
74.42381 |
8.6269 |
width |
15 |
2081 |
138 |
.73333 |
37.35238 |
6.1117 |
height |
15 |
2022 |
134 |
.76667 |
36.31667 |
6.0263 |
faceheight |
15 |
1147 |
76 |
.46667 |
15.30238 |
3.9118 |
facewidth |
15 |
2063 |
137 |
.50000 |
17.96429 |
4.2384 |
--------------------------------------------------------------------------------------------------
Within Covariance Matrix Information
|
|
|
Natural Log of the |
|
|
Covariance |
Determinant of the |
|
type |
Matrix Rank |
Covariance Matrix |
|
A |
5 |
16.16370 |
|
B |
5 |
15.77333 |
|
Pooled |
5 |
16.72724 |
|
|
The DISCRIM Procedure |
|
Test of Homogeneity of Within Covariance Matrices |
|||
Notation: K |
= |
Number of Groups |
|
P |
= |
Number of Variables |
|
N |
= |
Total Number of Observations - Number of Groups |
|
N(i) |
= |
Number of Observations in the i'th Group - 1 |
©2002 CRC Press LLC

|
|
— |
|
|
|
N(i)/2 |
|
|
|
|
Matrix (i)| |
|
|||
|
|
| | |within SS |
|
||||
V |
= |
---------------------------------------- |
|
|
|
|
|
|
|
|
|
|
|
N/2 |
|
|
|
|Pooled SS Matrix| |
|
||||
|
|
|— |
|
|
|
2 |
|
|
|
1 |
1 |
— | 2P + 3P - 1 |
|||
RHO |
= |
1.0 - | |
SUM ----- |
- |
--- |
| ----------------- |
|
|
|
|— |
N(i) |
N |
— | 6 (P+1) (K-1) |
||
DF |
= |
.5(K-1)P(P+1) |
|
|
|
|
|
|
|
|
|
|— |
PN/2 |
— | |
|
|
|
|
|
| |
N |
V |
| |
Under the null hypothesis: -2 RHO ln |
| |
------------------- |
|
| |
|||
|
|
|
|
| |
— |
PN(i)/2 |
| |
|
|
|
|
|— |
|
— | |
|
|
|
|
|
| | |
N(i) |
is distributed approximately as Chi-Square(DF).
Chi-Square |
DF |
Pr > ChiSq |
18.370512 |
15 |
0.2437 |
Since the Chi-Square value is not significant at the 0.1 level, a pooled covariance matrix will be used in the discriminant function.
Reference: Morrison, D.F. (1976) Multivariate Statistical Methods p252.
The DISCRIM Procedure
Pairwise Generalized Squared Distances Between Groups
2 |
– |
- |
– |
-1 |
– |
– |
D (i|j) = (X |
X)' COV |
(X - |
X) |
|||
|
|
i |
|
j |
i |
j |
Generalized Squared Distance to type
From |
|
|
type |
A |
B |
A |
0 |
3.50144 |
B |
3.50144 |
0 |
©2002 CRC Press LLC

|
The DISCRIM Procedure |
|
|
|
||||
Multivariate Statistics and Exact F Statistics |
|
|
||||||
|
S=1 |
M=1.5 |
N=12 |
|
|
|
||
Statistic |
|
Value |
F Value |
Num DF |
Den DF |
Pr > F |
||
Wilks' Lambda |
0.51811582 |
4.84 |
5 |
|
26 |
0.0029 |
||
Pillai's Trace |
0.48188418 |
4.84 |
5 |
|
26 |
0.0029 |
||
Hotelling-Lawley Trace |
0.93007040 |
4.84 |
5 |
26 |
0.0029 |
|||
Roy's Greatest Root |
0.93007040 |
4.84 |
5 |
|
26 |
0.0029 |
||
Linear Discriminant Function |
|
|
|
|||||
Constant = -.5 |
– |
-1 – |
|
|
|
-1 |
– |
|
X' COV X |
Coefficient Vector = COV |
X |
|
|||||
|
j |
j |
|
|
|
|
j |
|
Linear Discriminant Function for type |
|
|
|
|||||
Variable |
|
A |
|
B |
|
|
|
|
Constant |
-514.26257 |
-544.72605 |
|
|
|
|||
length |
|
1.46831 |
|
1.55762 |
|
|
|
|
width |
|
2.36106 |
|
2.20528 |
|
|
|
|
height |
|
2.75219 |
|
2.74696 |
|
|
|
|
faceheight |
0.77530 |
|
0.95250 |
|
|
|
||
facewidth |
0.19475 |
|
0.37216 |
|
|
|
The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SKULLS Resubstitution Summary using Linear Discriminant Function
Generalized Squared Distance Function
2 |
– |
-1 |
D |
– |
|
(X) = (X-X |
)' COV (X-X ) |
|
j |
j |
j |
Posterior Probability of Membership in Each type
2 2
Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D |
(X)) |
|
j |
k |
k |
©2002 CRC Press LLC

Number of Observations and Percent Classified into type
From |
|
|
|
type |
A |
B |
Total |
A |
14 |
3 |
17 |
|
82.35 |
17.65 100.00 |
|
B |
3 |
12 |
15 |
|
20.00 80.00 100.00 |
||
Total |
17 |
15 |
32 |
|
53.13 46.88 100.00 |
||
Priors |
0.5 |
0.5 |
|
Error Count Estimates for type
|
A |
B |
Total |
Rate |
0.1765 |
0.2000 |
0.1882 |
Priors |
0.5000 |
0.5000 |
|
The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SKULLS Cross-validation Summary using Linear Discriminant Function
Generalized Squared Distance Function
2 |
– |
|
-1 |
– |
|
D |
)' COV |
||||
(X) = (X-X |
(X-X ) |
||||
j |
(X)j |
(X) |
(X)j |
Posterior Probability of Membership in Each type
2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k
©2002 CRC Press LLC