Number of Observations and Percent Classified into type
From |
|
|
|
type |
A |
B |
Total |
A |
12 |
5 |
17 |
|
70.59 |
29.41 |
100.00 |
B |
6 |
9 |
15 |
|
40.00 60.00 100.00 |
Total |
8 |
14 |
32 |
|
56.25 43.75 100.00 |
Priors |
0.5 |
0.5 |
|
Error Count Estimates for type
|
A |
B |
Total |
Rate |
0.2941 |
0.4000 |
0.3471 |
Priors |
0.5000 |
0.5000 |
|
Display 15.2
The resubstitution approach to estimating the misclassification rate of the derived allocation rule is seen from Display 15.2 to be 18.82%. But the leaving-out-one (cross-validation) approach increases this to a more realistic 34.71%.
To identify the most important variables for discrimination, proc stepdisc can be used as follows. The output is shown in Display 15.3.
proc stepdisc data=skulls sle=.05 sls=.05; class type;
var length--facewidth; run;
The significance levels required for variables to enter and be retained are set with the sle (slentry) and sls (slstay) options, respectively. The default value for both is p=.15. By default, a “stepwise” procedure is used (other options can be specified using a method= statement). Variables are chosen to enter or leave the discriminant function according to one of two criteria:
The significance level of an F-test from an analysis of covariance, where the variables already chosen act as covariates and the variable under consideration is the dependent variable.
The squared multiple correlation for predicting the variable under consideration from the class variable controlling for the effects of the variables already chosen.
The significance level and the squared partial correlation criteria select variables in the same order, although they may select different numbers of variables. Increasing the sample size tends to increase the number of variables selected when using significance levels, but has little effect on the number selected when using squared partial correlations.
At step 1 in Display 15.3, the variable faceheight has the highest R2 value and is the first variable selected. At Step 2, none of the partial R2 values of the other variables meet the criterion for inclusion and the process therefore ends. The tolerance shown for each variable is one minus the squared multiple correlation of the variable with the other variables already selected. A variable can only be entered if its tolerance is above a value specified in the singular statement. The value set by default is 1.0E–8.
The STEPDISC Procedure
The Method for Selecting Variables is STEPWISE
Observations |
32 |
Variable(s) in the Analysis |
|
5 |
Class Levels |
2 |
Variable(s) will be Included |
0 |
|
|
Significance |
Level to Enter |
|
0.05 |
|
|
Significance |
Level to Stay |
|
0.05 |
|
Class Level Information |
|
|
Variable |
|
|
|
|
Type |
Name |
Frequency |
Weight Proportion |
A |
A |
17 |
17.0000 |
0.531250 |
B |
B |
15 |
15.0000 |
0.468750 |
The STEPDISC Procedure
Stepwise Selection: Step 1
Statistics for Entry, DF = 1, 30
Variable |
R-Square |
F Value |
Pr > F |
Tolerance |
length |
0.3488 |
16 |
.07 |
0.0004 |
1.0000 |
width |
0.0021 |
0 |
.06 |
0.8029 |
1.0000 |
height |
0.0532 |
1 |
.69 0.2041 |
1.0000 |
faceheight |
0.3904 |
19 |
.21 |
0.0001 |
1.0000 |
facewidth |
0.2369 |
9 |
.32 |
0.0047 |
1.0000 |
Variable faceheight will be entered.
Variable(s) that have been Entered
faceheight
Multivariate Statistics
Statistic |
Value |
F Value |
Num DF |
Den DF |
Pr > F |
Wilks' Lambda |
0.609634 |
19 |
.21 |
1 |
30 |
0.0001 |
Pillai's Trace |
0.390366 |
9 |
.21 |
1 |
30 |
0.0001 |
Average Squared Canonical Correlation |
0.390366 |
|
|
|
|
|
The STEPDISC Procedure
Stepwise Selection: Step 2
Statistics for Removal, DF = 1, 30
Variable |
R-Square |
F Value |
Pr > F |
faceheight |
0.3904 |
19.21 |
0.0001 |
No variables can be removed.
Statistics for Entry, DF = 1, 29
|
Partial |
|
|
|
Variable |
R-Square |
F Value |
Pr > F |
Tolerance |
length |
0.0541 |
1.66 |
0.2081 |
0.4304 |
width |
0.0162 |
0.48 |
0.4945 |
0.9927 |
height |
0.0047 |
0.14 |
0.7135 |
0.9177 |
facewidth |
0.0271 |
0.81 |
0.3763 |
0.6190 |
No variables can be entered.
No further steps are possible.
The STEPDISC Procedure
Stepwise Selection Summary
|
|
|
|
|
|
|
Averaged |
|
|
|
|
|
|
|
|
Squared |
|
|
Number |
Partial |
F |
Pr > |
Wilks' |
Pr < |
Canonical |
Pr > |
Step |
In Entered |
Removed R-Square |
Value |
F |
Lambda |
Lambda |
Correlation |
ASCC |
1 |
1 faceheight |
0.3904 |
19.21 |
0.0001 |
0.60963388 |
0.0001 |
0.39036612 |
0.0001 |
Display 15.3
Details of the “discriminant function” using only faceheight are found as follows:
proc discrim data=skulls crossvalidate; class type;
var faceheight; run;
The output is shown in Display 15.4. Here, the coefficients of faceheight in each class are simply the mean of the class on faceheight divided by the pooled within-group variance of the variable. The resubstitution and leaving one out methods of estimating the misclassification rate give the same value of 24.71%.
|
|
The DISCRIM Procedure |
|
|
Observations |
32 |
DF Total |
|
31 |
|
Variables |
1 |
DF Within Classes |
30 |
|
Classes |
|
2 |
DF Between Classes |
1 |
|
|
Class Level Information |
|
|
Variable |
|
|
|
|
|
Prior |
type |
Name |
Frequency |
Weight |
Proportion |
Probability |
A |
A |
|
17 |
17 |
.0000 |
0.531250 |
0.500000 |
B |
B |
|
15 |
5 |
.0000 |
0.468750 |
0.500000 |
|
|
|
|
|
|
|
|
Pooled Covariance Matrix Information
|
Natural Log of the |
Covariance |
Determinant of the |
Matrix Rank |
Covariance Matrix |
1 |
2.90727 |
The DISCRIM Procedure
Pairwise Generalized Squared Distances Between Groups
2 |
– |
- |
– |
-1 |
– |
– |
D (i|j) = (X |
X)' COV |
(X - |
X) |
|
i |
|
j |
|
i |
j |
Generalized Squared Distance to type |
From |
|
|
|
|
|
|
type |
|
|
A |
|
B |
|
A |
|
|
0 |
2.41065 |
|
B |
2 |
.41065 |
|
0 |
|
Linear Discriminant Function |
|
– |
-1– |
|
-1 |
– |
Constant = -.5 X' COV X Coefficient Vector = COV |
X |
j |
j |
|
|
j |
Linear Discriminant Function for type |
|
Variable |
A |
|
B |
|
Constant |
-133.15615 |
-159 |
.69891 |
|
faceheight |
3.81408 |
4 |
.17695 |
|
The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SKULLS Resubstitution Summary using Linear Discriminant Function
Generalized Squared Distance Function
2 |
|
-1 |
– |
|
– |
D (X) = (X-X |
|
)' COV (X-X ) |
j |
j |
j |
Posterior Probability of Membership in Each type
2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k
Number of Observations and Percent Classified into type
From |
|
|
|
type |
A |
B |
Total |
A |
12 |
5 |
17 |
|
70.59 |
29.41 100.00 |
B |
3 |
12 |
15 |
|
20.00 80.00 100.00 |
Total |
15 |
17 |
32 |
|
46.88 53.13 100.00 |
Priors |
0.5 |
0.5 |
|
Error Count Estimates for type |
|
A |
B |
Total |
Rate |
0.2941 |
0.2000 |
0.2471 |
Priors |
0.5000 |
0.5000 |
|
The DISCRIM Procedure
Classification Summary for Calibration Data: WORK.SKULLS Cross-validation Summary using Linear Discriminant Function
Generalized Squared Distance Function
2 |
– |
|
-1 |
|
|
– |
D (X) = (X-X |
)' COV (X-X ) |
j |
(X)j |
(X) (X)j |
Posterior Probability of Membership in Each type
2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))
j k k
Number of Observations and Percent Classified into type
From |
|
|
|
type |
A |
B |
Total |
A |
12 |
5 |
17 |
|
70.59 |
29.41 100.00 |
B |
3 |
12 |
15 |
|
20.00 80.00 100.00 |
Total |
15 |
17 |
32 |
|
46.88 53.13 100.00 |
Priors |
0.5 |
0.5 |
|
Error Count Estimates for type
|
A |
B |
Total |
Rate |
0.2941 |
0.2000 |
0.2471 |
Priors |
0.5000 |
0.5000 |
|
|
|
|
|
|
|
|
|
|
|
|
|
Display 15.4
Exercises
15.1Use the posterr options in proc discrim to estimate error rates for the discriminant functions derived for the skull data. Compare these with those given in Displays 15.2 and 15.4.
15.2Investigate the use of the nonparametric discriminant methods available in proc discrim for the skull data. Compare the results with those for the simple linear discriminant function given in the text.
Chapter 16
Correspondence
Analysis: Smoking and
Motherhood, Sex and the
Single Girl, and European
Stereotypes
16.1Description of Data
Three sets of data are considered in this chapter, all of which arise in the form of two-dimensional contingency tables as met previously in Chapter 3. The three data sets are given in Displays 16.1, 16.2, and 16.3; details are as follows.
Display 16.1: These data involve the association between a girl’s age and her relationship with her boyfriend.
Display 16.2: These data show the distribution of birth outcomes by age of mother, length of gestation, and whether or not the mother smoked during the prenatal period. We consider the data as a two-dimensional contingency table with four row categories and four column categories.
Display 16.3: These data were obtained by asking a large number of people in the U.K. which of 13 characteristics they would associate with the nationals of the U.K.’s partner countries in the European Community. Entries in the table give the percentages of respondents agreeing that the nationals of a particular country possess the particular characteristic.
|
|
|
|
Age Group |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Under 16 |
16–17 |
17–18 |
18–19 |
19–20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
No boyfriend |
21 |
21 |
14 |
13 |
8 |
|
|
|
|
Boyfriend/No sexual intercourse |
8 |
9 |
6 |
8 |
2 |
|
|
|
|
Boyfriend/Sexual intercourse |
2 |
3 |
4 |
10 |
10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Display 16.1
|
|
|
Premature |
|
Full-Term |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Died in |
Alive at |
|
Died in |
Alive at |
|
|
|
|
|
1st year |
year 1 |
|
1st year |
year 1 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Young mothers |
|
|
|
|
|
|
|
|
|
Non-smokers |
50 |
315 |
24 |
4012 |
|
|
|
|
Smokers |
9 |
40 |
6 |
459 |
|
|
|
|
Old mothers |
|
|
|
|
|
|
|
|
|
Non-smokers |
41 |
147 |
14 |
1594 |
|
|
|
|
Smokers |
4 |
11 |
1 |
124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Display 16.2
|
|
|
|
|
|
|
Characteristic |
|
|
|
|
Country |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
France |
37 |
29 |
21 |
19 |
10 |
10 |
8 |
8 |
6 |
6 |
5 |
2 |
1 |
Spain |
7 |
14 |
8 |
9 |
27 |
7 |
3 |
7 |
3 |
23 |
12 |
1 |
3 |
Italy |
30 |
12 |
19 |
10 |
20 |
7 |
12 |
6 |
5 |
13 |
10 |
1 |
2 |
U.K. |
9 |
14 |
4 |
6 |
27 |
12 |
2 |
13 |
26 |
16 |
29 |
6 |
25 |
Ireland |
1 |
7 |
1 |
16 |
30 |
3 |
10 |
9 |
5 |
11 |
22 |
2 |
27 |
Holland |
5 |
4 |
2 |
2 |
15 |
2 |
0 |
13 |
24 |
1 |
28 |
4 |
6 |
Germany |
4 |
48 |
1 |
12 |
3 |
9 |
2 |
11 |
41 |
1 |
38 |
8 |
8 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Note: Characteristics: (1) stylish; (2) arrogant; (3) sexy; (4) devious; (5) easy-going; (6) greedy; (7) cowardly; (8) boring; (9) efficient; (10) lazy; (11) hard working; (12) clever; (13) courageous.
Display 16.3
16.2Displaying Contingency Table Data Graphically Using Correspondence Analysis
Correspondence analysis is a technique for displaying the associations among a set of categorical variables in a type of scatterplot or map, thus allowing a visual examination of the structure or pattern of these associations. A correspondence analysis should ideally be seen as an extremely useful supplement to, rather than a replacement for, the more formal inferential procedures generally used with categorical data (see Chapters 3 and 8). The aim when using correspondence analysis is nicely summarized in the following quotation from Greenacre (1992):
An important aspect of correspondence analysis which distinguishes it from more conventional statistical methods is that it is not a confirmatory technique, trying to prove a hypothesis, but rather an exploratory technique, trying to reveal the data content. One can say that it serves as a window onto the data, allowing researchers easier access to their numerical results and facilitating discussion of the data and possibly generating hypothesis which can be formally tested at a later stage.