Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Handbook_of_statistical_analysis_using_SAS

.pdf
Скачиваний:
17
Добавлен:
01.05.2015
Размер:
4.92 Mб
Скачать

Number of Observations and Percent Classified into type

From

 

 

 

type

A

B

Total

A

12

5

17

 

70.59

29.41

100.00

B

6

9

15

 

40.00 60.00 100.00

Total

8

14

32

 

56.25 43.75 100.00

Priors

0.5

0.5

 

Error Count Estimates for type

 

A

B

Total

Rate

0.2941

0.4000

0.3471

Priors

0.5000

0.5000

 

Display 15.2

The resubstitution approach to estimating the misclassification rate of the derived allocation rule is seen from Display 15.2 to be 18.82%. But the leaving-out-one (cross-validation) approach increases this to a more realistic 34.71%.

To identify the most important variables for discrimination, proc stepdisc can be used as follows. The output is shown in Display 15.3.

proc stepdisc data=skulls sle=.05 sls=.05; class type;

var length--facewidth; run;

The significance levels required for variables to enter and be retained are set with the sle (slentry) and sls (slstay) options, respectively. The default value for both is p=.15. By default, a “stepwise” procedure is used (other options can be specified using a method= statement). Variables are chosen to enter or leave the discriminant function according to one of two criteria:

©2002 CRC Press LLC

The significance level of an F-test from an analysis of covariance, where the variables already chosen act as covariates and the variable under consideration is the dependent variable.

The squared multiple correlation for predicting the variable under consideration from the class variable controlling for the effects of the variables already chosen.

The significance level and the squared partial correlation criteria select variables in the same order, although they may select different numbers of variables. Increasing the sample size tends to increase the number of variables selected when using significance levels, but has little effect on the number selected when using squared partial correlations.

At step 1 in Display 15.3, the variable faceheight has the highest R2 value and is the first variable selected. At Step 2, none of the partial R2 values of the other variables meet the criterion for inclusion and the process therefore ends. The tolerance shown for each variable is one minus the squared multiple correlation of the variable with the other variables already selected. A variable can only be entered if its tolerance is above a value specified in the singular statement. The value set by default is 1.0E–8.

The STEPDISC Procedure

The Method for Selecting Variables is STEPWISE

Observations

32

Variable(s) in the Analysis

 

5

Class Levels

2

Variable(s) will be Included

0

 

 

Significance

Level to Enter

 

0.05

 

 

Significance

Level to Stay

 

0.05

 

Class Level Information

 

 

Variable

 

 

 

 

Type

Name

Frequency

Weight Proportion

A

A

17

17.0000

0.531250

B

B

15

15.0000

0.468750

©2002 CRC Press LLC

The STEPDISC Procedure

Stepwise Selection: Step 1

Statistics for Entry, DF = 1, 30

Variable

R-Square

F Value

Pr > F

Tolerance

length

0.3488

16

.07

0.0004

1.0000

width

0.0021

0

.06

0.8029

1.0000

height

0.0532

1

.69 0.2041

1.0000

faceheight

0.3904

19

.21

0.0001

1.0000

facewidth

0.2369

9

.32

0.0047

1.0000

Variable faceheight will be entered.

Variable(s) that have been Entered

faceheight

Multivariate Statistics

Statistic

Value

F Value

Num DF

Den DF

Pr > F

Wilks' Lambda

0.609634

19

.21

1

30

0.0001

Pillai's Trace

0.390366

9

.21

1

30

0.0001

Average Squared Canonical Correlation

0.390366

 

 

 

 

 

The STEPDISC Procedure

Stepwise Selection: Step 2

Statistics for Removal, DF = 1, 30

Variable

R-Square

F Value

Pr > F

faceheight

0.3904

19.21

0.0001

No variables can be removed.

Statistics for Entry, DF = 1, 29

 

Partial

 

 

 

Variable

R-Square

F Value

Pr > F

Tolerance

length

0.0541

1.66

0.2081

0.4304

width

0.0162

0.48

0.4945

0.9927

height

0.0047

0.14

0.7135

0.9177

facewidth

0.0271

0.81

0.3763

0.6190

©2002 CRC Press LLC

No variables can be entered.

No further steps are possible.

The STEPDISC Procedure

Stepwise Selection Summary

 

 

 

 

 

 

 

Averaged

 

 

 

 

 

 

 

 

Squared

 

 

Number

Partial

F

Pr >

Wilks'

Pr <

Canonical

Pr >

Step

In Entered

Removed R-Square

Value

F

Lambda

Lambda

Correlation

ASCC

1

1 faceheight

0.3904

19.21

0.0001

0.60963388

0.0001

0.39036612

0.0001

Display 15.3

Details of the “discriminant function” using only faceheight are found as follows:

proc discrim data=skulls crossvalidate; class type;

var faceheight; run;

The output is shown in Display 15.4. Here, the coefficients of faceheight in each class are simply the mean of the class on faceheight divided by the pooled within-group variance of the variable. The resubstitution and leaving one out methods of estimating the misclassification rate give the same value of 24.71%.

 

 

The DISCRIM Procedure

 

 

Observations

32

DF Total

 

31

 

Variables

1

DF Within Classes

30

 

Classes

 

2

DF Between Classes

1

 

 

Class Level Information

 

 

Variable

 

 

 

 

 

Prior

type

Name

Frequency

Weight

Proportion

Probability

A

A

 

17

17

.0000

0.531250

0.500000

B

B

 

15

5

.0000

0.468750

0.500000

 

 

 

 

 

 

 

 

©2002 CRC Press LLC

Pooled Covariance Matrix Information

 

Natural Log of the

Covariance

Determinant of the

Matrix Rank

Covariance Matrix

1

2.90727

The DISCRIM Procedure

Pairwise Generalized Squared Distances Between Groups

2

-

-1

D (i|j) = (X

X)' COV

(X -

X)

 

i

 

j

 

i

j

Generalized Squared Distance to type

From

 

 

 

 

 

 

type

 

 

A

 

B

 

A

 

 

0

2.41065

 

B

2

.41065

 

0

 

Linear Discriminant Function

 

-1

 

-1

Constant = -.5 X' COV X Coefficient Vector = COV

X

j

j

 

 

j

Linear Discriminant Function for type

 

Variable

A

 

B

 

Constant

-133.15615

-159

.69891

 

faceheight

3.81408

4

.17695

 

The DISCRIM Procedure

Classification Summary for Calibration Data: WORK.SKULLS Resubstitution Summary using Linear Discriminant Function

Generalized Squared Distance Function

2

 

-1

 

D (X) = (X-X

 

)' COV (X-X )

j

j

j

©2002 CRC Press LLC

Posterior Probability of Membership in Each type

2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))

j k k

Number of Observations and Percent Classified into type

From

 

 

 

type

A

B

Total

A

12

5

17

 

70.59

29.41 100.00

B

3

12

15

 

20.00 80.00 100.00

Total

15

17

32

 

46.88 53.13 100.00

Priors

0.5

0.5

 

Error Count Estimates for type

 

A

B

Total

Rate

0.2941

0.2000

0.2471

Priors

0.5000

0.5000

 

The DISCRIM Procedure

Classification Summary for Calibration Data: WORK.SKULLS Cross-validation Summary using Linear Discriminant Function

Generalized Squared Distance Function

2

 

-1

 

 

D (X) = (X-X

)' COV (X-X )

j

(X)j

(X) (X)j

Posterior Probability of Membership in Each type

2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))

j k k

©2002 CRC Press LLC

Number of Observations and Percent Classified into type

From

 

 

 

type

A

B

Total

A

12

5

17

 

70.59

29.41 100.00

B

3

12

15

 

20.00 80.00 100.00

Total

15

17

32

 

46.88 53.13 100.00

Priors

0.5

0.5

 

Error Count Estimates for type

 

A

B

Total

Rate

0.2941

0.2000

0.2471

Priors

0.5000

0.5000

 

 

 

 

 

 

 

 

 

 

 

 

 

Display 15.4

Exercises

15.1Use the posterr options in proc discrim to estimate error rates for the discriminant functions derived for the skull data. Compare these with those given in Displays 15.2 and 15.4.

15.2Investigate the use of the nonparametric discriminant methods available in proc discrim for the skull data. Compare the results with those for the simple linear discriminant function given in the text.

©2002 CRC Press LLC

Chapter 16

Correspondence

Analysis: Smoking and

Motherhood, Sex and the

Single Girl, and European

Stereotypes

16.1Description of Data

Three sets of data are considered in this chapter, all of which arise in the form of two-dimensional contingency tables as met previously in Chapter 3. The three data sets are given in Displays 16.1, 16.2, and 16.3; details are as follows.

Display 16.1: These data involve the association between a girl’s age and her relationship with her boyfriend.

Display 16.2: These data show the distribution of birth outcomes by age of mother, length of gestation, and whether or not the mother smoked during the prenatal period. We consider the data as a two-dimensional contingency table with four row categories and four column categories.

©2002 CRC Press LLC

Display 16.3: These data were obtained by asking a large number of people in the U.K. which of 13 characteristics they would associate with the nationals of the U.K.’s partner countries in the European Community. Entries in the table give the percentages of respondents agreeing that the nationals of a particular country possess the particular characteristic.

 

 

 

 

Age Group

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Under 16

16–17

17–18

18–19

19–20

 

 

 

 

 

 

 

 

 

 

 

 

 

 

No boyfriend

21

21

14

13

8

 

 

 

 

Boyfriend/No sexual intercourse

8

9

6

8

2

 

 

 

 

Boyfriend/Sexual intercourse

2

3

4

10

10

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Display 16.1

 

 

 

Premature

 

Full-Term

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Died in

Alive at

 

Died in

Alive at

 

 

 

 

 

1st year

year 1

 

1st year

year 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Young mothers

 

 

 

 

 

 

 

 

 

Non-smokers

50

315

24

4012

 

 

 

 

Smokers

9

40

6

459

 

 

 

 

Old mothers

 

 

 

 

 

 

 

 

 

Non-smokers

41

147

14

1594

 

 

 

 

Smokers

4

11

1

124

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Display 16.2

©2002 CRC Press LLC

 

 

 

 

 

 

 

Characteristic

 

 

 

 

Country

1

2

3

4

5

6

7

8

9

10

11

12

13

France

37

29

21

19

10

10

8

8

6

6

5

2

1

Spain

7

14

8

9

27

7

3

7

3

23

12

1

3

Italy

30

12

19

10

20

7

12

6

5

13

10

1

2

U.K.

9

14

4

6

27

12

2

13

26

16

29

6

25

Ireland

1

7

1

16

30

3

10

9

5

11

22

2

27

Holland

5

4

2

2

15

2

0

13

24

1

28

4

6

Germany

4

48

1

12

3

9

2

11

41

1

38

8

8

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Note: Characteristics: (1) stylish; (2) arrogant; (3) sexy; (4) devious; (5) easy-going; (6) greedy; (7) cowardly; (8) boring; (9) efficient; (10) lazy; (11) hard working; (12) clever; (13) courageous.

Display 16.3

16.2Displaying Contingency Table Data Graphically Using Correspondence Analysis

Correspondence analysis is a technique for displaying the associations among a set of categorical variables in a type of scatterplot or map, thus allowing a visual examination of the structure or pattern of these associations. A correspondence analysis should ideally be seen as an extremely useful supplement to, rather than a replacement for, the more formal inferential procedures generally used with categorical data (see Chapters 3 and 8). The aim when using correspondence analysis is nicely summarized in the following quotation from Greenacre (1992):

An important aspect of correspondence analysis which distinguishes it from more conventional statistical methods is that it is not a confirmatory technique, trying to prove a hypothesis, but rather an exploratory technique, trying to reveal the data content. One can say that it serves as a window onto the data, allowing researchers easier access to their numerical results and facilitating discussion of the data and possibly generating hypothesis which can be formally tested at a later stage.

©2002 CRC Press LLC

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]