Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный университет биоресурсов и природопользования

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

Handbook_of_statistical_analysis_using_SAS

.pdf

Скачиваний:

Добавлен:

01.05.2015

Размер:

4.92 Mб

Скачать

☆

<<< < Предыдущая 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 3031 / 3631 32 33 34 35 36 > Следующая >>>

Number of Observations and Percent Classified into type

From
type	A	B	Total
A	12	5	17
	70.59	29.41	100.00
B	6	9	15
	40.00 60.00 100.00
Total	8	14	32
	56.25 43.75 100.00
Priors	0.5	0.5

Error Count Estimates for type

	A	B	Total
Rate	0.2941	0.4000	0.3471
Priors	0.5000	0.5000

Display 15.2

The resubstitution approach to estimating the misclassiﬁcation rate of the derived allocation rule is seen from Display 15.2 to be 18.82%. But the leaving-out-one (cross-validation) approach increases this to a more realistic 34.71%.

To identify the most important variables for discrimination, proc stepdisc can be used as follows. The output is shown in Display 15.3.

proc stepdisc data=skulls sle=.05 sls=.05; class type;

var length--facewidth; run;

The signiﬁcance levels required for variables to enter and be retained are set with the sle (slentry) and sls (slstay) options, respectively. The default value for both is p=.15. By default, a “stepwise” procedure is used (other options can be speciﬁed using a method= statement). Variables are chosen to enter or leave the discriminant function according to one of two criteria:

The signiﬁcance level of an F-test from an analysis of covariance, where the variables already chosen act as covariates and the variable under consideration is the dependent variable.

The squared multiple correlation for predicting the variable under consideration from the class variable controlling for the effects of the variables already chosen.

The signiﬁcance level and the squared partial correlation criteria select variables in the same order, although they may select different numbers of variables. Increasing the sample size tends to increase the number of variables selected when using signiﬁcance levels, but has little effect on the number selected when using squared partial correlations.

At step 1 in Display 15.3, the variable faceheight has the highest R2 value and is the ﬁrst variable selected. At Step 2, none of the partial R2 values of the other variables meet the criterion for inclusion and the process therefore ends. The tolerance shown for each variable is one minus the squared multiple correlation of the variable with the other variables already selected. A variable can only be entered if its tolerance is above a value speciﬁed in the singular statement. The value set by default is 1.0E–8.

The STEPDISC Procedure

The Method for Selecting Variables is STEPWISE

Observations	32	Variable(s) in the Analysis			5
Class Levels	2	Variable(s) will be Included			0
		Significance	Level to Enter		0.05
		Significance	Level to Stay		0.05
	Class Level Information
Variable
Type	Name	Frequency	Weight Proportion
A	A	17	17.0000	0.531250
B	B	15	15.0000	0.468750

The STEPDISC Procedure

Stepwise Selection: Step 1

Statistics for Entry, DF = 1, 30

Variable	R-Square	F Value		Pr > F	Tolerance
length	0.3488	16	.07	0.0004	1.0000
width	0.0021	0	.06	0.8029	1.0000
height	0.0532	1	.69 0.2041		1.0000
faceheight	0.3904	19	.21	0.0001	1.0000
facewidth	0.2369	9	.32	0.0047	1.0000

Variable faceheight will be entered.

Variable(s) that have been Entered

faceheight

Multivariate Statistics

Statistic	Value	F Value		Num DF	Den DF	Pr > F
Wilks' Lambda	0.609634	19	.21	1	30	0.0001
Pillai's Trace	0.390366	9	.21	1	30	0.0001
Average Squared Canonical Correlation	0.390366

The STEPDISC Procedure

Stepwise Selection: Step 2

Statistics for Removal, DF = 1, 30

Variable	R-Square	F Value	Pr > F
faceheight	0.3904	19.21	0.0001

No variables can be removed.

Statistics for Entry, DF = 1, 29

	Partial
Variable	R-Square	F Value	Pr > F	Tolerance
length	0.0541	1.66	0.2081	0.4304
width	0.0162	0.48	0.4945	0.9927
height	0.0047	0.14	0.7135	0.9177
facewidth	0.0271	0.81	0.3763	0.6190

No variables can be entered.

No further steps are possible.

The STEPDISC Procedure

Stepwise Selection Summary

							Averaged
							Squared
	Number	Partial	F	Pr >	Wilks'	Pr <	Canonical	Pr >
Step	In Entered	Removed R-Square	Value	F	Lambda	Lambda	Correlation	ASCC
1	1 faceheight	0.3904	19.21	0.0001	0.60963388	0.0001	0.39036612	0.0001

Display 15.3

Details of the “discriminant function” using only faceheight are found as follows:

proc discrim data=skulls crossvalidate; class type;

var faceheight; run;

The output is shown in Display 15.4. Here, the coefﬁcients of faceheight in each class are simply the mean of the class on faceheight divided by the pooled within-group variance of the variable. The resubstitution and leaving one out methods of estimating the misclassiﬁcation rate give the same value of 24.71%.

		The DISCRIM Procedure
	Observations		32	DF Total			31
	Variables		1	DF Within Classes			30
	Classes		2	DF Between Classes			1
		Class Level Information
	Variable						Prior
type	Name	Frequency		Weight		Proportion	Probability
A	A		17	17	.0000	0.531250	0.500000
B	B		15	5	.0000	0.468750	0.500000

Pooled Covariance Matrix Information

	Natural Log of the
Covariance	Determinant of the
Matrix Rank	Covariance Matrix
1	2.90727

The DISCRIM Procedure

Pairwise Generalized Squared Distances Between Groups

2	–	-	–	-1	–	–
D (i\|j) = (X		-	X)' COV		(X -	X)
	i		j		i	j
Generalized Squared Distance to type
From
type			A		B
A			0	2.41065
B	2	.41065			0

Linear Discriminant Function
–	-1–		-1	–
Constant = -.5 X' COV X Coefficient Vector = COV				X
j	j			j
Linear Discriminant Function for type
Variable	A		B
Constant	-133.15615	-159	.69891
faceheight	3.81408	4	.17695

The DISCRIM Procedure

Classification Summary for Calibration Data: WORK.SKULLS Resubstitution Summary using Linear Discriminant Function

Generalized Squared Distance Function

2		-1
–		–
D (X) = (X-X		)' COV (X-X )
j	j	j

Posterior Probability of Membership in Each type

2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))

j k k

Number of Observations and Percent Classified into type

From
type	A	B	Total
A	12	5	17
	70.59	29.41 100.00
B	3	12	15
	20.00 80.00 100.00
Total	15	17	32
	46.88 53.13 100.00
Priors	0.5	0.5
Error Count Estimates for type
	A	B	Total
Rate	0.2941	0.2000	0.2471
Priors	0.5000	0.5000

The DISCRIM Procedure

Classification Summary for Calibration Data: WORK.SKULLS Cross-validation Summary using Linear Discriminant Function

Generalized Squared Distance Function

2	–		-1
			–
D (X) = (X-X		)' COV (X-X )
j	(X)j		(X) (X)j

Posterior Probability of Membership in Each type

2 2 Pr(j|X) = exp(-.5 D (X)) / SUM exp(-.5 D (X))

j k k

Number of Observations and Percent Classified into type

From
type	A	B	Total
A	12	5	17
	70.59	29.41 100.00
B	3	12	15
	20.00 80.00 100.00
Total	15	17	32
	46.88 53.13 100.00
Priors	0.5	0.5

Error Count Estimates for type

	A	B	Total
Rate	0.2941	0.2000	0.2471
Priors	0.5000	0.5000

Display 15.4

Exercises

15.1Use the posterr options in proc discrim to estimate error rates for the discriminant functions derived for the skull data. Compare these with those given in Displays 15.2 and 15.4.

15.2Investigate the use of the nonparametric discriminant methods available in proc discrim for the skull data. Compare the results with those for the simple linear discriminant function given in the text.

Chapter 16

Correspondence

Analysis: Smoking and

Motherhood, Sex and the

Single Girl, and European

Stereotypes

16.1Description of Data

Three sets of data are considered in this chapter, all of which arise in the form of two-dimensional contingency tables as met previously in Chapter 3. The three data sets are given in Displays 16.1, 16.2, and 16.3; details are as follows.

Display 16.1: These data involve the association between a girl’s age and her relationship with her boyfriend.

Display 16.2: These data show the distribution of birth outcomes by age of mother, length of gestation, and whether or not the mother smoked during the prenatal period. We consider the data as a two-dimensional contingency table with four row categories and four column categories.

Display 16.3: These data were obtained by asking a large number of people in the U.K. which of 13 characteristics they would associate with the nationals of the U.K.’s partner countries in the European Community. Entries in the table give the percentages of respondents agreeing that the nationals of a particular country possess the particular characteristic.

		Age Group

	Under 16	16–17	17–18	18–19	19–20

No boyfriend	21	21	14	13	8
Boyfriend/No sexual intercourse	8	9	6	8	2
Boyfriend/Sexual intercourse	2	3	4	10	10

Display 16.1

	Premature			Full-Term

	Died in	Alive at		Died in	Alive at
	1st year	year 1		1st year	year 1

Young mothers
Non-smokers	50	315	24		4012
Smokers	9	40	6		459
Old mothers
Non-smokers	41	147	14		1594
Smokers	4	11	1		124

Display 16.2

						Characteristic
Country	1	2	3	4	5	6	7	8	9	10	11	12	13
France	37	29	21	19	10	10	8	8	6	6	5	2	1
Spain	7	14	8	9	27	7	3	7	3	23	12	1	3
Italy	30	12	19	10	20	7	12	6	5	13	10	1	2
U.K.	9	14	4	6	27	12	2	13	26	16	29	6	25
Ireland	1	7	1	16	30	3	10	9	5	11	22	2	27
Holland	5	4	2	2	15	2	0	13	24	1	28	4	6
Germany	4	48	1	12	3	9	2	11	41	1	38	8	8

Note: Characteristics: (1) stylish; (2) arrogant; (3) sexy; (4) devious; (5) easy-going; (6) greedy; (7) cowardly; (8) boring; (9) efﬁcient; (10) lazy; (11) hard working; (12) clever; (13) courageous.

Display 16.3

16.2Displaying Contingency Table Data Graphically Using Correspondence Analysis

Correspondence analysis is a technique for displaying the associations among a set of categorical variables in a type of scatterplot or map, thus allowing a visual examination of the structure or pattern of these associations. A correspondence analysis should ideally be seen as an extremely useful supplement to, rather than a replacement for, the more formal inferential procedures generally used with categorical data (see Chapters 3 and 8). The aim when using correspondence analysis is nicely summarized in the following quotation from Greenacre (1992):

An important aspect of correspondence analysis which distinguishes it from more conventional statistical methods is that it is not a conﬁrmatory technique, trying to prove a hypothesis, but rather an exploratory technique, trying to reveal the data content. One can say that it serves as a window onto the data, allowing researchers easier access to their numerical results and facilitating discussion of the data and possibly generating hypothesis which can be formally tested at a later stage.

<<< < Предыдущая 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 3031 / 3631 32 33 34 35 36 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
14.11.201956.62 Кб3Gal_-Vol_kn.docx
#
01.05.201545.25 Mб31Get_Rid_of_your_Accent_-_Advanced_Level.pdf
#
01.05.201522.82 Mб95gistologia.pdf
#
22.08.20193.23 Mб10Gnuch.-Kovt.-Skoroch puc..doc
#
01.05.2015325.63 Кб5GOST_20850-84_ДКК.doc.столярка.doc
#
01.05.20154.92 Mб17Handbook_of_statistical_analysis_using_SAS.pdf
#
10.08.201983.97 Кб14HARDWARE.doc
#
01.05.201533.9 Кб6History.docx
#
10.03.201612.98 Mб20hmelnickii_g_o_homenko_v_s_veterinarna_farmakologiya.pdf
#
10.03.20164.78 Mб10Hroshi_ta_kredyt_vyd4.pdf
#
01.05.201553.25 Кб68inform_testi (1).doc