Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный университет биоресурсов и природопользования

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

Handbook_of_statistical_analysis_using_SAS

.pdf

Скачиваний:

Добавлен:

01.05.2015

Размер:

4.92 Mб

Скачать

☆

<<< < Предыдущая 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 3132 / 3632 33 34 35 36 > Следующая >>>

Mathematically, correspondence analysis can be regarded as either:

A method for decomposing the chi-squared statistic for a contingency table into components corresponding to different dimensions of the heterogeneity between its rows and columns, or

A method for simultaneously assigning a scale to rows and a separate scale to columns so as to maximize the correlation between the resulting pair of variables.

Quintessentially, however, correspondence analysis is a technique for displaying multivariate categorical data graphically, by deriving coordinate values to represent the categories of the variables involved, which can then be plotted to provide a “picture” of the data.

In the case of two categorical variables forming a two-dimensional contingency table, the required coordinates are obtained from the singular value decomposition (Everitt and Dunn [2001]) of a matrix E with elements eij given by:

eij =	pij – pi+p+j	(16.1)
	--------p----i-+---p---+--j---

where pij = nij /n with nij being the number of observations in the ijth cell of the contingency table and n the total number of observations. The total number of observations in row i is represented by ni+ and the

corresponding value for column j is n . Finally p = ni+ and p = n+j . The

+j i+ ------ +j ------

n n

elements of E can be written in terms of the familiar “observed” (O) and “expected” (E) nomenclature used for contingency tables as:

eij	= ---1---- O-------–----E--	(16.2)
	n E

Written in this way, it is clear that the terms are a form of residual from ﬁtting the independence model to the data.

The singular value decomposition of E consists of ﬁnding matrices U, V, and ∆ (diagonal) such that:

E = U∆ V ′

(16.3)

where U contains the eigenvectors of EE ′ and V the eigenvectors of E′E. The diagonal matrix ∆ contains the ranked singular values δ k so that δ k2 are the eigenvalues (in decreasing) order of either EE ′ or E′E.

The coordinate of the ith row category on the kth coordinate axis is given by δ kuik / pi+ , and the coordinate of the jth column category on

the same axis is given by δ kvjk / p+j , where uik, i = 1 … r and vjk, j = 1 … c are, respectively, the elements of the kth column of U and the kth

column of V.

To represent the table fully requires at most R = min(r, c) – 1 dimensions, where r and c are the number of rows and columns of the

table. R is the rank of the matrix E.	The eigenvalues, δ k2 , are such that:
R		r	c	2 X2
Trace ( EE′) = ∑ δ	2	= ∑	∑		(16.4)
	k			eij = -----
k = 1		i = 1 j = 1		n

where X2 is the usual chi-squared test statistic for independence. In the context of correspondence analysis, X2/n is known as inertia. Correspondence analysis produces a graphical display of the contingency table from the columns of U and V, in most cases from the ﬁrst two columns, u1, u2, v1, v2, of each, since these give the “best” two-dimensional representation. It can be shown that the ﬁrst two coordinates give the following approximation to the eij:

eij ≈ ui1vj1 + ui2vj2

(16.5)

so that a large positive residual corresponds to uik and vjk for k = 1 or 2, being large and of the same sign. A large negative residual corresponds to uik and vjk, being large and of opposite sign for each value of k. When uik and vjk are small and their signs are not consistent for each k, the corresponding residual term will be small. The adequacy of the representation produced by the ﬁrst two coordinates can be informally assessed by calculating the percentages of the inertia they account for; that is

δ 12	+ δ 22
Percentage inertia = -------	-------- (16.6)

∑ δ 2k

k = 1

Values of 60% and above usually mean that the two-dimensional solution gives a reasonable account of the structure in the table.

16.3Analysis Using SAS

16.3.1 Boyfriends

Assuming that the 15 cell counts shown in Display 16.1 are in an ASCII ﬁle, tab separated, a suitable data set can be created as follows:

data boyfriends;

infile 'n:\handbook2\datasets\boyfriends.dat' expandtabs; input c1-c5;

if _n_=1 then rowid='NoBoy'; if _n_=2 then rowid='NoSex'; if _n_=3 then rowid='Both';

label c1='under 16' c2='16-17' c3='17-18' c4='18-19' c5='19-20';

run;

The data are already in the form of a contingency table and can be simply read into a set of variables representing the columns of the table. The label statement is used to assign informative labels to these variables. More informative variable names could also have been used, but labels are more ﬂexible in that they may begin with numbers and include spaces. It is also useful to label the rows, and here the SAS automatic variable _n_ is used to set the values of a character variable rowid.

A correspondence analysis of this table can be performed as follows:

proc corresp data=boyfriends out=coor; var c1-c5;

id rowid; run;

The out= option names the data set that will contain the coordinates of the solution. By default, two dimensions are used and the dimens= option is used to specify an alternative.

The var statement speciﬁes the variables that represent the columns of the contingency table, and the id statement speciﬁes a variable to be used to label the rows. The latter is optional, but without it the rows will simply be labelled row1, row2, etc. The output appears in Display 16.4.

			The CORRESP Procedure
		Inertia and Chi-Square Decomposition
Singular	Principal		Chi-			Cumulative
Value	Inertia	Square		Percent		Percent		19 38 57 76 95
								----+----+----+----+----+---
0.37596	0.14135	19	.6473	95	.36	95	.36	*************************
0.08297 0.00688		0	.9569	4	.64	100	.00	*
Total	0.14823	20	.6042	100	.00

Degrees of Freedom = 8

Row Coordinates

		Dim1		Dim2
NoBoy	-0	.1933	-0	.0610
NoSex	-0	.1924	0	.1425
Both	0	.7322	-0	.0002

Summary Statistics for the Row Points

	Quality	Mass	Inertia
NoBoy	1.0000	0.5540	0.1536
NoSex	1.0000	0.2374	0.0918
Both	1.0000	0.2086	0.7546

Partial Contributions to Inertia for the Row Points

	Dim1	Dim2
NoBoy	0.1465	0.2996
NoSex	0.0622	0.7004
Both	0.7914	0.0000

Indices of the Coordinates that Contribute Most to Inertia for the Row Points

	Dim1	Dim2	Best
NoBoy	2	2	2
NoSex	0	2	2
Both	1	0	1

Squared Cosines for the Row Points

	Dim1	Dim2
NoBoy	0.9094	0.0906
NoSex	0.6456	0.3544
Both	1.0000	0.0000

Column Coordinates

		Dim1		Dim2
under 16	-0.3547		-0.0550
16-17	-0	.2897	0	.0003
17-18	-0	.1033	0	.0001
18-19	0	.2806	0	.1342
19-20	0	.7169	-0	.1234

The CORRESP Procedure

Summary Statistics for the Column Points

	Quality	Mass	Inertia
under 16	1.0000	0.2230	0.1939
16-17	1.0000	0.2374	0.1344
17-18	1.0000	0.1727	0.0124
18-19	1.0000	0.2230	0.1455
19-20	1.0000	0.1439	0.5137

Partial Contributions to Inertia for the Column Points

	Dim1	Dim2
under 16	0.1985	0.0981
16-17	0.1410	0.0000
17-18	0.0130	0.0000
18-19	0.1242	0.5837
19-20	0.5232	0.3183

Indices of the Coordinates that Contribute Most to Inertia for the Column Points

	Dim1	Dim2	Best
under 16	1	0	1
16-17	1	0	1
17-18	0	0	1
18-19	0	2	2
19-20	1	1	1

Squared Cosines for the Column Points

	Dim1	Dim2
under 16	0.9765	0.0235
16-17	1.0000	0.0000
17-18	1.0000	0.0000
18-19	0.8137	0.1863
19-20	0.9712	0.0288

Display 16.4

To produce a plot of the results, a built-in SAS macro, plotit, can be used:

%plotit(data=coor,datatype=corresp,color=black,colors=black);

The resulting plot is shown in Display 16.5. Displaying the categories of a contingency table in a scatterplot in this way involves the concept of distance between the percentage proﬁles of row or column categories. The distance measure used in a correspondence analysis is known as the chi-squared distance. The calculation of this distance can be illustrated using the proportions of girls in age groups 1 and 2 for each relationship type in Display 16.1.

Chi-squared distance =
( 0.68 – 0.64) 2	+	( 0.26 – 0.27) 2	+	( 0.06 – 0.09) 2	(16.7)
------------0.55---------------------	+	------------0.24---------------------	+	---------------------------------= 0.09	(16.7)
------------0.55---------------------		------------0.24---------------------		0.21

This is similar to ordinary “straight line” or Pythagorean distance, but differs by dividing each term by the corresponding average proportion.

In this way, the procedure effectively compensates for the different levels of occurrence of the categories. (More formally, the chance of the chisquared distance for measuring relationships between proﬁles can be justiﬁed as a way of standardizing variables under a multinomial or Poisson distributional assumption; see Greenacre [1992].)

The complete set of chi-squared distances for all pairs of the ﬁve age groups, calculated as shown above, can be arranged in a matrix as follows:

	1	2	3	4	5
1	0.00
1	0.00
2	0.09	0.00
2	0.09	0.00	0.00
3	0.06	0.19	0.00
4	0.66	0.59	0.41	0.00
4	0.66	0.59	0.41	0.00
5	1.07	1.01	0.83	0.51	0.00

The points representing the age groups in Display 16.5 give the twodimensional representation of these distances, the Euclidean distance, between two points representing the chi-square distance between the corresponding age groups. (Similarly for the point representing type of relationship.) For a contingency table with r rows and c columns, it can be shown that the chi-squared distances can be represented exactly in min{r – 1, c – 1} dimensions; here, since r = 3 and c = 5, this means that the coordinates in Display 16.5 will lead to Euclidean distances that are identical to the chi-squared distances given above. For example, the correspondence analysis coordinates for age groups 1 and 2 taken from Display 16.4 are

Age Group	x	y

1	–0.355	–0.055
2	–0.290	0.000

The corresponding Euclidean distance is calculated as:

( – 0.355 + 0.290) 2 + ( – 0.055 – 0.000) 2

that is, a value of 0.09 — agreeing with the chi-squared distance between the two age groups given previously (Eq. (16.7).

Of most interest in correspondence analysis solutions such as that graphed in Display 16.5 is the joint interpretation of the points representing

the row and column categories. It can be shown that row and column coordinates that are large and of the same sign correspond to a large positive residual term in the contingency table. Row and column coordinates that are large but of opposite signs imply a cell in the table with a large negative residual. Finally, small coordinate values close to the origin correspond to small residuals. In Display 16.5, for example, age group 5 and boyfriend/sexual intercourse both have large positive coordinate values on the ﬁrst dimension. Consequently, the corresponding cell in the table will have a large positive residual. Again, age group 5 and boyfriend/no sexual intercourse have coordinate values with opposite signs on both dimensions, implying a negative residual for the corresponding cell in the table.

Display 16.5

16.3.2Smoking and Motherhood

Assuming the cell counts of Display 16.2 are in an ASCII ﬁle births.dat and tab separated, they may be read in as follows:

data births;

infile 'n:\handbook2\datasets\births.dat' expandtabs; input c1-c4;

length rowid $12.; select(_n_);

when(1) rowid='Young NS'; when(2) rowid='Young Smoker';

when(3) rowid='Old NS'; when(4) rowid='Old Smoker';

end;

label c1='Prem Died' c2='Prem Alive' c3='FT Died' c4='FT Alive';

run;

As with the previous example, the data are read into a set of variables corresponding to the columns of the contingency table, and labels assigned to them. The character variable rowid is assigned appropriate values, using the automatic SAS variable _n_ to label the rows. This is explicitly declared as a 12-character variable with the length statement. Where a character variable is assigned values as part of the data step, rather than reading them from a data ﬁle, the default length is determined from its ﬁrst occurrence. In this example, that would have been from rowid='Young NS'; and its length would have been 8 with longer values truncated. This example also shows the use of the select group as an alternative to multiple if-then statements. The expression in parentheses in the select statement is compared to those in the when statements and the rowid variable set accordingly. The end statement terminates the select group.

The correspondence analysis and plot are produced in the same way as for the ﬁrst example. The output is shown in Display 16.6 and the plot in Display 16.7.

proc corresp data=births out=coor; var c1-c4;