Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

1greenacre_m_primicerio_r_multivariate_analysis_of_ecological

.pdf
Скачиваний:
6
Добавлен:
19.11.2019
Размер:
7.36 Mб
Скачать

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

Exhibit 9.8:

The eigenvalues in the classical MDS of the BrayCurtis dissimilarity indices of Exhibit 5.2, showing positive eigenvalues in green and negative ones in brown

0

5,000

10,000

15,000

Adding count variables to MDS maps

Performing nonmetric MDS on the same data gives a stress value of 13.5%, which is not a big improvement on the 16.3%, suggesting that the two resulting maps will not be as different as we found for the smaller data set of Jaccard indices. This is indeed the case, as shown by the quite similar maps in Exhibit 9.9.

In our experience, when there is a large number of samples (and by “large” we mean, as most statisticians do, 30 or more, as in this example), the metric and nonmetric approaches generally agree in their solutions. Where they disagree is in the quantification of the success of their results, with the stress measure always giving a more optimistic value because it does not measure the recovery of the proximities themselves, but their ordering in the map.

The maps in Exhibit 9.9 emanate originally from abundance data on five species, so the question now is how to include these species on the map. We shall consider alternative ways of doing this in future chapters, but for the moment let us use the same approach as in Exhibit 9.7 when the species were positioned at the averages of the samples that contained them. The difference here is that we have abundance counts for the species across the samples, so what we can do is to position each species at their weighted average across the samples. For example, species a has abundances of 0, 26, 0, 0, 13, etc., and a total abundance of 0 26 0 0

13 ... 404, so the position of a is at a weighted average position of the 30 species, with weights 26/404 0.064 on sample s2, 13/404 0.032 on sample s5, and

120

MULTIDIMENSIONAL SCALING

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Exhibit 9.9:

 

 

 

 

 

s17 s8

 

 

 

 

 

 

 

 

60

 

 

 

 

 

 

 

 

 

 

Classical MDS map (upper)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

and nonmetric MDS map

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

(lower) of the Bray-Curtis

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

dissimilarities of Exhibit 5.2

 

40

 

 

 

 

 

s21

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Dimension 2

20

 

 

 

 

s14

 

 

s29

 

s15

 

s13

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s7

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

s5

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s23

 

 

 

s19

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s11s1

s10

 

 

 

 

 

 

 

 

 

 

 

s9s28

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s3 s20

 

 

 

 

 

 

 

 

s16 s22

 

 

 

s2 s12s24

 

 

 

 

 

 

 

 

 

s27

 

 

 

 

 

 

 

 

 

 

 

–20

 

s26

s30

s6

s18

 

 

 

 

 

 

 

 

 

 

s25

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

–40

 

–20

 

0

20

40

 

 

60

 

40

Dimension 2

20

 

0

 

–20

Dimension 1

s8

s17

 

 

s21

 

 

 

 

 

 

s14

 

 

s29

 

s13

 

 

 

s15

 

 

 

s7

 

 

s4

 

 

 

 

 

 

 

 

s5

 

 

 

 

 

 

 

s23

 

 

s19

s1

 

 

 

 

 

 

 

 

 

 

s9s28

 

s11

 

 

 

 

 

s10

 

s22

 

 

s2

s12

 

s20

s16

 

 

s24

 

s3

s26s27

s6

 

 

 

 

 

s25

s30

 

 

 

 

 

 

s18

 

 

–40

–20

0

20

40

60

Dimension 1

121

Exhibit 9.10:

Nonmetric MDS solution (right hand map in Exhibit 9.9) with species a to e added by weighted

averaging of sample points, and sediment types C, S and G by averaging

60

 

40

Dimension 2

20

0

–20

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

s8

s17

 

 

 

 

s21

 

 

 

 

 

 

 

 

 

 

 

 

 

s14

 

 

 

 

 

 

 

 

 

s13

 

 

 

 

 

 

s29

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

C

s15

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s7

s5

 

 

 

 

 

 

 

s4

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s23

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

e

 

S

 

s19

s1

 

 

 

 

 

 

 

 

 

s9 s28

 

 

s11

 

 

 

 

 

 

 

 

 

c

s10

 

 

 

 

 

 

a

 

 

d

s12

 

 

 

 

 

 

s22

 

 

 

s2

 

 

s20

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

b

G

s24

 

 

s3

 

 

 

s16

s26 s27

 

 

 

 

 

 

 

 

 

s6

 

 

 

 

 

 

 

 

 

 

 

s25

 

 

 

 

 

 

 

 

 

 

 

 

 

 

s30

 

 

s18

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

–40

 

–20

 

 

0

 

20

40

60

 

 

 

 

 

 

 

 

Dimension 1

 

 

 

 

 

 

SUMMARY: Multidimensional scaling

so on. Exhibit 9.10 shows the species positions on the nonmetric MDS solution, showing, for example, that species a and b are relatively more abundant in the samples at lower left, while c is more associated with samples on the right. Similarly, even though the ordinal sediment types C (clay), S (sand) and G (gravel) have not been used in the mapping, they can be depicted at the averages of the subsets of samples corresponding to them. The samples thus appear to follow a trend from top right (more clay) to bottom left (more gravel).

1.Multidimensional scaling (MDS) is a method that attempts to make a spatial map of a matrix of proximities, either distances or dissimilarities defined between sample units, so that the interpoint distances in the map come as close as possible to the given proximities according to the chosen fit criterion.

2.The fit criterion in metric MDS involves approximating the actual proximity values by the mapped distances, for example by least-squares.

122

MULTIDIMENSIONAL SCALING

3.Classical MDS is a particular form of metric MDS that relies on the eigenvalueeigenvector decomposition of a square matrix. The eigenvalues give convenient measures of variance explained on each axis, and the dimensions of the solution are uncorrelated.

4.Nonmetric MDS has a more relaxed fit criterion in that it strives to match only the ordering of the proximities to the ordering of the mapped distances.

5.The error in classical MDS is quantified by the percentage of unexplained variance, while in nonmetric MDS the error is quantified by the stress.

6.The stress measure always gives a more optimistic result, because of the relaxation of approximating the proximity values in the map in favour of their rank ordering.

7.In most cases, however, when the size of the proximity data matrix is quite large, say for at least 30 sample units, the results of the two approaches will be essentially the same.

8.When the proximities are of a Euclidean type, it will be more useful to use the metric scaling approach because of the connection with methods such as principal component analysis (Chapter 12) and correspondence analysis (Chapter 13). There would be little advantage, for example, in applying nonmetric scaling to a matrix of chi-square distances.

9.When the proximities are non-Euclidean, the nonmetric approach avoids the dilemma that the triangle inequality is violated by concentrating on ordering of proximities rather than their actual values.

123

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

124

REGRESSION AND PRINCIPAL COMPONENT ANALYSIS

125

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

126

CHAPTER 10

Regression Biplots

In the previous chapter, displays of samples were obtained in a scatterplot with spatial properties (hence often called a map), approximating given distance or dissimilarity matrices. Then some types of variables were added to the display, specifically zero/one categorical variables (e.g., presences of species, sediment categories) and count variables (e.g., species abundances). In this chapter we continue with this theme of adding variables to a plot of samples, including continuous variables in their original form or in fuzzy-coded form. When samples and variables are displayed jointly in such a scatterplot, it is often called a biplot. This designation implies that a certain property holds between the two sets of points in the display in terms of the scalar products between the samples and variables. In this chapter we consider the simplest form of biplot, the regression biplot, which will serve two purposes: first, to give a different geometric interpretation of multiple regression; and second, to give a basic understanding of all the joint displays of samples and variables that will appear in the rest of this book.

Contents

 

Algebra of multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

127

Geometry of multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

128

Regression biplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

132

Generalized linear model biplots with categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

133

Fuzzy-coded species abundances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

135

More than two predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

135

SUMMARY: Regression biplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

138

The multiple linear regression model postulates that the expected value of a response variable Y (i.e., the mean of Y) is a linear combination of several explanatory variables x1, x2, …, xp:

E(Y ) = α + β1x1 + β2x2 + βpxp

(10.1)

Algebra of multiple linear regression

127

Geometry of multiple linear regression

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

For example, using the data of Exhibit 1.1, consider the regression of species labelled d on depth, pollution and temperature. The model is estimated as:

E(d) 6.271 0.148 depth 1.388 pollution 0.043 temperature (10.2)

Notice that, for the moment, we do not comment on whether this type of linear model of a count variable on three environmental variables would be sensible or not, because d is not an interval variable – we will return to this point later.

Since the coefficients in (10.2) depend on the units of the variables, we prefer to consider the regression using all variables in comparable units. Usually this is done by standardization of the variables, so that they are all in units of standard deviation. Let us denote these standardized variables (i.e., centred and normalized) with an asterisk, then the regression model becomes:

E(d*) 0.347 depth* 0.446 pollution* 0.002 temperature* (10.3)

The constant term now vanishes and the coefficients, called standardized regression coefficients, can be compared with one another. Thus it seems that pollution has the strongest influence on the average level of species d, reducing it by 0.446 of a standard deviation for every increase of one standard deviation of pollution. The effect of temperature is minimal and, in fact, is nonsignificant statistically (p 0.99), while depth and pollution are both significant (p 0.039 and p 0.010, respectively), so we drop temperature and consider just the regression on the other two variables, which maintains the value of the coefficients, but slightly smaller p -values: p 0.035 and p 0.008, respectively:

E(d*) 0.347 depth* 0.446 pollution*

(10.4)

When referring to the multiple regression model, it is often said that a hyperplane is being fitted to the data. For a single explanatory variable this reduces to a straight line in the familiar case of simple linear regression. When there are two explanatory variables, as in (10.4), the model is a two-dimensional plane in three dimensions, the third dimension being the response variable d* – a view of this plane in three dimensions is given in Exhibit 10.1, with standardized depth* and pollution* forming the two horizontal dimensions and d* the vertical one. Notice how the plane is going down in the direction of pollution, but going up in the direction of depth, according to the regression coefficients (see the web site of the book which shows a video of this three-dimensional image). Notice too the lack of fit of the points to the plane – the value of R 2 for the regression is 0.442, which means that 44.2% of the variance of d is being

128

REGRESSION BIPLOTS

 

 

 

4

 

 

 

 

d*

Exhibit 10.1:

 

 

 

 

 

 

 

 

 

 

Regression plane defined

 

 

 

 

 

 

 

 

 

by Equation (10.4) for

 

 

 

 

 

 

 

 

 

standardized response

 

 

 

 

 

 

 

 

 

d* and standardized

 

 

 

 

 

 

 

 

 

explanatory variables

 

 

 

 

 

 

 

 

 

pollution* and depth*. The

2

 

 

 

 

 

view is from above the plane

 

 

 

 

 

Pollution*

 

 

 

 

 

 

 

 

 

–4

 

 

 

 

 

 

 

 

 

 

 

 

4

–2

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

–2

 

 

 

–2

 

 

2

 

 

–4

 

 

Depth*

–4

 

 

4

 

 

explained, and 55.8% of the variance unexplained and considered residual, or error, variance.

The linearity of the plane means that predictions of the same mean values form parallel straight lines in the plane. From a mountaineer’s point of view, if you are standing on the plane and want to stay at the same height, you need to walk in a straight line. Projecting these parallel straight lines onto the depth pollution plane gives the contours, also called isolines, as shown in Exhibit 10.2. Finally, the vector in the depth pollution plane with coordinates equal to the regression coefficients, 0.347 0.446 , called the gradient, indicates the direction of steepest ascent in the regression plane, and is perpendicular to the contours. Given the geometry of the regression plane in Exhibit 10.2, it follows that we can do away with the d* dimension, just like cartographers do, and consider just the depth pollution plane and the contours of the regression plane, which are perpendicular to the gradient vector. Exhibit 10.3 shows this “ground view” of the model.

The short arrow labelled d is the gradient vector. The dashed line through this vector is called the biplot axis for the variable d. Contour lines are perpendicular to the biplot axis. Exhibit 10.3(a) corresponds to the darker “shadow” in Exhibit 10.2 in the depth pollution plane, where the contours are in units of standard

129