
1greenacre_m_primicerio_r_multivariate_analysis_of_ecological
.pdf
MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA
Exhibit 9.8:
The eigenvalues in the classical MDS of the BrayCurtis dissimilarity indices of Exhibit 5.2, showing positive eigenvalues in green and negative ones in brown
0 |
5,000 |
10,000 |
15,000 |
Adding count variables to MDS maps
Performing nonmetric MDS on the same data gives a stress value of 13.5%, which is not a big improvement on the 16.3%, suggesting that the two resulting maps will not be as different as we found for the smaller data set of Jaccard indices. This is indeed the case, as shown by the quite similar maps in Exhibit 9.9.
In our experience, when there is a large number of samples (and by “large” we mean, as most statisticians do, 30 or more, as in this example), the metric and nonmetric approaches generally agree in their solutions. Where they disagree is in the quantification of the success of their results, with the stress measure always giving a more optimistic value because it does not measure the recovery of the proximities themselves, but their ordering in the map.
The maps in Exhibit 9.9 emanate originally from abundance data on five species, so the question now is how to include these species on the map. We shall consider alternative ways of doing this in future chapters, but for the moment let us use the same approach as in Exhibit 9.7 when the species were positioned at the averages of the samples that contained them. The difference here is that we have abundance counts for the species across the samples, so what we can do is to position each species at their weighted average across the samples. For example, species a has abundances of 0, 26, 0, 0, 13, etc., and a total abundance of 0 26 0 0
13 ... 404, so the position of a is at a weighted average position of the 30 species, with weights 26/404 0.064 on sample s2, 13/404 0.032 on sample s5, and
120

MULTIDIMENSIONAL SCALING |
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Exhibit 9.9: |
|
|
|
|
|
s17 s8 |
|
|
|
|
|
|
|
|||
|
60 |
|
|
|
|
|
|
|
|
|
|
Classical MDS map (upper) |
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
and nonmetric MDS map |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
(lower) of the Bray-Curtis |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
dissimilarities of Exhibit 5.2 |
|
40 |
|
|
|
|
|
s21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Dimension 2 |
20 |
|
|
|
|
s14 |
|
|
s29 |
|
s15 |
|
s13 |
|
|
|
|
|
|
|
|
|
|
|
|||||||
|
|
|
|
|
|
|
|
|
|
|
s4 |
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
s7 |
|
|
|
|
|
|
|
|
|
0 |
|
|
|
|
|
|
s5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
s23 |
|
|
|
s19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
s11s1 |
s10 |
|
|
|
|
|
|
|
|
|
|
|
|
s9s28 |
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
s3 s20 |
|
|
|
|
|
|
|
|
|
s16 s22 |
|
|
|
s2 s12s24 |
|
|
|
|
||
|
|
|
|
|
s27 |
|
|
|
|
|
|
|
|
|
|
|
–20 |
|
s26 |
s30 |
s6 |
s18 |
|
|
|
|
|
||||
|
|
|
|
|
s25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
–40 |
|
–20 |
|
0 |
20 |
40 |
|
|
60 |
|
40 |
Dimension 2 |
20 |
|
0 |
|
–20 |
Dimension 1
s8
s17
|
|
s21 |
|
|
|
|
|
|
s14 |
|
|
s29 |
|
s13 |
|
|
|
|
s15 |
|
|||
|
|
s7 |
|
|
s4 |
||
|
|
|
|
|
|||
|
|
|
s5 |
|
|
|
|
|
|
|
s23 |
|
|
s19 |
s1 |
|
|
|
|
|
|
||
|
|
|
|
s9s28 |
|
s11 |
|
|
|
|
|
|
s10 |
||
|
s22 |
|
|
s2 |
s12 |
|
s20 |
s16 |
|
|
s24 |
|
s3 |
||
s26s27 |
s6 |
|
|
|
|
||
|
s25 |
s30 |
|
|
|
|
|
|
|
s18 |
|
|
–40 |
–20 |
0 |
20 |
40 |
60 |
Dimension 1
121

Exhibit 9.10:
Nonmetric MDS solution (right hand map in Exhibit 9.9) with species a to e added by weighted
averaging of sample points, and sediment types C, S and G by averaging
60
|
40 |
Dimension 2 |
20 |
0
–20
MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA
s8
s17
|
|
|
|
s21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
s14 |
|
|
|
|
|
|
|
|
|
s13 |
|||
|
|
|
|
|
|
s29 |
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
C |
s15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
|
s7 |
s5 |
|
|
|
|
|
|
|
s4 |
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
s23 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
e |
|
S |
|
s19 |
s1 |
|
|
||
|
|
|
|
|
|
|
s9 s28 |
|
|
s11 |
|
|
|||
|
|
|
|
|
|
|
c |
s10 |
|
|
|||||
|
|
|
|
a |
|
|
d |
s12 |
|
|
|
|
|||
|
|
s22 |
|
|
|
s2 |
|
|
s20 |
|
|
||||
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
b |
G |
s24 |
|
|
s3 |
|
|
||||
|
s16 |
s26 s27 |
|
|
|
|
|
|
|
|
|||||
|
s6 |
|
|
|
|
|
|
|
|
|
|||||
|
|
s25 |
|
|
|
|
|
|
|
|
|
|
|
||
|
|
|
s30 |
|
|
s18 |
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
–40 |
|
–20 |
|
|
0 |
|
20 |
40 |
60 |
||||||
|
|
|
|
|
|
|
|
Dimension 1 |
|
|
|
|
|
|
SUMMARY: Multidimensional scaling
so on. Exhibit 9.10 shows the species positions on the nonmetric MDS solution, showing, for example, that species a and b are relatively more abundant in the samples at lower left, while c is more associated with samples on the right. Similarly, even though the ordinal sediment types C (clay), S (sand) and G (gravel) have not been used in the mapping, they can be depicted at the averages of the subsets of samples corresponding to them. The samples thus appear to follow a trend from top right (more clay) to bottom left (more gravel).
1.Multidimensional scaling (MDS) is a method that attempts to make a spatial map of a matrix of proximities, either distances or dissimilarities defined between sample units, so that the interpoint distances in the map come as close as possible to the given proximities according to the chosen fit criterion.
2.The fit criterion in metric MDS involves approximating the actual proximity values by the mapped distances, for example by least-squares.
122
MULTIDIMENSIONAL SCALING
3.Classical MDS is a particular form of metric MDS that relies on the eigenvalueeigenvector decomposition of a square matrix. The eigenvalues give convenient measures of variance explained on each axis, and the dimensions of the solution are uncorrelated.
4.Nonmetric MDS has a more relaxed fit criterion in that it strives to match only the ordering of the proximities to the ordering of the mapped distances.
5.The error in classical MDS is quantified by the percentage of unexplained variance, while in nonmetric MDS the error is quantified by the stress.
6.The stress measure always gives a more optimistic result, because of the relaxation of approximating the proximity values in the map in favour of their rank ordering.
7.In most cases, however, when the size of the proximity data matrix is quite large, say for at least 30 sample units, the results of the two approaches will be essentially the same.
8.When the proximities are of a Euclidean type, it will be more useful to use the metric scaling approach because of the connection with methods such as principal component analysis (Chapter 12) and correspondence analysis (Chapter 13). There would be little advantage, for example, in applying nonmetric scaling to a matrix of chi-square distances.
9.When the proximities are non-Euclidean, the nonmetric approach avoids the dilemma that the triangle inequality is violated by concentrating on ordering of proximities rather than their actual values.
123
MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA
124

REGRESSION AND PRINCIPAL COMPONENT ANALYSIS
125
MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA
126

CHAPTER 10
Regression Biplots
In the previous chapter, displays of samples were obtained in a scatterplot with spatial properties (hence often called a map), approximating given distance or dissimilarity matrices. Then some types of variables were added to the display, specifically zero/one categorical variables (e.g., presences of species, sediment categories) and count variables (e.g., species abundances). In this chapter we continue with this theme of adding variables to a plot of samples, including continuous variables in their original form or in fuzzy-coded form. When samples and variables are displayed jointly in such a scatterplot, it is often called a biplot. This designation implies that a certain property holds between the two sets of points in the display in terms of the scalar products between the samples and variables. In this chapter we consider the simplest form of biplot, the regression biplot, which will serve two purposes: first, to give a different geometric interpretation of multiple regression; and second, to give a basic understanding of all the joint displays of samples and variables that will appear in the rest of this book.
Contents |
|
Algebra of multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
127 |
Geometry of multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
128 |
Regression biplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
132 |
Generalized linear model biplots with categorical variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
133 |
Fuzzy-coded species abundances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
135 |
More than two predictors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
135 |
SUMMARY: Regression biplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
138 |
The multiple linear regression model postulates that the expected value of a response variable Y (i.e., the mean of Y) is a linear combination of several explanatory variables x1, x2, …, xp:
E(Y ) = α + β1x1 + β2x2 + βpxp |
(10.1) |
Algebra of multiple linear regression
127

Geometry of multiple linear regression
MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA
For example, using the data of Exhibit 1.1, consider the regression of species labelled d on depth, pollution and temperature. The model is estimated as:
E(d) 6.271 0.148 depth 1.388 pollution 0.043 temperature (10.2)
Notice that, for the moment, we do not comment on whether this type of linear model of a count variable on three environmental variables would be sensible or not, because d is not an interval variable – we will return to this point later.
Since the coefficients in (10.2) depend on the units of the variables, we prefer to consider the regression using all variables in comparable units. Usually this is done by standardization of the variables, so that they are all in units of standard deviation. Let us denote these standardized variables (i.e., centred and normalized) with an asterisk, then the regression model becomes:
E(d*) 0.347 depth* 0.446 pollution* 0.002 temperature* (10.3)
The constant term now vanishes and the coefficients, called standardized regression coefficients, can be compared with one another. Thus it seems that pollution has the strongest influence on the average level of species d, reducing it by 0.446 of a standard deviation for every increase of one standard deviation of pollution. The effect of temperature is minimal and, in fact, is nonsignificant statistically (p 0.99), while depth and pollution are both significant (p 0.039 and p 0.010, respectively), so we drop temperature and consider just the regression on the other two variables, which maintains the value of the coefficients, but slightly smaller p -values: p 0.035 and p 0.008, respectively:
E(d*) 0.347 depth* 0.446 pollution* |
(10.4) |
When referring to the multiple regression model, it is often said that a hyperplane is being fitted to the data. For a single explanatory variable this reduces to a straight line in the familiar case of simple linear regression. When there are two explanatory variables, as in (10.4), the model is a two-dimensional plane in three dimensions, the third dimension being the response variable d* – a view of this plane in three dimensions is given in Exhibit 10.1, with standardized depth* and pollution* forming the two horizontal dimensions and d* the vertical one. Notice how the plane is going down in the direction of pollution, but going up in the direction of depth, according to the regression coefficients (see the web site of the book which shows a video of this three-dimensional image). Notice too the lack of fit of the points to the plane – the value of R 2 for the regression is 0.442, which means that 44.2% of the variance of d is being
128

REGRESSION BIPLOTS |
|
|
|
||||||
4 |
|
|
|
|
d* |
Exhibit 10.1: |
|||
|
|||||||||
|
|
|
|
|
|
|
|
|
Regression plane defined |
|
|
|
|
|
|
|
|
|
by Equation (10.4) for |
|
|
|
|
|
|
|
|
|
standardized response |
|
|
|
|
|
|
|
|
|
d* and standardized |
|
|
|
|
|
|
|
|
|
explanatory variables |
|
|
|
|
|
|
|
|
|
pollution* and depth*. The |
2 |
|
|
|
|
|
view is from above the plane |
|||
|
|
|
|
|
Pollution* |
||||
|
|
|
|
|
|
|
|
|
|
–4 |
|
|
|
||||||
|
|
|
|
|
|
|
|
|
4 |
–2 |
|
|
2 |
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
||||||
|
|
|
|||||||
|
|
|
|
|
|
||||
0 |
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
0 |
|
|
|
||||||
–2 |
|
|
|
||||||
–2 |
|
|
2 |
||||||
|
|
||||||||
–4 |
|
|
Depth* |
||||||
–4 |
|
|
4 |
||||||
|
|
explained, and 55.8% of the variance unexplained and considered residual, or error, variance.
The linearity of the plane means that predictions of the same mean values form parallel straight lines in the plane. From a mountaineer’s point of view, if you are standing on the plane and want to stay at the same height, you need to walk in a straight line. Projecting these parallel straight lines onto the depth pollution plane gives the contours, also called isolines, as shown in Exhibit 10.2. Finally, the vector in the depth pollution plane with coordinates equal to the regression coefficients, 0.347 0.446 , called the gradient, indicates the direction of steepest ascent in the regression plane, and is perpendicular to the contours. Given the geometry of the regression plane in Exhibit 10.2, it follows that we can do away with the d* dimension, just like cartographers do, and consider just the depth pollution plane and the contours of the regression plane, which are perpendicular to the gradient vector. Exhibit 10.3 shows this “ground view” of the model.
The short arrow labelled d is the gradient vector. The dashed line through this vector is called the biplot axis for the variable d. Contour lines are perpendicular to the biplot axis. Exhibit 10.3(a) corresponds to the darker “shadow” in Exhibit 10.2 in the depth pollution plane, where the contours are in units of standard
129