Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

1greenacre_m_primicerio_r_multivariate_analysis_of_ecological

.pdf
Скачиваний:
6
Добавлен:
19.11.2019
Размер:
7.36 Mб
Скачать

Exhibit 18.8:

Generalized additive modelling of diversity as smooth functions of depth and temperature: depth is diagnosed as having a significant quadratic relationship, while the slightly increasing

linear relationship with temperature is nonsignificant. Both plots are centred vertically at mean diversity, so show estimated deviations from the mean. Confidence regions for the estimated relationships are also shown

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

 

0.2

 

0.0

s(Depth,1.96)

–0.4 –0.2

 

–0.6

 

–0.8

 

–1.0

200

250

300

350

400

450

Depth

 

0.2

 

0.0

s(Temperature,1)

–0.4 –0.2

 

–0.6

 

–0.8

 

–1.0

0

1

2

3

4

Temperature

240

STATISTICAL MODELLING

 

 

 

 

mean H

2.11

 

2.13 depth

0.0000324 depth2

(18.8)

 

0.80

 

0.0048

0.0000071

 

 

(p 0.01)

 

(p < 0.0001)

(p < 0.0001)

 

in which case the AIC is 79.3, and explained variance is 19.9%.

As a final illustration of the power of GAMs, we can make a model with a smooth interaction of depth and temperature. This has an even lower AIC value 70.7, and now to visualize the diagnosed relationship requires making a contour plot of the model values in the space of the depth and temperature variables, or making a perspective plot in three dimensions – see Exhibit 18.9.To test whether the interaction is significant we can compare the residual deviances for the model shown in Exhibit 18.8 (85.04) and the one in Exhibit 18.9 (83.67), i.e. a difference of only 1.37 units of deviance, which is not significant.2 All these results and considerations lead us to the conclusion that the parametric model (18.8) with depth modelled as a quadratic is the one of choice – it has few parameters, is a function that can be easily interpreted and computed and does almost as well as several competing models that are more complex. Here we have demonstrated how GAM can help to suggest a nonlinear model for a regression. We will return to GAM modelling in Chapter 20 where we show that it is a convenient and flexible approach for taking into account the effect of spatial position.

We close this chapter on statistical modelling by showing a completely differ- Classification trees ent approach to modelling a continuous or categorical response variable, by

constructing a type of decision tree with the goal of predicting the continuous response variable (regression trees) or categorical response category (classification trees). We consider the latter case first, and take as example the presence/ absence of polar cod (Boreogadus saida) in a sample. In the data matrix there are 21 samples with polar cod and 68 without, so the response data consist of 21 ones and 68 zeros. Applying a classification tree algorithm, with two predictors, depth and temperature, produces the tree model of Exhibit 18.10. The 89 samples are notionally fed down the tree and are split by the decisions at each branch, where each decision indicates the subsample that goes to the left hand side. For example, samples going to the left at the top of the tree satisfy the condition

2 Here we have not entered into the aspect of the degrees of freedom for this comparison of GAM models, nor how p -values are computed. In GAM the degrees of freedom are not integers, but estimates on a continuous scale. Hence, comparing models leads to differences in degrees of freedom that are also not whole numbers

– in this particular case the degrees of freedom associated with the deviance difference of 1.37 are 1.01, close enough to 1 for all practical purposes.

241

Exhibit 18.9:

Contour plot (upper) and perspective plot (down) of the diagnosed interaction regression surface of depth and temperature, predicting the deviations from mean diversity. The concave relationship with

depth is clearly seen as well as the slight relationship with depth. The difference between the model with or without interactions is, however, not significant

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

 

4

 

 

0

 

 

 

 

 

 

.

 

 

 

 

 

 

 

1

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

.

 

 

 

 

 

0

0

0

 

 

 

 

 

.

 

 

 

 

 

 

 

3

2

 

 

 

 

 

 

 

 

 

 

.

 

 

 

0

0

 

 

0

 

 

 

.

.

 

 

 

 

 

 

2

1

 

 

 

 

 

3

 

 

 

 

 

 

Temperature

 

 

 

 

 

3

 

 

 

 

 

 

.

4

 

 

 

 

 

0

 

 

 

 

 

– .

 

 

 

 

 

 

0

2

 

 

 

 

2

 

 

 

.

 

 

 

 

0

5

 

 

0

 

 

 

 

 

 

 

 

 

 

.

 

 

 

.

 

 

3

 

 

 

 

0

 

 

 

 

 

 

1

 

 

 

 

 

 

.

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

 

 

 

 

 

 

 

.

 

 

 

 

 

 

 

0

 

 

1

 

 

 

0

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

.

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

.

 

 

 

 

 

 

 

3

 

 

0

 

 

 

 

 

 

 

.

 

 

 

 

 

 

 

2

 

 

 

 

200

250

300

350

400

450

 

 

 

 

 

Depth

 

 

s(Depth,Temperature,4

. 33)

Depth

Temperature

242

STATISTICAL MODELLING

Temperature ≥ 1.6

Depth < 305.5

False

51/0

False

True

12/5

5/16

Exhibit 18.10:

Classification tree model for predicting the presence of polar cod. The one branch which predicts their presence gives the rule: temperature < 1.6 ºC and depth 306 m.

This rule correctly predicts the presence of polar

cod in 16 samples but misclassifies 5 samples as having polar cod when they do not

that temperature is greater than or equal to 1.6 ºC, while the others for which

temperature is less than 1.6 ºC go to the right. Of the 89 samples, 51 go to the left, and all of them have no polar cod, so the prediction is False (i.e., no polar cod). The remaining 38 samples that go to the right are optimally split into two groups according to depth, 305 m or less to the left, and 306 m or more to the right. Of 38 samples, 17 go to the left and of these 12 have no polar cod, so False is predicted, while 21 go to the right, and a majority has polar cod so polar cod is predicted (True). The final branches of the tree, where the final predictions are made, are called terminal nodes, and the objective is to make them as concentrated as possible into one category.

The beauty of this approach is that it copes with interactions in a natural way by looking for combinations of characteristics that explain the response, in this case the combination of lower temperature (lower than 1.6 ºC) and higher depths (greater than or equal to 306 m) is a prediction rule for polar cod, otherwise no polar cod are predicted.

As a comparison, let us perform a logistic regression predicting polar cod, using depth and temperature. Both variables are significant predictors but result in only 12 correct predictions of polar cod presence. The misclassification tables for the two approaches are given in Exhibit 18.11.

The same style of tree model can be constructed for a continuous response. Regression trees

In this case the idea is to arrive at terminal nodes with standard deviations (or

243

Exhibit 18.11:

Comparison of misclassification rates for the classification tree of Exhibit 18.10, compared to that for logistic regression, using the same predictors. The classification tree correctly predicts presence and absence in 79 of the 89 samples, while

logistic regression correctly predicts 74

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

 

CLASSIFICATION TREE

LOGISTIC REGRESSION

 

 

 

 

 

 

 

 

Truth

 

Truth

 

Polar cod

No polar cod

Polar cod

No polar cod

 

 

 

 

 

 

Polar cod

16

5

 

12

6

PREDICTED

 

 

 

 

 

No polar cod

5

63

 

9

62

 

 

 

 

 

 

any other appropriate measure of variability for the response) as low as possible. As an example, we return to the diversity response, this time choosing time latitude and longitude coordinates as the predictors in order to classify the samples into regions of homogeneous diversity. The result is given in Exhibit 18.12.

The regression tree partitions the sampling area and can be drawn on the map in Exhibit 18.13. The most diverse area is in the north-west, while the least diverse is in the central western block.

Exhibit 18.12:

 

 

Latitude < 74.09

 

 

Regression tree predicting

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

fish diversity from latitude

 

 

 

 

 

 

 

 

and longitude of sample

 

 

 

 

 

 

 

 

positions. The terminal

 

 

 

 

 

 

 

 

nodes give the average

 

 

 

 

 

 

 

 

diversity of the samples

 

 

 

 

 

 

 

 

that fall into them. This

 

 

 

 

 

 

 

 

tree yields the spatial

Latitude

≥ 72.61

 

 

Longitude

≥ 30.63

classification of the

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

sampling region given in

 

 

 

 

 

 

 

 

Exhibit 18.13

 

 

 

 

 

 

 

 

 

 

1.289

1.553

 

 

 

 

 

 

 

 

n = 10

n = 18

 

 

Latitude < 73.61

Latitude < 71.53

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1.285

1.444

Longitude

< 26.23

 

n = 11

n = 16

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1.201

 

 

 

 

 

 

 

 

n = 12

 

 

 

 

 

 

 

 

 

 

 

0.7244

0.9954

 

 

 

 

 

n = 10

n = 12

 

 

 

 

244

STATISTICAL MODELLING

76

 

 

 

 

 

 

 

1.55

 

1.29

 

74

 

 

 

 

 

 

 

1.20

 

 

 

 

 

0.72

1.00

 

 

72

 

 

1.44

 

 

 

 

 

 

 

 

 

 

1.29

 

 

70

 

 

 

 

 

15

20

25

30

35

40

Exhibit 18.13:

Map of Barents Sea showing the locations of the 89 sampling sites (see Exhibit 11.1) and the slicing up of the region according to the regression tree of Exhibit 18.12, and the average fish diversities in each block. Most of the slices divide the latitudes north to south, with just two east-west divisions of the longitudes. Dark locations show the

21 sites where polar cod were found

1.The family of generalized linear models (GLMs) includes multiple linear regression, Poisson regression, and logistic regression, when the response variable is continuous, count or categorical, respectively, for which the assumed conditional distributions given a set of explanatory variables (or predictors), are normal, Poisson and binomial respectively.

2.Each of these models assumes that a transformation of the mean is a linear function of the explanatory variables. This transformation is called the link function. In multiple regression there is no transformation, and the link is thus the identity. In Poisson regression the link is the logarithm, and in logistic regression it is the logit function, or log-odds.

3.To take into account nonlinearities, polynomial functions of the explanatory variables or fuzzy coding into several categories can be used.

SUMMARY: Statistical modelling

245

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

4.Generalized additive models (GAMs) are even more general than GLMs, allowing considerable flexibility in the form of the relationship of the response with the explanatory variables.

5.Both GLM and GAM environments allow interaction effects to be included and tested.

6.Classification and regression trees are an alternative that specifically look at the interaction structure of the predictors and come up with combinations of intervals that predict either categorical or continuous responses with minimum error.

246

TWO CASE STUDIES

247

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

248

CHAPTER 19

Case Study 1:

Temporal Trends and Spatial Patterns

across a Large Ecological Data Set

The examples presented in previous chapters have generally been on smallto medium-sized data sets that are good for teaching and understanding the basic concepts of the methodologies used. We conclude with two chapters detailing much larger studies that take full advantage of multivariate analysis to synthesize complex phenomena in a format that is easier to interpret and come to substantive conclusions. The two chapters treat the same set of data, a large set of samples of fish species in the Barents Sea over a six-year period, where the spatial location of each sample is known as well as additional environmental variables such as depth and water temperature. In the present chapter we shall study the temporal trends and spatial patterns of the fish compositions and also try to account for these patterns in terms of the environmental variables. But before applying multivariate analysis to data across time and space, we have to consider carefully the areal sampling across the years and reweight the observations to eliminate sampling bias.

Contents

Sampling bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Data set “Barents fish trends” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Reweighting samples for fuzzy coded data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Correspondence analysis of reweighted data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Canonical correspondence analysis of reweighted data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Some permutation tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Isolating the spatial part of the explained inertia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 SUMMARY: Case Study 1: Temporal trends and spatial trends across a large ecological data set . . 260

In this chapter we shall be considering samples in different regions over time over Sampling bias an area of interest. An important consideration is whether data have been col-

lected in a balanced way over time in each region. This is important if one wants to summarize the data over the whole area and make temporal comparisons. If

249