STATISTICAL MODELLING |
|
|
|
|
mean H |
2.11 |
|
2.13 depth |
0.0000324 depth2 |
(18.8) |
|
0.80 |
|
0.0048 |
0.0000071 |
|
|
(p 0.01) |
|
(p < 0.0001) |
(p < 0.0001) |
|
in which case the AIC is 79.3, and explained variance is 19.9%.
As a final illustration of the power of GAMs, we can make a model with a smooth interaction of depth and temperature. This has an even lower AIC value 70.7, and now to visualize the diagnosed relationship requires making a contour plot of the model values in the space of the depth and temperature variables, or making a perspective plot in three dimensions – see Exhibit 18.9.To test whether the interaction is significant we can compare the residual deviances for the model shown in Exhibit 18.8 (85.04) and the one in Exhibit 18.9 (83.67), i.e. a difference of only 1.37 units of deviance, which is not significant.2 All these results and considerations lead us to the conclusion that the parametric model (18.8) with depth modelled as a quadratic is the one of choice – it has few parameters, is a function that can be easily interpreted and computed and does almost as well as several competing models that are more complex. Here we have demonstrated how GAM can help to suggest a nonlinear model for a regression. We will return to GAM modelling in Chapter 20 where we show that it is a convenient and flexible approach for taking into account the effect of spatial position.
We close this chapter on statistical modelling by showing a completely differ- Classification trees ent approach to modelling a continuous or categorical response variable, by
constructing a type of decision tree with the goal of predicting the continuous response variable (regression trees) or categorical response category (classification trees). We consider the latter case first, and take as example the presence/ absence of polar cod (Boreogadus saida) in a sample. In the data matrix there are 21 samples with polar cod and 68 without, so the response data consist of 21 ones and 68 zeros. Applying a classification tree algorithm, with two predictors, depth and temperature, produces the tree model of Exhibit 18.10. The 89 samples are notionally fed down the tree and are split by the decisions at each branch, where each decision indicates the subsample that goes to the left hand side. For example, samples going to the left at the top of the tree satisfy the condition
2 Here we have not entered into the aspect of the degrees of freedom for this comparison of GAM models, nor how p -values are computed. In GAM the degrees of freedom are not integers, but estimates on a continuous scale. Hence, comparing models leads to differences in degrees of freedom that are also not whole numbers
– in this particular case the degrees of freedom associated with the deviance difference of 1.37 are 1.01, close enough to 1 for all practical purposes.
STATISTICAL MODELLING
Temperature ≥ 1.6
Depth < 305.5
False
51/0
Exhibit 18.10:
Classification tree model for predicting the presence of polar cod. The one branch which predicts their presence gives the rule: temperature < 1.6 ºC and depth 306 m.
This rule correctly predicts the presence of polar
cod in 16 samples but misclassifies 5 samples as having polar cod when they do not
that temperature is greater than or equal to 1.6 ºC, while the others for which
temperature is less than 1.6 ºC go to the right. Of the 89 samples, 51 go to the left, and all of them have no polar cod, so the prediction is False (i.e., no polar cod). The remaining 38 samples that go to the right are optimally split into two groups according to depth, 305 m or less to the left, and 306 m or more to the right. Of 38 samples, 17 go to the left and of these 12 have no polar cod, so False is predicted, while 21 go to the right, and a majority has polar cod so polar cod is predicted (True). The final branches of the tree, where the final predictions are made, are called terminal nodes, and the objective is to make them as concentrated as possible into one category.
The beauty of this approach is that it copes with interactions in a natural way by looking for combinations of characteristics that explain the response, in this case the combination of lower temperature (lower than 1.6 ºC) and higher depths (greater than or equal to 306 m) is a prediction rule for polar cod, otherwise no polar cod are predicted.
As a comparison, let us perform a logistic regression predicting polar cod, using depth and temperature. Both variables are significant predictors but result in only 12 correct predictions of polar cod presence. The misclassification tables for the two approaches are given in Exhibit 18.11.
The same style of tree model can be constructed for a continuous response. Regression trees
In this case the idea is to arrive at terminal nodes with standard deviations (or
MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA
4.Generalized additive models (GAMs) are even more general than GLMs, allowing considerable flexibility in the form of the relationship of the response with the explanatory variables.
5.Both GLM and GAM environments allow interaction effects to be included and tested.
6.Classification and regression trees are an alternative that specifically look at the interaction structure of the predictors and come up with combinations of intervals that predict either categorical or continuous responses with minimum error.
MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA
CHAPTER 19
Case Study 1:
Temporal Trends and Spatial Patterns
across a Large Ecological Data Set
The examples presented in previous chapters have generally been on smallto medium-sized data sets that are good for teaching and understanding the basic concepts of the methodologies used. We conclude with two chapters detailing much larger studies that take full advantage of multivariate analysis to synthesize complex phenomena in a format that is easier to interpret and come to substantive conclusions. The two chapters treat the same set of data, a large set of samples of fish species in the Barents Sea over a six-year period, where the spatial location of each sample is known as well as additional environmental variables such as depth and water temperature. In the present chapter we shall study the temporal trends and spatial patterns of the fish compositions and also try to account for these patterns in terms of the environmental variables. But before applying multivariate analysis to data across time and space, we have to consider carefully the areal sampling across the years and reweight the observations to eliminate sampling bias.
Contents
Sampling bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Data set “Barents fish trends” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Reweighting samples for fuzzy coded data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 Correspondence analysis of reweighted data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 Canonical correspondence analysis of reweighted data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Some permutation tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Isolating the spatial part of the explained inertia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 SUMMARY: Case Study 1: Temporal trends and spatial trends across a large ecological data set . . 260
In this chapter we shall be considering samples in different regions over time over Sampling bias an area of interest. An important consideration is whether data have been col-
lected in a balanced way over time in each region. This is important if one wants to summarize the data over the whole area and make temporal comparisons. If