Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

1greenacre_m_primicerio_r_multivariate_analysis_of_ecological

.pdf
Скачиваний:
6
Добавлен:
19.11.2019
Размер:
7.36 Mб
Скачать

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

PERES-NETO P.R., D.A. JACKSON, and K.M. SOMERS. “Giving meaningful interpretation to ordination axes: assessing loading significance in principal component analysis”. Ecology 84 (2003): 2347-2363.

(The authors compare a variety of approaches for assessing the signiÞcance of eigenvector coefÞcients in terms of type I error rates and power).

PIELOU, E.C. The Interpretation of Ecological Data: a Primer on ClassiÞcation and Ordination. New York: John Wiley & Sons, Inc., 1984.

(An early review of ecological data analysis linking ecological and statistical concepts to ease the interpretation of results).

TER BRAAK, C.J.F. “Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis”. Ecology 67 (1986): 1167-1179.

(Another citation classic for an author that has made important contributions to the Þeld, ensuring a wide availability of the new methods via software development (Canoco, in collaboration with Šmilauer Ð see Lepš and Šmilauer 2003)).

TER BRAAK, C.J.F., and I.C. PRENTICE. “A theory of gradient analysis”. Advances in Ecological Research 18 (1988): 271-313.

(A classic introduction to gradient analysis theory and its ecological applications).

WIEDMANN, M., M. ASCHAN, G. CERTAIN, A. DOLGOV, M. GREENACRE, E. JO-

HANNESEN, B. PLANQUE, and R. PRIMICERIO. “Functional diversity of the Barents Sea fish community”. Marine Ecology Progress Series, in press, doi: 10.3354/ meps10558.

(A more extensive analysis of functional traits of Barents Sea Þsh species, compared to our case study of Chapter 20, using more Þsh species and a more recent time series).

YEE T.W. “Constrained additive ordination”. Ecology 87 (2006): 203-213.

(The paper introduces constrained additive ordination (CAO) models, described as Òloosely speaking, [É] generalized additive models Þtted to a very small number of latent variablesÓ. The paper provides the R code to implement the CAO methodology with some clear example applications).

ZUUR A.F., E.N. IENO, and C.S. ELPHICK. “A protocol for data exploration to avoid common statistical problems”. Methods in Ecology and Evolution 1 (2010): 3-14.

(The authors provide a protocol for data exploration, discussing Òcurrent tools to detect outliers, heterogeneity of variance, collinearity, dependence of observations, problems with interactions, double zeros in multivariate analysis, zero inßation in generalized linear modelling, and the correct type of relationships between dependent and independent variables; and []

300

BIBLIOGRAPHY AND WEB RESOURCES

provide advice on how to address these problems when they ariseÒ. The paper also provides

R code to implement the protocol).

ZUUR A.F., E.N. IENO, N.J. WALKER, A.A. SAVELIEV, and G.M. SMITH. Mixed Effects Models and Extensions in Ecology with R. New York: Springer, 2009.

(A very popular introduction to mixed effects models rich with relevant ecological examples based on the kind of ÒmessyÓ data ecologists need to cope with).

http://www.multivariatestatistics.org

Web resources

GREENACRE M.J. Multivariate books, data sets and R scripts.

(This web site supports three books published by the BBVA Foundation: La Práctica del Análisis de Correspondencias (Spanish translation of Correspondence Analysis in Practice, Second Edition), Biplots in Practice and the present book. Full text, data sets and R scripts available for free download).

http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf

OKSANEN J. Multivariate analyses of ecological communities in R: vegan tutorial. 43 p.

(This compact tutorial to the R package vegan is a very useful, brief introduction to the application of multivariate statistics in community ecology).

http://ordination.okstate.edu

PALMER M.W. Ordination methods for ecologists

(The web site is a very useful overview of ordination methods, providing links, references, and guides for self study, including a much needed glossary).

301

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

302

APPENDIX C

Computational Note

This appendix is a short summary of the software used for the analyses in this book, using packages from the R environment for statistical computing and graphics. Data sets and R code for reproducing the results are given online at the supporting website:

www.multivariatestatistics.org

As an introduction to the online code, we give here a list of some of the common

R functions and packages used in the computations of this book.

Contents

 

Functions from the base package in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

303

Package ca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

305

Package vegan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

305

Packages maptools, mapdata and mapproj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

306

Package mgcv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

306

Packages rpart and tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

306

Additional functions in supporting material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

306

The open-access R software has become the standard for statistical computing, especially for conducting research, thanks to its flexible programming environment. It is downloadable for free from the R project website

www.r-project.org

The simple installation process sets up R with what is called the base package, consisting of various functions that are commonly used (later we list more specialized packages that need to be downloaded and installed separately). Here we list some of the useful functions in the base package:

Functions from the base package in R

303

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

hist – takes a set of data on a continuous or count variable and makes a histogram; user can choose the interval boundaries; see Exhibits 1.2 and 1.3, for example.

qqnorm – takes a set of data on a continuous variable and plots the sample quantiles against the quantiles of a normal distribution; if the points follow the 45-degree diagonal line of the plot, the data can be regarded as normal, otherwise not; see Chapter 17.

shapiro.wilks – takes a set of data on a continuous variable and performs the Shapiro-Wilks test for normality; if the p -value is small then normality is rejected; see Chapter 17.

pairs – takes a rectangular data matrix as input and computes all bivariate scatterplots; see Exhibits 1.4 and 20.8.

boxplot – takes a set of data on a continuous variable and makes a box-and- whisker plot, optionally with a categorical variable that makes boxplots for each category alongside one another, with a common scale; see Exhibits 1.5 and 1.8.

scale – takes a set of data on a continuous variable and standardizes it by subtracting its mean (i.e., centering) and dividing by its standard deviation (i.e., normalization); centering or normalization can be switched off; see Chapter 3.

dist – takes a rectangular data matrix as input and computes a distance matrix between the rows, with several choices of distance functions; for example, see Exhibit 4.5.

cor – takes either two sets of data or a matrix of data with variables in columns and computes the single correlation in the former case, or the correlation matrix in the latter case; optionally computes Spearman rank correlations; see Exhibit 6.4.

table – takes a single categorical and counts the frequencies in each category; if two categorical variables are given the function counts the frequencies in the cross-tabulation; see Exhibit 6.6.

sample – takes a set of data and performs random sampling, without replacement (this re-arranges, or shuffles, the data set randomly) for permutation testing or with replacement for bootstrapping; see Chapter 18.

304

COMPUTATIONAL NOTE

hclust – takes a matrix of distance or dissimilarities (e.g., created by function dist) and performs hierarchical clustering; various clustering algorithms can be selected, including Ward clustering; see Chapters 7 and 8.

kmeans – takes a rectangular matrix of data and the specified number of groups and performs k-means nonhierarchical clustering; see Chapter 8.

cmdscale – takes a matrix of distances or dissimilarities (e.g., created by function dist) and performs classical multidimensional scaling; see Chapter 9.

lm – takes data on a response variable and one or more explanatory variables (or predictors) and performs least-squares linear regression; weights can be specified for weighted least-squares regression; see Chapters 10–20.

glm – takes data on a response variable and one or more explanatory variables (or predictors) and performs generalized linear modelling (GLM); several link functions and error distributions can be specified, giving linear regression, Poisson regression and logistic regression, for example; see Chapters 10 and 18.

prcomp and princomp – alternative functions for computing a principal component analysis on a rectangular data matrix, where rows are assumed to be sampling units and columns to be variables; see Chapter 12.

kruskal.test – takes a data set for a continuous variable and a grouping variable and performs the Kruskal-Wallis rank test of difference between groups (the nonparametric equivalent of a one-way ANOVA); see Chapter 17.

The ca package performs correspondence analysis (function ca) and multiple

Package ca

correspondence analysis (function mjca – this generalization of CA to multi-

 

 

variate categorical data, more used in the social sciences, is not discussed in this

 

book). Various graphical options are available using function plot.ca, includ-

 

ing plotting with contribution coordinates and three-dimensional visualization

 

of a CA solution with three principal axes, using function plot3d.ca, including

 

interaction with 3d display such as rotation and zooming. The 3d graphics uses

 

the R package rgl; see Chapter 13.

 

The vegan package performs a variety of multivariate analyses and includes

Package vegan

most of the methods treated in this book, and aimed at biologists (specifically

 

 

botanists, but the terminology can be equated to any biological application).

 

Methods that are not included in R’s base package described above are compu-

 

tation of Bray-Curtis and Gower dissimilarities (function vegdist with options

 

method="bray" or method="gower" respectively), various diversity measures

 

305

Packages maptools, mapdata and mapproj

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

(function diversity), nonmetric multidimensional scaling (function metaMDS), canonical correspondence analysis (function cca), redundancy analysis (function rda – like cca but for continuous response variables) and various permutation tests (e.g., function permutest); see Chapters 15 and 20.

Packages maptools and mapdata provide functions that allow drawing of geographical maps, with mapdata containing the outlines of all the world’s landmasses and several countries. The mapproj package performs a variety of map projections, using function mapproject, based on the latitude and longitude coordinates of a set of spatial locations. This is useful for obtaining coordinates on which Euclidean distances can be computed that approximate great circle distances; see Chapters 11 and 19.

Package mgcv This package performs generalized additive modelling (GAM), using a function gam that functions very similarly to glm for generalized linear modelling. Explanatory variables can be defined as smooth functions using the function s, for example s(x) for predictor x; see Chapters 18, 19 and 20.

Packages rpart and tree

These packages are alternatives for classification and regression trees, also called recursive partitioning (hence rpart). They define tree models in the same style as functions lm, glm and gam, as a response variable ~ sum of explanatory variables. Plotting the result using plot gives the tree plot; see Chapter 18.

Additional functions in supporting material

Several additional functions that are used in our applications are given in the supporting material on www.multivariatestatistics.org.

fuzzy.tri – takes a set of data on a continuous variable, with a specified number of categories, and transforms to fuzzy categories using triangular membership functions; hinges are by default defined as quantiles, but can be supplied by the user; see Exhibit 3.3.

chidist – takes a rectangular matrix of same-scale nonnegative data as input and computes the matrix of chi-square distances between rows or between columns; see Exhibit 4.7.

jaccard – takes a rectangular matrix of presence-absence data (ones and zeros) and computes the matrix of Jaccard dissimilarities between rows or between columns (this can also be achieved in the vegan package using function vegdist with method="jaccard"); see Chapter 5 and Exhibit 7.1.

306

LIST OF EXHIBITS

Exhibit 1.1: Typical set of multivariate biological and environmental data: the

 

species data are counts, whereas the environmental data are con-

 

tinuous measurements, with each variable on a different scale; the

 

last variable is a categorical variable classifying the sediment of the

 

sample as mainly C (=clay/silt), S (=sand) or G (=gravel/stone) ......

16

Exhibit 1.2: Histograms of three environmental variables and bar-chart of the

 

categorical variable ...............................................................................

17

Exhibit 1.3: Histograms of the five species, showing the usual high frequencies of

 

low values that are mostly zeros, especially in species e .....................

18

Exhibit 1.4: Pairwise scatterplots of the three continuous variables in the lower

 

triangle, showing smooth relationships (in brown, a type of moving

 

average) of the vertical variable with respect to the horizontal one;

 

for example, at the intersection of depth an pollution, pollution

 

defines the vertical (“y”) axis and depth the horizontal (“x”) one.

 

The upper triangle gives the correlation coefficients, with size of

 

numbers proportional to their absolute values ..................................

19

Exhibit 1.5: Box-and-whisker plots showing the distribution of each continu-

 

ous environmental variable within each of the three categories of

 

sediment (C = clay/silt, S = sand, G = gravel/stone). In each case the

 

central horizontal line is the median of the distribution, the boxes

 

extend to the first and third quartiles, and the dashed lines extend

 

to the minimum and maximum values ...............................................

20

Exhibit 1.6: Pairwise scatterplots of the five species abundances, showing in each

 

case the smooth relationship of the vertical variable with respect to

 

the horizontal one; the lower triangle gives the correlation coeffi-

 

cients, with size of numbers proportional to their absolute values ...

21

Exhibit 1.7: Pairwise scatterplots of the five groups of species with the three con-

 

tinuous environmental variables, showing the simple least-squares

 

regression lines and coefficients of determination (R2) ....................

22

Exhibit 1.8: Box-and-whisker plots showing the distribution of each count

 

variable across the three sediment types (C = clay/silt, S = sand, G =

 

gravel/stone) and the F-statistics of the respective ANOVAs .............

23

307

MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

Exhibit 2.1: Schematic diagram of the two main types of situations in multivariate

 

analysis: on the left, a data matrix where a variable y is singled out as

 

being a response variable and can be partially explained in terms of

 

the variables in X. On the right, a data matrix Y with a set of response

 

variables but no observed predictors, where Y is regarded as being

 

explained by an unobserved, latent variable f ....................................

26

Exhibit 2.2: The four corners of multivariate analysis. Vertically, functional and

 

structural methods are distinguished. Horizontally, continuous and

 

discrete variables of interest are contrasted: the response variable(s)

 

in the case of functional methods, and the latent variable(s) in the

 

case of structural methods ...................................................................

27

Exhibit 3.1: A classification of data in terms of their measurement scales. A variable

 

can be categorical (nominal or ordinal) or continuous (ratio or interval).

 

Count data have a special place: they are usually thought of as ratio vari-

 

ables but the discreteness of their values links them to ordinal categorical

 

data. Compositional data are a special case of ratio data that are compo-

 

sitional in a collective sense because of their “unit-sum” constraint .........

34

Exhibit 3.2: The natural logarithmic transformation x' = log(x) and a few Box-

 

Cox power transformations, for powers = ½(square root), ¼(dou-

 

ble square root, or fourth root) and 0.05 ...........................................

37

Exhibit 3.3: Fuzzy coding of a continuous variable x into three categories, using

 

triangular membership functions. The minimum, median and maxi-

 

mum are used as hinge points. An example is given of a value x* just

 

below the median being fuzzy coded as [0.22 0.78 0] .......................

39

Exhibit 4.1: Pythagoras’ theorem in the familiar right-angled triangle, and the

 

monument to this triangle in the port of Pythagorion, Samos island,

 

Greece, with Pythagoras himself forming one of the sides. © Michael

 

Greenacre .............................................................................................

48

Exhibit 4.2: Pythagoras’ theorem applied to distances in two-dimensional space

49

Exhibit 4.3: Pythagoras’ theorem extended into three dimensional space ...........

50

Exhibit 4.4: Standardized values of the three continuous variables of Exhibit 1.1

52

Exhibit 4.5: Standardized Euclidean distances between the 30 samples, based on

 

the three continuous environmental variables, showing part of the

 

triangular distance matrix ....................................................................

53

Exhibit 4.6: Profiles of the sites, obtained by dividing the rows of counts in Ex-

 

hibit 1.1 by their respective row totals. The last row is the average

 

profile, computed in the same way, as proportions of the column

 

totals of the original table of counts ...................................................

56

308

LIST OF EXHIBITS

 

Exhibit 4.7: Chi-square distances between the 30 samples, based on the biological

 

count data, showing part of the triangular distance matrix ...............

57

Exhibit 5.1: Illustration of the triangle inequality for distances in Euclidean space

62

Exhibit 5.2: Bray-Curtis dissimilarities, multiplied by 100, between the 30

 

samples of Exhibit 1.1, based on the count data for species a to e.

 

Violations of the triangle inequality can be easily picked out: for

 

example, from s25 to s4 the Bray-Curtis is 93.9, but the sum of the

 

values “via s6” from s25 to s6 and from s6 to s4 is 18.6+69.2 = 87.8,

 

which is shorter ........................................................................................

64

Exhibit 5.3: Various dissimilarities and distances between pairs of sites (count

 

data from Exhibit 1.1). B-C-raw: Bray Curtis dissimilarities on raw

 

counts (usual definition and usage), chi2 raw: chi-square distances

 

on raw counts, B-C rel: Bray-Curtis dissimilarities on relative counts,

 

chi2 rel: chi-square distances on relative counts (usual definition and

 

usage) ....................................................................................................

65

Exhibit 5.4: Graphical comparison of Bray-Curtis dissimilarities and chi-square

 

distances for (a) raw counts, taking into account size and shape, and

 

(b) relative counts, taking into account shape only ...........................

66

Exhibit 5.5: Two-dimensional illustration of the L1 (city-block) and L2 (Euclidean)

 

distances between two points i and i': the L1 distance is the sum of

 

the absolute differences in the coordinates, while the L2 distance is

 

the square root of the sum of squared differences ............................

68

Exhibit 5.6: Presence–absence data of 10 species in 7 samples .............................

69

Exhibit 5.7: Distances between four stations based on the L1 distance between

 

their standardized and rescaled values, as described above. The

 

distances are shown equal to the part due to the categorical (CAT.)

 

variables plus the part due to the continuous (CONT.) variables .....

72

Exhibit 6.1: (a) Two variables measured in three samples (sites in this case), viewed

 

in three dimensions, using original scales; (b) Standardized values; (c)

 

Same variables plotted in three dimensions using standardized values.

 

Projections of some points onto the “floor” of the s2 s3 plane are

 

shown, to assist in understanding the three-dimensional positions of

 

the points ..............................................................................................

76

Exhibit 6.2: Triangle of pollution and depth vectors with respect to origin (O)

 

taken out of Exhibit 6.1(c) and laid flat .............................................

77

Exhibit 6.3: Same triangle as in Exhibit 6.2, but with variables having unit length

 

(i.e., unit variables. The projection of either variable onto the direc-

 

309