1greenacre_m_primicerio_r_multivariate_analysis_of_ecological
.pdfMULTIVARIATE ANALYSIS OF ECOLOGICAL DATA
PERES-NETO P.R., D.A. JACKSON, and K.M. SOMERS. “Giving meaningful interpretation to ordination axes: assessing loading significance in principal component analysis”. Ecology 84 (2003): 2347-2363.
(The authors compare a variety of approaches for assessing the signiÞcance of eigenvector coefÞcients in terms of type I error rates and power).
PIELOU, E.C. The Interpretation of Ecological Data: a Primer on ClassiÞcation and Ordination. New York: John Wiley & Sons, Inc., 1984.
(An early review of ecological data analysis linking ecological and statistical concepts to ease the interpretation of results).
TER BRAAK, C.J.F. “Canonical correspondence analysis: a new eigenvector technique for multivariate direct gradient analysis”. Ecology 67 (1986): 1167-1179.
(Another citation classic for an author that has made important contributions to the Þeld, ensuring a wide availability of the new methods via software development (Canoco, in collaboration with Šmilauer Ð see Lepš and Šmilauer 2003)).
TER BRAAK, C.J.F., and I.C. PRENTICE. “A theory of gradient analysis”. Advances in Ecological Research 18 (1988): 271-313.
(A classic introduction to gradient analysis theory and its ecological applications).
WIEDMANN, M., M. ASCHAN, G. CERTAIN, A. DOLGOV, M. GREENACRE, E. JO-
HANNESEN, B. PLANQUE, and R. PRIMICERIO. “Functional diversity of the Barents Sea fish community”. Marine Ecology Progress Series, in press, doi: 10.3354/ meps10558.
(A more extensive analysis of functional traits of Barents Sea Þsh species, compared to our case study of Chapter 20, using more Þsh species and a more recent time series).
YEE T.W. “Constrained additive ordination”. Ecology 87 (2006): 203-213.
(The paper introduces constrained additive ordination (CAO) models, described as Òloosely speaking, [É] generalized additive models Þtted to a very small number of latent variablesÓ. The paper provides the R code to implement the CAO methodology with some clear example applications).
ZUUR A.F., E.N. IENO, and C.S. ELPHICK. “A protocol for data exploration to avoid common statistical problems”. Methods in Ecology and Evolution 1 (2010): 3-14.
(The authors provide a protocol for data exploration, discussing Òcurrent tools to detect outliers, heterogeneity of variance, collinearity, dependence of observations, problems with interactions, double zeros in multivariate analysis, zero inßation in generalized linear modelling, and the correct type of relationships between dependent and independent variables; and []
300
BIBLIOGRAPHY AND WEB RESOURCES
provide advice on how to address these problems when they ariseÒ. The paper also provides
R code to implement the protocol).
ZUUR A.F., E.N. IENO, N.J. WALKER, A.A. SAVELIEV, and G.M. SMITH. Mixed Effects Models and Extensions in Ecology with R. New York: Springer, 2009.
(A very popular introduction to mixed effects models rich with relevant ecological examples based on the kind of ÒmessyÓ data ecologists need to cope with).
http://www.multivariatestatistics.org |
Web resources |
GREENACRE M.J. Multivariate books, data sets and R scripts.
(This web site supports three books published by the BBVA Foundation: La Práctica del Análisis de Correspondencias (Spanish translation of Correspondence Analysis in Practice, Second Edition), Biplots in Practice and the present book. Full text, data sets and R scripts available for free download).
http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf
OKSANEN J. Multivariate analyses of ecological communities in R: vegan tutorial. 43 p.
(This compact tutorial to the R package vegan is a very useful, brief introduction to the application of multivariate statistics in community ecology).
http://ordination.okstate.edu
PALMER M.W. Ordination methods for ecologists
(The web site is a very useful overview of ordination methods, providing links, references, and guides for self study, including a much needed glossary).
301
MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA
302
APPENDIX C
Computational Note
This appendix is a short summary of the software used for the analyses in this book, using packages from the R environment for statistical computing and graphics. Data sets and R code for reproducing the results are given online at the supporting website:
www.multivariatestatistics.org
As an introduction to the online code, we give here a list of some of the common
R functions and packages used in the computations of this book.
Contents |
|
Functions from the base package in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
303 |
Package ca . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
305 |
Package vegan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
305 |
Packages maptools, mapdata and mapproj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
306 |
Package mgcv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
306 |
Packages rpart and tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
306 |
Additional functions in supporting material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . |
306 |
The open-access R software has become the standard for statistical computing, especially for conducting research, thanks to its flexible programming environment. It is downloadable for free from the R project website
www.r-project.org
The simple installation process sets up R with what is called the base package, consisting of various functions that are commonly used (later we list more specialized packages that need to be downloaded and installed separately). Here we list some of the useful functions in the base package:
Functions from the base package in R
303
MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA
hist – takes a set of data on a continuous or count variable and makes a histogram; user can choose the interval boundaries; see Exhibits 1.2 and 1.3, for example.
qqnorm – takes a set of data on a continuous variable and plots the sample quantiles against the quantiles of a normal distribution; if the points follow the 45-degree diagonal line of the plot, the data can be regarded as normal, otherwise not; see Chapter 17.
shapiro.wilks – takes a set of data on a continuous variable and performs the Shapiro-Wilks test for normality; if the p -value is small then normality is rejected; see Chapter 17.
pairs – takes a rectangular data matrix as input and computes all bivariate scatterplots; see Exhibits 1.4 and 20.8.
boxplot – takes a set of data on a continuous variable and makes a box-and- whisker plot, optionally with a categorical variable that makes boxplots for each category alongside one another, with a common scale; see Exhibits 1.5 and 1.8.
scale – takes a set of data on a continuous variable and standardizes it by subtracting its mean (i.e., centering) and dividing by its standard deviation (i.e., normalization); centering or normalization can be switched off; see Chapter 3.
dist – takes a rectangular data matrix as input and computes a distance matrix between the rows, with several choices of distance functions; for example, see Exhibit 4.5.
cor – takes either two sets of data or a matrix of data with variables in columns and computes the single correlation in the former case, or the correlation matrix in the latter case; optionally computes Spearman rank correlations; see Exhibit 6.4.
table – takes a single categorical and counts the frequencies in each category; if two categorical variables are given the function counts the frequencies in the cross-tabulation; see Exhibit 6.6.
sample – takes a set of data and performs random sampling, without replacement (this re-arranges, or shuffles, the data set randomly) for permutation testing or with replacement for bootstrapping; see Chapter 18.
304
COMPUTATIONAL NOTE
hclust – takes a matrix of distance or dissimilarities (e.g., created by function dist) and performs hierarchical clustering; various clustering algorithms can be selected, including Ward clustering; see Chapters 7 and 8.
kmeans – takes a rectangular matrix of data and the specified number of groups and performs k-means nonhierarchical clustering; see Chapter 8.
cmdscale – takes a matrix of distances or dissimilarities (e.g., created by function dist) and performs classical multidimensional scaling; see Chapter 9.
lm – takes data on a response variable and one or more explanatory variables (or predictors) and performs least-squares linear regression; weights can be specified for weighted least-squares regression; see Chapters 10–20.
glm – takes data on a response variable and one or more explanatory variables (or predictors) and performs generalized linear modelling (GLM); several link functions and error distributions can be specified, giving linear regression, Poisson regression and logistic regression, for example; see Chapters 10 and 18.
prcomp and princomp – alternative functions for computing a principal component analysis on a rectangular data matrix, where rows are assumed to be sampling units and columns to be variables; see Chapter 12.
kruskal.test – takes a data set for a continuous variable and a grouping variable and performs the Kruskal-Wallis rank test of difference between groups (the nonparametric equivalent of a one-way ANOVA); see Chapter 17.
The ca package performs correspondence analysis (function ca) and multiple |
Package ca |
correspondence analysis (function mjca – this generalization of CA to multi- |
|
|
|
variate categorical data, more used in the social sciences, is not discussed in this |
|
book). Various graphical options are available using function plot.ca, includ- |
|
ing plotting with contribution coordinates and three-dimensional visualization |
|
of a CA solution with three principal axes, using function plot3d.ca, including |
|
interaction with 3d display such as rotation and zooming. The 3d graphics uses |
|
the R package rgl; see Chapter 13. |
|
The vegan package performs a variety of multivariate analyses and includes |
Package vegan |
most of the methods treated in this book, and aimed at biologists (specifically |
|
|
|
botanists, but the terminology can be equated to any biological application). |
|
Methods that are not included in R’s base package described above are compu- |
|
tation of Bray-Curtis and Gower dissimilarities (function vegdist with options |
|
method="bray" or method="gower" respectively), various diversity measures |
|
305
Packages maptools, mapdata and mapproj
MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA
(function diversity), nonmetric multidimensional scaling (function metaMDS), canonical correspondence analysis (function cca), redundancy analysis (function rda – like cca but for continuous response variables) and various permutation tests (e.g., function permutest); see Chapters 15 and 20.
Packages maptools and mapdata provide functions that allow drawing of geographical maps, with mapdata containing the outlines of all the world’s landmasses and several countries. The mapproj package performs a variety of map projections, using function mapproject, based on the latitude and longitude coordinates of a set of spatial locations. This is useful for obtaining coordinates on which Euclidean distances can be computed that approximate great circle distances; see Chapters 11 and 19.
Package mgcv This package performs generalized additive modelling (GAM), using a function gam that functions very similarly to glm for generalized linear modelling. Explanatory variables can be defined as smooth functions using the function s, for example s(x) for predictor x; see Chapters 18, 19 and 20.
Packages rpart and tree
These packages are alternatives for classification and regression trees, also called recursive partitioning (hence rpart). They define tree models in the same style as functions lm, glm and gam, as a response variable ~ sum of explanatory variables. Plotting the result using plot gives the tree plot; see Chapter 18.
Additional functions in supporting material
Several additional functions that are used in our applications are given in the supporting material on www.multivariatestatistics.org.
fuzzy.tri – takes a set of data on a continuous variable, with a specified number of categories, and transforms to fuzzy categories using triangular membership functions; hinges are by default defined as quantiles, but can be supplied by the user; see Exhibit 3.3.
chidist – takes a rectangular matrix of same-scale nonnegative data as input and computes the matrix of chi-square distances between rows or between columns; see Exhibit 4.7.
jaccard – takes a rectangular matrix of presence-absence data (ones and zeros) and computes the matrix of Jaccard dissimilarities between rows or between columns (this can also be achieved in the vegan package using function vegdist with method="jaccard"); see Chapter 5 and Exhibit 7.1.
306
LIST OF EXHIBITS
Exhibit 1.1: Typical set of multivariate biological and environmental data: the |
|
species data are counts, whereas the environmental data are con- |
|
tinuous measurements, with each variable on a different scale; the |
|
last variable is a categorical variable classifying the sediment of the |
|
sample as mainly C (=clay/silt), S (=sand) or G (=gravel/stone) ...... |
16 |
Exhibit 1.2: Histograms of three environmental variables and bar-chart of the |
|
categorical variable ............................................................................... |
17 |
Exhibit 1.3: Histograms of the five species, showing the usual high frequencies of |
|
low values that are mostly zeros, especially in species e ..................... |
18 |
Exhibit 1.4: Pairwise scatterplots of the three continuous variables in the lower |
|
triangle, showing smooth relationships (in brown, a type of moving |
|
average) of the vertical variable with respect to the horizontal one; |
|
for example, at the intersection of depth an pollution, pollution |
|
defines the vertical (“y”) axis and depth the horizontal (“x”) one. |
|
The upper triangle gives the correlation coefficients, with size of |
|
numbers proportional to their absolute values .................................. |
19 |
Exhibit 1.5: Box-and-whisker plots showing the distribution of each continu- |
|
ous environmental variable within each of the three categories of |
|
sediment (C = clay/silt, S = sand, G = gravel/stone). In each case the |
|
central horizontal line is the median of the distribution, the boxes |
|
extend to the first and third quartiles, and the dashed lines extend |
|
to the minimum and maximum values ............................................... |
20 |
Exhibit 1.6: Pairwise scatterplots of the five species abundances, showing in each |
|
case the smooth relationship of the vertical variable with respect to |
|
the horizontal one; the lower triangle gives the correlation coeffi- |
|
cients, with size of numbers proportional to their absolute values ... |
21 |
Exhibit 1.7: Pairwise scatterplots of the five groups of species with the three con- |
|
tinuous environmental variables, showing the simple least-squares |
|
regression lines and coefficients of determination (R2) .................... |
22 |
Exhibit 1.8: Box-and-whisker plots showing the distribution of each count |
|
variable across the three sediment types (C = clay/silt, S = sand, G = |
|
gravel/stone) and the F-statistics of the respective ANOVAs ............. |
23 |
307
MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA |
|
Exhibit 2.1: Schematic diagram of the two main types of situations in multivariate |
|
analysis: on the left, a data matrix where a variable y is singled out as |
|
being a response variable and can be partially explained in terms of |
|
the variables in X. On the right, a data matrix Y with a set of response |
|
variables but no observed predictors, where Y is regarded as being |
|
explained by an unobserved, latent variable f .................................... |
26 |
Exhibit 2.2: The four corners of multivariate analysis. Vertically, functional and |
|
structural methods are distinguished. Horizontally, continuous and |
|
discrete variables of interest are contrasted: the response variable(s) |
|
in the case of functional methods, and the latent variable(s) in the |
|
case of structural methods ................................................................... |
27 |
Exhibit 3.1: A classification of data in terms of their measurement scales. A variable |
|
can be categorical (nominal or ordinal) or continuous (ratio or interval). |
|
Count data have a special place: they are usually thought of as ratio vari- |
|
ables but the discreteness of their values links them to ordinal categorical |
|
data. Compositional data are a special case of ratio data that are compo- |
|
sitional in a collective sense because of their “unit-sum” constraint ......... |
34 |
Exhibit 3.2: The natural logarithmic transformation x' = log(x) and a few Box- |
|
Cox power transformations, for powers = ½(square root), ¼(dou- |
|
ble square root, or fourth root) and 0.05 ........................................... |
37 |
Exhibit 3.3: Fuzzy coding of a continuous variable x into three categories, using |
|
triangular membership functions. The minimum, median and maxi- |
|
mum are used as hinge points. An example is given of a value x* just |
|
below the median being fuzzy coded as [0.22 0.78 0] ....................... |
39 |
Exhibit 4.1: Pythagoras’ theorem in the familiar right-angled triangle, and the |
|
monument to this triangle in the port of Pythagorion, Samos island, |
|
Greece, with Pythagoras himself forming one of the sides. © Michael |
|
Greenacre ............................................................................................. |
48 |
Exhibit 4.2: Pythagoras’ theorem applied to distances in two-dimensional space |
49 |
Exhibit 4.3: Pythagoras’ theorem extended into three dimensional space ........... |
50 |
Exhibit 4.4: Standardized values of the three continuous variables of Exhibit 1.1 |
52 |
Exhibit 4.5: Standardized Euclidean distances between the 30 samples, based on |
|
the three continuous environmental variables, showing part of the |
|
triangular distance matrix .................................................................... |
53 |
Exhibit 4.6: Profiles of the sites, obtained by dividing the rows of counts in Ex- |
|
hibit 1.1 by their respective row totals. The last row is the average |
|
profile, computed in the same way, as proportions of the column |
|
totals of the original table of counts ................................................... |
56 |
308
LIST OF EXHIBITS |
|
Exhibit 4.7: Chi-square distances between the 30 samples, based on the biological |
|
count data, showing part of the triangular distance matrix ............... |
57 |
Exhibit 5.1: Illustration of the triangle inequality for distances in Euclidean space |
62 |
Exhibit 5.2: Bray-Curtis dissimilarities, multiplied by 100, between the 30 |
|
samples of Exhibit 1.1, based on the count data for species a to e. |
|
Violations of the triangle inequality can be easily picked out: for |
|
example, from s25 to s4 the Bray-Curtis is 93.9, but the sum of the |
|
values “via s6” from s25 to s6 and from s6 to s4 is 18.6+69.2 = 87.8, |
|
which is shorter ........................................................................................ |
64 |
Exhibit 5.3: Various dissimilarities and distances between pairs of sites (count |
|
data from Exhibit 1.1). B-C-raw: Bray Curtis dissimilarities on raw |
|
counts (usual definition and usage), chi2 raw: chi-square distances |
|
on raw counts, B-C rel: Bray-Curtis dissimilarities on relative counts, |
|
chi2 rel: chi-square distances on relative counts (usual definition and |
|
usage) .................................................................................................... |
65 |
Exhibit 5.4: Graphical comparison of Bray-Curtis dissimilarities and chi-square |
|
distances for (a) raw counts, taking into account size and shape, and |
|
(b) relative counts, taking into account shape only ........................... |
66 |
Exhibit 5.5: Two-dimensional illustration of the L1 (city-block) and L2 (Euclidean) |
|
distances between two points i and i': the L1 distance is the sum of |
|
the absolute differences in the coordinates, while the L2 distance is |
|
the square root of the sum of squared differences ............................ |
68 |
Exhibit 5.6: Presence–absence data of 10 species in 7 samples ............................. |
69 |
Exhibit 5.7: Distances between four stations based on the L1 distance between |
|
their standardized and rescaled values, as described above. The |
|
distances are shown equal to the part due to the categorical (CAT.) |
|
variables plus the part due to the continuous (CONT.) variables ..... |
72 |
Exhibit 6.1: (a) Two variables measured in three samples (sites in this case), viewed |
|
in three dimensions, using original scales; (b) Standardized values; (c) |
|
Same variables plotted in three dimensions using standardized values. |
|
Projections of some points onto the “floor” of the s2 – s3 plane are |
|
shown, to assist in understanding the three-dimensional positions of |
|
the points .............................................................................................. |
76 |
Exhibit 6.2: Triangle of pollution and depth vectors with respect to origin (O) |
|
taken out of Exhibit 6.1(c) and laid flat ............................................. |
77 |
Exhibit 6.3: Same triangle as in Exhibit 6.2, but with variables having unit length |
|
(i.e., unit variables. The projection of either variable onto the direc- |
|
309
