- •brief contents
- •contents
- •preface
- •acknowledgments
- •about this book
- •What’s new in the second edition
- •Who should read this book
- •Roadmap
- •Advice for data miners
- •Code examples
- •Code conventions
- •Author Online
- •About the author
- •about the cover illustration
- •1 Introduction to R
- •1.2 Obtaining and installing R
- •1.3 Working with R
- •1.3.1 Getting started
- •1.3.2 Getting help
- •1.3.3 The workspace
- •1.3.4 Input and output
- •1.4 Packages
- •1.4.1 What are packages?
- •1.4.2 Installing a package
- •1.4.3 Loading a package
- •1.4.4 Learning about a package
- •1.5 Batch processing
- •1.6 Using output as input: reusing results
- •1.7 Working with large datasets
- •1.8 Working through an example
- •1.9 Summary
- •2 Creating a dataset
- •2.1 Understanding datasets
- •2.2 Data structures
- •2.2.1 Vectors
- •2.2.2 Matrices
- •2.2.3 Arrays
- •2.2.4 Data frames
- •2.2.5 Factors
- •2.2.6 Lists
- •2.3 Data input
- •2.3.1 Entering data from the keyboard
- •2.3.2 Importing data from a delimited text file
- •2.3.3 Importing data from Excel
- •2.3.4 Importing data from XML
- •2.3.5 Importing data from the web
- •2.3.6 Importing data from SPSS
- •2.3.7 Importing data from SAS
- •2.3.8 Importing data from Stata
- •2.3.9 Importing data from NetCDF
- •2.3.10 Importing data from HDF5
- •2.3.11 Accessing database management systems (DBMSs)
- •2.3.12 Importing data via Stat/Transfer
- •2.4 Annotating datasets
- •2.4.1 Variable labels
- •2.4.2 Value labels
- •2.5 Useful functions for working with data objects
- •2.6 Summary
- •3 Getting started with graphs
- •3.1 Working with graphs
- •3.2 A simple example
- •3.3 Graphical parameters
- •3.3.1 Symbols and lines
- •3.3.2 Colors
- •3.3.3 Text characteristics
- •3.3.4 Graph and margin dimensions
- •3.4 Adding text, customized axes, and legends
- •3.4.1 Titles
- •3.4.2 Axes
- •3.4.3 Reference lines
- •3.4.4 Legend
- •3.4.5 Text annotations
- •3.4.6 Math annotations
- •3.5 Combining graphs
- •3.5.1 Creating a figure arrangement with fine control
- •3.6 Summary
- •4 Basic data management
- •4.1 A working example
- •4.2 Creating new variables
- •4.3 Recoding variables
- •4.4 Renaming variables
- •4.5 Missing values
- •4.5.1 Recoding values to missing
- •4.5.2 Excluding missing values from analyses
- •4.6 Date values
- •4.6.1 Converting dates to character variables
- •4.6.2 Going further
- •4.7 Type conversions
- •4.8 Sorting data
- •4.9 Merging datasets
- •4.9.1 Adding columns to a data frame
- •4.9.2 Adding rows to a data frame
- •4.10 Subsetting datasets
- •4.10.1 Selecting (keeping) variables
- •4.10.2 Excluding (dropping) variables
- •4.10.3 Selecting observations
- •4.10.4 The subset() function
- •4.10.5 Random samples
- •4.11 Using SQL statements to manipulate data frames
- •4.12 Summary
- •5 Advanced data management
- •5.2 Numerical and character functions
- •5.2.1 Mathematical functions
- •5.2.2 Statistical functions
- •5.2.3 Probability functions
- •5.2.4 Character functions
- •5.2.5 Other useful functions
- •5.2.6 Applying functions to matrices and data frames
- •5.3 A solution for the data-management challenge
- •5.4 Control flow
- •5.4.1 Repetition and looping
- •5.4.2 Conditional execution
- •5.5 User-written functions
- •5.6 Aggregation and reshaping
- •5.6.1 Transpose
- •5.6.2 Aggregating data
- •5.6.3 The reshape2 package
- •5.7 Summary
- •6 Basic graphs
- •6.1 Bar plots
- •6.1.1 Simple bar plots
- •6.1.2 Stacked and grouped bar plots
- •6.1.3 Mean bar plots
- •6.1.4 Tweaking bar plots
- •6.1.5 Spinograms
- •6.2 Pie charts
- •6.3 Histograms
- •6.4 Kernel density plots
- •6.5 Box plots
- •6.5.1 Using parallel box plots to compare groups
- •6.5.2 Violin plots
- •6.6 Dot plots
- •6.7 Summary
- •7 Basic statistics
- •7.1 Descriptive statistics
- •7.1.1 A menagerie of methods
- •7.1.2 Even more methods
- •7.1.3 Descriptive statistics by group
- •7.1.4 Additional methods by group
- •7.1.5 Visualizing results
- •7.2 Frequency and contingency tables
- •7.2.1 Generating frequency tables
- •7.2.2 Tests of independence
- •7.2.3 Measures of association
- •7.2.4 Visualizing results
- •7.3 Correlations
- •7.3.1 Types of correlations
- •7.3.2 Testing correlations for significance
- •7.3.3 Visualizing correlations
- •7.4 T-tests
- •7.4.3 When there are more than two groups
- •7.5 Nonparametric tests of group differences
- •7.5.1 Comparing two groups
- •7.5.2 Comparing more than two groups
- •7.6 Visualizing group differences
- •7.7 Summary
- •8 Regression
- •8.1 The many faces of regression
- •8.1.1 Scenarios for using OLS regression
- •8.1.2 What you need to know
- •8.2 OLS regression
- •8.2.1 Fitting regression models with lm()
- •8.2.2 Simple linear regression
- •8.2.3 Polynomial regression
- •8.2.4 Multiple linear regression
- •8.2.5 Multiple linear regression with interactions
- •8.3 Regression diagnostics
- •8.3.1 A typical approach
- •8.3.2 An enhanced approach
- •8.3.3 Global validation of linear model assumption
- •8.3.4 Multicollinearity
- •8.4 Unusual observations
- •8.4.1 Outliers
- •8.4.3 Influential observations
- •8.5 Corrective measures
- •8.5.1 Deleting observations
- •8.5.2 Transforming variables
- •8.5.3 Adding or deleting variables
- •8.5.4 Trying a different approach
- •8.6 Selecting the “best” regression model
- •8.6.1 Comparing models
- •8.6.2 Variable selection
- •8.7 Taking the analysis further
- •8.7.1 Cross-validation
- •8.7.2 Relative importance
- •8.8 Summary
- •9 Analysis of variance
- •9.1 A crash course on terminology
- •9.2 Fitting ANOVA models
- •9.2.1 The aov() function
- •9.2.2 The order of formula terms
- •9.3.1 Multiple comparisons
- •9.3.2 Assessing test assumptions
- •9.4 One-way ANCOVA
- •9.4.1 Assessing test assumptions
- •9.4.2 Visualizing the results
- •9.6 Repeated measures ANOVA
- •9.7 Multivariate analysis of variance (MANOVA)
- •9.7.1 Assessing test assumptions
- •9.7.2 Robust MANOVA
- •9.8 ANOVA as regression
- •9.9 Summary
- •10 Power analysis
- •10.1 A quick review of hypothesis testing
- •10.2 Implementing power analysis with the pwr package
- •10.2.1 t-tests
- •10.2.2 ANOVA
- •10.2.3 Correlations
- •10.2.4 Linear models
- •10.2.5 Tests of proportions
- •10.2.7 Choosing an appropriate effect size in novel situations
- •10.3 Creating power analysis plots
- •10.4 Other packages
- •10.5 Summary
- •11 Intermediate graphs
- •11.1 Scatter plots
- •11.1.3 3D scatter plots
- •11.1.4 Spinning 3D scatter plots
- •11.1.5 Bubble plots
- •11.2 Line charts
- •11.3 Corrgrams
- •11.4 Mosaic plots
- •11.5 Summary
- •12 Resampling statistics and bootstrapping
- •12.1 Permutation tests
- •12.2 Permutation tests with the coin package
- •12.2.2 Independence in contingency tables
- •12.2.3 Independence between numeric variables
- •12.2.5 Going further
- •12.3 Permutation tests with the lmPerm package
- •12.3.1 Simple and polynomial regression
- •12.3.2 Multiple regression
- •12.4 Additional comments on permutation tests
- •12.5 Bootstrapping
- •12.6 Bootstrapping with the boot package
- •12.6.1 Bootstrapping a single statistic
- •12.6.2 Bootstrapping several statistics
- •12.7 Summary
- •13 Generalized linear models
- •13.1 Generalized linear models and the glm() function
- •13.1.1 The glm() function
- •13.1.2 Supporting functions
- •13.1.3 Model fit and regression diagnostics
- •13.2 Logistic regression
- •13.2.1 Interpreting the model parameters
- •13.2.2 Assessing the impact of predictors on the probability of an outcome
- •13.2.3 Overdispersion
- •13.2.4 Extensions
- •13.3 Poisson regression
- •13.3.1 Interpreting the model parameters
- •13.3.2 Overdispersion
- •13.3.3 Extensions
- •13.4 Summary
- •14 Principal components and factor analysis
- •14.1 Principal components and factor analysis in R
- •14.2 Principal components
- •14.2.1 Selecting the number of components to extract
- •14.2.2 Extracting principal components
- •14.2.3 Rotating principal components
- •14.2.4 Obtaining principal components scores
- •14.3 Exploratory factor analysis
- •14.3.1 Deciding how many common factors to extract
- •14.3.2 Extracting common factors
- •14.3.3 Rotating factors
- •14.3.4 Factor scores
- •14.4 Other latent variable models
- •14.5 Summary
- •15 Time series
- •15.1 Creating a time-series object in R
- •15.2 Smoothing and seasonal decomposition
- •15.2.1 Smoothing with simple moving averages
- •15.2.2 Seasonal decomposition
- •15.3 Exponential forecasting models
- •15.3.1 Simple exponential smoothing
- •15.3.3 The ets() function and automated forecasting
- •15.4 ARIMA forecasting models
- •15.4.1 Prerequisite concepts
- •15.4.2 ARMA and ARIMA models
- •15.4.3 Automated ARIMA forecasting
- •15.5 Going further
- •15.6 Summary
- •16 Cluster analysis
- •16.1 Common steps in cluster analysis
- •16.2 Calculating distances
- •16.3 Hierarchical cluster analysis
- •16.4 Partitioning cluster analysis
- •16.4.2 Partitioning around medoids
- •16.5 Avoiding nonexistent clusters
- •16.6 Summary
- •17 Classification
- •17.1 Preparing the data
- •17.2 Logistic regression
- •17.3 Decision trees
- •17.3.1 Classical decision trees
- •17.3.2 Conditional inference trees
- •17.4 Random forests
- •17.5 Support vector machines
- •17.5.1 Tuning an SVM
- •17.6 Choosing a best predictive solution
- •17.7 Using the rattle package for data mining
- •17.8 Summary
- •18 Advanced methods for missing data
- •18.1 Steps in dealing with missing data
- •18.2 Identifying missing values
- •18.3 Exploring missing-values patterns
- •18.3.1 Tabulating missing values
- •18.3.2 Exploring missing data visually
- •18.3.3 Using correlations to explore missing values
- •18.4 Understanding the sources and impact of missing data
- •18.5 Rational approaches for dealing with incomplete data
- •18.6 Complete-case analysis (listwise deletion)
- •18.7 Multiple imputation
- •18.8 Other approaches to missing data
- •18.8.1 Pairwise deletion
- •18.8.2 Simple (nonstochastic) imputation
- •18.9 Summary
- •19 Advanced graphics with ggplot2
- •19.1 The four graphics systems in R
- •19.2 An introduction to the ggplot2 package
- •19.3 Specifying the plot type with geoms
- •19.4 Grouping
- •19.5 Faceting
- •19.6 Adding smoothed lines
- •19.7 Modifying the appearance of ggplot2 graphs
- •19.7.1 Axes
- •19.7.2 Legends
- •19.7.3 Scales
- •19.7.4 Themes
- •19.7.5 Multiple graphs per page
- •19.8 Saving graphs
- •19.9 Summary
- •20 Advanced programming
- •20.1 A review of the language
- •20.1.1 Data types
- •20.1.2 Control structures
- •20.1.3 Creating functions
- •20.2 Working with environments
- •20.3 Object-oriented programming
- •20.3.1 Generic functions
- •20.3.2 Limitations of the S3 model
- •20.4 Writing efficient code
- •20.5 Debugging
- •20.5.1 Common sources of errors
- •20.5.2 Debugging tools
- •20.5.3 Session options that support debugging
- •20.6 Going further
- •20.7 Summary
- •21 Creating a package
- •21.1 Nonparametric analysis and the npar package
- •21.1.1 Comparing groups with the npar package
- •21.2 Developing the package
- •21.2.1 Computing the statistics
- •21.2.2 Printing the results
- •21.2.3 Summarizing the results
- •21.2.4 Plotting the results
- •21.2.5 Adding sample data to the package
- •21.3 Creating the package documentation
- •21.4 Building the package
- •21.5 Going further
- •21.6 Summary
- •22 Creating dynamic reports
- •22.1 A template approach to reports
- •22.2 Creating dynamic reports with R and Markdown
- •22.3 Creating dynamic reports with R and LaTeX
- •22.4 Creating dynamic reports with R and Open Document
- •22.5 Creating dynamic reports with R and Microsoft Word
- •22.6 Summary
- •afterword Into the rabbit hole
- •appendix A Graphical user interfaces
- •appendix B Customizing the startup environment
- •appendix C Exporting data from R
- •Delimited text file
- •Excel spreadsheet
- •Statistical applications
- •appendix D Matrix algebra in R
- •appendix E Packages used in this book
- •appendix F Working with large datasets
- •F.1 Efficient programming
- •F.2 Storing data outside of RAM
- •F.3 Analytic packages for out-of-memory data
- •F.4 Comprehensive solutions for working with enormous datasets
- •appendix G Updating an R installation
- •G.1 Automated installation (Windows only)
- •G.2 Manual installation (Windows and Mac OS X)
- •G.3 Updating an R installation (Linux)
- •references
- •index
- •Symbols
- •Numerics
- •23.1 The lattice package
- •23.2 Conditioning variables
- •23.3 Panel functions
- •23.4 Grouping variables
- •23.5 Graphic parameters
- •23.6 Customizing plot strips
- •23.7 Page arrangement
- •23.8 Going further
Cluster analysis
This chapter covers
■Identifying cohesive subgroups (clusters) of observations
■Determining the number of clusters present
■Obtaining a nested hierarchy of clusters
■Obtaining discrete clusters
Cluster analysis is a data-reduction technique designed to uncover subgroups of observations within a dataset. It allows you to reduce a large number of observations to a much smaller number of clusters or types. A cluster is defined as a group of observations that are more similar to each other than they are to the observations in other groups. This isn’t a precise definition, and that fact has led to an enormous variety of clustering methods.
Cluster analysis is widely used in the biological and behavioral sciences, marketing, and medical research. For example, a psychological researcher might cluster data on the symptoms and demographics of depressed patients, seeking to uncover subtypes of depression. The hope would be that finding such subtypes might lead to more targeted and effective treatments and a better understanding of the disorder. Marketing researchers use cluster analysis as a customer-segmentation strategy.
369
370 |
CHAPTER 16 Cluster analysis |
Customers are arranged into clusters based on the similarity of their demographics and buying behaviors. Marketing campaigns are then tailored to appeal to one or more of these subgroups. Medical researchers use cluster analysis to help catalog gene-expression patterns obtained from DNA microarray data. This can help them to understand normal growth and development and the underlying causes of many human diseases.
The two most popular clustering approaches are hierarchical agglomerative clustering and partitioning clustering. In agglomerative hierarchical clustering, each observation starts as its own cluster. Clusters are then combined, two at a time, until all clusters are merged into a single cluster. In the partitioning approach, you specify K: the number of clusters sought. Observations are then randomly divided into K groups and reshuffled to form cohesive clusters.
Within each of these broad approaches, there are many clustering algorithms to choose from. For hierarchical clustering, the most popular are single linkage, complete linkage, average linkage, centroid, and Ward’s method. For partitioning, the two most popular are k-means and partitioning around medoids (PAM). Each clustering method has advantages and disadvantages, which we’ll discuss.
The examples in this chapter focus on food and wine (I suspect my friends aren’t surprised). Hierarchical clustering is applied to the nutrient dataset contained in the flexclust package to answer the following questions:
■What are the similarities and differences among 27 types of fish, fowl, and meat, based on 5 nutrient measures?
■Is there a smaller number of groups into which these foods can be meaningfully clustered?
Partitioning methods will be used to evaluate 13 chemical analyses of 178 Italian wine samples. The data are contained in the wine dataset available with the rattle package. Here, the questions are as follows:
■Are there subtypes of wine in the data?
■If so, how many subtypes are there, and what are their characteristics?
In fact, the wine samples represent three varietals (recorded as Type). This will allow you to evaluate how well the cluster analysis recovers the underlying structure.
Although there are many approaches to cluster analysis, they usually follow a similar set of steps. These common steps are described in section 16.1. Hierarchical agglomerative clustering is described in section 16.2, and partitioning methods are covered in section 16.3. Some final advice and cautionary statements are provided in section 16.4. In order to run the examples in this chapter, be sure to install the cluster, NbClust, flexclust, fMultivar, ggplot2, and rattle packages. The rattle package will also be used in chapter 17.
16.1 Common steps in cluster analysis
Like factor analysis (chapter 14), an effective cluster analysis is a multistep process with numerous decision points. Each decision can affect the quality and usefulness of
Common steps in cluster analysis |
371 |
the results. This section describes the 11 typical steps in a comprehensive cluster analysis:
1Choose appropriate attributes. The first (and perhaps most important) step is to select variables that you feel may be important for identifying and understanding differences among groups of observations within the data. For example, in a study of depression, you might want to assess one or more of the following: psychological symptoms; physical symptoms; age at onset; number, duration, and timing of episodes; number of hospitalizations; functional status with regard to self-care; social and work history; current age; gender; ethnicity; socioeconomic status; marital status; family medical history; and response to previous treatments. A sophisticated cluster analysis can’t compensate for a poor choice of variables.
2Scale the data. If the variables in the analysis vary in range, the variables with the largest range will have the greatest impact on the results. This is often undesirable, and analysts scale the data before continuing. The most popular approach is to standardize each variable to a mean of 0 and a standard deviation of 1. Other alternatives include dividing each variable by its maximum value or subtracting the variable’s mean and dividing by the variable’s median absolute deviation. The three approaches are illustrated with the following code snippets:
df1 <- apply(mydata, 2, function(x){(x-mean(x))/sd(x)}) df2 <- apply(mydata, 2, function(x){x/max(x)})
df3 <- apply(mydata, 2, function(x){(x – mean(x))/mad(x)})
In this chapter, you’ll use the scale() function to standardize the variables to a mean of 0 and a standard deviation of 1. This is equivalent to the first code snippet (df1).
3Screen for outliers. Many clustering techniques are sensitive to outliers, distorting the cluster solutions obtained. You can screen for (and remove) univariate outliers using functions from the outliers package. The mvoutlier package contains functions that can be used to identify multivariate outliers. An alternative is to use a clustering method that is robust to the presence of outliers. Partitioning around medoids (section 16.3.2) is an example of the latter approach.
4Calculate distances. Although clustering algorithms vary widely, they typically require a measure of the distance among the entities to be clustered. The most popular measure of the distance between two observations is the Euclidean distance, but the Manhattan, Canberra, asymmetric binary, maximum, and Minkowski distance measures are also available (see ?dist for details). In this chapter, the Euclidean distance is used throughout. Calculating Euclidean distances is covered in section 16.1.1.
5Select a clustering algorithm. Next, you select a method of clustering the data. Hierarchical clustering is useful for smaller problems (say, 150 observations or less) and where a nested hierarchy of groupings is desired. The partitioning method can handle much larger problems but requires that the number of clusters be specified in advance.
372 |
CHAPTER 16 Cluster analysis |
Once you’ve chosen the hierarchical or partitioning approach, you must select a specific clustering algorithm. Again, each has advantages and disadvantages. The most popular methods are described in sections 16.2 and 16.3. You may wish to try more than one algorithm to see how robust the results are to the choice of methods.
6Obtain one or more cluster solutions. This step uses the method(s) selected in step 5.
7Determine the number of clusters present. In order to obtain a final cluster solution, you must decide how many clusters are present in the data. This is a thorny problem, and many approaches have been proposed. It usually involves extracting various numbers of clusters (say, 2 to K) and comparing the quality of the solutions. The NbClust() function in the NBClust package provides 30 different indices to help you make this decision (elegantly demonstrating how unresolved this issue is). NbClust is used throughout this chapter.
8Obtain a final clustering solution. Once the number of clusters has been determined, a final clustering is performed to extract that number of subgroups.
9Visualize the results. Visualization can help you determine the meaning and usefulness of the cluster solution. The results of a hierarchical clustering are usually presented as a dendrogram. Partitioning results are typically visualized using a bivariate cluster plot.
10Interpret the clusters. Once a cluster solution has been obtained, you must interpret (and possibly name) the clusters. What do the observations in a cluster have in common? How do they differ from the observations in other clusters? This step is typically accomplished by obtaining summary statistics for each variable by cluster. For continuous data, the mean or median for each variable within each cluster is calculated. For mixed data (data that contain categorical variables), the summary statistics will also include modes or category distributions.
11Validate the results. Validating the cluster solution involves asking the question, “Are these groupings in some sense real, and not a manifestation of unique aspects of this dataset or statistical technique?” If a different cluster method or different sample is employed, would the same clusters be obtained? The fpc, clv, and clValid packages each contain functions for evaluating the stability of a clustering solution.
Because the calculations of distances between observations is such an integral part of cluster analysis, it’s described next and in some detail.
16.2 Calculating distances
Every cluster analysis begins with the calculation of a distance, dissimilarity, or proximity between each entity to be clustered. The Euclidean distance between two observations is given by
dij =
where i and j are observations and P is the number of variables.
Calculating distances |
373 |
Consider the nutrient dataset provided with the flexclust package. The dataset contains measurements on the nutrients of 27 types of meat, fish, and fowl. The first few observations are given by
>data(nutrient, package="flexclust")
>head(nutrient, 4)
|
|
energy protein fat calcium iron |
||||
BEEF BRAISED |
340 |
20 |
28 |
9 |
2.6 |
|
HAMBURGER |
245 |
21 |
17 |
9 |
2.7 |
|
BEEF |
ROAST |
420 |
15 |
39 |
7 |
2.0 |
BEEF |
STEAK |
375 |
19 |
32 |
9 |
2.6 |
and the Euclidean distance between the first two (beef braised and hamburger) is
d = (340 − 245)2 + (20 − 21)2 + (28 − 17)2 + (9 − 9)2 + (26 − 27)2 = 95.64
The dist() function in the base R installation can be used to calculate the distances between all rows (observations) of a matrix or data frame. The format is dist(x, method=), where x is the input data and method="euclidean" by default. The function returns a lower triangle matrix by default, but the as.matrix() function can be used to access the distances using standard bracket notation. For the nutrient data frame,
>d <- dist(nutrient)
>as.matrix(d)[1:4,1:4]
|
|
BEEF BRAISED HAMBURGER BEEF ROAST BEEF STEAK |
|||
BEEF BRAISED |
0.0 |
95.6 |
80.9 |
35.2 |
|
HAMBURGER |
95.6 |
0.0 |
176.5 |
130.9 |
|
BEEF |
ROAST |
80.9 |
176.5 |
0.0 |
45.8 |
BEEF |
STEAK |
35.2 |
130.9 |
45.8 |
0.0 |
Larger distances indicate larger dissimilarities between observations. The distance between an observation and itself is 0. As expected, the dist() function provides the same distance between beef braised and hamburger as the hand calculations.
Cluster analysis with mixed data types
Euclidean distances are usually the distance measure of choice for continuous data. But if other variable types are present, alternative dissimilarity measures are required. You can use the daisy() function in the cluster package to obtain a dissimilarity matrix among observations that have any combination of binary, nominal, ordinal, and continuous attributes. Other functions in the cluster package can use these dissimilarities to carry out a cluster analysis. For example, agnes() offers agglomerative hierarchical clustering, and pam() provides partitioning around medoids.
Note that distances in the nutrient data frame are heavily dominated by the contribution of the energy variable, which has a much larger range. Scaling the data will help
