- •brief contents
- •contents
- •preface
- •acknowledgments
- •about this book
- •What’s new in the second edition
- •Who should read this book
- •Roadmap
- •Advice for data miners
- •Code examples
- •Code conventions
- •Author Online
- •About the author
- •about the cover illustration
- •1 Introduction to R
- •1.2 Obtaining and installing R
- •1.3 Working with R
- •1.3.1 Getting started
- •1.3.2 Getting help
- •1.3.3 The workspace
- •1.3.4 Input and output
- •1.4 Packages
- •1.4.1 What are packages?
- •1.4.2 Installing a package
- •1.4.3 Loading a package
- •1.4.4 Learning about a package
- •1.5 Batch processing
- •1.6 Using output as input: reusing results
- •1.7 Working with large datasets
- •1.8 Working through an example
- •1.9 Summary
- •2 Creating a dataset
- •2.1 Understanding datasets
- •2.2 Data structures
- •2.2.1 Vectors
- •2.2.2 Matrices
- •2.2.3 Arrays
- •2.2.4 Data frames
- •2.2.5 Factors
- •2.2.6 Lists
- •2.3 Data input
- •2.3.1 Entering data from the keyboard
- •2.3.2 Importing data from a delimited text file
- •2.3.3 Importing data from Excel
- •2.3.4 Importing data from XML
- •2.3.5 Importing data from the web
- •2.3.6 Importing data from SPSS
- •2.3.7 Importing data from SAS
- •2.3.8 Importing data from Stata
- •2.3.9 Importing data from NetCDF
- •2.3.10 Importing data from HDF5
- •2.3.11 Accessing database management systems (DBMSs)
- •2.3.12 Importing data via Stat/Transfer
- •2.4 Annotating datasets
- •2.4.1 Variable labels
- •2.4.2 Value labels
- •2.5 Useful functions for working with data objects
- •2.6 Summary
- •3 Getting started with graphs
- •3.1 Working with graphs
- •3.2 A simple example
- •3.3 Graphical parameters
- •3.3.1 Symbols and lines
- •3.3.2 Colors
- •3.3.3 Text characteristics
- •3.3.4 Graph and margin dimensions
- •3.4 Adding text, customized axes, and legends
- •3.4.1 Titles
- •3.4.2 Axes
- •3.4.3 Reference lines
- •3.4.4 Legend
- •3.4.5 Text annotations
- •3.4.6 Math annotations
- •3.5 Combining graphs
- •3.5.1 Creating a figure arrangement with fine control
- •3.6 Summary
- •4 Basic data management
- •4.1 A working example
- •4.2 Creating new variables
- •4.3 Recoding variables
- •4.4 Renaming variables
- •4.5 Missing values
- •4.5.1 Recoding values to missing
- •4.5.2 Excluding missing values from analyses
- •4.6 Date values
- •4.6.1 Converting dates to character variables
- •4.6.2 Going further
- •4.7 Type conversions
- •4.8 Sorting data
- •4.9 Merging datasets
- •4.9.1 Adding columns to a data frame
- •4.9.2 Adding rows to a data frame
- •4.10 Subsetting datasets
- •4.10.1 Selecting (keeping) variables
- •4.10.2 Excluding (dropping) variables
- •4.10.3 Selecting observations
- •4.10.4 The subset() function
- •4.10.5 Random samples
- •4.11 Using SQL statements to manipulate data frames
- •4.12 Summary
- •5 Advanced data management
- •5.2 Numerical and character functions
- •5.2.1 Mathematical functions
- •5.2.2 Statistical functions
- •5.2.3 Probability functions
- •5.2.4 Character functions
- •5.2.5 Other useful functions
- •5.2.6 Applying functions to matrices and data frames
- •5.3 A solution for the data-management challenge
- •5.4 Control flow
- •5.4.1 Repetition and looping
- •5.4.2 Conditional execution
- •5.5 User-written functions
- •5.6 Aggregation and reshaping
- •5.6.1 Transpose
- •5.6.2 Aggregating data
- •5.6.3 The reshape2 package
- •5.7 Summary
- •6 Basic graphs
- •6.1 Bar plots
- •6.1.1 Simple bar plots
- •6.1.2 Stacked and grouped bar plots
- •6.1.3 Mean bar plots
- •6.1.4 Tweaking bar plots
- •6.1.5 Spinograms
- •6.2 Pie charts
- •6.3 Histograms
- •6.4 Kernel density plots
- •6.5 Box plots
- •6.5.1 Using parallel box plots to compare groups
- •6.5.2 Violin plots
- •6.6 Dot plots
- •6.7 Summary
- •7 Basic statistics
- •7.1 Descriptive statistics
- •7.1.1 A menagerie of methods
- •7.1.2 Even more methods
- •7.1.3 Descriptive statistics by group
- •7.1.4 Additional methods by group
- •7.1.5 Visualizing results
- •7.2 Frequency and contingency tables
- •7.2.1 Generating frequency tables
- •7.2.2 Tests of independence
- •7.2.3 Measures of association
- •7.2.4 Visualizing results
- •7.3 Correlations
- •7.3.1 Types of correlations
- •7.3.2 Testing correlations for significance
- •7.3.3 Visualizing correlations
- •7.4 T-tests
- •7.4.3 When there are more than two groups
- •7.5 Nonparametric tests of group differences
- •7.5.1 Comparing two groups
- •7.5.2 Comparing more than two groups
- •7.6 Visualizing group differences
- •7.7 Summary
- •8 Regression
- •8.1 The many faces of regression
- •8.1.1 Scenarios for using OLS regression
- •8.1.2 What you need to know
- •8.2 OLS regression
- •8.2.1 Fitting regression models with lm()
- •8.2.2 Simple linear regression
- •8.2.3 Polynomial regression
- •8.2.4 Multiple linear regression
- •8.2.5 Multiple linear regression with interactions
- •8.3 Regression diagnostics
- •8.3.1 A typical approach
- •8.3.2 An enhanced approach
- •8.3.3 Global validation of linear model assumption
- •8.3.4 Multicollinearity
- •8.4 Unusual observations
- •8.4.1 Outliers
- •8.4.3 Influential observations
- •8.5 Corrective measures
- •8.5.1 Deleting observations
- •8.5.2 Transforming variables
- •8.5.3 Adding or deleting variables
- •8.5.4 Trying a different approach
- •8.6 Selecting the “best” regression model
- •8.6.1 Comparing models
- •8.6.2 Variable selection
- •8.7 Taking the analysis further
- •8.7.1 Cross-validation
- •8.7.2 Relative importance
- •8.8 Summary
- •9 Analysis of variance
- •9.1 A crash course on terminology
- •9.2 Fitting ANOVA models
- •9.2.1 The aov() function
- •9.2.2 The order of formula terms
- •9.3.1 Multiple comparisons
- •9.3.2 Assessing test assumptions
- •9.4 One-way ANCOVA
- •9.4.1 Assessing test assumptions
- •9.4.2 Visualizing the results
- •9.6 Repeated measures ANOVA
- •9.7 Multivariate analysis of variance (MANOVA)
- •9.7.1 Assessing test assumptions
- •9.7.2 Robust MANOVA
- •9.8 ANOVA as regression
- •9.9 Summary
- •10 Power analysis
- •10.1 A quick review of hypothesis testing
- •10.2 Implementing power analysis with the pwr package
- •10.2.1 t-tests
- •10.2.2 ANOVA
- •10.2.3 Correlations
- •10.2.4 Linear models
- •10.2.5 Tests of proportions
- •10.2.7 Choosing an appropriate effect size in novel situations
- •10.3 Creating power analysis plots
- •10.4 Other packages
- •10.5 Summary
- •11 Intermediate graphs
- •11.1 Scatter plots
- •11.1.3 3D scatter plots
- •11.1.4 Spinning 3D scatter plots
- •11.1.5 Bubble plots
- •11.2 Line charts
- •11.3 Corrgrams
- •11.4 Mosaic plots
- •11.5 Summary
- •12 Resampling statistics and bootstrapping
- •12.1 Permutation tests
- •12.2 Permutation tests with the coin package
- •12.2.2 Independence in contingency tables
- •12.2.3 Independence between numeric variables
- •12.2.5 Going further
- •12.3 Permutation tests with the lmPerm package
- •12.3.1 Simple and polynomial regression
- •12.3.2 Multiple regression
- •12.4 Additional comments on permutation tests
- •12.5 Bootstrapping
- •12.6 Bootstrapping with the boot package
- •12.6.1 Bootstrapping a single statistic
- •12.6.2 Bootstrapping several statistics
- •12.7 Summary
- •13 Generalized linear models
- •13.1 Generalized linear models and the glm() function
- •13.1.1 The glm() function
- •13.1.2 Supporting functions
- •13.1.3 Model fit and regression diagnostics
- •13.2 Logistic regression
- •13.2.1 Interpreting the model parameters
- •13.2.2 Assessing the impact of predictors on the probability of an outcome
- •13.2.3 Overdispersion
- •13.2.4 Extensions
- •13.3 Poisson regression
- •13.3.1 Interpreting the model parameters
- •13.3.2 Overdispersion
- •13.3.3 Extensions
- •13.4 Summary
- •14 Principal components and factor analysis
- •14.1 Principal components and factor analysis in R
- •14.2 Principal components
- •14.2.1 Selecting the number of components to extract
- •14.2.2 Extracting principal components
- •14.2.3 Rotating principal components
- •14.2.4 Obtaining principal components scores
- •14.3 Exploratory factor analysis
- •14.3.1 Deciding how many common factors to extract
- •14.3.2 Extracting common factors
- •14.3.3 Rotating factors
- •14.3.4 Factor scores
- •14.4 Other latent variable models
- •14.5 Summary
- •15 Time series
- •15.1 Creating a time-series object in R
- •15.2 Smoothing and seasonal decomposition
- •15.2.1 Smoothing with simple moving averages
- •15.2.2 Seasonal decomposition
- •15.3 Exponential forecasting models
- •15.3.1 Simple exponential smoothing
- •15.3.3 The ets() function and automated forecasting
- •15.4 ARIMA forecasting models
- •15.4.1 Prerequisite concepts
- •15.4.2 ARMA and ARIMA models
- •15.4.3 Automated ARIMA forecasting
- •15.5 Going further
- •15.6 Summary
- •16 Cluster analysis
- •16.1 Common steps in cluster analysis
- •16.2 Calculating distances
- •16.3 Hierarchical cluster analysis
- •16.4 Partitioning cluster analysis
- •16.4.2 Partitioning around medoids
- •16.5 Avoiding nonexistent clusters
- •16.6 Summary
- •17 Classification
- •17.1 Preparing the data
- •17.2 Logistic regression
- •17.3 Decision trees
- •17.3.1 Classical decision trees
- •17.3.2 Conditional inference trees
- •17.4 Random forests
- •17.5 Support vector machines
- •17.5.1 Tuning an SVM
- •17.6 Choosing a best predictive solution
- •17.7 Using the rattle package for data mining
- •17.8 Summary
- •18 Advanced methods for missing data
- •18.1 Steps in dealing with missing data
- •18.2 Identifying missing values
- •18.3 Exploring missing-values patterns
- •18.3.1 Tabulating missing values
- •18.3.2 Exploring missing data visually
- •18.3.3 Using correlations to explore missing values
- •18.4 Understanding the sources and impact of missing data
- •18.5 Rational approaches for dealing with incomplete data
- •18.6 Complete-case analysis (listwise deletion)
- •18.7 Multiple imputation
- •18.8 Other approaches to missing data
- •18.8.1 Pairwise deletion
- •18.8.2 Simple (nonstochastic) imputation
- •18.9 Summary
- •19 Advanced graphics with ggplot2
- •19.1 The four graphics systems in R
- •19.2 An introduction to the ggplot2 package
- •19.3 Specifying the plot type with geoms
- •19.4 Grouping
- •19.5 Faceting
- •19.6 Adding smoothed lines
- •19.7 Modifying the appearance of ggplot2 graphs
- •19.7.1 Axes
- •19.7.2 Legends
- •19.7.3 Scales
- •19.7.4 Themes
- •19.7.5 Multiple graphs per page
- •19.8 Saving graphs
- •19.9 Summary
- •20 Advanced programming
- •20.1 A review of the language
- •20.1.1 Data types
- •20.1.2 Control structures
- •20.1.3 Creating functions
- •20.2 Working with environments
- •20.3 Object-oriented programming
- •20.3.1 Generic functions
- •20.3.2 Limitations of the S3 model
- •20.4 Writing efficient code
- •20.5 Debugging
- •20.5.1 Common sources of errors
- •20.5.2 Debugging tools
- •20.5.3 Session options that support debugging
- •20.6 Going further
- •20.7 Summary
- •21 Creating a package
- •21.1 Nonparametric analysis and the npar package
- •21.1.1 Comparing groups with the npar package
- •21.2 Developing the package
- •21.2.1 Computing the statistics
- •21.2.2 Printing the results
- •21.2.3 Summarizing the results
- •21.2.4 Plotting the results
- •21.2.5 Adding sample data to the package
- •21.3 Creating the package documentation
- •21.4 Building the package
- •21.5 Going further
- •21.6 Summary
- •22 Creating dynamic reports
- •22.1 A template approach to reports
- •22.2 Creating dynamic reports with R and Markdown
- •22.3 Creating dynamic reports with R and LaTeX
- •22.4 Creating dynamic reports with R and Open Document
- •22.5 Creating dynamic reports with R and Microsoft Word
- •22.6 Summary
- •afterword Into the rabbit hole
- •appendix A Graphical user interfaces
- •appendix B Customizing the startup environment
- •appendix C Exporting data from R
- •Delimited text file
- •Excel spreadsheet
- •Statistical applications
- •appendix D Matrix algebra in R
- •appendix E Packages used in this book
- •appendix F Working with large datasets
- •F.1 Efficient programming
- •F.2 Storing data outside of RAM
- •F.3 Analytic packages for out-of-memory data
- •F.4 Comprehensive solutions for working with enormous datasets
- •appendix G Updating an R installation
- •G.1 Automated installation (Windows only)
- •G.2 Manual installation (Windows and Mac OS X)
- •G.3 Updating an R installation (Linux)
- •references
- •index
- •Symbols
- •Numerics
- •23.1 The lattice package
- •23.2 Conditioning variables
- •23.3 Panel functions
- •23.4 Grouping variables
- •23.5 Graphic parameters
- •23.6 Customizing plot strips
- •23.7 Page arrangement
- •23.8 Going further
464 |
CHAPTER 20 Advanced programming |
creating new applications. In chapter 21, you’ll have an opportunity to put these skills into practice by creating a useful package from start to finish.
20.1 A review of the language
R is an object-oriented, functional, array programming language in which objects are specialized data structures, stored in RAM, and accessed via names or symbols. Names of objects consist of uppercase and lowercase letters, the digits 0–9, the period, and the underscore. Names are case-sensitive and can’t start with a digit, and a period is treated as a simple character without special meaning.
Unlike in languages such as C and C++, you can’t access memory locations directly. Data, functions, and just about everything else that can be stored and named are objects. Additionally, the names and symbols themselves are objects that can be manipulated. All objects are stored in RAM during program execution, which has significant implications for the analysis of massive datasets.
Every object has attributes: meta-information describing the characteristics of the object. Attributes can be listed with the attributes() function and set with the attr() function. A key attribute is an object’s class. R functions use information about an object’s class in order to determine how the object should be handled. The class of an object can be read and set with the class() function. Examples will be given throughout this chapter and the next.
20.1.1Data types
There are two fundamental data types: atomic vectors and generic vectors. Atomic vectors are arrays that contain a single data type. Generic vectors, also called lists, are collections of atomic vectors. Lists are recursive in that they can also contain other lists. This section considers both types in some detail.
Unlike in many languages, you don’t have to declare an object’s data type or allocate space for it. The type is determined implicitly from the object’s contents, and the size grows or shrinks automatically depending on the type and number of elements the object contains.
ATOMIC VECTORS
Atomic vectors are arrays that contain a single data type (logical, real, complex, character, or raw). For example, each of the following is a one-dimensional atomic vector:
passed <- c(TRUE, TRUE, FALSE, TRUE) ages <- c(15, 18, 25, 14, 19)
cmplxNums <- c(1+2i, 0+1i, 39+3i, 12+2i) names <- c("Bob", "Ted", "Carol", "Alice")
Vectors of type "raw" hold raw bytes and aren’t discussed here.
Many R data types are atomic vectors with special attributes. For example, R doesn’t have a scalar type. A scalar is an atomic vector with a single element. So k <- 2 is a shortcut for k <- c(2).
A review of the language |
465 |
A matrix is an atomic vector that has a dimension attribute, dim, containing two elements (number of rows and number of columns). For example, start with a onedimensional numeric vector x:
>x <- c(1,2,3,4,5,6,7,8)
>class(x)
[1] "numeric" > print(x)
{1] 1 2 3 4 5 6 7 8
Then add a dim attribute:
> attr(x, "dim") <- c(2,4)
The object x is now a 2 × 3 matrix of class matrix:
> print(x) |
|
|
|
|
|
[,1] [,2] [,3] [,4] |
|||
[1,] |
1 |
3 |
5 |
7 |
[2,] |
2 |
4 |
6 |
8 |
>class(x) [1] "matrix"
>attributes(x) $dim
[1] 2 2
Row and column names can be attached by adding a dimnames attribute:
> attr(x, |
"dimnames") <- list(c("A1", "A2"), |
|||
|
|
|
|
c("B1", "B2", "B3", "B4")) |
> print(x) |
|
|||
|
B1 |
B2 |
B3 |
B4 |
A1 |
1 |
3 |
5 |
7 |
A2 |
2 |
4 |
6 |
8 |
Finally, the matrix can be returned to a one-dimensional vector by removing the dim attribute:
>attr(x, "dim") <- NULL
>class(x)
[1] "numeric" > print(x)
[1] 1 2 3 4 5 6 7 8
An array is an atomic vector with a dim attribute that has three or more elements. Again, you set the dimensions with the dim attribute, and you can attach labels with the dimnames attribute. Like one-dimensional vectors, matrices and arrays can be of type logical, numeric, character, complex, or raw. But you can’t mix types in a single matrix or array.
The attr() function allows you to create arbitrary attributes and associate them with an object. Attributes store additional information about an object and can be used by functions to determine how they’re processed.
466 |
CHAPTER 20 Advanced programming |
There are a number of special functions for setting attributes, including dim(), dimnames(), names(), row.names(), class(), and tsp(). The latter is used to create time series objects. These special functions have restrictions on the values that can be set. Unless you’re creating custom attributes, it’s always a good idea to use these special functions. Their restrictions and the error messages they produce make coding errors less likely and more obvious.
GENERIC VECTORS OR LISTS
Lists are collections of atomic vectors and/or other lists. Data frames are a special type of list, where each atomic vector in the collection has the same length. Consider the iris data frame that comes with the base R installation. It describes four physical measures taken on each of 150 plants, along with their species (setosa, versicolor, or virginica):
> head(iris) |
|
|
|
|
|
|
Sepal.Length Sepal.Width Petal.Length Petal.Width Species |
||||
1 |
5.1 |
3.5 |
1.4 |
0.2 |
setosa |
2 |
4.9 |
3.0 |
1.4 |
0.2 |
setosa |
3 |
4.7 |
3.2 |
1.3 |
0.2 |
setosa |
4 |
4.6 |
3.1 |
1.5 |
0.2 |
setosa |
5 |
5.0 |
3.6 |
1.4 |
0.2 |
setosa |
6 |
5.4 |
3.9 |
1.7 |
0.4 |
setosa |
This data frame is actually a list containing five atomic vectors. It has a names attribute (a character vector of variable names), a row.names attribute (a numeric vector identifying individual plants), and a class attribute with the value "data.frame". Each vector represents a column (variable) in the data frame. This can be easily seen by printing the data frame with the unclass() function and obtaining the attributes with the attributes() function:
unclass(iris)
attributes(iris)
The output is omitted here to save space.
It’s important to understand lists because R functions frequently return them as values. Let’s look at an example using a cluster-analysis technique from chapter 16. Cluster analysis uses a family of methods to identify naturally occurring groupings of observations.
You’ll apply k-means cluster analysis (section 16.3.1) to the iris data. Assume that there are three clusters present in the data, and observe how the observations (rows) become grouped. You’ll ignore the species variable and use only the physical measures of each plant to form the clusters. The required code is
set.seed(1234)
fit <- kmeans(iris[1:4], 3)
What information is contained in the object fit? The help page for kmeans() indicates that the function returns a list with seven components. The str() function displays the object’s structure, and the unclass() function can be used to examine the
A review of the language |
467 |
object’s contents directly. The length() function indicates how many components the object contains, and the names() function provides the names of these components. You can use the attributes() function to examine the attributes of the object. The contents of the object returned by kmeans() are explored here:
> |
names(fit) |
|
|
|
[1] |
"cluster" |
"centers" |
"totss" |
"withinss" |
[5] |
"tot.withinss" |
"betweenss" |
"size" |
"iter" |
[9] |
"ifault" |
|
|
|
> unclass(fit) $cluster
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 [29] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 2 2 2 [57] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 [85] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3 3 2 3 3 3 3 3 [113] 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 3 3 3 3 2 3
[141] 3 3 2 3 3 3 2 3 3 2
$centers |
|
|
|
|
|
Sepal.Length Sepal.Width Petal.Length Petal.Width |
|||
1 |
5.006 |
3.428 |
1.462 |
0.246 |
2 |
5.902 |
2.748 |
4.394 |
1.434 |
3 |
6.850 |
3.074 |
5.742 |
2.071 |
$totss [1] 681.4
$withinss
[1] 15.15 39.82 23.88
$tot.withinss [1] 78.85
$betweenss [1] 602.5
$size
[1] 50 62 38
$iter [1] 2
$ifault [1] 0
Executing sapply(fit, class) returns the class of each component in the object:
> sapply(fit, class) |
|
|
|
|
cluster |
centers |
totss |
withinss tot.withinss |
|
"integer" |
"matrix" |
"numeric" |
"numeric" |
"numeric" |
betweenss |
size |
iter |
ifault |
|
"numeric" |
"integer" |
"integer" |
"integer" |
|
In this example, cluster is an integer vector containing the cluster memberships, and centers is a matrix containing the cluster centroids (means on each variable for each
468 |
CHAPTER 20 Advanced programming |
cluster). The size component is an integer vector containing the number of plants in each of the three clusters. To learn about the other components, see the Value section of help(kmeans).
INDEXING
Learning to unpack the information in a list is a critical R programming skill. The elements of any data object can be extracted via indexing. Before diving into a list, let’s look at extracting elements from an atomic vector.
Elements are extracted using object[index], where object is the vector and index is an integer vector. If the elements of the atomic vector have been named, index can also be a character vector with these names. Note that in R, indices start with 1, not 0 as in many other languages.
Here is an example, using this approach for an atomic vector without named elements:
>x <- c(20, 30, 40)
>x[3]
[1] 40
> x[c(2,3)] [1] 30 40
For an atomic vector with named elements, you could use
>x <- c(A=20, B=30, C=40)
>x[c(2,3)]
BC 30 40
>x[c("B", "C")]
BC
30 40
For lists, components (atomic vectors or other lists) can be extracted using object[index], where index is an integer vector. The following uses the fit object from the kmeans example that appears a little later, in listing 20.1:
> fit[c(2,7)] |
|
|
|
|
$centers |
|
|
|
|
|
Sepal.Length Sepal.Width Petal.Length Petal.Width |
|||
1 |
5.006 |
3.428 |
1.462 |
0.246 |
2 |
5.902 |
2.748 |
4.394 |
1.434 |
3 |
6.850 |
3.074 |
5.742 |
2.071 |
$size
[1] 50 62 38
Note that components are returned as a list.
To get just the elements in the component, use object[[integer]]:
> fit[2] $centers
|
Sepal.Length Sepal.Width Petal.Length Petal.Width |
|||
1 |
5.006 |
3.428 |
1.462 |
0.246 |
2 |
5.902 |
2.748 |
4.394 |
1.434 |
3 |
6.850 |
3.074 |
5.742 |
2.071 |
|
|
A review of the language |
469 |
|
> fit[[2]] |
|
|
|
|
|
Sepal.Length Sepal.Width Petal.Length Petal.Width |
|||
1 |
5.006 |
3.428 |
1.462 |
0.246 |
2 |
5.902 |
2.748 |
4.394 |
1.434 |
3 |
6.850 |
3.074 |
5.742 |
2.071 |
In the first case, a list is returned. In second case, a matrix is returned. The difference can be important, depending on what you do with the results. If you want to pass the results to a function that requires a matrix as input, you would want to use the doublebracket notation.
To extract a single named component, you can use the $ notation. In this case, object[[integer]] and object$name are equivalent:
> fit$centers |
|
|
|
|
|
Sepal.Length Sepal.Width Petal.Length Petal.Width |
|||
1 |
5.006 |
3.428 |
1.462 |
0.246 |
2 |
5.902 |
2.748 |
4.394 |
1.434 |
3 |
6.850 |
3.074 |
5.742 |
2.071 |
This also explains why the $ notation works with data frames. Consider the iris data frame. The data frame is a special case of a list, where each variable is represented as a component. This is why iris$Sepal.Length returns the 150-element vector of sepal lengths.
Notations can be combined to obtain the elements within components. For example,
> fit[[2]][1,] |
|
|
|
Sepal.Length |
Sepal.Width |
Petal.Length |
Petal.Width |
5.006 |
3.428 |
1.462 |
0.246 |
extracts the second component of fit (a matrix of means) and returns the first row (the means for the first cluster on each of the four variables).
By extracting the components and elements of lists returned by functions, you can take the results and go further. For example, to plot the cluster centroids with a line graph, you can use the following code.
Listing 20.1 Plotting the centroids from a k-means cluster analysis
> |
set.seed(1234) |
|
> |
fit <- kmeans(iris[1:4], 3) |
b Obtains the cluster means |
>means <- fit$centers
>library(reshape2)
>dfm <- melt(means)
>names(dfm) <- c("Cluster", "Measurement", "Centimeters")
>dfm$Cluster <- factor(dfm$Cluster)
>head(dfm)
|
Cluster |
Measurement Centimeters |
|
1 |
1 |
Sepal.Length |
5.006 |
2 |
2 |
Sepal.Length |
5.902 |
3 |
3 |
Sepal.Length |
6.850 |
4 |
1 |
Sepal.Width |
3.428 |
5 |
2 |
Sepal.Width |
2.748 |
6 |
3 |
Sepal.Width |
3.074 |
cReshapes the data to long form