Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

384

 

 

 

 

CHAPTER 16

Cluster analysis

 

 

 

 

 

 

 

 

Listing 16.5 Partitioning around medoids for the wine data

 

 

 

> library(cluster)

 

 

 

 

Clusters standardized data

 

> set.seed(1234)

 

 

 

 

 

> fit.pam <- pam(wine[-1], k=3, stand=TRUE)

 

 

 

 

> fit.pam$medoids

 

 

 

 

 

 

Prints the

 

 

 

 

 

 

 

 

 

 

 

 

Alcohol Malic

Ash Alcalinity

Magnesium Phenols Flavanoids

medoids

[1,]

13.5

1.81

2.41

20.5

 

100

2.70

2.98

 

[2,]

12.2

1.73

2.12

19.0

 

80

1.65

2.03

 

[3,]

13.4

3.91

2.48

23.0

 

102

1.80

0.75

 

 

 

Nonflavanoids

Proanthocyanins

Color

Hue Dilution Proline

Plots the

[1,]

 

0.26

 

1.86

5.1

1.04

3.47

920

[2,]

 

0.37

 

1.63

3.4

1.00

3.17

510

cluster

[3,]

 

0.43

 

1.41

7.3

0.70

1.56

750

solution

> clusplot(fit.pam, main="Bivariate Cluster Plot")

Note that the medoids are actual observations contained in the wine dataset. In this case, they’re observations 36, 107, and 175, and they have been chosen to represent the three clusters. The bivariate plot is created by plotting the coordinates of each observation on the first two principal components (see chapter 14) derived from the 13 assay variables. Each cluster is represented by an ellipse with the smallest area containing all its points.

Also note that PAM didn’t perform as well as k-means in this instance:

> ct.pam <- table(wine$Type, fit.pam$clustering)

 

1

2

3

1

59

0

0

2

16

53

2

30 1 47

>randIndex(ct.pam) [1] 0.699

The adjusted Rand index has decreased from 0.9 (for k-means) to 0.7.

16.5 Avoiding nonexistent clusters

Before I finish this discussion, a word of caution is in order. Cluster analysis is a methodology designed to identify cohesive subgroups in a dataset. It’s very good at doing this. In fact, it’s so good, it can find clusters where none exist.

Consider the following code:

library(fMultivar)

set.seed(1234)

df <- rnorm2d(1000, rho=.5) df <- as.data.frame(df)

plot(df, main="Bivariate Normal Distribution with rho=0.5")

The rnorm2d() function in the fMultivar package is used to sample 1,000 observations from a bivariate normal distribution with a correlation of 0.5. The resulting graph is displayed in figure 16.7. Clearly there are no clusters in this data.

V2 −3 −2 −1 0 1 2 3

Avoiding nonexistent clusters

385

Bivariate Normal Distribution with rho=0.5

Figure 16.7 Bivariate normal data (n = 1000). There are no clusters in this data.

−3

−2

−1

0

1

2

3

V1

The wssplot() and NbClust() functions are then used to determine the number of clusters present:

wssplot(df)

library(NbClust)

nc <- NbClust(df, min.nc=2, max.nc=15, method="kmeans") dev.new()

barplot(table(nc$Best.n[1,]),

xlab="Number of Clusters", ylab="Number of Criteria", main="Number of Clusters Chosen by 26 Criteria")

The results are plotted in figures 16.8 and 16.9.

of squares

1500 2000

groups sum

1000

Within

500

2

4

6

8

10

12

14

Number of Clusters

Figure 16.8 Plot of within-groups sums of squares vs. number of k-means clusters for bivariate normal data

386

CHAPTER 16 Cluster analysis

 

 

Number of Clusters Chosen by 26 Criteria

 

 

8

 

 

 

 

 

 

 

 

 

Criteria

6

 

 

 

 

 

 

 

 

 

of

4

 

 

 

 

 

 

 

 

 

Number

2

 

 

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

 

 

0

1

2

3

4

5

8

10

12

13

 

 

 

 

 

Number of Clusters

 

 

 

 

Figure 16.9 Number of clusters recommended for bivariate normal data by criteria in the NbClust package. Two or three clusters are suggested.

The wssplot() function suggest that there are three clusters, whereas many of the criteria returned by NbClust() suggest between two and three clusters. If you carry out a two-cluster analysis with PAM,

library(ggplot2)

library(cluster) fit <- pam(df, k=2)

df$clustering <- factor(fit$clustering)

ggplot(data=df, aes(x=V1, y=V2, color=clustering, shape=clustering)) + geom_point() + ggtitle("Clustering of Bivariate Normal Data")

you get the two-cluster plot shown in figure 16.10. (The ggplot() statement is part of the comprehensive graphics package ggplot2. Chapter 19 covers ggplot2 in detail.)

Clustering of Bivariate Normal Data

2

V2

0

−2

Clustering

1

2

Figure 16.10 PAM cluster analysis of bivariate normal data, extracting two clusters. Note that the clusters are an arbitrary division of the data.

−2

0

2

V1

 

 

 

 

Summary

 

 

387

 

−8

 

 

 

 

 

 

 

−10

 

 

 

 

 

 

CCC

−14 −12

 

 

 

 

 

 

 

−16

 

 

 

 

 

 

 

−18

 

 

 

 

 

 

 

2

4

6

8

10

12

14

 

 

 

Number of clusters

 

 

 

Figure 16.11 CCC plot for bivariate normal data. It correctly suggests that no clusters are present.

Clearly the partitioning is artificial. There are no real clusters here. How can you avoid this mistake? Although it isn’t foolproof, I have found that the Cubic Cluster Criteria (CCC) reported by NbClust can often help to uncover situations where no structure exists. The code is

plot(nc$All.index[,4], type="o", ylab="CCC", xlab="Number of clusters", col="blue")

and the resulting graph is displayed in figure 16.11. When the CCC values are all negative and decreasing for two or more clusters, the distribution is typically unimodal.

The ability of cluster analysis (or your interpretation of it) to find erroneous clusters makes the validation step of cluster analysis important. If you’re trying to identify clusters that are “real” in some sense (rather than a convenient partitioning), be sure the results are robust and repeatable. Try different clustering methods, and replicate the findings with new samples. If the same clusters are consistently recovered, you can be more confident in the results.

16.6 Summary

In this chapter, we reviewed some of the most common approaches to clustering observations into cohesive groups. First we reviewed the general steps for a comprehensive cluster analysis. Next, common methods for hierarchical and partitioning clustering were described. Finally, I reinforced the need to validate the resulting clusters in situations where you seek more than convenient partitioning.

Cluster analysis is a broad topic, and R has some of the most comprehensive facilities for applying this methodology currently available. To learn more about these capabilities, see the CRAN Task View for Cluster Analysis & Finite Mixture Models (http://cran.r-project.org/web/views/Cluster.html). Additionally, Tan, Steinbach, & Kumar (2006) have an excellent book on data-mining techniques. It contains a lucid

388

CHAPTER 16 Cluster analysis

chapter on cluster analysis that you can freely downloaded (www-users.cs.umn.edu/ ~kumar/dmbook/ch8.pdf). Finally, Everitt, Landau, Leese, & Stahl (2011) have written a practical and highly regarded textbook on this subject.

Cluster analysis is a methodology for discovering cohesive subgroups of observations in a dataset. In the next chapter, we’ll consider situations where the groups have already been defined and your goal is to find an accurate method of classifying observations into them.

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]