Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
546
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

374

CHAPTER 16 Cluster analysis

to equalize the impact of each variable. In the next section, you’ll apply hierarchical cluster analysis to this dataset.

16.3 Hierarchical cluster analysis

As stated previously, in agglomerative hierarchical clustering, each case or observation starts as its own cluster. Clusters are then combined two at a time until all clusters are merged into a single cluster. The algorithm is as follows:

1Define each observation (row, case) as a cluster.

2Calculate the distances between every cluster and every other cluster.

3Combine the two clusters that have the smallest distance. This reduces the number of clusters by one.

4Repeat steps 2 and 3 until all clusters have been merged into a single cluster containing all observations.

The primary difference among hierarchical clustering algorithms is their definitions of cluster distances (step 2). Five of the most common hierarchical clustering methods and their definitions of the distance between two clusters are given in table 16.1.

Table 16.1 Hierarchical clustering methods

Cluster method

Definition of the distance between two clusters

 

 

Single linkage

Shortest distance between a point in one cluster and a point in the other cluster.

Complete linkage

Longest distance between a point in one cluster and a point in the other cluster.

Average linkage

Average distance between each point in one cluster and each point in the other

 

cluster (also called UPGMA [unweighted pair group mean averaging]).

Centroid

Distance between the centroids (vector of variable means) of the two clusters.

 

For a single observation, the centroid is the variable’s values.

Ward

The ANOVA sum of squares between the two clusters added up over all the

 

variables.

 

 

Single-linkage clustering tends to find elongated, cigar-shaped clusters. It also commonly displays a phenomenon called chaining—dissimilar observations are joined into the same cluster because they’re similar to intermediate observations between them. Complete-linkage clustering tends to find compact clusters of approximately equal diameter. It can also be sensitive to outliers. Average-linkage clustering offers a compromise between the two. It’s less likely to chain and is less susceptible to outliers. It also has a tendency to join clusters with small variances.

Ward’s method tends to join clusters with small numbers of observations and tends to produce clusters with roughly equal numbers of observations. It can also be sensitive to outliers. The centroid method offers an attractive alternative due to its simple and easily understood definition of cluster distances. It’s also less sensitive to outliers than other hierarchical methods. But it may not perform as well as the averagelinkage or Ward method.

Hierarchical cluster analysis

375

Hierarchical clustering can be accomplished with the hclust() function. The format is hclust(d, method=), where d is a distance matrix produced by the dist() function and methods include "single", "complete", "average", "centroid", and "ward".

In this section, you’ll apply average-linkage clustering to the nutrient data introduced from section 16.1.1. The goal is to identify similarities, differences, and groupings among 27 food types based on nutritional information. The code for carrying out the clustering is provided in the following listing.

Listing 16.1 Average-linkage clustering of the nutrient data

data(nutrient, package="flexclust") row.names(nutrient) <- tolower(row.names(nutrient)) nutrient.scaled <- scale(nutrient)

d <- dist(nutrient.scaled)

fit.average <- hclust(d, method="average")

plot(fit.average, hang=-1, cex=.8, main="Average Linkage Clustering")

First the data are imported, and the row names are set to lowercase (because I hate UPPERCASE LABELS). Because the variables differ widely in range, they’re standardized to a mean of 0 and a standard deviation of 1. Euclidean distances between each of the 27 food types are calculated, and an average-linkage clustering is performed. Finally, the results are plotted as a dendrogram (see figure 16.1). The hang option in the plot() function justifies the observation labels (causing them to hang down from 0).

Average-Linkage Clustering

 

5

 

4

 

3

Height

2

 

1

 

0

sardines canned

clams raw

clams canned

beef heart

beef roast

lamb shoulder roast

beef steak

beef braised

smoked ham

pork roast

pork simmered

mackerel canned

salmon canned

mackerel broiled

perch fried

crabmeat canned

haddock fried

beef canned

veal cutlet

beef tongue

hamburger

lamb leg roast

shrimp canned

chicken canned

tuna canned

chicken broiled

bluefish baked

d

Figure 16.1 Average-linkage

clustering of nutrient data

hclust (*, "average")

376

CHAPTER 16 Cluster analysis

The dendrogram displays how items are combined into clusters and is read from the bottom up. Each observation starts as its own cluster. Then the two observations that are closest (beef braised and smoked ham) are combined. Next, pork roast and pork simmered are combined, followed by chicken canned and tuna canned. In the fourth step, the beef braised/smoked ham cluster and the pork roast/pork simmered clusters are combined (and the cluster now contains four food items). This continues until all observations are combined into a single cluster. The height dimension indicates the criterion value at which clusters are joined. For average-linkage clustering, this criterion is the average distance between each point in one cluster and each point in the other cluster.

If your goal is to understand how food types are similar or different with regard to their nutrients, then figure 16.1 may be sufficient. It creates a hierarchical view of the similarity/dissimilarity among the 27 items. Canned tuna and chicken are similar, and both differ greatly from canned clams. But if the end goal is to assign these foods to a smaller number of (hopefully meaningful) groups, additional analyses are required to select an appropriate number of clusters.

The NbClust package offers numerous indices for determining the best number of clusters in a cluster analysis. There is no guarantee that they will agree with each other. In fact, they probably won’t. But the results can be used as a guide for selecting possible candidate values for K, the number of clusters. Input to the NbClust() function includes the matrix or data frame to be clustered, the distance measure and clustering method to employ, and the minimum and maximum number of clusters to consider. It returns each of the clustering indices, along with the best number of clusters proposed by each. The next listing applies this approach to the average-linkage clustering of the nutrient data.

Listing 16.2 Selecting the number of clusters

>library(NbClust)

>devAskNewPage(ask=TRUE)

>nc <- NbClust(nutrient.scaled, distance="euclidean",

min.nc=2, max.nc=15, method="average") > table(nc$Best.n[1,])

0

2

3

4

5

9

10

13

14

15

2

4

4

3

4

1

1

2

1

4

> barplot(table(nc$Best.n[1,]),

 

 

 

xlab="Numer

of

Clusters", ylab="Number of Criteria",

 

 

 

main="Number of Clusters Chosen by 26 Criteria")

Here, four criteria each favor two clusters, four criteria favor three clusters, and so on. The results are plotted in figure 16.2.

You could try the number of clusters (2, 3, 5, and 15) with the most “votes” and select the one that makes the most interpretive sense. The following listing explores the five-cluster solution.

 

4

Criteria

3

of

2

Number

1

 

0

Hierarchical cluster analysis

377

Number of Clusters Chosen by 26 Criteria

Figure 16.2 Recommended number of clusters using 26 criteria provided by the

NbClust package

0

1

2

3

4

5

9

10

13

14

15

Number of Clusters

Listing 16.3 Obtaining the final cluster solution

>

 

clusters <- cutree(fit.average, k=5)

 

 

 

 

 

>

 

table(clusters)

b Assigns cases

clusters

 

 

 

 

 

1

2

3

4

5

 

 

 

7

16

1

2

1

 

 

> aggregate(nutrient, by=list(cluster=clusters), median)

 

 

 

 

 

 

c Describes clusters

 

cluster energy protein fat calcium iron

1

1

340.0

19

29

9

2.50

2

2

170.0

20

8

13

1.45

3

3

160.0

26

5

14

5.90

4

4

57.5

9

1

78

5.70

5

5

180.0

22

9

367

2.50

> aggregate(as.data.frame(nutrient.scaled), by=list(cluster=clusters),

 

median)

 

 

 

 

 

 

cluster energy protein

fat calcium

iron

 

1

1

1.310

0.000

1.379

-0.448

0.0811

 

2

2

-0.370

0.235

-0.487

-0.397 -0.6374

 

3

3

-0.468

1.646

-0.753

-0.384

2.4078

 

4

4

-1.481

-2.352 -1.109

0.436

2.2709

 

5

5

-0.271

0.706

-0.398

4.140

0.0811

 

> plot(fit.average, hang=-1, cex=.8,

 

d Plots results

 

 

main="Average Linkage Clustering\n5 Cluster Solution")

> rect.hclust(fit.average, k=5)

 

 

 

The cutree() function is used to cut the tree into five clusters b. The first cluster has 7 observations, the second cluster has 16 observations, and so on. The aggregate() function is then used to obtain the median profile for each cluster c. The results are reported in both the original metric and in standardized form. Finally, the

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]