Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
540
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

382

CHAPTER 16 Cluster analysis

 

 

Number of Clusters Chosen by 26 Criteria

 

 

14

 

 

 

 

 

 

 

ofCriteria

12

 

 

 

 

 

 

 

8 10

 

 

 

 

 

 

 

Number

4 6

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

0

 

 

 

 

 

 

 

 

0

1

2

3

10

12

14

15

 

 

 

 

Number of Clusters

 

 

 

Figure 16.5 Recommended number of clusters using 26 criteria provided by the

NbClust package

standardized data, the aggregate() function is used along with the cluster memberships to determine variable means for each cluster in the original metric.

How well did k-means clustering uncover the actual structure of the data contained in the Type variable? A cross-tabulation of Type (wine varietal) and cluster membership is given by

>ct.km <- table(wine$Type, fit.km$cluster)

>ct.km

 

1

2

3

1

59

0

0

2

3

65

3

3

0

0

48

You can quantify the agreement between type and cluster using an adjusted Rand index, provided by the flexclust package:

>library(flexclust)

>randIndex(ct.km) [1] 0.897

The adjusted Rand index provides a measure of the agreement between two partitions, adjusted for chance. It ranges from -1 (no agreement) to 1 (perfect agreement). Agreement between the wine varietal type and the cluster solution is 0.9. Not bad—shall we have some wine?

16.4.2Partitioning around medoids

Because it’s based on means, the k-means clustering approach can be sensitive to outliers. A more robust solution is provided by partitioning around medoids (PAM). Rather than representing each cluster using a centroid (a vector of variable means), each cluster is identified by its most representative observation (called a medoid). Whereas k-means uses Euclidean distances, PAM can be based on any distance measure. It can therefore accommodate mixed data types and isn’t limited to continuous

variables.

Partitioning cluster analysis

383

The PAM algorithm is as follows:

1Randomly select K observations (call each a medoid).

2Calculate the distance/dissimilarity of every observation to each medoid.

3Assign each observation to its closest medoid.

4Calculate the sum of the distances of each observation from its medoid (total cost).

5Select a point that isn’t a medoid, and swap it with its medoid.

6Reassign every point to its closest medoid.

7Calculate the total cost.

8If this total cost is smaller, keep the new point as a medoid.

9Repeat steps 5–8 until the medoids don’t change.

A good worked example of the underlying math in the PAM approach can be found at http://en.wikipedia.org/wiki/k-medoids (I don’t usually cite Wikipedia, but this is a great example).

You can use the pam() function in the cluster package to partition around medoids. The format is pam(x, k, metric="euclidean", stand=FALSE), where x is a data matrix or data frame, k is the number of clusters, metric is the type of distance/ dissimilarity measure to use, and stand is a logical value indicating whether the variables should be standardized before calculating this metric. PAM is applied to the wine data in the following listing; see figure 16.6.

Bivariate Cluster Plot

 

4

 

 

 

 

 

2

 

 

 

 

Component 2

0

 

 

 

 

 

−2

 

 

 

 

 

−4

−2

0

2

4

Component 1

These two components explain 55.41 % of the point variability.

Figure 16.6 Cluster plot for the three-group PAM clustering of the Italian wine data

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]