Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
R in Action, Second Edition.pdf
Скачиваний:
546
Добавлен:
26.03.2016
Размер:
20.33 Mб
Скачать

Cluster analysis

This chapter covers

Identifying cohesive subgroups (clusters) of observations

Determining the number of clusters present

Obtaining a nested hierarchy of clusters

Obtaining discrete clusters

Cluster analysis is a data-reduction technique designed to uncover subgroups of observations within a dataset. It allows you to reduce a large number of observations to a much smaller number of clusters or types. A cluster is defined as a group of observations that are more similar to each other than they are to the observations in other groups. This isn’t a precise definition, and that fact has led to an enormous variety of clustering methods.

Cluster analysis is widely used in the biological and behavioral sciences, marketing, and medical research. For example, a psychological researcher might cluster data on the symptoms and demographics of depressed patients, seeking to uncover subtypes of depression. The hope would be that finding such subtypes might lead to more targeted and effective treatments and a better understanding of the disorder. Marketing researchers use cluster analysis as a customer-segmentation strategy.

369

370

CHAPTER 16 Cluster analysis

Customers are arranged into clusters based on the similarity of their demographics and buying behaviors. Marketing campaigns are then tailored to appeal to one or more of these subgroups. Medical researchers use cluster analysis to help catalog gene-expression patterns obtained from DNA microarray data. This can help them to understand normal growth and development and the underlying causes of many human diseases.

The two most popular clustering approaches are hierarchical agglomerative clustering and partitioning clustering. In agglomerative hierarchical clustering, each observation starts as its own cluster. Clusters are then combined, two at a time, until all clusters are merged into a single cluster. In the partitioning approach, you specify K: the number of clusters sought. Observations are then randomly divided into K groups and reshuffled to form cohesive clusters.

Within each of these broad approaches, there are many clustering algorithms to choose from. For hierarchical clustering, the most popular are single linkage, complete linkage, average linkage, centroid, and Ward’s method. For partitioning, the two most popular are k-means and partitioning around medoids (PAM). Each clustering method has advantages and disadvantages, which we’ll discuss.

The examples in this chapter focus on food and wine (I suspect my friends aren’t surprised). Hierarchical clustering is applied to the nutrient dataset contained in the flexclust package to answer the following questions:

What are the similarities and differences among 27 types of fish, fowl, and meat, based on 5 nutrient measures?

Is there a smaller number of groups into which these foods can be meaningfully clustered?

Partitioning methods will be used to evaluate 13 chemical analyses of 178 Italian wine samples. The data are contained in the wine dataset available with the rattle package. Here, the questions are as follows:

Are there subtypes of wine in the data?

If so, how many subtypes are there, and what are their characteristics?

In fact, the wine samples represent three varietals (recorded as Type). This will allow you to evaluate how well the cluster analysis recovers the underlying structure.

Although there are many approaches to cluster analysis, they usually follow a similar set of steps. These common steps are described in section 16.1. Hierarchical agglomerative clustering is described in section 16.2, and partitioning methods are covered in section 16.3. Some final advice and cautionary statements are provided in section 16.4. In order to run the examples in this chapter, be sure to install the cluster, NbClust, flexclust, fMultivar, ggplot2, and rattle packages. The rattle package will also be used in chapter 17.

16.1 Common steps in cluster analysis

Like factor analysis (chapter 14), an effective cluster analysis is a multistep process with numerous decision points. Each decision can affect the quality and usefulness of

Common steps in cluster analysis

371

the results. This section describes the 11 typical steps in a comprehensive cluster analysis:

1Choose appropriate attributes. The first (and perhaps most important) step is to select variables that you feel may be important for identifying and understanding differences among groups of observations within the data. For example, in a study of depression, you might want to assess one or more of the following: psychological symptoms; physical symptoms; age at onset; number, duration, and timing of episodes; number of hospitalizations; functional status with regard to self-care; social and work history; current age; gender; ethnicity; socioeconomic status; marital status; family medical history; and response to previous treatments. A sophisticated cluster analysis can’t compensate for a poor choice of variables.

2Scale the data. If the variables in the analysis vary in range, the variables with the largest range will have the greatest impact on the results. This is often undesirable, and analysts scale the data before continuing. The most popular approach is to standardize each variable to a mean of 0 and a standard deviation of 1. Other alternatives include dividing each variable by its maximum value or subtracting the variable’s mean and dividing by the variable’s median absolute deviation. The three approaches are illustrated with the following code snippets:

df1 <- apply(mydata, 2, function(x){(x-mean(x))/sd(x)}) df2 <- apply(mydata, 2, function(x){x/max(x)})

df3 <- apply(mydata, 2, function(x){(x – mean(x))/mad(x)})

In this chapter, you’ll use the scale() function to standardize the variables to a mean of 0 and a standard deviation of 1. This is equivalent to the first code snippet (df1).

3Screen for outliers. Many clustering techniques are sensitive to outliers, distorting the cluster solutions obtained. You can screen for (and remove) univariate outliers using functions from the outliers package. The mvoutlier package contains functions that can be used to identify multivariate outliers. An alternative is to use a clustering method that is robust to the presence of outliers. Partitioning around medoids (section 16.3.2) is an example of the latter approach.

4Calculate distances. Although clustering algorithms vary widely, they typically require a measure of the distance among the entities to be clustered. The most popular measure of the distance between two observations is the Euclidean distance, but the Manhattan, Canberra, asymmetric binary, maximum, and Minkowski distance measures are also available (see ?dist for details). In this chapter, the Euclidean distance is used throughout. Calculating Euclidean distances is covered in section 16.1.1.

5Select a clustering algorithm. Next, you select a method of clustering the data. Hierarchical clustering is useful for smaller problems (say, 150 observations or less) and where a nested hierarchy of groupings is desired. The partitioning method can handle much larger problems but requires that the number of clusters be specified in advance.

372

CHAPTER 16 Cluster analysis

Once you’ve chosen the hierarchical or partitioning approach, you must select a specific clustering algorithm. Again, each has advantages and disadvantages. The most popular methods are described in sections 16.2 and 16.3. You may wish to try more than one algorithm to see how robust the results are to the choice of methods.

6Obtain one or more cluster solutions. This step uses the method(s) selected in step 5.

7Determine the number of clusters present. In order to obtain a final cluster solution, you must decide how many clusters are present in the data. This is a thorny problem, and many approaches have been proposed. It usually involves extracting various numbers of clusters (say, 2 to K) and comparing the quality of the solutions. The NbClust() function in the NBClust package provides 30 different indices to help you make this decision (elegantly demonstrating how unresolved this issue is). NbClust is used throughout this chapter.

8Obtain a final clustering solution. Once the number of clusters has been determined, a final clustering is performed to extract that number of subgroups.

9Visualize the results. Visualization can help you determine the meaning and usefulness of the cluster solution. The results of a hierarchical clustering are usually presented as a dendrogram. Partitioning results are typically visualized using a bivariate cluster plot.

10Interpret the clusters. Once a cluster solution has been obtained, you must interpret (and possibly name) the clusters. What do the observations in a cluster have in common? How do they differ from the observations in other clusters? This step is typically accomplished by obtaining summary statistics for each variable by cluster. For continuous data, the mean or median for each variable within each cluster is calculated. For mixed data (data that contain categorical variables), the summary statistics will also include modes or category distributions.

11Validate the results. Validating the cluster solution involves asking the question, “Are these groupings in some sense real, and not a manifestation of unique aspects of this dataset or statistical technique?” If a different cluster method or different sample is employed, would the same clusters be obtained? The fpc, clv, and clValid packages each contain functions for evaluating the stability of a clustering solution.

Because the calculations of distances between observations is such an integral part of cluster analysis, it’s described next and in some detail.

16.2 Calculating distances

Every cluster analysis begins with the calculation of a distance, dissimilarity, or proximity between each entity to be clustered. The Euclidean distance between two observations is given by

dij =

where i and j are observations and P is the number of variables.

Calculating distances

373

Consider the nutrient dataset provided with the flexclust package. The dataset contains measurements on the nutrients of 27 types of meat, fish, and fowl. The first few observations are given by

>data(nutrient, package="flexclust")

>head(nutrient, 4)

 

 

energy protein fat calcium iron

BEEF BRAISED

340

20

28

9

2.6

HAMBURGER

245

21

17

9

2.7

BEEF

ROAST

420

15

39

7

2.0

BEEF

STEAK

375

19

32

9

2.6

and the Euclidean distance between the first two (beef braised and hamburger) is

d = (340 245)2 + (20 21)2 + (28 17)2 + (9 9)2 + (26 27)2 = 95.64

The dist() function in the base R installation can be used to calculate the distances between all rows (observations) of a matrix or data frame. The format is dist(x, method=), where x is the input data and method="euclidean" by default. The function returns a lower triangle matrix by default, but the as.matrix() function can be used to access the distances using standard bracket notation. For the nutrient data frame,

>d <- dist(nutrient)

>as.matrix(d)[1:4,1:4]

 

 

BEEF BRAISED HAMBURGER BEEF ROAST BEEF STEAK

BEEF BRAISED

0.0

95.6

80.9

35.2

HAMBURGER

95.6

0.0

176.5

130.9

BEEF

ROAST

80.9

176.5

0.0

45.8

BEEF

STEAK

35.2

130.9

45.8

0.0

Larger distances indicate larger dissimilarities between observations. The distance between an observation and itself is 0. As expected, the dist() function provides the same distance between beef braised and hamburger as the hand calculations.

Cluster analysis with mixed data types

Euclidean distances are usually the distance measure of choice for continuous data. But if other variable types are present, alternative dissimilarity measures are required. You can use the daisy() function in the cluster package to obtain a dissimilarity matrix among observations that have any combination of binary, nominal, ordinal, and continuous attributes. Other functions in the cluster package can use these dissimilarities to carry out a cluster analysis. For example, agnes() offers agglomerative hierarchical clustering, and pam() provides partitioning around medoids.

Note that distances in the nutrient data frame are heavily dominated by the contribution of the energy variable, which has a much larger range. Scaling the data will help

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]