Добавил:

Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Национальный исследовательский университет «Высшая школа экономики»

Предмет:

[НЕСОРТИРОВАННОЕ]

Файл:

R in Action, Second Edition.pdf

Скачиваний:

540

Добавлен:

26.03.2016

Размер:

20.33 Mб

Скачать

☆

<<< < Предыдущая 107 108 109 110 111 112 113 114 115 116 117 118119 / 173119 120 121 122 123 124 125 126 127 128 129 130 131 > Следующая >>>

384					CHAPTER 16	Cluster analysis

	Listing 16.5 Partitioning around medoids for the wine data
	> library(cluster)							Clusters standardized data
	> set.seed(1234)							Clusters standardized data
	> fit.pam <- pam(wine[-1], k=3, stand=TRUE)
	> fit.pam$medoids									Prints the
										Prints the
		Alcohol Malic		Ash Alcalinity		Magnesium Phenols Flavanoids				medoids
[1,]		13.5	1.81	2.41	20.5		100	2.70	2.98
[2,]		12.2	1.73	2.12	19.0		80	1.65	2.03
[3,]		13.4	3.91	2.48	23.0		102	1.80	0.75
		Nonflavanoids		Proanthocyanins		Color	Hue Dilution Proline			Plots the
[1,]			0.26		1.86	5.1	1.04	3.47	920	Plots the
[2,]			0.37		1.63	3.4	1.00	3.17	510	cluster
[3,]			0.43		1.41	7.3	0.70	1.56	750	solution

> clusplot(fit.pam, main="Bivariate Cluster Plot")

Note that the medoids are actual observations contained in the wine dataset. In this case, they’re observations 36, 107, and 175, and they have been chosen to represent the three clusters. The bivariate plot is created by plotting the coordinates of each observation on the first two principal components (see chapter 14) derived from the 13 assay variables. Each cluster is represented by an ellipse with the smallest area containing all its points.

Also note that PAM didn’t perform as well as k-means in this instance:

> ct.pam <- table(wine$Type, fit.pam$clustering)

	1	2	3
1	59	0	0
2	16	53	2

30 1 47

>randIndex(ct.pam) [1] 0.699

The adjusted Rand index has decreased from 0.9 (for k-means) to 0.7.

16.5 Avoiding nonexistent clusters

Before I finish this discussion, a word of caution is in order. Cluster analysis is a methodology designed to identify cohesive subgroups in a dataset. It’s very good at doing this. In fact, it’s so good, it can find clusters where none exist.

Consider the following code:

library(fMultivar)

set.seed(1234)

df <- rnorm2d(1000, rho=.5) df <- as.data.frame(df)

plot(df, main="Bivariate Normal Distribution with rho=0.5")

The rnorm2d() function in the fMultivar package is used to sample 1,000 observations from a bivariate normal distribution with a correlation of 0.5. The resulting graph is displayed in figure 16.7. Clearly there are no clusters in this data.

V2 −3 −2 −1 0 1 2 3

Avoiding nonexistent clusters

385

Bivariate Normal Distribution with rho=0.5

Figure 16.7 Bivariate normal data (n = 1000). There are no clusters in this data.

−3

−2

−1

The wssplot() and NbClust() functions are then used to determine the number of clusters present:

wssplot(df)

library(NbClust)

nc <- NbClust(df, min.nc=2, max.nc=15, method="kmeans") dev.new()

barplot(table(nc$Best.n[1,]),

xlab="Number of Clusters", ylab="Number of Criteria", main="Number of Clusters Chosen by 26 Criteria")

The results are plotted in figures 16.8 and 16.9.

of squares	1500 2000
groups sum	1000
Within	500

Number of Clusters

Figure 16.8 Plot of within-groups sums of squares vs. number of k-means clusters for bivariate normal data

386	CHAPTER 16 Cluster analysis

		Number of Clusters Chosen by 26 Criteria
	8
Criteria	6
of	4
Number	2
	0
	0	1	2	3	4	5	8	10	12	13
					Number of Clusters

Figure 16.9 Number of clusters recommended for bivariate normal data by criteria in the NbClust package. Two or three clusters are suggested.

The wssplot() function suggest that there are three clusters, whereas many of the criteria returned by NbClust() suggest between two and three clusters. If you carry out a two-cluster analysis with PAM,

library(ggplot2)

library(cluster) fit <- pam(df, k=2)

df$clustering <- factor(fit$clustering)

ggplot(data=df, aes(x=V1, y=V2, color=clustering, shape=clustering)) + geom_point() + ggtitle("Clustering of Bivariate Normal Data")

you get the two-cluster plot shown in figure 16.10. (The ggplot() statement is part of the comprehensive graphics package ggplot2. Chapter 19 covers ggplot2 in detail.)

Clustering of Bivariate Normal Data

−2

Clustering

Figure 16.10 PAM cluster analysis of bivariate normal data, extracting two clusters. Note that the clusters are an arbitrary division of the data.

−2

				Summary			387
	−8
	−10
CCC	−14 −12
	−16
	−18
	2	4	6	8	10	12	14
			Number of clusters

Figure 16.11 CCC plot for bivariate normal data. It correctly suggests that no clusters are present.

Clearly the partitioning is artificial. There are no real clusters here. How can you avoid this mistake? Although it isn’t foolproof, I have found that the Cubic Cluster Criteria (CCC) reported by NbClust can often help to uncover situations where no structure exists. The code is

plot(nc$All.index[,4], type="o", ylab="CCC", xlab="Number of clusters", col="blue")

and the resulting graph is displayed in figure 16.11. When the CCC values are all negative and decreasing for two or more clusters, the distribution is typically unimodal.

The ability of cluster analysis (or your interpretation of it) to find erroneous clusters makes the validation step of cluster analysis important. If you’re trying to identify clusters that are “real” in some sense (rather than a convenient partitioning), be sure the results are robust and repeatable. Try different clustering methods, and replicate the findings with new samples. If the same clusters are consistently recovered, you can be more confident in the results.

16.6 Summary

In this chapter, we reviewed some of the most common approaches to clustering observations into cohesive groups. First we reviewed the general steps for a comprehensive cluster analysis. Next, common methods for hierarchical and partitioning clustering were described. Finally, I reinforced the need to validate the resulting clusters in situations where you seek more than convenient partitioning.

Cluster analysis is a broad topic, and R has some of the most comprehensive facilities for applying this methodology currently available. To learn more about these capabilities, see the CRAN Task View for Cluster Analysis & Finite Mixture Models (http://cran.r-project.org/web/views/Cluster.html). Additionally, Tan, Steinbach, & Kumar (2006) have an excellent book on data-mining techniques. It contains a lucid

388	CHAPTER 16 Cluster analysis

chapter on cluster analysis that you can freely downloaded (www-users.cs.umn.edu/ ~kumar/dmbook/ch8.pdf). Finally, Everitt, Landau, Leese, & Stahl (2011) have written a practical and highly regarded textbook on this subject.

Cluster analysis is a methodology for discovering cohesive subgroups of observations in a dataset. In the next chapter, we’ll consider situations where the groups have already been defined and your goal is to find an accurate method of classifying observations into them.

<<< < Предыдущая 107 108 109 110 111 112 113 114 115 116 117 118119 / 173119 120 121 122 123 124 125 126 127 128 129 130 131 > Следующая >>>

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]

#
05.08.2019741.83 Кб0psihologia.rtf
#
02.06.2015162.69 Кб76Psyh_final_ver.docx
#
02.06.2015141.74 Кб44Psyh_final_ver.docx
#
26.03.2016226.3 Кб23public_corporation.doc
#
26.03.2016451.53 Кб7pud_finansovyy-menedjment_318476.pdf
#
26.03.201620.33 Mб540R in Action, Second Edition.pdf
#
26.03.2016296.21 Кб17Radaev_Kak_napisat_akademicheskiy_text.pdf
#
26.03.20163.76 Mб4Raeff_Modernity.pdf
#
26.03.20162.12 Mб19raigorodskii_d_ya_hrestomatiya_psihologiya_lich.pdf
#
02.06.2015494.59 Кб6raschet_SRK_smorodin.doc
#
02.06.201563.98 Кб4referat_IOGP_3.docx