Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Berrar D. et al. - Practical Approach to Microarray Data Analysis

.pdf
Скачиваний:
67
Добавлен:
17.08.2013
Размер:
11.61 Mб
Скачать

244

Chapter 13

research community, is crucial for the implementation of advanced clustering-based studies. A cluster evaluation framework may have a major impact on the generation of relevant and valid results. This paper shows how it may also support or guide biomedical knowledge discovery tasks. The clustering and validation techniques presented in this chapter may be applied to expression data of higher sample and feature set dimensionality.

A general approach to developing clustering applications may consist of the comparison, synthesis and validation of results obtained from different algorithms. For instance, in the case of hierarchical clustering there are tools that can support the combination of results into consensus trees (Bremer, 1990). However, additional methods will be required to automatically compare different partitions based on validation indices and/or graphical representations.

Other problems that deserve further research are the development of clustering techniques based on the direct correlation between subsets of samples and features, multiple-membership clustering, and context-oriented visual tools for clustering support (Azuaje, 2002b). Furthermore, there is the need to improve, adapt and expand the use of statistical techniques to assess uncertainty and significance in genomic expression experiments.

ACKNOWLEDGEMENTS

This contribution was partly supported by the Enterprise Ireland Research Innovation Fund 2001.

REFERENCES

Alizadeh A.A., Eisen M.B., Davis R.E., Ma C., Lossos I.S., Rosenwald A., Boldrick J.C., Sabet H., Tran T., Yu X., Powell J.I., Yang L., Marti G.E., Moore T., Hudson J., Lu L., Lewis D.B., Tibshirani R., Sherlock G., Chan W.C., Greiner T.C., Weisenburger D.D., Armitage J.O., Warnke R., Levy R., Wilson W., Grever M.R., Bird J.C., Botstein D., Brown P.O., Staudt M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503-511.

Azuaje F. (2001a). An unsupervised neural network approach to discovering gene expression patterns in B-cell lymphoma. Online Journal of Bioinformatics 1:23-41.

Azuaje F. (2001b). A computational neural approach to support the discovery of gene function and classes of cancer. IEEE Transactions on Biomedical Engineering 48:332-339.

Azuaje F. (2002a) A cluster validity framework for genome expression data. Bioinformatics 18:319-320.

Azuaje F. (2002b). In silico approaches to microarray-based disease classification and gene function discovery. Annals of Medicine 34.

Bezdek J.C., Pal N.R. (1998). Some new indexes of cluster validity. IEEE Transactions on Systems, Man and Cybernetics, Part B 28:301-315.

13. Clustering Genomic Expression Data: Design and Evaluation

245

Principles

 

Bittner M., Meltzer P., Chen Y., Jiang Y., Seftor E., Hendrix M., Radmacher M., Simon R., Yakhini Z., Ben-Dor A., Sampas N., Dougherty E., Wang E., Marincola F., Gooden C., Lueders J., Glatfelter A., Pollock P., Carpten J., Gillanders E., Leja D., Dietrich K., Beaudry C., Berens M., Alberts D., Sondak V., Hayward N., Trent J. (2000). Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406:536-540.

Bremer K. (1990). Combinable component consensus. Cladistics 6:69-372.

Cheng Y., Church G.M. (2000). Biclustering of expresssion data. Proceedings of ISMB 8th International Conference on Intelligent Systems for Molecular Biology; 2000 August 19 - 23; La Jolla. California.

Dhanasekaran S.M., Barrete T., Ghosh ., Shah R., Varambally S., Kurachi K., Pienta K., Rubin M., Chinnaiyan A. (2001), Delineation of prognostic biomarkers in prostate cancer. Nature 412:822-826.

Eisen M.B., Spellman P., Brown P.O., Botstein D. (1998). Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95:14863-14868.

Everitt B. (1993) Cluster Analysis. London: Edward Arnold.

Fisher L., Van Ness J.W. (1971). Admissible clustering procedures. Biometrika 58:91-104.

Golub T.R., Slonim D.K., Tamayo P., Huard C., Gassenbeck M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Bloomfield C.D., Lander E.S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531-537.

Hansen P., Jaumard B. (1997). Cluster analysis and mathematical programming. Mathematical Programming 79:191-215.

Ideker T., Thorsson V., Ranish J.A., Christmas R., Buhler J., Eng J.K., Bumgarner R., Goodlett D.R., Aebersol R., Hood L. (2001). Integrated genomic and proteomic analyses of a systematically perturbated metabolic network. Science 292:929-933.

Kasuba T. (1993). Simplified fuzzy ARTMAP. AI Expert 8:19-25.

Perou C.M., Sorlie T., Eisen M.B., Van de Rijn M., Jeffrey S.S., Rees C.A., Pollack J.R., Ross D.T., Johnsen H., Aksien L.A., Fluge O., Pergamenschikov A., Williams C., Zhu S.X., Lonning P.E., Borresen-Dale A.L., Brown P.O., Botstein D. (2000). Molecular portraits of human breast tumours. Nature 406:747-752.

Quackenbush J. (2001). Computational analysis of microarray data. Nature Reviews Genetics 2:418-427,

Rousseeuw P.J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics 20:53-65.

Tamayo P., Slonim D., Mesirov J., Zhu Q., Kitareewan S., Dmitrovsky E., Lander E.S., Golub R. (1999). Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96:2907-291.

Chapter 14

CLUSTERING OR AUTOMATIC CLASS DISCOVERY: HIERARCHICAL METHODS

Derek C. Stanford, Douglas B. Clarkson, Antje Hoering

Insightful Corporation, 1700 Westlake Avenue North, Seattle, WA, 98109, USA e-mail: {Stanford, clarkson, hoering}@insightful.com

1.INTRODUCTION

Given a set of data, a hierarchical clustering algorithm attempts to find naturally occurring groups or clusters in the data. It is an exploratory technique that can give valuable insight into underlying relationships not otherwise easily displayed or found in multidimensional data, Microarray data sets present several challenges for hierarchical clustering: the need to scale the algorithms to a very large number of genes, selection of an appropriate clustering criterion, the choice of an appropriate distance measure (a metric), the need for methods which deal with missing values, screening out unimportant genes, and selection of the number of clusters, to name a few. This chapter discusses standard methods for hierarchical agglomerative clustering, including single linkage, average linkage, complete linkage, and model-based clustering; it also presents adaptive single linkage clustering, which is a new clustering method designed to meet the challenges of microarray data.

There are two general forms of hierarchical clustering. In the agglomerative method (bottom-up approach), each data point initially forms a cluster, and the two “closest” clusters are merged in each step. The “closest” clusters are defined by a clustering criterion (see Section 3). The divisive method (top-down approach) starts with one large cluster that contains all data points and splits off a cluster at each step. This chapter concentrates on the agglomerative method, the most commonly used method in practice.

Applications of hierarchical clustering in microarray data are diverse. One application is to group together genes with similar regulation. The

14. Clustering or Automatic Class Discovery: Hierarchical Methods

247

underlying idea is that genes with similar expression levels might code for the same protein or proteins that exhibit similar functions on a molecular level. This clustering information helps researchers to better understand biological processes. For example, suppose that a time series of microarrays is observed and the main interest is to find clusters of genes forming genetic networks or regulatory circuits. The number of networks is not usually known and may be quite large; hierarchical clustering is particularly useful here because it allows for any number of gene clusters. Applications in this area range from searching for genetic networks in the yeast cell cycle, to the development of the central nervous system in the rat, or neurogenesis in the Drosophila (Erb et al., 1999).

Among other uses, hierarchical clustering of microarray data has been used to find cancer diagnostics (Ramaswamy et al., 2001), to investigate cancer tumorigenesis mechanisms (Welcsh et al., 2002), and to identify cancer subtypes (Golub et al., 1999). In the case of B-cell lymphoma, for example, hierarchical cluster analysis has been used to discover a new subtype which laboratory work was unable to detect (Alizadeh et al., 2000).

In addition to identifying groups of related genes, hierarchical clustering methods also offer tools to screen out uninteresting genes (see Section 3.3 for discussion and Section 4.2 for an example). Removal of uninteresting genes is especially important in microarray data, where a very large number of genes may be observed.

Section 2 discusses underlying issues and challenging aspects of the hierarchical clustering of microarray data such as scalability, metrics, and missing data. Section 3 presents several hierarchical clustering methods, including adaptive single linkage clustering, which is a new method designed to provide adaptive cluster detection while maintaining scalability. Section 4 provides examples using both simulated and real data. A brief discussion, including mention of existing hierarchical clustering software, is given in Section 5.

2.SCALABILITY, METRICS, AND MISSING DATA

Scalability issues must be addressed both for a large number of data points (N) and for a large number of dimensions (P). Microarray data sets routinely make use of thousands of genes analyzed across hundreds or thousands of samples. Nonparametric hierarchical clustering methods usually make use of a distance (or similarity) matrix, which contains all inter-point distances for the data. With N data points, the distance matrix has entries. When N is large (e.g., N > 10,000) the size of the distance matrix becomes prohibitive in terms of both memory usage and speed; a nonparametric algorithm can fail if it requires an explicit computation of the distance matrix, as is the case

248 Chapter 14

with most standard hierarchical clustering software. Model-based clustering methods use a probability density function to model the location, size, and shape of each cluster. A potentially serious problem with model-based clustering, especially in gene expression data, is that the number of parameters can be quite large unless extensive constraints are placed on the model density. For example, a multivariate normal (or Gaussian) density is often used to model each cluster; with no constraints, this requires P parameters for the mean vector for each cluster and P(P+ l)/2 parameters for the covariance matrix for each cluster. Estimation of this many parameters is impractical when P is large (e.g., P > 50),

The choice of a metric appropriate for use on a particular microarray data set is inherently problem specific, depending on both the goals of the analysis and the properties of the data. The metric chosen can have a significant impact on both clustering results and computational speed. Hierarchical clustering software usually provides a selection of metrics, and may also allow users to define their own metrics. Aside from choosing a metric based on obvious data characteristics (e.g., discrete or continuous), researchers must consider their own perceptions of what it means to say that two observations are similar.

The use of different metrics can lead to quite different clustering results, as the following simple example illustrates. Suppose we have four genes (numbered from one to four) and two experiments. In the first experiment, only the first two genes are expressed, while in the second experiment only the even numbered genes are expressed. Then a metric giving positive weight to only the first experiment results in two clusters (genes 1 and 2, and genes 3 and 4), while a metric giving positive weight to the second experiment results in two very different clusters (genes 2 and 4 and genes 1 and 3). The metric must also account for scale – if the expression levels in experiment 1 have larger magnitude than in experiment 2, then a simple Euclidean metric will yield the experiment 1 clustering results.

Popular metrics for clustering microarray data are Euclidean distance and metrics based upon correlation measures. One correlation-based metric is computed as one minus the correlation coefficient. This metric yields a distance measure between zero and 2, with a correlation of one yielding a distance of zero, and a correlation of minus one yielding the largest possible distance of two. A variation of this is the use of one minus the absolute correlation. In this case, a correlation of either one or minus one yields a distance of zero, while a correlation of zero yields the largest possible distance of one. This allows clustering of genes responding to a stimulus in the same or opposite ways. For binary (0 or 1) or categorical data, metrics based upon probabilities are common, e.g., the probability, over all experiments, that both genes are expressed above some threshold.

14. Clustering or Automatic Class Discovery: Hierarchical Methods

249

In addition to the metric, one must also consider the genes and experiments used in the analysis. Inclusion of variables that do not differentiate (i.e., which are purely noise) can lead to poor clustering results. Removal of these prior to clustering can significantly improve clustering performance.

A significant difficulty with microarray data sets is the high rate of missing values. This reduces accuracy, and it also creates problems for computational efficiency. This problem is sometimes resolved by simply deleting the rows and columns with missing data. This cannot only yield biased estimates, it can also eliminate nearly all of the data. A common alternative is to define a metric that accepts missing values, usually by averaging or up-weighting of the observed data. For example, missing values can be handled by the pairwise exclusion of missing observations in distance calculations or, when mean values are required, by using all available data. Weighting schemes that account for the missing dimensions may also be desirable. Results obtained from metrics defined in this way can be misleading since there is no guarantee that the observed data behaves in the same manner as the missing data – it is easy to find simple examples showing that the metric chosen for dealing with missing data can have a large impact on the estimated distance. Often, imputation of missing data (see Chapter 3) is more appropriate. Imputation methodology, and missing data in general, is an open research topic with no easy solution, especially when the location of the missing data may be causally related to the unknown clusters.

3.HIERARCHICAL CLUSTERING METHODS

The basic algorithm in hierarchical agglomerative clustering is to begin with each data point as a separate cluster and then iteratively merge the two “closest” clusters until only a single cluster remains. Here “close” is defined by the clustering criterion, which defines how to determine the distance between two clusters. In nonparametric clustering, this criterion consists of two parts: the distance measure or metric, which specifies how to compute the distance between two points; and the linkage, which specifies how to combine these distances to obtain the between-cluster distance. In modelbased clustering, the clustering criterion is based on the likelihood of the data given the model (see Section 3.1).

With N data points, this approach of iteratively merging the two closest clusters provides a nested sequence of clustering results, with one result for each number of clusters from N to 1. Hierarchical clustering does not seek to globally optimize a criterion; instead, it proceeds in a stepwise fashion in which a merge performed at one step cannot be undone at a later step.

250

Chapter 14

The clustering results are typically displayed in a dendrogram showing the cluster structure; for example, see Figure 14.1. Clusters or nodes forming lower on the dendrogram are closer together, while upper nodes represent merges of clusters that are farther apart. Since each data point begins as a single cluster, the leaves (terminal nodes at the bottom of the dendrogram) each represent one data point, while interior nodes represent clusters of more than one data point. The top node of the dendrogram denotes the entire data set as a single cluster. The y-axis is usually the merge height, the distance between two clusters when they are merged. Some methods do not have an explicit height associated with each merge; for example, model-based clustering chooses each merge by seeking to maximize a clustering criterion based on the likelihood. In these cases, a dendrogram can still be constructed, but the y-axis may represent the value of a clustering criterion or simply the order of clustering. The x-axis is arbitrary; the sequence of data points along the x-axis is generally chosen to avoid crossing lines in the display of the dendrogram.

3.1Clustering Criteria and Linkage

Three common nonparametric approaches to hierarchical clustering are single linkage, complete linkage, and average linkage; a comprehensive review is given by (Gordon, 1999). In single linkage clustering, the distance between any two clusters of points is defined as the smallest distance between any point in the first cluster and any point in the second cluster. The single linkage approach is related to the minimum spanning tree (MST) of the data set (Gower and Ross, 1969). This leads to a significant advantage of single linkage clustering: efficient algorithms can be used to obtain the single linkage clustering result without allocating the order memory units usually required by other nonparametric hierarchical clustering algorithms.

Complete linkage defines the inter-cluster distance as the largest distance between any point in the first cluster and any point in the second cluster. Average linkage is often perceived as a compromise between single and complete linkage because it uses the average of all pair-wise distances between points in the first cluster and points in the second cluster. This is also called group average linkage (Sokal and Michener, 1958). Weighted average linkage (Sokal and Sneath, 1963) is defined in terms of its updating method: when a new cluster is created by merging two smaller clusters, the distance from the new cluster to any other cluster is computed as the average of the distances of the two smaller clusters. Thus, the two smaller clusters receive equal weight in the distance calculation, as opposed to average linkage, which accords equal weight to each data point.

14. Clustering or Automatic Class Discovery: Hierarchical Methods

251

Ward's method (Ward, 1963) examines the sum of squared distances from each point to the centroid or mean of its cluster, and merges the two clusters yielding the smallest increase in this sum of squares criterion. This is equivalent to modeling the clusters as multivariate normal densities with different means and a single hyper-spherical covariance matrix.

Ward’s method is a special case of model-based clustering (Banfield and Raftery, 1993). Model-based clustering is based on an assumption of a within-cluster probability density as a model for the data. If the model chosen is incorrect or inappropriate, erroneous results will be obtained. Though models can be based on any density, the most common choice is the multivariate normal; this density is parametrized by a mean vector and a covariance matrix for each cluster. The mean vector determines the location of the cluster, while the covariance matrix specifies its shape and size. At each step, two clusters are chosen for a merge by maximizing the likelihood of the data given the model. The likelihood is the value of the probability density model evaluated using the observed data (Arnold, 1990). The likelihoods of all data points are combined into an overall likelihood; various approaches exist for this, such as a mixture likelihood or a classification likelihood. (Stanford, 1999) gives two theorems linking optimal choice of the form of the overall likelihood to the goals of the clustering procedure.

When choosing a hierarchical clustering method, some consideration should be given to the type of clusters expected. Complete linkage algorithms tend to yield compact clusters similar to the multivariate normal point clouds modeled in Ward’s method, while single linkage clusters can be “stringy” or elongated, adapting well to any pattern of closely spaced points. In model-based clustering, the chosen density will have a strong impact on the resulting cluster shapes. For example, if the covariance matrix is constrained to be the same for all clusters, then clusters with the same size and shape will most likely be observed. If the covariance matrices are constrained to be multiples of the identity matrix, then hyper-spherical clusters (as in Ward’s method) will be found.

3.2Choosing the Final Clusters

The traditional approach for determining the final set of clusters is to specify the number of clusters desired and then cut the dendrogram at the height, which yields this number. This procedure only works well if the merges near the top of the dendrogram have large children, i.e. when the final agglomeration steps involve large subsets of the data. Single linkage clustering only exhibits this structure if the clusters are all well separated; otherwise, they tend to show a chaining effect, in which many of the upper dendrogram nodes are merely merges of distant points with a main group.

252

Chapter 14

For example, if the data consist of two large clusters near each other and a single distant data point, then the two cluster result from single linkage clustering will give the single distant point as one cluster and everything else as the other cluster. Dendrograms based on other criteria, such as average and complete linkage, are not as prone to the chaining effect, but they also generally have nodes with large children near the top even when only one cluster is present in the data. For model-based clustering, the number of clusters can be assessed by examining the likelihood, though this requires severe assumptions about the data.

The chaining effect in a single linkage dendrogram contains important information: it indicates that the clusters are not well separated. Adaptive single linkage clustering utilizes the information in the single linkage dendrogram, but it uses a better method for determining the final clusters.

3.3Adaptive Single Linkage Clustering

Adaptive single linkage (ASL) clustering (McKinney, 1995) begins with a single linkage dendrogram but extracts clusters in a bottom-up rather than a top-down manner. Generally, desirable clusters represent modal regions, regions of the data with higher point densities than the surrounding regions. To find these modal regions, each node is analyzed to determine its runt value (Hartigan and Mohanty, 1992), where the runt value of a node is defined as the size of its smaller child. Large runt values provide evidence against unimodality because they indicate the merge of two large subgroups. Clusters are found by selecting nodes with runt values larger than a specified threshold. The threshold provides a lower bound on the cluster size, and also determines the number of clusters found. For microarray analysis, small threshold values (e.g., 5) can be used to identify small, highly similar clusters; this might be suited to finding potential regulatory pathways for further analysis. Larger threshold values (e.g., 30) are more appropriate for finding larger groups of genes, such as groups involved in large-scale cellular activities or responses to experimental conditions.

Adaptive single linkage clustering presents at least two advantages over traditional methods. First, as noted above, fast and memory-efficient algorithms exist for computing the single linkage dendrogram – all that is required is that the data fit into memory. Second, because nodes with size less than the threshold can be regarded as “noise” or “fluff”, the method can be used to automatically eliminate a large number of genes (or experiments) from further consideration. This use as a screening tool is illustrated in the example in Section 4.2. Further details on the algorithms underlying adaptive single linkage clustering can be found in (Glenny et al., 1995).

14. Clustering or Automatic Class Discovery: Hierarchical Methods

253

4.EXAMPLES

We provide two examples of cluster analysis. The first is a small simulation comparing average linkage with adaptive single linkage. The second uses a real microarray data set to demonstrate several analyses.

4.1Simulated Data

We begin by presenting a simulation that demonstrates the utility of adaptive single linkage clustering; this example uses two-dimensional data to allow visual inspection of the clustering results. Our data consist of two spherical Gaussian clusters located at [–1,–1] and [1,1] with 100 points each, and 100 points of Poisson background noise over the rectangle from [–5,–5] to [5,5]. We use a Euclidean metric and compare results from adaptive single linkage clustering and average linkage clustering. We examine a range of average linkage results with up to 7 clusters, as well as the adaptive single linkage result with 2 clusters (the unclassified noise points in adaptive single linkage might be considered to be a third cluster).

The average linkage dendrogram (Figure 14.1) shows a confusing amount of structure; the dendrogram suggests that it might be reasonable to have several clusters. We must drill down to 7 clusters (Figure 14.2) before the large central group finally splits into a reasonable approximation of the original two Gaussian clusters.

In contrast, the adaptive single linkage approach detects clusters through an analysis of the cluster merges rather than cutting the dendrogram from the top. The dendrogram (Figure 14.3) shows two main clusters surrounded by many outlying points. These two clusters provide a close approximation of the true location of the two underlying Gaussian clusters (Figure 14.4). This example illustrates several points: different clustering methods can lead to very different results, the method for choosing the clusters has a significant impact on the interpretation of results and the ease of cluster detection, and adaptive single linkage clustering can provide reasonable results even for touching clusters with background noise (regarded as a difficult case for traditional single linkage).

Соседние файлы в предмете Генетика