Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Berrar D. et al. - Practical Approach to Microarray Data Analysis

.pdf
Скачиваний:
67
Добавлен:
17.08.2013
Размер:
11.61 Mб
Скачать

144

Chapter 7

5.2Test Set Estimation

Suppose that a test set of labeled observations sampled independently from the same population as the learning set is available. In such a case, an unbiased estimate of the classification error rate may be obtained by running the test set observations through the classifier built from the learning set and recording the proportion of test cases with discordant predicted and actual class labels.

In the absence of a genuine test set, cases in the learning set may be

divided into two sets, a training set

and a validation set

_ The classifier

is built using

and the error rate is computed for

It is important to

ensure that observations in

and

can be viewed as i.i.d. samples from

the population of interest. This can be achieved in practice by randomly dividing the original learning set into two subsets. In addition, to reduce variability in the estimated error rates, this procedure may be repeated a number of times (e.g., 50) and error rates averaged (Breiman, 1998). A general limitation of this approach is that it reduces effective sample size for training purposes. This is an issue for microarray datasets, which have a limited number of observations. There are no widely accepted guidelines for choosing the relative size of these artificial training and validation sets. A possible choice is to leave out a randomly selected 10% of the observations to use as a validation set. However, for comparing the error rates of different classifiers, validation sets containing only 10% of the data are often not sufficiently large to provide adequate discrimination. Increasing validation set size to one third of the data provides better discrimination in the microarray context.

5.3Cross-Validation Estimation

In V-fold cross-validation (CV), cases in the learning set

are randomly

divided into V sets

v= 1,..., V of as

nearly equal size as possible.

Classifiers are built on training sets

error rates are computed for the

validation sets

and averaged over v. There is a bias-variancetrade-offin

the selection of V: small V’s typically give a larger bias, but a smaller variance and mean squared error.

A commonly used form of CV is leave-one-out cross-validation

(LOOCV), where V= n. LOOCV often results in low bias but high variance estimates of classification error. However, for stable (low variance) classifiers such as k-NN, LOOCV provides good estimates of generalization error rates. For large learning sets, LOOCV carries a high computational burden, as it requires n applications of the training procedure.

7. Introduction to Classification in Microarray Experiments

145

5.4General Issues in Performance Assessment

The use of cross-validation (or any other estimation method) is intended to provide accurate estimates of classification error rates. It is important to note that these estimates relate only to the experiment that was (cross-) validated. There is a common practice in microarray classification of doing feature selection using all of the learning set and then using cross-validation only on the classifier-building portion of the process. In that case, inference can only be applied to the latter portion of the process. However, in most cases, the important features are unknown and the intended inference includes feature selection. Then, CV estimates as above tend to suffer from a downward bias and inference is not warranted. Features should be selected only on the basis of the samples in the training sets for CV estimation. This applies to other error rate estimation methods (e.g., test set error and out-of-bag estimation), and also to other aspects of the classifier training process, such as variable standardization and parameter selection. Examples of classifier parameters that should be included in cross-validation are: the number of predictor variables, the number of neighbors k for k-nearest neighbor classifiers, and the choice of kernel for SVMs. The issue of “honest” crossvalidation analysis is discussed in (West et al., 2001).

The approaches described above can also be extended to reflect differential misclassification costs; in such situations, performance is assessed based on the general definition of risk in Equation 7.1. In the case of unequal representation of the classes, some form of stratified sampling may be needed to ensure balance across important classes in all subsamples. In addition, for complex experimental designs, such as factorial or timecourse designs, the resampling mechanisms used for computational inference should reflect the design of the experiment.

Finally, note that in machine learning, a frequently employed alternative to simple accuracy-based measures is the lift. The lift of a given class k is computed from a test set as the proportion of correct class k predictions divided by the proportion of class k test cases, i.e.

In general, the greater the lift, the better the classifier.

146

Chapter 7

6.DISCUSSION

Classification is an important question in microarray experiments, for purposes of classifying biological samples and predicting clinical or other outcomes using gene expression data. In this chapter, we have discussed the statistical foundations of classification and described two basic classification approaches, nearest neighbor and linear discriminant analysis (including the special case of DLDA, also known as naive Bayes classification). We have addressed the important issues of feature selection and honest classifier performance assessment, which takes into account gene screening and other training decisions in error rate estimation procedures such as crossvalidation.

The reader is referred to (Dudoit et al., 2002) and (Dudoit and Fridlyand, 2002) for a more detailed discussion and comparison of classification methods for microarray data. The classifiers examined in these two studies include linear and quadratic discriminant analysis, nearest neighbor classifiers, classification trees, and SMVs. Resampling methods such as bagging and boosting were also considered, including random forests and LogitBoost for tree stumps. Simple methods such as nearest neighbors and naive Bayes classification were found to perform remarkably well compared to more complex approaches, such as aggregated classification trees or SVMs. Dudoit and Fridlyand (2002) also discussed the general questions of feature selection, standardization, distance function, loss function, biased sampling of classes, and binary vs. polychotomous classification. Decisions concerning all these issues can have a large impact on the performance of the classifier; they should be made in conjunction with the choice of classifier and included in the assessment of classifier performance.

Although classification is by no means a new subject in the statistical literature, the large and complex multivariate datasets generated by microarray experiments raise new methodological and computational challenges. These include building accurate classifiers in a “large p, small n” situation and obtaining honest estimates of classifier performance. In particular, better predictions may be obtained by inclusion of other predictor variables such as age or sex. In addition to accuracy, a desirable property of a classifier is its ability to yield insight into the predictive structure of the data, that is, identify individual genes and sets of interacting genes that are related to class distinction. Further investigation of the resulting genes may improve our understanding of the biological mechanisms underlying class distinction and eventually lead to marker genes to be used in a clinical setting for predicting outcomes such as survival and response to treatment.

7. Introduction to Classification in Microarray Experiments

147

ACKNOWLEDGMENTS

We are most grateful to Leo Breiman for many insightful conversations on classification. We would also like to thank Robert Gentleman for valuable discussions on classification in microarray experiments while designing a short course on this topic. Finally, we have appreciated the editorscareful reading of the chapter and very helpful suggestions.

REFERENCES

Alizadeh A. A., Eisen M. B., Davis R. E., Ma C., Lossos I. S., Rosenwald A., Boldrick J. C., Sabet H., Tran T., Yu X., Powell J. I., Yang L., Marti G. E., Moore T., Jr J. H., Lu L., Lewis D. B., Tibshirani R., Sherlock G., Chan W. C., Greiner T. C., Weisenburger D. D., Armitage J. O., Warnke R., Levy R., Wilson W., Grever M. R., Byrd J. C., Botstein D., Brown P. O., and Staudt L. M. (2000). Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403:503-511.

Alon U., Barkai N., Notterman D. A., Gish K., Ybarra S., Mack D., and Levine A. J. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci., 96:6745-6750.

Bittner M., Meltzer P., Chen Y., Jiang Y., Seftor E., Hendrix M., Radmacher M., Simon R., Yakhini Z., Ben-Dor A., Sampas N., Dougherty E.,Wang E.,Marincola F., Gooden C., Lueders J., Glatfelter A., Pollock P., Carpten J., Gillanders E., Leja D., Dietrich K., Beaudry C., Berens M., Alberts D., Sondak V., Hayward N., and Trent J. (2000). Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature, 406:536-540.

Bø T. H. and Jonassen I. (2002). New feature subset selection procedures for classification of expression profiles. Genome Biology, 3(4): 1-11.

Boldrick J. C., Alizadeh A. A., Diehn M., Dudoit S., Liu C. L., Belcher C. E., Botstein D., Staudt L. M., Brown P. O., and Relman D. A. (2002). Stereotyped and specific gene expression programs in human innate immue responses to bacteria. Proc. Natl. Acad. Sci., 99(2):972-977.

Breiman L. (1998). Arcing classifiers. Annals of Statistics, 26:801-824.

Breiman L. (1999). Random forests - random features. Technical Report 567, Department of Statistics, U.C. Berkeley.

Breiman L., Friedman, J.H. Olshen, R., and Stone C.J. (1984). Classification and regression trees. The Wadsworth statistics/probability series. Wadsworth International Group.

Chen X., Cheung S.T., So S., Fan S.T., Barry C., Higgins J., Lai K.-M., Ji J., Dudoit S., Ng I. O. L., van de Rijn M., Botstein D., and Brown P.O. (2002). Gene expression patterns in human liver cancers. Molecular Biology of the Cell, 13(6):1929-1939.

Chow M.L., Moler E.J., and Mian I.S. (2001). Identifying marker genes in transcription profiling data using a mixture of feature relevance experts. Physiological Genomics, 5:99111.

Costello J. F., Fruehwald M.C., Smiraglia D.J., Rush, L. J., Robertson G.P., Gao X., Wright F.A., Feramisco J.D., Peltomki P., Lang J.C., Schuller D.E., Yu L., Bloomfeld C.D., Caligiuri M.A., Yates A., Nishikawa R., Huang H.J.S., Petrelli N.J., Zhang X., O’Dorisio

148

Chapter 7

M.S., Held W.A., Cavenee W.K., and Plass C. (2000). Aberrant CpG-island methylation has non-random and tumour-typespecific patterns. Nature Genetics, 24:132-138.

Dettling M. and Buelmann P. (2002). How to use boosting for tumor classification with gene expression data. Available at http://stat.ethz.ch/~dettling/boosting.

Dudoit S. and Fridlyand J. (2002). Classification in microarray experiments. In Speed, T. P., editor, Statistical Analysis of Gene Expression Microarray Data. Chapman & Hall/CRC. (To appear).

Dudoit S., Fridlyand J., and Speed T P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457):77-87.

Fix E. and Hodges J. (1951). Discriminatory analysis, nonparametric discrimination: consistency properties. Technical report, Randolph Field, Texas: USAF School of Aviation Medicine.

Friedman J.H, (1994). Flexible metric nearest neighbor classification. Technical report, Department of Statistics, Stanford University.

Friedman J.H. (1996). On bias, variance, 0/1-loss, and the curse-of-dimensionality. Technical report, Department of Statistics, Stanford University.

Golub T.R., Slonim D.K., Tamayo P., Huard C., Gaasenbeek M., Mesirov J.P., Coller H., Loh M., Downing J.R., Caligiuri M.A., Bloomfield C.D., and Lander E.S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531-537.

Hastie T., Tibshirani R., and Friedman J.H. (2001). The Elements of Statistical Learning : Data Mining, Inference, and Prediction. Springer Verlag.

Jain A.N., Chin K., Børresen-Dale A., Erikstein B.K., Eynstein L.P., Kaaresen R., and Gray J.W. (2001). Quantitative analysis of chromosomal CGH in human breast tumors associates copy number abnormalities with p53 status and patient survival. Proc. Natl. Acad, Sci., 98:7952-7957,

Mardia K.V., Kent J.T., and Bibby J.M. (1979). Multivariate Analysis. Academic Press, Inc., San Diego.

McLachlan G.J. (1992). Discriminant analysis and statistical pattern recognition. Wiley, New York.

Perou C.M., Jeffrey S.S., van de Rijn M., Rees C.A., Eisen M.B., Ross D.T., Pergamenschikov A., Williams C.F., Zhu S.X., Lee J.C.F., Lashkari D., Shalon D., Brown, P.O., and Botstein D. (1999). Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc. Natl. Acad. Sci., 96:9212-9217.

Pollack J.R., Perou C.M., Alizadeh A.A., Eisen M.B., Pergamenschikov A., Williams C.F., Jeffrey S.S., Botstein D., and Brown P.O. (1999). Genome-wide analysis of DNA copynumber changes using cDNA microarrays. Nature Genetics, 23:41-46.

Pomeroy S.L., Tamayo P., Gaasenbeek M., Sturla L.M., Angelo M., McLaughlin M.E., Kim J.Y., Goumnerova L.C., Black P.M., Lau C., Allen J.C., Zagzag D., Olson J., Curran T., Wetmore C., Biegel J.A., Poggio T., Mukherjee S., Rifkin R., Califano A., Stolovitzky G., Louis D.N., Mesirov J.P., Lander E.S., and Golub T.R. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(24):436-442. (and supplementary information).

7. Introduction to Classification in Microarray Experiments

149

Ripley, B. D. (1996). Pattern recognition and neural networks. Cambridge University Press, Cambridge, New York. Ross, D. T., Scherf, U., Eisen, M. B., Perou, C. M, Spellman, P., Iyer, V., Jeffrey, S. S., de Rijn, M. V., Waltham, M., Pergamenschikov, A., Lee, J. C. F., Lashkari, D., Shalon, D., Myers, T. G., Weinstein, J. N., Botstein, D., and Brown, P. O. (2000). Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics, 24:227-234.

Sørlie, T., Perou, C. M., Tibshirani, R., Aas, T., Geisler, S., Johnsenb, H., Hastie,T., Eisen,M. B., van deRijn,M., Jeffrey, S. S.,Thorsen, T., Quist, H., Matese, J. C., Brown, P. O., Botstein, D., Lønningg, P. E., and Børresen-Dale, A. L. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci., 98(19): 10869-10874.

Tibshirani R., Hastie T., and G. Chu B.N. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl. Acad. Sci. 99:6567-6572.

West M., Blanchette C., Dressman H., Huang E., Ishida S., Spang R., Zuzan H., Marks J.R., and Nevins J.R. (2001). Predicting the clinical status of human breast cancer using gene expression profiles. Proc. Natl. Acad. Sci., 98:11462-11467.

Chapter 8

BAYESIAN NETWORK CLASSIFIERS FOR GENE EXPRESSION ANALYSIS

Byoung-Tak Zhang and Kyu-Baek Hwang

Biointelligence Laboratory, School of Computer Science and Engineering, Seoul National University, Seoul 151-742, Korea

e-mail: {btzhang, kbhwang}@bi.snu.ac.kr

1.INTRODUCTION

The recent advent of DNA chip technologies has made it possible to measure the expression level of thousands of genes in the cell population. The parallel view on gene expression profiles offers a novel opportunity to broaden the knowledge about various life phenomena. For example, the microarray samples from normal and cancer tissues are accumulated for the study of differentially expressed genes in the malignant cell (Golub et al., 1999; Alon et al., 1999; Slonim et al., 2000; Khan et al., 2001). The eventual knowledge acquired by such a study could aid in discriminating between carcinoma cells and normal ones based on the gene expression pattern. One of the main objectives of machine learning is to build a discriminative (classification) model from data, automatically.

There exist various kinds of machine learning models deployed for the classification task i.e. k-nearest neighbor (kNN) models (Li et al., 2002), decision trees (Dubitzky et al., 2002), artificial neural networks (Khan et al., 2001; Dubitzky et al., 2002), and Bayesian networks (Hwang et al., 2002). These models differ mostly in their way of representing the learned knowledge. The kNN methods just lay aside learning examples in computer memory. When a new example without class label is encountered, a set of k similar examples are retrieved from memory and used for classification. Decision trees represent the learned knowledge in the form of a set of ‘ifthen’ rules. Neural networks learn the functional relationships between the class variable and input attributes. Bayesian networks represent the joint probability distribution over the variables of interest. The kNN model is the

8. Bayesian Network Classifiers for Gene Expression Analysis

151

simplest among the above classification models. The Bayesian network might be the most complicated and flexible one. In general, the more complicated model requires the more elaborate and complex learning techniques. Nonetheless, each of the above classification models could achieve the classification performance comparable to each other, regardless of its representation power. Then, what is the reason for using more complex models? The answer might be that they enable the acquisition of more flexible and comprehensive knowledge. And the Bayesian network is probably the most suitable model for such purposes. Thus, it has been employed for the sample classification (Hwang et al., 2002) as well as for the genetic network analysis (Friedman et al., 2000; Hartemink et al., 2001) with microarray data.

This chapter deals with the Bayesian network for the classification of microarray data and is organized as follows. In Section 2, we give a simple explanation of the Bayesian network model. Methods of data preprocessing and learning Bayesian networks from data are provided in Section 3. In Section 4, the advantages of the Bayesian networks as well as the difficulties in applying them to the classification task are described. Some techniques for improving the classification performance of the Bayesian network are also presented. In Section 5, we compare the classification accuracy of the Bayesian network with other state-of-the-art techniques on two microarray data sets. The use of Bayesian networks for knowledge discovery is also illustrated. Finally, we give some concluding remarks in Section 6.

2.BAYESIAN NETWORKS

The Bayesian network (Heckerman, 1999; Jensen, 2001) is a kind of probabilistic graphical model, which represents the joint probability distribution over a set of variables of interest.1 In the framework of probabilistic graphical models, the conditional independence is exploited for the efficient representation of the joint probability distribution. For three sets of variables X, Y, and Z,2 X is conditionally independent from Y given the value of Z, if P(x | y, z) = P(x | z) for all x, y, and z whenever P(y, z) > 0. The Bayesian network structure encodes various conditional independencies among the variables. Formally, a Bayesian network assumes a directed-

1When applying Bayesian networks to microarray data analysis, each gene or the experimental condition is regarded as a variable. The value of the gene variable corresponds to its expression level. The experimental conditions include the characteristics of tissues, cell cycles, and others.

2Following the standard notation, we represent a random variable as a capital letter (e.g., X, Y, and Z) and a set of variables as a boldface capital letter (e.g., X, Y, and Z). The corresponding lowercase letters denote the instantiation of the variable (e.g., x,y, and z) or all the members of the set of variables (e.g., x, y, and z), respectively.

152

Chapter 8

acyclic graph (DAG) structure where a node corresponds to a variable3 and an edge denotes the direct probabilistic dependency between two connected nodes. The DAG structure asserts that each node is independent from all of its non-descendants conditioned on its parent nodes. By these assertions, the Bayesian network over a set of N variables, represents the joint probability distribution as

where denotes the set of parents of is called the local probability distribution of The local probability distribution describes the conditional probability distribution of each node given the values of its parents. The appropriate local probability distribution model is chosen according to the variable type. When all the variables are discrete, the multinomial model is used. When all the variables are continuous, the linear Gaussian model4 can be used.5

Figure 8.1 is an example Bayesian network for cancer classification.

This Bayesian network represents the joint probability distribution over eight gene variables and ‘Class’ variable. Unlike other machine learning models

3 Because each node in the Bayesian network is one-to-one correspondent to a variable, ‘node’ and ‘variable’ denote the same object in this paper. We use both terms interchangeably according to the context.

4 In the linear Gaussian model, a variable is normally distributed around a mean that depends linearly on the values of its parent nodes. The variance is independent of the parent nodes.

5 The hybrid case, in which the discrete variables and the continuous variables are mixed, could also exist. Such a case is not dealt with in this paper.

8. Bayesian Network Classifiers for Gene Expression Analysis

153

for classification, the Bayesian network does not discriminate between the class variable and the input attributes. The class variable is simply regarded as one of the data attributes. When classifying a sample without class label using Bayesian networks in Figure 8.1, we calculate the conditional probability of ‘Class’ variable given the values of eight gene variables as follows:

where the summation is taken over all the possible states of ‘Class’ variable. Among the possible cancer class labels, the one with the highest conditional probability value might be selected as an answer.6 The joint probability in the numerator and denominator of Equation (2) can be decomposed into a product of the local probability of each node based on the DAG structure in Figure 8.1.7 In addition to the classification, the Bayesian network represents the probabilistic relationships among variables in a comprehensible DAG format. For example, the Bayesian network in Figure 8.1 asserts that the expression of ‘Gene D’ might affect the expression of both ‘Gene G’ and ‘Gene H’ (‘Gene D’ is the common parent of ‘Gene G’ and ‘Gene H’).8

3.APPLYING BAYESIAN NETWORKS TO THE CLASSIFICATION OF MICROARRAY DATA

Figure 8.2 shows the overall procedure of applying Bayesian networks to the classification of microarray data. First, an appropriate number of genes are selected and the expression level of each gene is transformed into the discrete value9. After the discretization and selection process, a Bayesian network is learned from the reduced microarray data set which has only the selected genes and the ‘Class’ variable as its attributes. Finally, the learned

6 In the case of a tie, the answer can be selected randomly or just ‘unclassified’.

7 The local probability distribution of each node is estimated from the data in the procedure of Bayesian network learning.

8 An edge in the Bayesian network just means the probabilistic dependency between two connected nodes. In Figure 8.1, ‘Gene D’ depends on ‘Gene G’ and vice versa. The probabilistic dependency does not always denote the causal relationship but the possibility of its existence.

9 The discretization of gene expression level is related to the choice of the local probability distribution model for each gene node. It is not compulsory in microarray data analysis with Bayesian networks.

Соседние файлы в предмете Генетика