Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Genomics and Proteomics Engineering in Medicine and Biology - Metin Akay

.pdf
Скачиваний:
50
Добавлен:
10.08.2013
Размер:
9.65 Mб
Скачать

94GENE REGULATION BIOINFORMATICS OF MICROARRAY DATA

G.Pavesi, G. Pesole, M. Regnier, N. Simonis, S. Sinha, G. Thijs, J. van Helden,

M.Vandenbogaert, Z. Weng, C. Workman, C. Ye, and Z. Zhu, “Assessing computational tools for the discovery of transcription factor binding sites,” Nature Biotechnol., 23(1): 137–144, 2005.

48.G. Pavesi, P. Mereghetti, G. Mauri, and G. Pesole, “Weeder Web: Discovery of transcription factor binding sites in a set of sequences from co-regulated genes,” Nucleic Acids Res., 32(2, Suppl.): W199–203, 2004.

49.J. van Helden, B. Andre´, and L. Collado-Vides, “Extracting regulatory sites from upstream region of yeast genes by computational analysis of oligonucleotide frequencies,” J. Mol. Biol., 281: 827–842, 1998.

50.T. D. Schneider and R. M. Stephens, “Sequence logos: A new way to display consensus sequences,” Nucleic Acids Res., 18(20): 6097–6100, 1990.

51.T. Werner, “Models for prediction and recognition of eukaryotic promoters,” Mamm. Genome, 10: 71–80, 1999.

52.M. Tompa, “An exact method for finding short motifs in sequences, with application to the ribosome binding site problem,” Proc. Int. Conf. Intell. Sys. Mol. Biol. (Heidelberg, Germany), 7: 262–271, 1999.

53.P. A. Pevzner and S. H. Sze, “Combinatorial approaches to finding subtle signals in dna sequences,” Proc. Int. Conf. Intell. Sys. Mol. Biol., 8: 269–278, 2000.

54.U. Keich and P. A. Pevzner, “Subtle motifs: Defining the limits of motif finding algorithms,” Bioinformatics, 18(10): 1382–1390, 2002.

55.U. Keich and P. A. Pevzner, “Finding motifs in the twilight zone,” Bioinformatics, 18(10): 1374–1381, 2002.

56.J. Buhler and M. Tompa, “Finding motifs using random projections,” J. Comput. Biol., 9(2): 225–242, 2002.

57.L. Marsan and M.-F. Sagot, “Algorithms for extracting structured motifs using a suffix tree with application to promoter and regulatory site consensus identification,” J. Comp. Biol., 7: 345–360, 2000.

58.A. Vanet, L. Marsan, A. Labigne, and M. F. Sagot, “Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma80 family of promoter signals,”

J.Mol. Biol., 297(2): 335–353, 2000.

59.G. Z. Hertz, G. W. Hartzell, and G. D. Stormo, “Identification of consensus patterns in unaligned DNA sequences known to be functionally related,” Comput. Appl. Biosci., 6: 81–92, 1990.

60.G. Z. Hertz and G. D. Stormo, “Identifying DNA and protein patterns with statistically significant alignments of multiple sequences,” Bioinformatics, 15(7/8): 563–577, 1999.

61.C. E. Lawrence and A. A. Reilly, “An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences,” Proteins, 7: 41–51, 1990.

62.T. L. Bailey and C. Elkan, “Unsupervised learning of multiple motifs in biopolymers using expectation maximization,” Machine Learning, 21: 51–80, 1995.

63.T. L. Bailey and C. Elkan, “The value of prior knowledge in discovering motifs with

MEME,” Proc. Int. Conf. Intell. Sys. Mol. Biol., 3: 21–29, 1995.

64.L. R. Cardon and G. D. Stormo, “Expectation maximization for identifying proteinbinding sites with variable lengths from unaligned DNA fragments,” J. Mol. Biol.,

223:159–170, 1992.

REFERENCES 95

65.C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton, “Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment,” Science, 262: 208–214, 1993.

66.J. S. Liu, A. F. Neuwald, and C. E. Lawrence, “Bayesian models for multiple local sequence alignment and Gibbs sampling strategies,” J. Am. Stat. Assoc., 90(432): 1156–1170, 1995.

67.J. D. Hughes, P. W. Estep, S. Tavazoie, and G. M. Church, “Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae,” J. Mol. Biol., 296: 1205–1214, 2000.

68.G. Thijs, M. Lescot, K. Marchal, S. Rombauts, B. De Moor, P. Rouze´, and Y. Moreau, “A higher order background model improves the detection of promoter regulatory elements by Gibbs sampling,” Bioinformatics, 17(12): 1113–1122, 2001.

69.G. Thijs, K. Marchal, M. Lescot, S. Rombauts, B. De Moor, P. Rouze´, and Y. Moreau, “A Gibbs sampling method to detect over-represented motifs in the upstream regions of co-expressed genes,” J. Comput. Biol., 9(2): 447–464, 2002.

70. K. Marchal, G. Thijs, S. De Keersmaecker, P. Monsieurs, B. De Moor, and J. Vanderleyden, “Genome-specific higher-order background models to improve motif detection,” Trends Microbiol., 11(2): 61–66, 2003.

71.X. Liu, D. L. Brutlag, and J. S. Liu, “BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes,” Proc. Pacific Symp. Biocomput., 6: 127–138, 2001.

72.C. T. Workman and G. D. Stormo, “Ann-spec: A method for discovering transcription binding sites with improved specificity,” Proc. Pacific Symp. Biocomput. (Honolulu, Hawai), 5: 464–475, 2000.

73.D. GuhaThakurta and G. D. Stormo, “Identifying target sites for cooperatively binding factors,” Bioinformatics, 17: 608–621, 2001.

74.B. P. Berman, Y. Nibu, B. D. Pfeiffer, P. Tomancak, S. E. Celniker, M. Levine, G. M. Rubin, and M. B. Eisen, “Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila,” Proc Natl. Acad. Sci. USA, 99(2): 757–762, 2002.

75.M. S. Halfon, Y. Grad, G. M. Church, and A. M. Michelson, “Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model,” Genome Res., 12(7): 1019–1028, 2002.

76.O. Johansson, W. Alkema, W. W. Wasserman, and J. Lagergren, “Identification of functional clusters of transcription factor binding motifs in genome sequences: The mscan algorithm,” Bioinformatics, 19(1, Suppl.): I169–I176, 2003.

77.M. Markstein and M. Levine, “Decoding cis-regulatory dnas in the Drosophila genome,” Curr. Opin. Genet. Dev., 12(5): 601–606, 2002.

78.M. Rebeiz, N. L. Reeves, and J. W. Posakony, “SCORE: A computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data. Site clustering,” Proc. Natl. Acad. Sci. USA, 99(15): 9888–9893, 2002.

79.E. M. Crowley, K. Roeder, and M. Bina, “A statistical model for locating regulatory regions in genomic DNA,” J. Mol. Biol., 268: 8–14, 1997.

80.N. Rajewsky, M. Vergassola, U. Gaul, and E. D. Siggia, “Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo,” BMC Bioinformatics, 3(1): 30, 2002.

96GENE REGULATION BIOINFORMATICS OF MICROARRAY DATA

81.M. C. Frith, U. Hansen, and Z. Weng, “Detection of cis-element clusters in higher eukaryotic dna,” Bioinformatics, 17(10): 878–879, 2001.

82.M. C. Frith, J. L. Spouge, U. Hansen, and Z. Weng, “Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences,” Nucleic Acids Res., 30(14): 3214–3224, 2002.

83.T. L. Bailey and W. S. Noble, “Searching for statistically significant regulatory modules,” Bioinformatics, 19(2, Suppl.): II16–II25, 2003.

84.S. Sinha, E. Van Nimwegen, and E. D. Siggia, “A probabilistic method to detect regulatory modules,” Bioinformatics, 19(1, Suppl.): I292–I301, 2003.

85.S. Aerts, P. Van Loo, G. Thijs, Y. Moreau, and B. De Moor, “Computational detection of cis-regulatory modules,” Bioinformatics, 19(2, Suppl.): II5–II14, 2003.

86.L. A. McCue, W. Thompson, C. S. Carmack, M. P. Ryan, J. S. Liu, V. Derbyshire, and

C.E. Lawrence, “Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes,” Nucleic Acids Res., 29: 774–782, 2001.

87.J. S. Liu and C. E. Lawrence, “Bayesian inference on biopolymer models,” Bioinformatics, 15: 38–52, 1999.

88.M. Blanchette, B. Schwikowski, and M. Tompa, “Algorithms for phylogenetic footprinting,” J. Comput. Biol., 9(2): 211–223, 2002.

89.M. Blanchette and M. Tompa, “Discovery of regulatory elements by a computational method for phylogenetic footprinting,” Genome Res., 12(5): 739–748, 2002.

90.M. Blanchette and M. Tompa, “Footprinter: A program designed for phylogenetic footprinting,” Nucleic Acids Res., 31(13): 3840–3842, 2003.

91.T. Wang and G. D. Stormo, “Combining phylogenetic data with coregulated genes to identify regulatory motifs,” Bioinformatics, 19(18): 2369–2380, 2003.

92.E. Berezikov, V. Guryev, R. H. Plasterk, and E. Cuppen, “CONREAL: Conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting,” Genome Res., 14(1): 170–178, 2004.

93.P. Cliften, P. Sudarsanam, A. Desikan, L. Fulton, B. Fulton, J. Majors, R. Waterston,

B.A. Cohen, and M. Johnston, “Finding functional features in Saccharomyces genomes by phylogenetic footprinting,” Science, 301(5629): 71–76, 2003.

94.B. Lenhard, A. Sandelin, L. Mendoza, P. Engstrom, N. Jareborg, and W. W. Wasserman, “Identification of conserved regulatory elements by comparative genome analysis,”

J.Biol., 2(2): 13, 2003.

95.K. Marchal, S. De Keersmaecker, P. Monsieurs, N. van Boxel, K. Lemmens, G. Thijs,

J.Vanderleyden, and B. De Moor, “In silico identification and experimental validation of pmrab targets in Salmonella typhimurium by regulatory motif detection,” Genome Biol., 5(2): R9, 2004.

96.J. van Helden, B. Andre, and J. Collado-Vides, “Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies,” J. Mol. Biol., 281(5): 827–842, 1998.

97.M. Blanchette, B. Schwikowski, and M. Tompa, “An exact algorithm to identify motifs in orthologous sequences from multiple species,” Proc. Int. Conf. Intell. Sys. Mol. Biol.,

8:37–45, 2000.

98.H. J. Bussemaker, H. Li, and E. D. Siggia, “Regulatory element detection using a probabilistic segmentation model,” Proc. Int. Conf. Intell. Sys. Mol. Biol., 8: 67–74, 2000.

REFERENCES 97

99.M. Lescot, P. De´hais, G. Thijs, K. Marchal, Y. Moreau, Y. Van de Peer, P. Rouze´, and

S.Rombauts, “PlantCARE, a database of plant cis-acting regulatory elements and a portal to tools for in silico analysis of promoter sequences,” Nucleic Acids Res., 30: 325–327, 2002.

100.W. Chen, N. J. Provart, J. Glazebrook, F. Katagiri, H.-S. Chang, T. Eulgem, F. Mauch,

S.Luan, G. Zou, S. A. Whitham, P. R. Budworth, Y. Tao, Z. Xie, X. Chen, S. Lam, J. A. Kreps, J. F. Harper, A. Si-Ammour, B. Mauch-Mani, M. Heinlein, K. Kobayashi,

T.Hohn, J. L. Dangl, X. Wang, and T. Zhu, “Expression profile matrix of Arabidopsis transcription factor genes suggests their putative functions in response to environmental stresses,” Plant Cell, 14(3): 559–574, 2002.

101.E. J. Klok, I. W. Wilson, D. Wilson, S. C. Chapman, R. M. Ewing, S. C. Somerville, W. J. Peacock, R. Dolferus, and E. S. Dennis, “Expression profile analysis of the low-oxygen response in Arabidopsis root cultures,” Plant Cell, 14: 2481–2494, 2002.

102.S. Le Crom, F. Devaux, P. Marc, X. Zhang, W. S. Moye-Rowley, and C. Jacq, “New insights into the pleiotropic drug resistance network from genome-wide characterization of the YRR1 transcription factor regulation system,” Mol. Cell. Biol., 22(8): 2642– 2649, 2002.

103.I. M. Martinez and M. J. Chrispeels, “Genomic analysis of unfolded protein response in Arabidopsis shows its connection to important cellular processess” Plant Cell, 15(2): 561–576, 2003.

104.U. Ohler, G. Liao, H. Niemann, and G. M. Rubin, “Computational analysis of core promoters in the Drosophila genome,” Genome Biol., 3(12): 1–12, 2002.

105.J. B. Rossel, I. W. Wilson, and B. J. Pogson, “Global changes in gene expression in response to high light in Arabidopsis,” Plant Physiol., 130: 1109–1120, 2002.

106.M. T. Laub, H. H. McAdams, T. Feldblyum, C. M. Fraser, and L. Shapiro, “Global analysis of the genetic network controlling a bacterial cell cycle,” Science, 290(5499): 2144–2148, 2000.

107.T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M. Hannett, C. T. Harbison, C. M. Thompson, I. Simon, J. Zeitlinger, E. G. Jennings,

H.L. Murray, D. B. Gordon, B. Ren, J. J. Wyrick, J. B. Tagne, T. L. Volkert,

E.Fraenkel, D. K. Gifford, and R. A. Young, “Transcriptional regulatory networks in

Saccharomyces cerevisiae,” Science, 298(5594): 799–804, 2002.

108.J. R. Pollack, C. M. Perou, A. A. Alizadeh, M. B. Eisen, A. Pergamenschikov, C. F. Williams, S. S. Jeffrey, D. Botstein, and P. O. Brown, “Genome-wide analysis of DNA copy-number changes using cDNA microarrays,” Nature Genet., 23(1): 41–46, 1999.

109.H. J. Bussemaker, H. Li, and E. D. Siggia, “Regulatory element detection using correlation with expression,” Nature Genet., 27(2): 167–171, 2001.

110.C. Roven and H. J. Bussemaker, “Reduce: An online tool for inferring cis-regulatory elements and transcriptional module activities from microarray data,” Nucleic Acids Res., 31(13): 3487–3490, 2003.

111.E. M. Conlon, X. S. Liu, J. D. Lieb, and J. S. Liu, “Integrating regulatory motif discovery and genome-wide expression analysis,” Proc. Natl. Acad. Sci. USA, 100(6): 3339–3344, 2003.

112.M. Lapidot and Y. Pilpel, “Comprehensive quantitative analyses of the effects of promoter sequence elements on mrna transcription,” Nucleic Acids Res., 31(13): 3824–3828, 2003.

98 GENE REGULATION BIOINFORMATICS OF MICROARRAY DATA

113.Z. Bar-Joseph, G. K. Gerber, T. I. Lee, N. J. Rinaldi, J. Y. Yoo, F. Robert, D. B. Gordon, E. Fraenkel, T. S. Jaakkola, R. A. Young, and D. K. Gifford, “Computational discovery of gene modules and regulatory networks,” Nature Biotechnol., 21(11): 1337–1342, 2003.

114.X. S. Liu, D. L. Brutlag, and J. S. Liu, “An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments,” Nature Biotechnol., 20(8): 835–839, 2002.

115.O. Alter, P. O. Brown, and D. Botstein, “Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms,” Proc. Natl. Acad. Sci. USA, 100(6): 3351–3356, 2003.

116.J. Suykens, T. van Gestel, J. De Brabander, B. De Moor, and J. Vandewalle, Least Square Support Vector Machines, World Scientific Publishing, Singapore, 2002.

&CHAPTER 4

Robust Methods for

Microarray Analysis

GEORGE S. DAVIDSON, SHAWN MARTIN, KEVIN W. BOYACK, BRIAN N. WYLIE, JUANITA MARTINEZ, ANTHONY ARAGON,

´

MARGARET WERNER-WASHBURNE, MONICA MOSQUERA-CARO, and CHERYL WILLMAN

4.1. INTRODUCTION

The analysis of a complex system within an environment that is only subject to incomplete control is nearly impossible without some way to measure a large fraction of the system’s internal state information. As a result, it is only with the recent advent of high-throughput measurement technologies (able to simultaneously measure tens of thousands of molecular concentrations) that systems biology is really a possibility. As an example of the scope of this problem, consider that eukaryotic cells typically have on the order of tens of thousands of genes, each of which is likely to have several alternative splicing variants coding for the protein building blocks of the cell. These proteins undergo posttranslational modifications and have multiple phosphorylations such that there are likely to be hundreds of thousands, possibly millions, of variants. Hence the future of systems biology relies critically on high-throughput instruments, such as microarrays and dual mass spectrometers. The research reported here addresses the need for improved informatics to deal with the large volume of information from these techniques.

At present, microarray data are notoriously noisy. Hopefully this technology will improve in the future, but for the immediate present, it is important that the analysis tools and informatics systems developed for microarray analysis admit and adjust to significant levels of noise. In particular, such methods should be stable in the presence of noise and should assign measures of confidence to their own results. In this chapter we describe our efforts to implement and assess reliable methods for microarray analysis. We begin with the structure of the typical microarray experiment, normalization of the resulting data, and ways to find relationships. We then

Genomics and Proteomics Engineering in Medicine and Biology. Edited by Metin Akay Copyright # 2007 the Institute of Electrical and Electronics Engineers, Inc.

99

100 ROBUST METHODS FOR MICROARRAY ANALYSIS

proceed to discuss the tools themselves as well as methods to assess the output of such tools.

Much of the work in this chapter is based on the VxInsight visualization software [1]. However, we have tried to minimize overlap with existing published results and have emphasized new work.

Throughout each section we will present results and examples from our research to motivate the specific approaches, algorithms, and analysis methods we have developed. The data sets we have analyzed will not be emphasized, although they were of course critical to the work. Most of the data sets we have used have been published, and these will be cited in the course of the chapter. The data sets so far unpublished have been provided by the University of New Mexico Cancer Research and Treatment Center. These data sets include an infant leukemia data set, a precursor-B childhood leukemia data set, and an adult leukemia data set, all of which have been presented at the American Society of Hematology conferences in 2002 and/or 2003. Publications concerning the biological implications of these data sets are forthcoming. Here we discuss methods only.

Finally, we emphasize that this chapter is by no means a survey of techniques for microarray analysis. Indeed, the literature on microarray analysis is vast, and new papers appear frequently. We will mention, however, some of the seminal papers, including [2, 3], which describe the original technology, as well as [4], which gives an overview of the earlier work on microarray technology. Of course, microarrays would not be terribly useful without computational analysis, and some of the original work on microarray analysis includes hierarchical clustering of the yeast cell cycle [5, 6], an analysis of leukemia microarray data using an original method of gene selection [7], and an application of the singular-value decomposition to microarray data [8]. In addition to innumerable additional papers, there are a variety of books available which emphasize different aspects of the microarrays and microarray analysis. A few of the more recent books include [9–12].

4.2. MICROARRAY EXPERIMENTS AND ANALYSIS METHODS

4.2.1. Microarray Experiments

Microarray experiments and their analyses seek to detect effects in gene expression under different treatments or natural conditions with the goal of clarifying the cellular mechanisms involved in the cells’ differential responses. Uncertainty is the rule rather than the exception in these analyses. First, the underlying systems (cells and/or tissues) are incredibly complex whether viewed from the dynamic process perspective or their physical realizations in space and time. Second, there is abundant variability between cells experiencing exactly the same conditions as a result of genetic polymorphisms as well as the stochastic nature of these chemical systems. Third, the collection and initial preservation of these cells are not a precisely controlled process. For example, when a surgeon removes a cancer, the tissue may not be frozen or otherwise processed for minutes to hours, all the

4.2. MICROARRAY EXPERIMENTS AND ANALYSIS METHODS

101

while the cells continue to respond to these unnatural conditions. Further, the tissues, or partially processed extracts, are often sent to another laboratory several hours or even days away from the original collection site, all of which offers opportunities for chemical changes. Fourth, these measurements are not easy to make; they involve many processing steps with a wide variety of chemicals, and at every step variability arises. The processing is often (necessarily) divided across several days and among several technicians, with inherently different skills and training. Further, the chemicals are never exactly the same; they are created in different batches and age at different rates. All of these affect the laboratory yields and the quality of the processing. Finally, the arrays themselves are technological objects subject to all sorts of variability in their creation, storage, and final use. In effect, the simple measurement of mRNA concentrations that we would like to make is confounded by huge uncertainties. To be able to make good measurements, it is essential that all of the mentioned steps be subject to careful statistical process control, monitoring, and systematic improvement. Further, the actual experiments should be designed to avoid, randomize, or otherwise balance the confounding effects for the most important experimental measurements [13, 14]. These are best practices. Unfortunately, they are not often followed in the laboratory.

By the time the data are ready for analysis, they are typically presented in a numeric table recording a measurement for each gene across several microarrays. For example, one might be analyzing 400 arrays each with 20,000 gene measurements, which would be presented in a table with 20,000 rows and 400 columns. Often the table will have missing values resulting from scratched arrays, poor hybridizations, or scanner misalignments, to name just a few (from among a host) of the possible problems. The analysis methods should be able to gracefully deal with these incomplete data sets, while the analyst should approach these data with great skepticism and humility considering the complexity of the cellular processes and the error-prone nature of our microarray technology. Despite all of these issues, statisticians and informaticians, unlike mathematicians, are expected to say something about the structure and meaning of these data. Because, as Thompson has said, “[statisticians] should be concerned with a reality beyond data-free formalism. That is why statistics is not simply a subset of mathematics” [15, p. xv]. Here, we attempt to follow Thompson in discussing implications of, as well as our approaches to, analyzing these experiments, including considerable detail about the algorithms and the way the data are handled.

4.2.2. Preprocessing and Normalization

As discussed in the previous section, microarray data typically have a large number of missing data, or values otherwise deemed to be nonpresent. We typically drop genes with too many missing values, assigning a threshold for this purpose that is under the control of the analyst. The raw values are then scaled to help with the processing.

The distributions of microarray measurement values typically have extremely long tails. There are a few genes with very large expressions, while most of the

102 ROBUST METHODS FOR MICROARRAY ANALYSIS

others are quite small. Tukey and Mosteller suggested a number of transforms to make data from such distributions less extreme and more like the normal Gaussian distribution [16]. In particular, taking logarithms of the raw data is a common practice to make microarray data more symmetric and to shorten the extreme tail (see Fig. 4.1a).

However, we frequently use another transform to compress the extreme values, which is due to [17]. This rank-order-based score is an increasing function of expression level, for which the smaller values are compressed to be very nearly the same. This is particularly useful with array data, as many of these smaller values are due purely to noise. The Savage score is computed as follows: If the absolute expression levels are rank ordered from smallest to largest, X(1) X(2) X(n), then the score for X(k) is given by

Xk 1

SS(X(k)) ¼ j¼1 n þ 1 j

Figure 4.1b shows how this score affects the resulting distribution. In particular, Savage scoring compresses the extreme tail and emphasizes the intermediate and large counts. In this case, about 60% of the savage scores are below 1.

The use of this scoring has two advantages over correlations using raw counts. First, because it is based on rank ordering, data from arrays processed with very different scales can still be compared. Second, because the noisiest fraction of the measurements is aggressively forced toward zero, the effect of the noise is suppressed without completely ignoring the information (it has been taken into account during the sorting phase). Finally, large differences in rank order will still be strong enough to be detected.

The normalization of array data is controversial, although some form of centering and variance normalization is often applied [18, 19]. However, it has been argued

FIGURE 4.1. (a) Distribution of log-transformed data from typical Affymetrix U94A microarray. (b) Distribution of the same data after Savage scores.

4.3. UNSUPERVISED METHODS

103

that for many experiments there is no intrinsic reason to expect the underlying mRNA concentrations to have the same mean (or median) values, and hence variance adjustment is suspect. Nevertheless, the analyst has the option to do such normalizations, if desired. In general, we avoid this issue by working with order statistics and Savage scores.

4.2.3. Overview of Basic Analysis Method

After preprocessing the measurements with thresholding, rescaling, and various other transforms, we often perform our analysis as follows. First, we compare the genes or arrays under investigation by computing pairwise similarities with several techniques. These similarities are used to cluster (or assign spatial coordinates to) the genes in ways that bring similar genes closer together. These clusters are then visualized with VxInsight [1, 20].

Although we typically use genes in our analyses, we often use arrays as well. In fact, the analysis is the same, as all our techniques use either the gene matrix initially described, or the transpose of the gene matrix. Hence, throughout the remainder of this chapter, we will use the terms genes and arrays interchangeably.

Next, the clusters are tested with statistical methods to identify genes and groups of genes that are differentially expressed in the identified clusters or genes otherwise identified with respect to experimental questions and hypotheses. The expression values for the identified genes are plotted and tested for stability. Those genes which seem particularly diagnostic or differentiating are studied in detail by reading the available information in online databases and in the original literature. Each of these analysis steps will be presented in an order approximately following the analysis order we use in practice.

This type of analysis is quite typical for microarrays and is usually divided into two categories. The method first described for clustering genes is known as cluster analysis, or unsupervised learning. The term unsupervised is used because we are attempting to divine groups within the data set while avoiding the use of a priori knowledge. In contrast, the second method described for the identification of gene lists is usually called supervised learning. The term supervised refers to the fact that we are using prior information in our analysis, in this case attempting to discover which genes are differentially expressed between known groups. An unsupervised method asks, in essence: Are there groups in the data set and, if so, what are they? A supervised method asks: Given groups, can we predict group membership and can we learn what is most important in distinguishing the groups?

4.3. UNSUPERVISED METHODS

4.3.1. Overview of Clustering for Microarray Data

Organizing large groups of data into clusters is a standard and powerful practice in exploratory data analysis. It has also become a standard practice in microarray