
Berrar D. et al. - Practical Approach to Microarray Data Analysis
.pdf304 |
Chapter 17 |
ACKNOWLEDGEMENTS
The authors thank Jennifer Shoemaker and Patrick McConnell for valuable discussions.
REFERENCES
Agrawal R., Mannila H., Srikant R., Toivonen H., and Verkamo. I. C. (1996). Fast discovery of association rules. In "Advances in knowledge discovery and data mining" (U. M. Fayyad, Ed.), pp. 307-328, AAAI Press : MIT Press, Menlo Park, CA.
Agrawal R., and Shafer J. C. (1996). Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering 8: 962-969.
Aussem A., and Petit J.-M. (2002). Epsilon-functional dependency inference: application to DNA microarray expression data. In Proceedings of BDA'02 (French Database Conference), Evry, France.
Berrar D., Dubitzky W., Granzow M., and Eils R. (2002). Analysis of Gene Expression and Drug Activity Data by Knowledge-based Association Mining. In Proceedings of CAMDA 02, Durham, NC, http://www.camda.duke.edu/CAMDA01/papers.asp.
Berrar D., Granzow M., Dubitzky W., Stilgenbauer S., Wilgenbus, K. D. H., Lichter P., and R. E. (2001). New Insights in Clinical Impact of Molecular Genetic Data by Knowledgedriven Data Mining. In Proc. 2nd Int'l Conference on Systems Biology, pp. 275-281, Omnipress.
Brin S., Motwani R., Ullman J. D., and Tsur S. (1997). Dynamic itemset counting and implication rules for market basket data. In "IGMOD Record (ACM Special Interest Group on Management of Data).
Chang J.-H., Hwang K.-B., and Zhang B.-T. (2002). Analysis of Gene Expression Profiles and Drug Activity Patterns by Clustering and Bayesian Network Learning. In Methods of microarray data analysis II (S. M. Lin, and K. F. Johnson, Eds.), Kluwer Academic Publishers.
Chen R., Jiang Q., Yuan H., and Gruenwald L. (2001). Mining association rules in analysis of transcription factors essential to gene expressions. In Proceedings of CBGIST 2001, Durham, NC.
Eisen M. B., Spellman P. T., Brown P. O., and Botstein D. (1998). Cluster analysis and display of genome-wide expression patterns, Proc Natl Acad Sci USA 95:14863-8.
Glymour C. N., and Cooper G. F. (1999). Computation, causation, and discovery. MIT Press, Cambridge, Mass.
Han J., and Kamber M. (2001). Data mining: concepts and techniques. Morgan Kaufmann Publishers, San Francisco.
Han J., Pei J., and Yin Y., (2000). Mining frequent patterns without candidate generation. In ACM SIGMOD Intl. Conference on Management of Data, ACM Press.
Hipp J., Guntzer U., and Nakaeizadeh G. (2000). Algorithms for Association Rule Mining - A General Survey and Comparison. In Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Hughes T. R., Marton M. J., Jones A. R., Roberts C. J., Stoughton R., Armour C. D., Bennett H. A., Coffey E., Dai H., He Y. D., Kidd M. J., King A. M., Meyer M. R., Slade D., Lum P. Y., Stepaniants S. B., Shoemaker D. D., Gachotte D., Chakraburtty K., Simon J., Bard
17. Correlation and Association Analysis |
305 |
M., and Friend S. H. (2000). Functional discovery via a compendium of expression profiles. Cell 102: 109-26.
Ihaka R., and Gentleman R. (1996). R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics 5:299-314.
Jeha S., Luo X. N., Beran M., Kantarjian H., and Atweh G. F. (1996). Antisense RNA inhibition of phosphoprotein p18 expression abrogates the transformed phenotype of leukemic cells. Cancer Res 56:1445-50.
Klemettinen M., Mannila H., Ronkainen P., Toivonen H., and Verkamo A. I. (1994). Finding interesting rules from large sets of discovered association rules. In Third International Conference on Information and Knowledge Management (CIKM' 94), pp. 401-407, ACM Press.
Lindlof A., and Olsson B. (2002). Could correlation-based methods be used to derive genetic association networks? In Proceedings of the 6th Joint Conference on Information Sciences, pp. 1237-1242, Association for Intelligent Machinery, RTP, NC.
Park J. S., Chen M. S., and Yu P. S. (1997). Using a hash-based method with transaction trimming for mining association rules. IEEE Transactions on Knowledge and Data Engineering 9:813-825.
Roos G., Brattsand G., Landberg G., Marklund U., and Gullberg M. (1993). Expression of oncoprotein 18 in human leukemias and lymphomas. Leukemia 7:1538-46.
Scherf U., Ross D. T., Waltham M., Smith L. H., Lee J. K., Tanabe L., Kohn K. W., Reinhold W. C., Myers T. G., Andrews D. T., Scudiero D. A., Eisen M. B., Sausville E. A., Pommier Y., Botstein D., Brown P. O., and Weinstein J. N. (2000), A gene expression database for the molecular pharmacology of cancer. Nat Genet 24:236-44.
Sheskin D. (2000). Handbook of parametric and nonparametric statistical procedures.
Chapman & Hall/CRC, Boca Raton.
Siegel S., and Castellan N. J. (1988). Nonparametric statistics for the behavioral sciences,
McGraw-Hill, New York.
Silverstein C., Brin S., and Motwani R. (1998). Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery 2:39-68.
Taniguchi M., Miura K., Iwao H., and Yamanaka S. (2001). Quantitative assessment of DNA microarrays – comparison with Northern blot analyses. Genomics 71:34-9.
Waddell P. J., and Kishino H. (2000). Cluster inference methods and graphical models evaluated on NCI60 microarray gene expression data. Genome Inform Ser Workshop Genome Inform 11:129-40.
Zaki M. J. (2000). Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering 12:372-390.
Zaki M. J., Parthasarathy S., Ogihara M., and Li W. (1997). New algorithms for fast discovery of association rules. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97).
Zhou Y., Gwadry F. G., Reinhold W. C., Miller L. D., Smith L. H., Scherf U., Liu E. T., Kohn K. W., Pommier Y., and Weinstein J. N. (2002). Transcriptional regulation of mitotic genes by camptothecin-induced DNA damage: microarray analysis of doseand time-dependent effects. Cancer Res 62:1688-95.
Chapter 18
GLOBAL FUNCTIONAL PROFILING OF GENE EXPRESSION DATA
Sorin Draghici1 and Stephen A. Krawetz2
1Dept. of Computer Science, Karmanos Cancer Institute and the Institute for Scientific Computing, Wayne State University, 431 State Hall, Detroit, MI, 48202
e-mail: sod@cs.wayne.edu
2Dept. of Obstetrics and Gynecology, Center for Molecular Medicine and Genetics, and the Institute for Scientific Computing, Wayne State University
e-mail: steve@compbio.med.wayne.edu
1.CHALLENGES IN TODAY’S BIOLOGICAL RESEARCH
Molecular biology and genetics are currently at the center of an informational revolution. The data gathering capabilities have greatly surpassed the data analysis techniques. If we were to imagine the Holy Grail of life sciences, we might envision a technology that would allow us to fully understand the data at the speed at which it is collected. Sequencing, localization of new genes, functional assignment, pathway elucidation, and understanding the regulatory mechanisms of the cell and organism should be seamless. Ideally, we would like knowledge manipulation to become tomorrow the way goods manufacturing is today: high automatization producing more goods, of higher quality and in a more cost effective manner than manual production. In a sense, knowledge manipulation is now reaching the pre-industrial age. Our farms of sequencing machines and legions of robotic arrayers can now produce massive amounts of data but using it to manufacture highly processed pieces of knowledge still requires skilled masters painstakingly forging through small pieces of raw data one at a time. The ultimate goal is to automate this knowledge discovery process.
Data collection is easy, data interpretation is difficult. Typical examples of high-throughput techniques able to produce data at a phenomenal rate
18. Global Functional Profiling of Gene Expression Data |
307 |
include shotgun sequencing (Bankier, 2001; Venter et al., 2001) and gene expression microarrays (Lockhart et al., 1996; Schena et al., 1995; Shalon et al., 1996). Researchers in structural genomics have at their disposal sequencing machines able to determine the sequence of approximately 100 samples every 3 hours (see for instance the ABI 3700 DNA analyzer from Applied Biosystems). The machines can be set up to work in a continuous flow which means data can be produced at a theoretical rate of approx. 800 sequences per day per machine. Considering a typical length of a sequence segment of about 500 base pairs, it follows that one machine alone can sequence approximately 400,000 nucleotides per day. This enormous throughput enabled impressive accomplishments such as the sequencing of the human genome (Lander et al., 2001; Venter et al., 2001). Recent estimates indicate there are 306 prokaryotic and 195 eukaryotic genome projects currently being undertaken in addition to 93 published complete genomes (Bernal et al., 2001). Currently, our understanding of the role played by various genes seems to be lagging far behind their sequencing. The yeast is an illustrative example. Although the 6,600 genes of its genome have been known since 1997, only approximately 40% of them have known or inferred functions.
A second widely used high-throughput genomic technique is the DNA microarray technology (Eisen et al., 1998; Golub et al., 1999; Lockhart et al., 1996; Schena et al., 1995). In its most general form, the DNA array is a substrate (nylon membrane, glass or plastic) on which DNA is deposited in localized regions arranged in a regular, grid-like pattern. The DNA array is subsequently probed with complementary DNA (cDNA) obtained by reverse transcriptase reaction from the mRNA extracted from a tissue sample. This DNA is fluorescently labeled with a dye and a subsequent illumination with an appropriate source of light will provide an image of the array. (Alternative detection techniques include using radioactive labels.) After an image processing step is completed, the result is a large number of expression values. Typically, one DNA chip will provide expression values for thousands of genes. For instance, the recently released Affymetrix chip HGU133A contains 22,283 genes. A typical experiment will involve several chips and generate hundreds of thousands of numerical values in a few days.
The continuous use of such high-throughput data collection techniques over the years has produced a large amount of heterogeneous data. Many types of genetic data (sequence, protein, EST, etc.) are stored in many different databases. The existing data is neither perfect nor complete, but reliable information can be extracted from it. The first challenge faced by today’s researchers is to develop effective ways of analyzing the huge amount of data that has been and will continue to be collected (Eisenberg et al., 2000; Lockhart et al., 2000; Vukmirovic et al., 2000). In other words, there
308 |
Chapter 18 |
is a need for global, high-throughput data analysis techniques able to keep pace with the available high throughput data collection techniques.
The second challenge focuses on the type of discoveries we should be seeking. The current frontiers of knowledge span two orthogonal directions. Vertically, there are different levels of abstractions such as genes, pathways and organisms. Horizontally, at each level of abstraction there are known, hypothesized and unknown entities. For instance, at the gene level, there are genes with a known function, genes with an inferred function, genes with an unknown function and completely unknown genes. In any given pathway, there are known interactions, inferred interactions and completely unknown interactions. However, the vertical connections between the levels are, in many cases, limited to the membership relationships of genes associated to known pathways.
Most available techniques focus on the horizontal direction, trying to expand the knowledge frontier from known entities to unknown entities or trying to individuate the specific entities involved in a given condition. For instance, there are very many approaches to identifying the genes that are differentially expressed in a specific condition. Such techniques include fold-change (DeRisi, 1997; ter Linde et al., 1999; Wellmann et al., 2000), unusual ratio (Tao et al., 1999; Schena et al., 1995; Schena et al., 1996), ANOVA (Aharoni et al., 1975; Brazma et al., 2000; Draghici et al., 2001; Draghici et al., 2002; Kerr et al., 2000; Kerr and Churchill, 2001a; Kerr and Churchill, 2001b), model based maximum likelihood (Chen et al., 1997; Lee et al., 2000; Sapir et al., 2000), hierarchical models (Newton et al., 2001), univariate statistical tests (Audic and Claverie, 1997; Claverie et al., 1999; Dudoit et al., 2000), clustering (Aach et al., 2000; Ewing et al., 1999; Heyer et al., 1999; Proteome, 2002; Tsoka et al., 2000; van Helden et al., 2000; Zhu and Zhang, 2000), principal component analysis (Eisen et al., 1998; Hilsenbeck et al., 1999; Raychaudhuri et al., 2000), singular value decomposition (Alter et al., 2000), independent component analysis (Liebermeister, 2001), and gene shaving (Hastie et al., 2000). However, the task of establishing vertical relationships, such as translating sets of differentially regulated genes into an understanding of the complex interactions that take place at pathway level, is much more difficult. Although such techniques have started to appear (e.g., inferring gene networks (DeRisi et al., 1997; D’haeseleer et al., 2000; Roberts et al., 2000; Wu et al., 2002), function prediction (Fleischmann et al., 1999; Gavin et al., 2002; Kretschmann et al., 2001; Wu et al., 2002), etc.), this approach is substantially more difficult. Thus, the second challenge is to establish advanced methods and techniques able to make such vertical inferences or at least to propose such potential inferences for human validation. In other
18. Global Functional Profiling of Gene Expression Data |
309 |
words, the challenge is to extract system level information from component level data (Ideker et al., 2001).
2.FUNCTIONAL INTERPRETATION OF HIGHTHROUGHPUT GENE EXPRESSION EXPERIMENTS
Microarrays enable the simultaneous interrogation of thousands of genes. Using such tools, researchers often aim at constructing gene expression profiles that characterize various pathological conditions such as cancer (Golub et al., 1999; Perou et al., 2000; van’t Veer et al., 2002). Various technologies, such as cDNA and oligonucleotide arrays, are now available together with a plethora of methods for analyzing the expression data produced by the chips. Independent of the platform and the analysis methods used, the result of a microarray experiment is, in most cases, a list of genes found to be differentially expressed between two or more conditions under study. The challenge faced by the researcher is to translate this list of differentially regulated genes into a better understanding of the underlying biological phenomena. The translation from a list of differentially expressed genes to a functional profile able to offer insight into the cellular mechanisms is a very tedious task if performed manually. Typically, one would take each accession number corresponding to a regulated gene, search various public databases and compile a list with, for instance, the biological processes that the gene is involved in. This task can be performed repeatedly, for each gene, in order to construct a master list of all biological processes in which at least one gene was involved. Further processing of this list can provide a list of those biological processes that are common between several of the regulated genes. It is expected that those biological processes that occur more frequently in this list would be more relevant to the condition studied. The same type of analysis could be carried out for other functional categories such as biochemical function, cellular role, etc.
3.FUNCTIONAL PROFILING WITH ONTOEXPRESS
Onto-Express (OE) is a tool designed to facilitate this process. This is accomplished by mining known data and compiling a functional profile of the experiment under study. OE constructs a functional profile for each of the Gene Ontology (GO) categories (Ashburner et al., 2000): cellular component, biological process and molecular function as well as biochemical function and cellular role, as defined by Proteome (Proteome, 2002). The precise definitions for these categories and the other terms used
310 |
Chapter 18 |
in OE’s output can be found in GO (Ashburner et al., 2000). As biological processes can be regulated within a local chromosomal region (e.g. imprinting), an additional profile is constructed for the chromosome location. OE uses a database with a proprietary schema implemented and maintained in our laboratory (Draghici and Khatri, 2002). We use data from GenBank, UniGene, LocusLink, PubMed, and Proteome.
The current version of Onto-Express is implemented as a typical 3-tier architecture. The back-end is a relational DB implemented in Oracle 9i and running on a SunFire V880, 4CPUs, 8 GB RAM, 200GB accessing a 500 GB RAID array and tape jukebox backup. The application performing the data mining and statistical analysis is written in Java and runs on a separate server (Dell PowerEdge). The front end is a Java applet served by a Tomcat/Apache web server running on a Sun Fire V100 web server appliance.
OE’s input is a list of genes found to be regulated in a specific condition. Such a list may be constructed using any technology: microarrays, SAGE, Westerns blots (e.g., high throughput PowerBlots (Biosciences, 2002)), Northerns blots, etc. This is why the utility of this application goes well beyond the needs of microarray users. At present, our database includes the human and mouse genomes.
The input of Onto-Express is a list of genes specified by either accession number, Affymetrix probe IDs or UniGene cluster IDs. At present, the OntoExpress database contains human and mouse data. More organisms will be added, as more annotation data becomes available. A particular functional category can be assigned to a gene based on specific experimental evidence or by theoretical inference (e.g., similarity with a protein having a known function). Onto-Express explicitly shows how many genes in a category are supported by experimental evidence (labelled with “experimented”) and how many are predicted (“predicted”). Those genes for which it is not known whether they were assigned to the given functional category based on a prediction or experimental evidence are reported as “non-recorded”. The results are provided in graphical form and emailed to the user on request. By default, the functional categories are sorted in decreasing order of number of genes as shown in Figure 18.1. The functional categories can also be sorted by confidence (see details about the computation of the p-valuesbelow) with the exception of the results for chromosomes where the chromosomes are always displayed in their order. There is one graph for each of the biochemical function, biological process, cellular role, cellular component and molecular function categories. A specific graph can be requested by choosing the desired category from the pull-down menu and subsequently clicking the “Draw graph” button. Clicking on a category displays a hyperlinked list of the genes in that category. The list contains the UniGene cluster

18. Global Functional Profiling of Gene Expression Data |
311 |
IDs uniquely identifying the genes. Clicking on a specific gene provides more information about that gene.
The following example will illustrate OE’s functionality. Let us consider an array containing 1,000 genes used to investigate the effect of a substance X. Using classical statistical and data analysis methods we decide that 100 of these genes are differentially regulated by substance X. Let us assume that the 100 differentially regulated genes are involved in the following biological processes: 80 of the 100 genes are involved in positive control of cell proliferation, 40 in oncogenesis, 30 in mitosis and 20 in glucose transport. These results are tremendously useful since they save the researcher the inordinate amount of effort required to go through each of the 100 genes, compile lists with all the biological processes and then crosscompare those biological processes to determine how many genes are in each process (Khatri et al., 2002). In comparison, a manual extraction of this information would literally take several weeks and would be less reliable and less rigorous.
The large number of genes involved in cell proliferation, oncogenesis and mitosis in the functional profile above, might suggest substance X affects a

312 Chapter 18
cancer pathway. However, a reasonable question is: what would happen if all genes on the array were involved in cell proliferation?
Would the presence of cell proliferation at the top of the list be significant? Clearly, the answer is no. If most or all genes on the array are involved in a certain process, then the fact that that particular process appears at the top is not significant. To correct this, the current version of the software allows the user to specify the array type used in the microarray experiment. Based on the genes present on this array, OE calculates the expected number of occurrences of a certain category.
Now, the data mining results are as in Table 18.1 and the interpretation of the functional profile appears to be completely different.
There are indeed 80 cell proliferation genes but in spite of this being the largest number, we actually expected 80 such genes so this is not significant. The same holds true for oncogenesis. The mitosis starts to be interesting because we expected 10 genes and we observed 30, which is 3 times more than expected. However, the most interesting is the glucose transport. We expected only 5 genes and we observed 20, i.e. 4 times more than expected. The emerging picture changes radically: instead of generating the hypothesis that substance X is a potential carcinogen, we may consider the hypothesis that X is correlated with diabetes.
The problem is that an event such as observing 30 genes when we expect 10 can still occur just by chance. The next section explains how the significance of these categories is calculated based on their frequency of occurrence in the initial set of genes M, the total number of genes N, the frequency of occurrence in the list of differentially regulated genes x and the number of such differentially regulated genes K. The statistical confidence thus calculated, will allow us to distinguish between significant events and possibly random events.

18. Global Functional Profiling of Gene Expression Data |
313 |
3.1Statistical Approaches
Several different statistical approaches can be used to calculate a p-value for each functional category F. Let us consider there are N genes on the chip used. Any given gene is either in category F or not. In other words, the N genes are of two categories: F and non-F (NF). This is similar to having an urn filled with N balls of two colors such as red (F) and green (not in F). M of these balls are red and N - M are green. The researcher uses their choice of data analysis methods to select which genes are regulated in their experiments. Let us assume that they picked a subset of K genes. We find that x of these K genes are red and we want to determine the probability of this happening by chance.
So, our problem is: given N balls (genes) of which M are red and N - M are green, we pick randomly K balls and we ask what is the probability of having picked exactly x red balls. This is sampling without replacement because once we pick a gene from the chip, we cannot pick it again.
The probability that a category occurs exactly x times just by chance in the list of differentially regulated genes is appropriately modeled by a hypergeometric distribution with parameters (N, M, K) (Casella, 2002):
Based on this, the probability of having x or fewer genes in F can be calculated by summing the probabilities of picking 1 or 2 or ... or x - 1 or x genes of category F (Tavazoie et al., 1999):
This corresponds to a one-sided test in which small p-values correspond to under-represented categories. The p-value for over-represented categories can be calculated as
The hypergeometric distribution is difficult to calculate when the number of genes is large (e.g., arrays such as Affymetrix HGU133A contain 22,283 genes). However, when N is large, the hypergeometric distribution tends to the binomial distribution (Casella, 2002). A similar approach was used by Cho et al. to discern whether hierarchical clusters were enriched in specific functional categories (Cho et al., 2001).