Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Berrar D. et al. - Practical Approach to Microarray Data Analysis

.pdf
Скачиваний:
61
Добавлен:
17.08.2013
Размер:
11.61 Mб
Скачать

24

Chapter 1

 

Nominal variables tell us which of several unordered categories a thing

belongs to. For example, we can say a tumor is of type or category A, B, or C. Such variables exhibit the lowest degree of organization, since the set of values such a variable may assume possesses no systematic intrinsic organization or order. The only relation between the values of nominal variables is the identity relation. Because of the lack of an order relation, it is not possible to tell if one attribute value is greater than another or that one value is closer to a certain value than another. However, we can tell if two values are equal or not equal. Given relevant background knowledge (human-based or computerized), it is possible to define more complex relations on nominal variables.

Ordinal variables or scales allow us to put things in order, because the set of values associated with an ordinal variable exhibits an intrinsic organization, which is defined by a total order relation. Therefore, we can tell if one value is bigger or smaller than another, but we can normally not quantify or measure the degree of difference or distance between two values. For example, if the observations x, y, and z are ranked, 5, 6, and 7, respectively, we can tell that x < y < z, but not if (z – y) < (y – x).

Interval-scaled variables exhibit an intrinsic organization, which not only allows us to establish if a value is smaller or greater than another (total order relation) but also to determine a meaningful difference or distance between two values or to add and subtract values. This property of a meaningful difference is particularly important for distance-based or similarity-based approaches like clustering.

The values of real-scaled variables show a higher level of intrinsic organization than ordinal-scaled and interval-scaled variables. Besides order or rank information and the possibility to meaningfully add and subtract values, real variables allow us also to multiply and divide, as they are measured against a meaningful zero point. Typical examples for real variables include weight, age, length, volumes, and so on. In microarray experiments involving controls or references we can define a natural zero point, against which overexpression and underexpression can be defined. Therefore, it is perfectly legal to say that gene a is twice as highly expressed as gene b.

Although not a completely new category, binary or dichotomous variables could be considered as a special case. Binary variables can assume two possible values. For microarrays, this could be two categories, one for overexpression and one for underexpression, or one for overexpression and one for the absence of overexpression (lumping together balanced and underexpressed values). Often binary variables are coded with the numeric values 1 and 0. The advantage of this coding scheme is that binary variables can be treated as interval variables or attributes. However, treating binary

1. Introduction to Microarray Data Analysis

25

variables as if they were interval-scaled implies that they are assumed to be symmetric (i.e. each value is equally important). If they are symmetric, then the interchanging of the codes will still result in the same score, for example, a degree of dissimilarity measured by the Euclidean distance measure. If the two entities, e.g., expression profiles, match on either code they are perceived to have something in common, and in case of a mismatch they are considered different (to the same extent independent on the coding). However, this could be a problem if the two values are not equally important. If one, for instance, uses the code 1 for overexpression and 0 for the absence of overexpression, it may be safe to say that two expression profiles have something in common if they match on a variable on code 1. This is not so clear if they match on the same variable on code 0. In case of asymmetric binary variables it is recommended to use non-invariant coefficients, such as the Jaccard coefficient to calculate similarities or dissimilarities (Kaufman and Rousseeuw, 1990).

4.2Process

Once the raw digitized graphics images have been established by Step 1 to Step 4 of the overall analysis process (Figure 1.4), a highly iterative inner loop of processing kicks in. The analysis steps and information flow of this process are depicted in Figure 1.6. What characterizes this process is its computerized nature, meaning that most of the involved data and analytical steps are realized on a computer. This makes it possible to explore and refine the different processing alternatives until a satisfactory conclusion is reached.

26

Chapter 1

4.2.1Pre-Processing – on the Rawness of Raw Data

People involved in microarray analysis often speak of “raw” data. The rawness characterizes the indigestible state of some data before a processing step, normally one that reduces the data volume. However, depending on the person you are dealing with, the precise meaning of “raw” may differ. For an image analyst, raw data probably refers to the analog signal produced by a laser scanner. A statistician might consider the digitized image or the integrated numerical data matrix as raw data. For a biologist raw data might be manifest in the descriptions of a discovered pattern or a constructed model (e.g., cluster definitions or regression coefficients). In the jet-set- paced world of a senior investigator, summary statistics and diagrams may fall into to raw-data category. And finally, the editor-in-chief of a scientific journal may refer to the initial submission of a paper reporting on the findings of a microarray experiment as raw data.

Here, we use the term raw data to denote the collection of digitized images – one image per hybridization experiment – arising out of Step 5 of the overall analysis process (Figure 1.4). These data must be computationally gathered, processed, and integrated with other relevant information.

First, each probe spot on the images must be analyzed to establish quantitative estimates of the red and green content. Here, normalization methods may be applied to compensate for systematic measurement errors due to equipment imperfection. This procedure summarizes the raw image data and stores them in an intermediate, more compact representation where each spot is described by a set of numeric values.

Second, if multiple spots on the image represent multiple expression measurements for the same gene, these measurements must be combined to obtain a single expression level estimate for the gene. This is where normalization procedures come in, to address issues of measurement variation due to instrumentation imperfection and biological variation. At this point one may also decide to represent each measurement by a set of absolute values, a red-green ratio, a fold change, or some other format.

Third, the sets of measurements from each hybridization experiment must be integrated into a single data matrix, similar to the one presented in Figure 1.5. To compensate for array-to-array measurement variation and to facilitate comparison between different hybridization experiments, a normalization procedure called standardization is employed at this point. An example of the resulting integrated data matrix is depicted in Table 1.2.

1. Introduction to Microarray Data Analysis

27

Now the data is almost “well-done”, ready to be devoured by the numerical analysis methods further down the processing stream. However, to facilitate a more effective and efficient performance of the subsequent processing steps of the inner loop (Figure 1.6), a further data manipulation step called transformation may be necessary. The objective of data transformation is to reduce the complexity of the data matrix and to represent the information in a different, more useful format.

The first four steps of the inner loop shown in Figure 1.6 depict these data manipulation operations – image analysis, normalization and standardization, data matrix synthesis, and transformation. Collectively, we refer to these as pre-processing. The following subsections discuss some of the issues and concepts of pre-processing in more detail. However, for predictive modeling, an important conceptual issue has to be considered. In order to closely “simulate” a realistic prediction scenario, the first thing to do after image analysis is to set aside randomly selected cases. These data are called validation data or validation set, with which the final prediction model is to be validated. Generally speaking, 10-20% of the whole data set is sufficient as validation data. All pre-processing steps considered hereafter have to be performed on the validation data separately, which would be the case for new cases (application set) that have to be predicted in a real-life prediction scenario.

4.2.1.1 Missing Values

There are many reasons why the measurement of a gene’s expression level for an individual sample-gene combination may have failed. In this case, the best that the data matrix can offer is to simply flag the failed measurement, that is, to report a missing value. Table 1.1 and Table 1.2 illustrate missing values in the context of our four-gene experiment. There are a number of strategies for dealing with missing values. In a way, the missing value handling approaches could be considered as a hybrid between normalization and transformation (discussed below).

The first obvious, albeit drastic, choice for dealing with such errors is to remove the affected expression profile (gene or array profile) from the data matrix altogether. The drawback of this radical measure is that it also removes other valuable data. In the worst case, this approach can lead to the removal of N × M – min(N,M) valid expression values in the presence of only min(N,M) missing values, leaving little left to be analyzed.

The second approach is to ignore the problem and leave the data matrix as it is. Perhaps one wants to use a special code to indicate the missing value, so that it cannot be confused with a valid measurement. There are many analytical methods that can inherently deal with missing values. Decision trees, for example, do have a “natural” mechanism for handling missing

28

Chapter 1

values by focusing on the existing measurements. Intuitively, this approach could be likened to computing an arithmetic mean of N values, M of which being flagged as missing. We compute the sum of the N – M valid values and divide the sum by N – M. Clearly, the “ignore” approach works only if the proportion of missing values is within acceptable limits.

A clever third way of correcting errors due to missing values is to “replace” the offending missing item with a reasonable or plausible substitute value. The methods for this kind of error correction are collectively referred to as missing value imputation. A simplistic approach to missing value imputation is to assign some average value in place of the missing value. For example, one could replace the missing value of gene b for patient 5 (Table 1.1) by the average of the expression levels of gene b across condition tumor A of patient 5. Another straightforward approach is to use the level representing balanced expression (red-green ratio equals 1.00) in place of the missing values. This method is what we use to replace the two missing values of our four-gene experiment. Notice that missing value imputation may be carried out after some other transformations have been completed.

4.2.1.2 Normalization and Standardization

Taken together, the technical gear, laboratory protocol, and human element employed to measure transcript abundance can be viewed as a complex scientific “instrument”. In such a system errors creep in from all directions at the same time, and in all shapes and sizes. These errors are due to imperfections in the instrument and the processes and materials involved in using it. This abstract view of a gene expression instrument considers sample selection, sample preparation, hybridization, and so on, as integral part of the instrument. Given this view, we are able to distinguish measurement errors due to instrumentation imperfection from measurement variations due to biological variations within the studied specimens. Although measurement variation due to biological fluctuations has implications for microarray analysis, such variations are not measurement “errors” in the sense of a deviation from a true expression level. They are simply a part of reality. If nature has decided that the height of 12-year-olds varies in some fashion, one cannot get rid of this variation by measuring height more precisely or by somehow trying to correct it.

Ideally, a numerical value in the expression matrix reflects the true level of transcript abundance (or some abundance ratio) in the measured genesample combination. However, as our instrument is inherently imperfect, the measured value, measurement, that we obtain from the instrument usually deviates from the true expression level, truth, by some amount called error.

1. Introduction to Microarray Data Analysis

29

The relationship between the true and measured values and the error is described by Equation 1.2.

Generally, a measurement error can be attributed to two distinct causes or error components – bias and variance; the relationship between bias and variance and error as described in Equation 1.3.

In this error model, bias describes a systematic tendency of the expression measurement instrument to detect either too low or too high values. The amount of this consistent deviation is more or less constant. We know, for example, that the emitted light intensities from the dyes Cy3 and Cy5 are not the same for the same reporter molecule quantities. This introduces a systematic measurement error that, if understood properly, can be compensated for at a later stage, for example, at the image-processing step.

The measurement error due to variance is often normally distributed, meaning that wrong measurements in either direction are equally frequent, and that small deviations are more frequent than large ones. Examples of this kind of error include a whole range of lab processes and conditions and manufacturing variations. A standard way of addressing this class of error is experiment replication.

With these considerations and the concepts presented in Equation 1.2 and 1.3, we summarize our general measurement error model as follows:

Notice, the variance term in Equation 1.3 refers to the variations affecting our readout based on the fluctuations in the instrument (and the involved processes). There is no need to incorporate variations occurring in the underlying specimen into this error model, as the term truth reflects the real quantity, no matter how much this quantity varies from sample to sample. Clearly, as replication is often employed to address measurement variation, the task of quantitatively disentangling the two elements of variation (instrumentation imperfections and biological fluctuations) may be difficult.

Normalization and standardization are numerical methods designed to deal with measurement errors (both bias and variance) and with measurement variations due to biological variation. In contrast, transformation refers to a class of numerical methods that aim to represent the expression levels calculated by normalization and standardization, in a format facilitating more effective and efficient analysis downstream.

30 Chapter 1

We differentiate between two broad classes of measurement variations caused by instrumentation imperfection and biological variation: (a) those that affect individual measurements, and (b) those that affect an entire array or parts of an array.

There are numerous sources for measurement variation that affect single measurements, including biological variation, probe imperfection, the tendency of low expression levels to vary more than high levels, and so forth. The principal approach to dealing with single-measurement variations is replication. But beware! Instead of carving up the same mouse over and over again, you should go all the way back when you replicate an experiment. This may be time-consuming, expensive, and ethically questionable. However, with multiple measurements it is possible to apply statistical methods to estimate the true quantity more accurately and to judge the error of this estimate. Simple t-statistics, for example, can be applied to accomplish this task. With this approach one could, for example, state a measured expression level as: “1.35 ± 0.03, confidence = 99%”. Besides replication, improvement in experimental design is also an effective approach to addressing the variability issues of single measurements.

Each array or chip measures the gene expression levels of many genes for a single sample (under a certain experimental condition). In a way, the hybridization of a single microarray element could be considered as an experiment in its own right. A typical microarray study carries out many such experiments, several tens to several hundreds. Due to laboratory problems, manufacturing imperfections, multiple investigators carrying out the individual experiments, biological variation, and other sources, array-to- array measurement variations are bound to creep in, more than one would ideally like. For studies (like predictive toxicology) that require the comparison of array profiles, this global measurement variation is a problem. Clearly, replication of the entire array is an approach to consider here. For instance, by hybridizing two arrays per sample, you could arrange the probe spots differently, so as to compensate for certain structural biases. In any case, even if you can afford to do multiple arrays per sample, you still need to address the issue of inter-array variability. Numerical procedures known as global normalization or standardization are designed to address this problem. As these methods attempt to place each array on a comparable scale, this process is also called scaling.

The simplest approach to scaling involves the multiplication of each value on an array with an array-specific factor, m, such that the resulting mean is the same for all arrays. Another procedure along these lines multiplies each value on an array so that the mean for each array equals 0 and the standard deviation equals 1. More sophisticated approaches fit a straight line to the data points of an array, and then use the line’s parameters

1. Introduction to Microarray Data Analysis

31

to transform the data points so that they spread about a “normal” line with a slope of 45 degrees. This procedure is applied to all the arrays involved, so that the 45-degree line is used as a common scale for all arrays. Other, yet more sophisticated approaches exist. All these methods make certain assumptions about the data. As the measurement of gene expression is a highly complex process, it is sometimes difficult to say if these assumptions are justified or not (Branca and Goodman, 2001; Quackenbush, 2001; Wu, 2002).

The data matrix generated by our hypothetical four-gene experiment (Table 1.1) contains the “semi-raw” measurements obtained from the image analysis stage. This data has not been normalized. To illustrate normalization, we will transform the data in a three-step process.

First, we replace the two missing values by 1s, indicating a balanced expression.

Second, we log-transform each value using the base-2 logarithm. Strictly speaking, this is a form of data transformation (as opposed to normalization). The reason for performing this operation is the positive skewness of the ratio data. This means that a large proportion of the measured values is confined to the lower end of the observed scale. As we use a ratio representation, values representing underexpression are all crammed into the interval between 0 and 1, whereas values denoting overexpression are able to roam free in the range of 1 to 6. This situation is shown in the left part of Figure 1.7. There are statistical reasons, why a more evenly or normally spread of the data are to be preferred by analytical methods. This desired property is called normality and the underlying concept for describing it is the normal distribution. The base-2 log-transformation re-distributes the skewed data in the desired manner (right part of Figure 1.7). For more realistic microarray data sets, the visual difference between shapes of the skewed and log-transformed data is much more pronounced.

32

Chapter 1

Third, we rescale the log-transformed data by applying a mean centering method to each of the 10 array profiles (in this case, sample or patient data sets). This method proceeds as follows. For each log-transformed array profile, we (a) Compute the mean, and the standard deviation, (b) Subtract the mean from each expression value, This centers the already more or less symmetrically distributed values around 0. (c) Standardize the zero-centered values in terms of standard-deviation units relative to the mean by dividing each shifted value by the standard deviation. So a new expression value, is derived using:

Applying this three-step procedure to the four-gene expression matrix shown in Table 1.1, produces the normalized matrix depicted in Table 1.2.

After having performed the data transformation steps on our four-gene experiment data, we are now ready to have a closer look at the data. Figure 1.8 visualizes the expression levels for the gene profiles of gene a and b. By means of this visualization we can confirm our previous hunch that gene a is presumed to be differentially expressed across the two experimental conditions, whereas gene b is not.

1. Introduction to Microarray Data Analysis

33

One way of approaching this decision more formally is to hypothesize that the expression levels, of a gene, g, observed across two conditions, A and B, are in fact coming from the same population, and that the observed differences and variations are indeed within the limits of what we would expect in this case. Casting this into a testable hypothesis, we assume two different, normally distributed, equal-variance populations, one for condition A and one for B, with the corresponding population means and respectively. In Figure 1.8 these populations are depicted by the bell-shaped curves at the right-hand side of the diagram. Then, we formulate the null hypothesis and proceed testing it with, for example, a standard t-test.

Carrying out the two-tailed t-test calculation for the two gene profiles of gene a and b shown in Figure 1.8, yields the following results (confidence level = 95%;

Соседние файлы в предмете Генетика