Physics of biomolecules and cells
.pdf
462 |
Physics of Bio-Molecules and Cells |
a library enriched for the mRNAs present in one tissue but not in another; it also generates a lot of noise.
A company called A ymetrix has a broad patent covering a range of techniques having to do with putting lots of di erent DNAs on solid surfaces. The patents are so broad no one else has been able to sell DNA on a solid surface, so A ymetrix enjoys a dominant position in the marketplace. A ymetrix manufactures RNA hybridization arrays under the trade name GeneChip; they are the other popular array, perhaps the most popular by “arrays used” count. The technology is a photolithography adaptation of solid-phase synthesis in broad use for sequence-directed oligonucleotide synthesis. Because photolithography works in parallel, the number of spots (called features or probes in this context) is not a concern. But the synthesis is directed by sequence, and so the sequence of every spot must be known in advance so that the photolithographic masks can be laid out. Also, since about 4 masks will be needed per position, the sequence can’t get too long: a couple dozen letters is a practical limit. This poses a conundrum: short DNA segments are not expected to have the sharp complementary-sequence specificity of longer DNA fragments, and cross-hybridization is expected to happen. To solve this, A ymetrix uses a two-fold approach in GeneChip arrays. First, a di erential signal is generated, by taking the di erence between a “perfect match” sequence (PM) and a “single mismatch” (MM) obtained by replacing the middle letter in the PM sequence by its opposite letter. Both probes together are called a probe pair. The rationale behind this construct is that MMs will bind less well the target sequence, but get full-strength all of the cross-hybridization and other physical noise sources, so the di erence between the PM and MM should eliminate crosshyb. Second, redundancy is introduced, by tiling the target gene sequence with several (sometimes overlapping) PMs. Current chip versions use 16 to 20 PMs of length 25 base pairs; letter number 13 is then changed to its complement to generate an equal number of MMs. The whole set of probe pairs tiling a target is called a probeset. The sequences are considered by A ymetrix to be proprietary information and are not disclosed to the public like us.
Thus, in GeneChip arrays, we get between 32 and 40 numbers (the brightnesses at each probe in the probeset), out of which we need to reconstruct a single number, the mRNA concentration. There’s obviously infinitely many di erent functions of 40 variables returning identical values for “ideal” measurements, but having inequivalent noise rejection properties on imperfect or noisy data. One standard algorithm is provided in the A ymetrix software suite, and many researchers are completely unaware of its shortcomings or even that it may be bypassed and your own favourite
M.O. Magnasco: Three Lectures on Biological Networks |
463 |
algorithm used instead. I will describe the problems we encountered when studying this issue in the last section.
2.3 Analysis of array data
So, how is this data then used? There are two prototypical experimental designs: time series and condition clustering. Time series works as follows. A culture of cells (say, fibroblasts, or yeast cells, or...) is “synchronized”, i.e., all cells are brought to an appropriately similar state. In the case of fibroblasts, they may be starved for a particular serum growth factor; or yeast cells may be arrested at a given stage in their cell cycle. The cultures are then given the appropriate “start” signal, be it the growth factor or nutrient. Samples of the culture are taken periodically, their RNA extracted, amplified and fluorescently labeled, and then hybridized on the chips. The output of such an experiment is an N × M table of numbers, where N is the number of spots on the array or genes being probed, and M is the number of time points; similar gene expression patterns are then grouped by similarity using clustering techniques (along the N gene directions), and sorted by the relative time order in which activation or repression happens. The hope here is to observe a cascade of transcriptional events unfold.
An apparently simpler design is one in which a number of dissimilar samples are thrown together. For instance, one may collect a number of clinical tissue biopsies may be collected, say polyps from colon biopsies and nearby healthy tissue. In this case, not only are genes clustered together by similar expression profiles, but also the experiments get clustered together. In this case one is trying to get a transcriptional signature for a classification of colon tumors: hopefully the top level of the clustering will divide healthy from cancerous tissue, and subsequent branches of the clustering will reveal distinct tumor types. Clustering analysis was fairly well established already before gene chips, but they have provided a strong impetus and so a flurry of new methods has appeared [42].
Analysis of array data di ers strongly from established time-series analysis methods, because the data has the wrong aspect ratio for proper timeseries analysis. For example, a flurry of methods for dynamical system identification were created in the “chaos” boom of the seventies and eighties. Most of them require a number of time slices which increases exponentially with the dimension of the attractor to be reconstructed [23]. This is because high-dimensional spaces are exponentially large: they require many many points to be “filled” so that their volume is sampled throughout. (Consider how many “corners” a hypercube has). But in array analysis, the aspect ratio of the data is all wrong: best cases we know of involve about 104 genes in 102 experiments. Notice that the number of experiments is not only smaller than the number of genes, but is smaller than the square root
464 |
Physics of Bio-Molecules and Cells |
of the number of genes. Because of this. it has become apparent recently that the full N ×M data set may be too large an object to analyze together. Clearly, even if we had uncorrelated, Gaussian white noise as our only noise source, as we cluster along the “genes” direction, the residual noise goes like the square root of the vector dimension; so we get a disadvantageous signal to noise ratio, S/N 1 simply because of the geometry of the data! So methods have evolved to find and cluster submatrices of the full thing. This is also in keeping with the notion that gene expression is there for many purposes in addition to the one we’re looking at in the experiment– transcriptional regulation of colon tissue may respond to the kind of diet the patient had before the biopsy, for instance. In addition, gene expression is so labile that any small change imprints itself on the data: there may be the experimental artifacts one would like to avoid. (Tumor data sets have been known to result in clustering of the surgeon performing the biopsy, for instance). Thus, a proper way of selecting a smaller subset of the genes and experiments for analysis is extremely important [43–45].
I hope I’ve been able to convey the impression that the analysis of array data is by now a thriving subject, and that the interested reader should immerse herself into the growing literature. I will now first describe some of the unwritten concerns that a physicist may want to carry into the subject, and then describe a much more basic problem: that of actually getting the numbers to do clustering on. Most researchers are happy to use whatever numbers the available software spew out without giving them much thought, be it software like Scanalize for spot arrays, or the A ymetrix software suite for GeneChip arrays. However, close inspection shows that there are so many unsolved issues at the level of the measurement process that substantial improvements to the quality of the data could be made just by treating the raw measurement data more carefully.
2.4 Some simplifying assumptions
Here we shall detail a few of the common simplifying assumptions that lurk about in this subject ready to catch the unwary analyst. Once again, I would like to stress that, in many cases, it is known that the assumption is wrong, it’s just the best shot one has at a problem, and it is otherwise unknown whether its a “safe” wrong assumption or a deadly one.
One way in which simplifying assumptions ruin an otherwise good piece of science is by creeping into the null hypothesis. Any quantitative analysis in this subject must be validated by an estimate of its statistical validity, since these experiments generate copious noise together with the signal (whose “copiousness” is unknown a priori). No test of statistical validity operates against a vacuum, but as a way of distinguishing the observed experimental data from a null hypothesis. If the null hypothesis is highly
M.O. Magnasco: Three Lectures on Biological Networks |
465 |
artificial, then it is worthless to assure us that the observed experiments are, with high statistical confidence, di erent from the artificial null hypothesis, because we already knew they were. There is a widespread tendency to accept otherwise unacceptably dumb null hypothesis, because the tests to establish significance against any more refined model are extremely di cult to carry out, and researchers do not agree on a standardized null hypothesis. The result is an escalation of the “significance scores” that are expected. For instance, sequence alignment algorithms are supposed to tell us what the optimal alignment of their input sequence is, against the sequences in a database; and then tell us what is the probability that this alignment arose by pure chance alone. What do we mean by pure chance alone exactly? The usual test is to test against a scrambled sequence of similar length; but of course, a scrambled sequence is spectrally white, while all biological sequences have prominent correlations. A test against a random sequence of similar composition and correlation structure would require people to agree on which feature of correlation structure is the important one, and would be much more di cult to carry out. As a result, the simpler null hypothesis gets ingrained, and researchers just expect astronomically small significance scores. But it should always be borne in mind, that significance scores against a more refined null hypothesis are not necessarily monotonous respect to the scores of a simpler hypothesis.
It is not known what a proper null hypothesis would be for expression array data, and it is a matter of current debate. Researchers have by and large used the arrays as sieves, trying to catch low-hanging fruit, to be verified by more conventional methods. Since biology labs have in the past struck gold by discovering and then studying the right molecule for a given process, a list of the 20 top candidates to be “the right molecule” is worth a lot to a biology lab–even though it hardly makes a piece of finished science. In this regard, the current climate favours sensitivity over accuracy: the people running the sieves are worried about catching some fruit, and so prefer to get a bigger list with many false positives and all of the right candidates over a smaller list with no false positives but important puzzle pieces missing. So it has been hard to get anyone worked up about the appropriate null hypothesis to quantify things like clustering analysis.
One simplifying assumption is that RNA concentration is a proxy for RNA transcription (just 90 degrees out of phase). The importance of RNA degradation rates cannot be overstressed, since it is roughly homogeneous within a given family of proteins but varies by orders of magnitude across families. Thus, any clustering of the observed raw signals likely will cluster together genes with similar RNA degradation rates even if unrelated–just on the basis of a similar spectral profile inducing spurious correlations against the “spectrally white” null hypothesis.
466 |
Physics of Bio-Molecules and Cells |
Another is that averaging over cells preserves the appropriate correlations one is trying to use. Instances of boolean processes which have a graded population distribution are known, and any population average is incapable of distinguishing such a thing from a graded response at the cellular level. See [28], where kinase cascades which had been shown to be graded in vitro were actually boolean at the single-cell level. Even for graded responses, instances are known in which the exact population distribution carries an enormous amount of information (see the third part of this course). This is a matter of much current concern for many people, who are trying to push array technology to single-cell detection limit.
Finally, gene expression is extremely labile, and it responds to everything a living being is in contact with. As discussed above artifacts are easy to come by. The assumption that the gene expression patterns observed only have information related to the parameters the experimentalist is attempting to control has already been shown to be quite dangerous.
2.5 Probeset analysis
In theory, theory and practice are the same, but in practice they arenÕt. Attributed to Yogi Berra.
If one takes an experimental sample of RNA extracts from tissue, divides them into two identical vials, and then carries out all of the amplification, fluorescence labeling, hybridization to the microarray, washing, fluorescence laser scanning and image analysis of the scanned image required to get the brightnesses at each probe, one observes a curious feature. The probe brightnesses are, by an large, repeatable with a high degree of certainty–for bright probes, within a few percent. There are about 20 probes per probeset, so averaging over them should make things better by a factor of √20 4. So the technology holds the promise that, somehow within it, there is the possibility of making measurements precise to two decimal digits. Yet, in actual practice, the final numbers coming out of analysis can scarcely be trusted for changes smaller than a factor of two–“times/over two” is the standard error line for chip data. ne may argue that the di erential design is of course sensitive to noise, but still there is evidently something wrong, and we shall now explore what.
The rationale behind using PM and MM sequences is that sequencespecific hybridization will definitely notice a change of one letter, while cross-hybridization, or any other nonspecific e ect won’t. A statement of this thinking, which I will call the standard hybridization model, is
P M = IS + INS + B
M M = αIS + INS + B
M.O. Magnasco: Three Lectures on Biological Networks |
467 |
where IS is the brightness due to the binding of the specific target, INS the nonspecific binding, B a background brightness of “physical” origin (photodetector dark current or reflections in the glass), and 0 < α < 1 is the loss of binding strength due to the single letter change.
The A ymetrix software suite then analyzes these numbers as follows. ∆ = P M − M M = (1 − α)IS, so the nonspecific and background contributions should have been obliterated from the probe pair di erentials. There are 16−20 such ∆, one per probe pair, per gene being probed. In order to discard outliars, the top and bottom scores are discarded; the rest are then algebraically averaged, as is sometimes practiced in the scoring of some olympic sports. This would eliminate the influence of defective probes.
The implicit assumptions that would allow such an averaging procedure to work are:
1.That IS = k[RNA], i.e., that the relationship between brightness and RNA concentration be linear. This implies a conversion constant k that translates from concentrations to brightness. Similarly, there should be a constant p relating the nonspecific portion of the brightness with the RNA concentration which causes it;
2.That the conversion constant k be the same for P M and M M , and similarly that p should be the same for P M and M M . Otherwise substraction does not cancel them out;
3.That 0 < α < 1;
4.That the α and k be relatively homogeneous in magnitude throughout the probeset.
Grabbing a dataset and doing some statistics quickly belies these assumptions. In fact, such an exercise shows that all the assumptions are violated. The most visible violation (and the one that was first noted by researchers in the subject) is that (3) implies that ∆ > 0, or P M > M M , an assertion that’s easy to check. Turns out that probe pairs for which M M > P M were noticed quite early on, since they lead to negative concentrations in the A y software suite.
Let me repeat that M M > P M is a very evident violation of the hybridization assumptions; evident since no more numerical analysis than a substraction and checking for a − sign is necessary. Felix Naef and I have been working on a number of large datasets, including an 86 sample set from human blood from rheumathoid arthritis patients, from Nila Patil at Perlegen Inc., formerly the human genetics division of A ymetrix; 36 Drosophila chips from M. Young’s lab at Rockefeller and 24 mouse chips from mouse brain tissue by Dan Lim et al. We have found that across
468 Physics of Bio-Molecules and Cells
di erent chips, di erent kinds of tissue, di erent people carrying out the reactions, etc. all chips with the single exception of yeast chips (arguably a di erent beast altogether) show pretty much the same statistics: about 30% of all probe pairs show M M > P M . Thirty percent is a figure hard to dismiss as negligible or small. We have checked whether these probe pairs are clustered in any way we can figure out. The first naive idea would be that at low intensity levels, noise becomes percentwise higher, and so it might make some probe pairs cross the line. It isn’t so: 27% of all probe pairs in the top quartile of intensity are M M > P M . The “bad” probe pairs are not concentrated into bad or problematic probesets either: 97% of all probesets have at least one such bad guy, and 60% have in excess of 5. (See the table in [27].)
This begs the question of whether there’s any interesting feature in the joint probability distribution of P M and M M . Figure 1 shows a gray-coded two-dimensional histogram with quite an interesting structure.
Fig. 9. Histograms of log P M vs. log M M for two di erent datasets: a) 86 human chips (HG-U95A), human blood, and b) 20 mouse chips (Mu11K/A), mouse brain tissue. The di erence in overall brightness scale reflects a change in the standard scanning protocol; data set a) is more recent, and was scanned at lower laser power. From [27].
Notice that the joint probability distribution forks out into two branches, leaving a little “button”-like structure at the center of the branching structure. The lower branch and half of the button are completely below the P M = M M diagonal. This plot not only belies the standard model above by showing the deviations to be meaningful–it also indicates that the deviations are likely interesting, since they appear as an elaborate structure. Unfortunately, it is impossible to check the obvious assumptions about sequence-specificity, since the sequences are considered proprietary
M.O. Magnasco: Three Lectures on Biological Networks |
469 |
information. Clearly the single mismatch binding is a much more complex process than naively thought, and a great deal of care should be exercised with exactly how to construct a di erential discriminator from the match/mismatch game. Obvious culprits are secondary structure both in the probes and the targets, sequence-specific stacking interactions, and fabrication e ciencies, which are strongly letter-specific and so evidently accumulate exponentially through the 100-odd mask processes that the chips are subject to during fab.
So the MM probes are not doing what they were expected to be doing. A simple method to deal with this problem has been presented in [27]: separate factors are fitted to P M and M M by a singular-value decomposition process.
Assumption (4) in the list above is easily belied too. Brightnesses within a probeset vary by orders of magnitude. The histogram of log(max P M/ min P M ) has its mode around 300, so the typical probeset spans two orders of magnitude and a half. The distribution of intensities within probesets can be assayed by normalizing all intensities to the median of the probeset (which is always well defined); since there is an abundance of data, we can build separate histograms for bright, medium and dim probesets.
Fig. 10. Histograms of log(P M/median(P M ) for the dataset of Figure 1a. Three distinct intensity ranges have been histogrammed separately, this allows to verify that it is not the low-end of the data that contaminates the histograms.
Because of the exponentially-distributed nature of the data, it is clear that algebraic averages do not converge. An average over quantities that vary on an exponential scale is dominated by the largest value, which is with high probability an outliar. Simply replacing the algebraic average with a geometric mean does wonders for the reliability of the data, as we showed in [26]. Yet, it is not su cient for a high-quality method, for it has no built-in method to reject cross-hybridization. A simple way of doing cross-hyb rejection without resorting to the M M was also shown in [26]
470 |
Physics of Bio-Molecules and Cells |
for the ratio comparison between two chips: a sum of two exponentially varying quantities looks mostly like the maximum of the quantities, except in the narrow diagonal range when they are of comparable size. Thus one may assume that a given probe is diplaying either mostly specific signal or mostly cross-hyb. So, if we compare all P M probes from one experiment to the corresponding probes in the second experiment, their ratios are likely to be showing either the real ratio between the two RNA concentrations, or nonsense. A histogram (if one had enough data to build one) would show the superposition of two distinct distributions: a sharp “specific” peak on a broad “nonsense” background. Robust estimators to fish out the signal out can be built, on a maximum likelyhood basis or any of many known statistical methods.
Finally, we should like to observe that we have clear indications that assumption (1) is false as well. The change in protocol alluded to in the caption to Figure 1 was introduced because of widespread complaints by GeneChip users that their data was showing saturation and that highly expressed genes which were known from blot assays to vary quite a bit were showing up as unchanging in the A y data. That individual probe pairs were showing optical saturation is clear from Figure 1b–just notice the top and right borders. But a much more interesting problem is that many probes show (to the careful observer) evidence of chemical that some probes get chemically saturated, even when they are not optically saturated.
2.6 Discussion
In order to build better methods for extracting the RNA concentrations from this data, clearly a close look at the data is necessary. We’ve now seen the data and some of its problems, and I hope to have succeeded in making the case that, in all likelyhood, not one single method will be able to mine all relevant information from the data. This is because complex methods are very hard to validate, while simple methods fail to capture all of the complexities. We believe this should be so, and that analysis of this kind of data shall benefit from many methods in existence, rather than few–just like the existence of several di erent clustering techniques enriches the arsenal of the analyst rather than complicating life. Only the purveyors of proprietary software, and the harried scientist who can’t be bothered to use anything other than pret-a-porter solutions can think otherwise–yet it has been quite problematic for people to, for instance, publish in this area.
M.O. Magnasco: Three Lectures on Biological Networks |
471 |
3Neural and gene expression networks: Song-induced gene expression in the canary brain
This lecture is more of a story. Partly because the underlying material is largely unfinished and ill understood; just the beginning of a long tale to be unraveled during many more years. It’s a story that happens at the busy intersection between two large avenues of exploration: the corner of “gene expression networks” and “neural networks”. This is the place where perception becomes memory. There’s too much happening here, and it’s a lot of e ort to tease apart the pieces of the picture, but it’s an interesting and exciting place nonetheless.
As we discussed in the previous lecture, gene expression is regulated; in fact, the point of it is to be regulated. As a response to changes in the environment, transcriptional programs are put in motion that e ect long-lasting adaptations to cope with those changes. The nervous system is no exception: in fact, it is the tissue in which these changes are most prominent, varied and clear. All of the consolidations of long-term memory, for instance long-lasting synaptic change, involve transcriptional regulation. The marvel is the swiftness of the response and the ease with which it is put in motion.
So the outline of our story is thus. Imagine a canary sitting in a cage. You place a tape recorder next to it, and you press PLAY. The tape contains a recording of another canary’s song–one our particular canary hadn’t heard before. Hearing the song causes a blush of gene expression in some auditory nuclei of the canary brain: a transient, yet vigorous response, easily excited. Studying the blush reveals it to be topographically organized, so that different song elements cause geometrically di erent blushes. In fact, within a small family of stimuli, we were able to invert the map: we could say what was on the tape based on the shape and “color” of the blush. And this is the response to doing nothing more than playing sounds to the canary–we can scarcely say we have “done” anything to the bird, and yet there is a discernible response to just one or two playbacks of the song; the response is visible within 5 min, and lasts for hours.
This level of resolution cannot be achieved with any other technique currently in existence short of large-scale electrophysiological recording. We were able to dissect extremely important di erences between similarsounding natural and artificial sounds. Yet the story has so many open threads that it’s hard to foresee how it shall go on.
The story has three main characters: a bird, a song, and a gene; they act on a stage, the brain. I need to introduce these characters now.
