Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный университет

Предмет:

Генетика

Файл:

Genomics- The Science and Technology Behind the Human Genome Project. Charles R. Cantor, Cassandra L / genomics11-15 / 15

.pdf

Скачиваний:

Добавлен:

17.08.2013

Размер:

499.07 Кб

Скачать

☆

<<< < Предыдущая 1 2 34 / 54 5 > Следующая >>>

556 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

Figure 15.23 Results of two actual comparisons between polyoma and SV40 viral DNA sequences, computed as described for the hypothetical example in Figure 15.22. Dots show DNA

windows with more than 55% identity within a window (top) or more than 66% (bottom). Provided by Rhonda Harrison.

				DYNAMIC		PROGRAMMING	557
strong homology show up as a series of dots along a		diagonal in	these plots.	One can
piece together the set of diagonals found to construct a model for a global alignment in-
volving all the residues. However, it is not easy to show if a particular global alignment
reached in this manner is the best one possible.
DYNAMIC	PROGRAMMING
Dynamic programming is a method that improves on the dot matrix approach just de-
scribed because it has the power, in principle, to ﬁnd an optimal sequence alignment. The
method is computationally intensive, but not beyond the capabilities of current supercom-
puters or specialized computer architectures or hardware designed to perform such tasks
rapidly and efﬁciently (Shpaer et al., 1996). The basic notion in dynamic programming is
illustrated in Figure 15.24. Here an early stage in the comparison of two				sequences	is
shown. Gaps are allowed, but a statistical penalty (a loss in total similarity score) is paid
each time a gap is initiated. (In more complex schemes the penalty can depend on the size
of the gap.) We already discussed the ﬁercely difﬁcult combinatorial problem that results
if one tries out all possible gaps. However, with dynamic programming, in the example
shown in Figure 15.24, one argues that the best alignment achievable at point							n (n is the
total number of residues plus gaps that have been inserted into			either of the	sequences
since start of the comparison) must contain the best		alignment at	point	plus	the	n	1,
best scoring of the following three possibilities:
1.	Align the two residues at position	n .

2.Gap the top sequence.

3.Gap the bottom sequence.

The argument behind this approach is not totally rigorous. There is no reason why the optimal alignment of two sequences should be co-linear. DNA rearrangements can change the order of exons, invert short exons, or otherwise scramble the linear order of genetic information. Dynamic programming, as conventionally used, can only ﬁnd the optimal co-linear alignment. However, one can circumvent this problem, in principle, by attempting the alignments starting at selected internal positions in both possible orientations, and determining if this alters the best match obtained.

Figure 15.24 Basic methodology used in comparison of two sequences by dynamic programming.

558	RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING
	Sequence alignment by dynamic programming is conveniently viewed by a plot	as
shown in Figure 15.25. The coordinates of the plot are the two sequences, just		as	in the
dot matrix blot discussed earlier. However, what is plotted at each position is not a score
from	the comparison of two sequence windows; instead, it is a cumulative score	for	a
path through the two sequences that represents a particular comparison. In the simple ex-
ample in Figure 15.25, the scoring is just for identities. In a real case, the elements of a
scoring matrix would be used. From the example in Figure 15.25			a it can be seen that a

particular alignment is just a path of arrows between adjacent residues. The three possible steps listed above at a given point in the comparison just correspond to:

1.Advancing along a diagonal

2.Advancing horizontally

3.Advancing vertically

In dynamic programming one must consider all possible paths through the matrix. A particular comparison is a continuous set of arrows. The best comparisons will give the highest scores. Usually there are multiple paths that give equal, optimal alignments. These can be found by starting with the point in the plot with the highest score and tracing back toward the beginning of the sequence comparison to ﬁnd the various routes that can lead to

this ﬁnal point (Fig. 15.25	b ).
Rigorous dynamic programming is	a very computationally intensive process because

there are so many possible paths to be tested. Remember that for each new protein sequence, one wishes to test alignments with all previously known sequences. This is a very

Figure 15.25	A simple example of dynamic programming.	(a) Matrix	used to describe the scores
of various alignments	between two sequences. Note that successive horizontal and	vertical moves
are forbidden, since this introduces a gap on both strands which is equivalent to a diagonal move—
not a gap at all. In	each cell only the route with the highest possible score	is shown. In practice,
each cell can be approached in three different ways—diagonally, horizontally,		and vertically.	(b)
The trace back through the matrix from the highest scoring conﬁguration to ﬁnd all paths (align-
ments) consistant with that ﬁnal conﬁguration. Only the path with the highest score is kept.

									DYNAMIC PROGRAMMING		559
large number of dynamic		programming comparisons. A number of different		procedures
can be used to speed this up. Some pay the price of a loss of rigor, so no guarantee exists
that the very best alignment will be found.
A very popular program for global database searching has been developed by David
Lipman and his coworkers. This is called FASTA for nucleic acid comparisons, FASTP,
for protein (or translated ORF) comparisons. The test protein				is		broken		into			k-tuple
words. These are searched through the entire database, in looking for exact matches and
scoring them. Some of these best scoring comparisons will form diagonal regions of dot
plots with a high density of exact matches. The 10 best of these regions are rescored with
a more accurate comparison matrix. They are trimmed to remove residues not contribut-
ing to the good score. Then nearby regions are joined, using gap penalties for the in-
evitable gaps required in this process. The resulting sets of comparisons are examined to
ﬁnd those with the best, provisional ﬁts. Then each of these is examined by dynamic pro-
gramming within a relatively narrow band of 32 residues around the sites of each of the
provisional matches. Even faster algorithms exist, like Lipman’s				BLAST			which			looks
only at short exact ﬁts			but makes a rigorous statistical assessment of their				signiﬁcance.
This is used to winnow down the vast array of possible comparisons before more rigor-
ous, and time-consuming, methods are employed.
A key point in favor of the dynamic programming method is that for each comparison
cell, as shown in Figure 15.26, one only needs to consider three				immediate				precursors.
This allows the computation to be broken into a set of steps that can be computed in par-
allel. The comparison can proceed down the diagonal of the plot in waves, with each cell
computed	independently	of	what is happening in the other cells along		that		diagonal
(Fig. 15.26). Thus parallel processing computers are well adapted to the dynamic
programming method. The result is an enormous increase in our power				to		do	global
sequence comparisons. Using a parallel architecture, it is possible,						with		the		existing
protein	database, to do		a complete inversion. That is, all sequences	are		compared				with
all sequences by dynamic			programming. The resulting set of scores allows the			database
to be factored into sequence families, and new, unsuspected sets of				families			have		been
found in this way. As an			alternative to parallel processing, specialized		computer				chips
have been designed and built that perform dynamic programming sequences						compar-
isons, but they do this very quickly because each sequence needs to be passed through the
chip only once.

Figure 15.26 Parallel processing in dynamic programming. Only three preceding cells are needed to compute the contents of each new cell. Computation can be carried out on all cells of the next diagonal, simultaneously.

560 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

GAINING ADDITIONAL POWER IN SEQUENCE COMPARISONS

With increases in available computation power occurring almost continuously, it is undeniably tempting to carry sequence comparisons even further to make them more sensitive.

The goal is to identify, as related, two or more sequences that lie at the edge of statistical certainty by ordinary methods. There are at least two different ways to attempt to do this.

One can try to introduce information	about possible protein three-dimensional structure
for the reasons described earlier in	this chapter. Alternatively, one can attempt to	align
more than two sequences at a time.
If two proteins are compared at the amino acid sequence level, but one of these		pro-

teins is of known three-dimensional structure, the sensitivity of the scoring used can be improved considerably, as we hinted earlier. One can postulate that if the two proteins are truly related, they should have a similar three-dimensional structure. In the interior of this structure, spatial constraints are very important. Insertions and deletions there will be very disruptive, and thus they should be assigned very costly comparison weights. The ef-

fects will be much less serious on the protein surface. Changes in any charged residues on the interior of the protein are also likely to be very costly. Interchanges among internal hydrophobic residues are likely to be tolerated, especially if there are compensating increases in size at some points and decreases in size elsewhere. Changes that convert internal hydrophobic residues to charged residues will be devastating. A small number of external hydrophobic residues may not be too serious. Changes in sequence that seriously destabilize prominent helices and beta sheets are obviously also costly.

A systematic approach to such structure-based sequence comparisons is called threading. Here the sequence to be tested is pulled through the structure of the comparison sequence. At each stage some structure variations are allowed to see if the new sequence is compatible in this alignment with the known structure. The effectiveness of threading procedures in improving homolog matching and structure prediction is still somewhat

controversial. However, in principle, this appears to be	a very powerful method as more
and more proteins with known three-dimensional structures become available.
As originally implemented by Christian Sander, by	attempting	to ﬁt	proteins	of	un-
known structure to the various known protein structures	as just	described	above,	one

gained 5% to 10% in the fraction of the database that showed signiﬁcant protein similarity matches. Clearly this approach will improve signiﬁcantly as the body of known pro-

tein structures increase in size. Another aspect of the protein structure ﬁtting problem deserves mention here. As more and more proteins with known three-dimensional structure

become available, it is of interest to ask whether any pairs of proteins have similar threedimensional structure, outside of any considerations about whether they have similar amino acid sequences. Sander has constructed algorithms that can look for similar structures, essentially by comparing plots of sets of residues that approach each other to within less than some critical distance. Such comparisons have revealed cases where two very similar three-dimensional structures had not been recognized before because they were being viewed from different angles that obscured visual recognition of their similarity. In the future we must anticipate that the number of structures will be so great that no visual comparison will sufﬁce, unless there is ﬁrst a detailed way to catalog or group the structures into potentially similar objects.

The second approach to enhanced protein sequence comparison is multiple alignments. This allows weak homology to be recognized if a family of related proteins is assembled. Suppose, for example, that one saw a sequence pattern like

		DOMAINS	AND MOTIFS	561
TABLE 15.6	One-Letter Code for Amino Acids

One-Letter	Three-Letter	Amino Acid
Code	Code	Name	Mnemonic

A	ala	alanine	a lanine
C	cys	cysteine	c ysteine
D	asp	aspartate	aspar	d ate
E	glu	glutamate	glutamat	e
F	phe	phenylalanine	fenylalanine
G	gly	glycine	g lycine
H	his	histidine	h istidine
I	ice	isoleucine	isoleucine
K	lys	lysine	knear l
L	leu	leucine	leucine
M	met	methionine	m	ethionine
N	asn	asparagine	asparagi	n e
P	pro	proline	p roline
Q	gln	glutamine	“cute” amine
S	ser	serine	s erine
T	thr	threonine	threonine
V	val	valine	v aline
W	trp	tryptophan	t	w yptophan
Y	tyr	tyrosine	t	y rosine

—P—AHQ—L—

in two proteins. Here we use the convenient one letter code for amino acids summarized

in Table 15.6. Such a pattern would be far too small to be scored as statistically signiﬁcant. However, seeing this pattern in many proteins would make the case for all of them

much stronger. Currently most multiple alignments are done by making separate pairwise comparisons among the proteins of interest. The extreme version of this is the database inversion described previously. The use of pairwise comparisons throws away a considerable amount of useful information. However, it saves massive amounts of computer time. The alternative, for rigorous algorithms, would be dynamic programming in multiple di-

mensions. This is not a task to be approached by the faint of heart (or short of budget) with current computer speeds. Future computer capabilities could totally change this.

DOMAINS AND MOTIFS

From an examination of the proteins of known three-dimensional structure, we know that much of these structures can be broken down into smaller elements. These are domains,

independently folded units, motifs, and structural elements or patterns that describe the basic folding of a section of polypeptide chain, to form either a structural backbone or framework for a domain or a binding site for a ligand. Examples are various types of barrels formed by multiple-stranded beta sheets, zinc ﬁngers and helix-turn-helix motifs, both of which are nucleic acid binding elements, Rossman folds and other ways of binding common small molecules like NAD and ATP, serine esterase active sites, transmem-

562 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

brane helices, kringles, and calcium binding sites. A few of these motifs are illustrated schematically in Figure 15.27. The motifs are not exact, but many can be recognized at the level of the amino acid sequence by conserved patterns of particular residues like cysteines or hydrophobic side chains.

Several aspects of the existence of protein motifs should aid our ability to analyze the function of newly discovered sequences once we understand these motifs better. First, there may well be just a ﬁnite number of different protein motifs, perhaps just several hundred, and eventually we will learn to recognize them all. Our current catalog of motifs derives from a rather biased set of protein structures. The current protein data base is composed mostly of materials that are stable, easy to crystallize, and obtainable in large quantities. Eventually the list should broaden considerably.

Each protein motif has a conserved three-dimensional structure, and so once we suspect a new protein of possessing a particular motif, we can use that structure for enhanced sequence comparisons as described above. If, as now seems likely, proteins are mostly composed of ﬁxed motifs, any new sequence can be dissected into its motifs, and these can be analyzed one at a time. This will greatly simplify the process of categorizing new structures. However, the key aspect of motifs that makes them an attractive path from sequences to biological understanding is that motifs are not just structural elements, they

(a )		(b )	(c )	(d )
Figure 15.27	Some	of the striking structural and functional motifs found in protein structures. (		a )
Helix-turn-helix. (	b	) Zinc ﬁnger. ( c ) Beta barrel. (	d ) Transmembrane helical domain.

INTERPRETING NONCODING SEQUENCE

563

Figure 15.28 An example of how motifs (found in transportation vehicles) can be used to infer or relate the structures of composites of motifs.

are functional elements. There must be a relatively ﬁnite set of rules that determines what combinations of motifs can live together within a protein domain. Once we know these rules, our ability to make intelligent guesses about function from structural information alone should improve markedly.

One way to view motifs is to consider them as words in the language of protein function. What we do not yet know, in addition to many of the words in our dictionary, is the grammar that combines them. Another way to view protein motifs is the far-fetched, but perhaps still useful, analogy shown in Figure 15.28. This portrays various transportation vehicles and some of the structure-function motifs that are contained in them. If we knew

the function of the individual motifs, and had some idea about how they could be combined, we might be able to infer the likely function of a new combination of motifs, even though we had never seen it before; for example, we should certainly be able to make a good de novo guess about the function of a seaplane.

INTERPRETING NONCODING SEQUENCE

Most of the genome is noncoding, and arguments about whether this is garbage, junk, or potential oil ﬁelds have been presented before. The parts of the noncoding region that we are most anxious to be able to interpret are the control sequences which determine when, where, and how much of a given gene product will be made. Our current ability to do this is minimal, and it is confounded by the fact that most of the functional control elements appear to be very short stretches of DNA sequence. These sequences are not translated, so

we have only the four bases to use for comparisons or analyses. This leads to a situation with very poor signal to noise. In eukaryotes, control elements appeared to be used in

10 6entries.

564 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

complex combinations like the example shown schematically in Figure 15.29. To analyze of stretch of DNA sequence for all possible combinations of small elements that might be

able to come together in such a complex seems intractable by current methods. Perhaps,

after we have learned much more about				the properties of some of these control regions,
we will be able to do a better job. The true magnitude of this problem, currently, is well
illustrated by the globin gene family where one key, short control element required for
correct tissue-speciﬁc gene expression was found (experimentally, and after much effort)
to be located more than 20-kb upstream from the start of transription. Thus ﬁnding con-
trol elements is deﬁnitely not like shooting ﬁsh in a barrel.
DIVERSITY	OF	DNA	SEQUENCES
We cannot be interested in just the DNA sequence of a single individual. All of the intel-
lectual interest		in studying both inherited human diseases and normal human variations
will come from comparisons among the				sequences of different individuals. This is al-
ready done when searching for disease genes. Its ultimate manifestations may be the kind
of large-scale diagnostic DNA sequencing postulated in Chapter 13.
How different are two humans? From recent DNA sequencing projects (hopefully on
representative regions, but there is				no way we can be sure of this) differences between
any two homologous chromosomes occur at a frequency of about 1 in 500. Scaled to the
entire human	genome,		this is 6	10 6 differences between any two chromosomes of a

single individual, and potentially four times this number when two individuals are compared, since each homolog in one genome must be compared with each pair of homologs

in the other. To simply make a catalog of all of the differences for any potential individual against an arbitrary reference haploid genome would require 12

Figure 15.29	Structure of	a typical upstream regulatory region for	a	eukaryotic gene.	(a)
Transcription factor-binding sequences.		(b) Linear complex	with	transcription factors.	(c) Three-
dimensional structure of transcription factor-DNA complex that generates a binding site for RNA
polymerase.

			SOURCES AND			ADDITIONAL READINGS	565
The current population of the earth is around 5					10 9 individuals. Thus,		to record all
of current human diversity would		require a database with	12			10 6 5 10 9 6
10 16 entries. By today’s mass storage standards, this is a large				database. The entire se-
quence of one human haploid genome could be stored on a single CD rom. On this scale,
storage of the catalog of human diversity would require 2						10 7 CD roms, hardly some-
thing one could transport conveniently. However, computer mass storage has been im-
proving at least by 10		4 per decade, and there	is no sign that this rate of increase is slack-
ening. As a result, by the time we have the tools to acquire the complete record of human
diversity, some 15 to 20 years from now, we should be able to store it on whatever is the
equivalent, then, of a few CD roms. Genome analysis is unlikely to be limited now, or in
the future, by computer power.
Why would anyone want to view all of human diversity? There are two basic features
that make this information compellingly interesting. First			is the complexity of a number
of interesting phenotypes that are undoubtedly controlled			by many genes. Included are
such items as personality, affect, and facial features. Given the limited ability for human
experimentation, if we want to dissect some of the genetic			contributions	to	these aspects
of our human characteristics, we will have to be able to compare DNA sequences with a
complex set of observations. The more sequence we have, and the broader the range of
individuals represented, the more likely it seems we should be able to make progress on
problems that today seem totally intractable.
The second motivation for studying the vast expanse of human diversity is a burgeon-
ing ﬁeld called molecular anthropology. A considerable amount of human prehistory has
left its	traces in our DNA. Within contemporary sequences should be a			record of		who
mated with whom as far back as our species goes. Such a record, if we could untangle it,
would provide information about our past that probably cannot be reconstructed so easily
and accurately in any other way. Included will be information about mass migrations, like
the colonization of the new world, population expansions in prehistory, catastrophes, like
plagues, or major climatic ﬂuctuations. A tool that is useful for such studies is called prin-
ciple component analysis. It looks at DNA or protein sequence variations					across	geo-
graphic regions and attempts to factor out particular variants with localized geographic
distribution, or with gradients of geographic distribution.
Many of the results obtained thus far in molecular anthropology are controversial and
open to alternative explanations. As the body of sequence data that can be fed into such
analyses grows, the robustness of the method and acceptance of its results will probably
increase. Junk DNA may be especially useful for these kinds of analyses. It is much more
variable	between individuals than	coding sequences, and thus the amount of			information
it contains about human prehistory should be all that greater. One ﬁnal, intriguing source
of information is available from occasional DNA samples found still intact in mummies,
human remains trapped in ice, or other environments that preserve DNA molecules. Such
information will provide benchmarks by which to test the			extrapolations	that otherwise
have to be made from contemporary DNA samples.
SOURCES	AND ADDITIONAL READINGS

Adams, M. D., Kelley, J. M., Gocayne, J. D., Dubnick, M., Polymeropoulos, M. H., et al. 1991.
Complementary DNA sequencing: Expressed sequence tags and human genome project.	Science
252: 1651–1656.

<<< < Предыдущая 1 2 34 / 54 5 > Следующая >>>

Соседние файлы в папке genomics11-15

#
17.08.2013277.66 Кб4711.pdf
#
17.08.2013510.17 Кб4612.pdf
#
17.08.2013311.59 Кб4613.pdf
#
17.08.2013577.75 Кб4514.pdf
#
17.08.2013499.07 Кб4715.pdf
#
17.08.201326.85 Кб46appendix databases.pdf