Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный университет

Предмет:

Генетика

Файл:

Genomics- The Science and Technology Behind the Human Genome Project. Charles R. Cantor, Cassandra L / genomics11-15 / 15

.pdf

Скачиваний:

Добавлен:

17.08.2013

Размер:

499.07 Кб

Скачать

☆

<<< < Предыдущая 1 23 / 53 4 5 > Следующая >>>

546	RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING
picked up bacterial DNA rather than the intended mammalian insert. A number of simple
schemes	exist that	can help to ﬁnd such	errors. Putative genomic or cDNA sequences
should	be screened	against all known common	vector sequences. A very frequent error is

to include pieces of vector inadvertently as part of the supposed insert. The presence of common repeats like Alu should be searched for in putative cDNAs or exons. Except in the rarest cases, these sequences should not be present there; ﬁnding them suggests that a cloning artifact may have occurred.

Rearrangements in cosmid clones and YACs are fairly common. The best way to ﬁnd these errors at the DNA sequence level is to compare the sequences with other clones in available contigs. A major justiﬁcation for the additional DNA sequencing required to examine a tiling set of cosmids is that there will be frequent overlaps which can help catch errors caused by rearrangements. Small sequencing errors are still about 1% in automated or manual sequencing. In many past efforts, considerable amounts of data were entered into sequence databases manually. It is vital that this be veriﬁed by a process of double entry and comparison. If not, except in the hands of the most compulsively careful individuals, typographical errors will abound.

When a single base is miscalled, either by misreading raw sequence data or by mistranscription in manipulating that data, the error is extremely difﬁcult to detect. However, when a base is inserted or deleted, especially within an ORF, the error is sometimes easily

caught. One way to do this is a procedure developed	by	Janos	Posfai	and Richard
Roberts. In the course of searching a DNA database, to examine possible homology be-
tween a new sequence and all preexisting sequences, one can ask whether potential strong
sequence homology (usually after the DNA has been translated		into	protein)	is blocked
by a frame shift. Where this occurs, a DNA sequencing error is almost always responsi-
ble. Several examples of the power of this approach in	spotting sequencing			errors are
shown in Figure 15.11.

An unsolved problem is how to alert the community when errors are found. Given the size of the community and the complexity of the queries it makes against the sequence databases, this is an enormous problem. At some point the databases will have to be intel-

Figure 15.11 Finding frameshift errors by comparing a new sequence with sequences preexisting in the databases. Adapted from Posfai and Roberts (1992).

SEARCHING FOR THE BIOLOGICAL FUNCTION OF DNA SEQUENCES

547

ligent enough to be able to evaluate the effect of corrections on past queries and alert the initiators of those queries that might now be subject to altered outcomes. If this cannot be done, inevitably people will begin to repeat queries over and over again to guard against the effects of errors. A second potential unsolved problem is how to deal with fraudulent sequences. Research journals are increasingly reluctant to publish DNA sequence results, and it is almost impossible to publish the raw data supporting DNA sequencing results. Because of this, much sequence data are submitted directly to databases without editorial review of the actual experimental data. This entails the risk that databases might become contaminated willfully or accidentally by the deposit of sequences marred by artifacts or totally artiﬁcial. Just how these sequences could be detected and removed remains a serious dilemma. Ultimately it may be necessary to link the databases to archives of raw data so that validation of a suspected artifact is feasible.

SEARCHING	FOR THE BIOLOGICAL	FUNCTION OF	DNA	SEQUENCES
The major thrust of biological research is to understand function. From the viewpoint of
the genome, this search for function can occur			at two very different levels: individual
genes or patterns of gene organization. We ﬁrst discuss the genome from this latter van-
tage point. An overview of the arrangement of sequences in the genome may provide pat-
terns of information that offer a clue to global aspects of function. These may be domains
of gene activity or gene type that reﬂect biological processes we have not yet discovered.
For example, most similar or related genes are			not clustered. Some small clusters are
seen, such as the globin genes		(Fig. 2.10). The pattern of arrangement of the genes in
these clusters presumably reﬂects an ancient gene duplication, which separated the alpha
and beta	families, and more recent duplications		that	evolved the more closely related

members of these families. What is striking, and not yet explained, is that the order of the genes in each of these families accurately corresponds to the temporal order in which the genes are expressed during human development.

Another example of intriguing patterns of gene arrangement is the hox gene family in man and the mouse, shown in Figure 15.12. The genes in this family code for factors that determine the segmental pattern of organization of the developing embryo. The family is

Figure 15.12 Organization of homeobox (hox) genes in the mammalian genomes. All genes in all four clusters are transcribed from left to right.

548 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

complex, and dispersed on a number of different chromosomes. Two aspects of the organization of the family are striking. First, it is so well conserved between the two species. Second, the spatial order of the genes within the family is the same as the order of the segments in the embryo that these genes affect. It is as if, for some totally unexplained reason, the map of the structure of the gene family is an image of the map of the function

of that family.

A ﬁnal example of functionally interesting gene arrangements is seen in a number of the members of the immunoglobulin superfamily including the light and heavy chains of antibodies, and several chains of the T-cell receptor. Here large numbers of related genes

are grouped together, mostly in a single continuous segment of		a chromosome. The rea-
son for this is probably to assist the rearrangement of these		genes, which takes place by
DNA splicing to form mature expressed genes for antigen-speciﬁc proteins. If other re-
gions of the genome are found with very large clusters of similar genes, one may well
suspect	that somatic DNA rearrangement or some other unusual	biological mechanism
will be at play with these genes.
A	totally different view of global function afforded by	complete physical maps and
DNA sequences is the ability to compare these physical structures of DNA with the ge-
netic map. An example is shown in Figure 15.13 for yeast chromosome III. There are
clearly	some regions where meiotic recombination is much more	frequent than average
and others where it is greatly suppressed. We do not yet understand the origin of these ef-

fects. One possibility is just the presence or absence of local DNA sequences that constitute recombination hot spots. However, there are other more global possibilities.

Recombination may correlate with overall transcriptional	activity, since highly tran-
scribed chromatin is more open and accessible to all types of enzymes including those re-
sponsible for recombination. Thus there may be positional relationships between gene
function and recombination, and thus gene evolution, that	we still know nothing about
today.

SEARCHING FOR THE BIOLOGICAL FUNCTION OF GENES

Most biologists, when they think of biological function in the context of the genome project, are referring to the function of individual genes. A common criticism of the genome project is that it is relatively useless to know the DNA sequences of genes without strong prior hints about their function. Most traditional molecular genetics begins with a function of interest and attempts to ﬁnd the genes that determine or affect that function. This traditional view of biology is contrasted with the challenge posed by genome research in

Figure 14.1,	which can well serve as a paradigm for all of biology. In genome research
we will discover DNA sequences with no a priori known function. Our current ability to
translate these	DNA sequences correctly into protein sequences	is excellent,	as	we
showed earlier, by using GRAIL or other powerful algorithms. Our current ability to take
these protein sequences and draw immediate inferences about their		possible function	is
well illustrated	by the example in Figure 15.14	a . Except		for those rare readers of this

book who are conversant in Dutch, this passage is largely unreadable. However, the frustrating aspect is that the passage is not totally unreadable. Because a number of scientiﬁc terms are cognates in Dutch and English, certain features stand out—one knows the passage has something to do with protein structure, but the full impact of the message is completely lost.

SEARCHING FOR THE BIOLOGICAL FUNCTION OF GENES

549

Figure 15.13

A comparison of the genetic and physical map of the yeast

S. cerevisiae.

Ingewikkelde en grote biologische macro-moleculen kunnen spontaan

in hun meest stabiele conformatie vouwen. Helaas, ontbreekt ons de kennis om dit proces te voorspellen want de gevouwen strutuur kan belangrijke aanwijzingen

over die functie van het molecuul bevatten.

(a)

We know that large biological molecular can fold into their most stable state spontaneously but we really have little ability at present to predict this folding. Our ignorance is most unfortunate since the folded structure may contain important clues on how the molecule functions.

		(b)
Figure 15.14	An analogy for the current	(a) and desired future	(b) ability to interpret DNA se-
quence in terms of its likely biological function.

550	RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING
	When the passage in Figure 15.14	a is translated into English, it provides an important
clue to one direction that can help ﬁnd functional clues (Fig. 15.14			b ). Considerable expe-
rience to date shows that protein three-dimensional structures are better conserved during
evolution than protein sequences. A great deal of		current research effort is being devoted
to improving our ability to infer possible protein structures from the sequences of sets of
related proteins, provided that at least one of them has a known three-dimensional struc-
ture. As our ability to do this improves, and as the number of different classes of protein
structures has one or more members successfully studied at high-resolution by X-ray
crystallographic or nuclear magnetic resonance		techniques, the prospects	of stepping

quickly from a sequence to a realistic, if not exact, model of the structure should improve markedly. However, just knowing a three-dimensional structure does not immediately provide deﬁnitive clues to function. It simply makes comparisons between a protein of

unknown function and the set of	proteins of known function more powerful and more
likely to yield useful insights.
Today, when a new segment of DNA sequence is determined, the ﬁrst thing that is al-
most always done with it is to compare it to all		other known DNA sequences. The pur-
pose is to see if it is related to anything already known. By related, we mean, that there is
a statistically signiﬁcant similarity to one or more preexisting DNA sequences. The deﬁn-
ition of what statistically signiﬁcant means in the context of sequence comparisons is not
universally accepted despite decades of work in this area. Obviously, at one extreme, one
may ﬁnd that a new sequence is virtually identical to a preexisting one. Unless the two se-
quences derive from very similar but not identical organisms, the ﬁnding of near identity
means true identity with the differences due to sequencing errors, or a new member of a
gene family, or an example of proteins very strongly conserved in evolution, like the his-
tones. At the other extreme, a new sequence may match nothing to within whatever local
standards of minimal homology are considered operative.
Most often, however, when	a new DNA sequence is compared with the current data
base of more than 1000 Mb of DNA, some slight or signiﬁcant sequence homology is
found. For coding sequences, it is usually much more powerful to search after translation
of DNA to protein. This translation loses very little functional information; it gains con-
siderable statistical power because the noise caused by the degeneracy of the genetic code
is blanked out. Thus consider, for example, two arginine codons like AGG and CGA in a
corresponding place on two sequences; the only evidence for similarity is the G in posi-
tion 2, which has roughly one chance in four of		occurring randomly. In contrast, posing
an arginine opposite an arginine at the same place in a protein sequence has, very crudely,
only one chance in 20 of occurring randomly. (In reality the statistical differences are not
this great because amino acids with six possible codons, like arginine, also tend to occur
much more often than average.)
A statistically signiﬁcant match between a		new sequence and some preexisting se-
quence implies some or all of the following possibilities: similar function, similar struc-
ture, or evolutionary relatedness. It is not easy to sort out these different effects. However,
one encouraging feature of such global sequence searches is that their effectiveness ap-
pears to be increasing markedly and rapidly as the database grows. Ten years ago Russell
Doolittle noted that a new protein sequence had		a 25% chance of matching something
else in the databases. Currently the odds are considerably better than this. From the ﬁrst
bacterial sequencing projects described earlier in this chapter, between 54% and 78% of
the ORFs found showed hints of homology in structure or function with something else in
the data base. With the	S. cerevisiae	ORFs on chromosome III, 42% gave hints of homol-

										METHODS FOR COMPARING SEQUENCES	551
ogous structure or function of which 14% were deemed really quite strong. In the case of
C. elegans,			where		more	extensive data are available, 45% of the ORFs were reported to
be relatable to existing databases. It seems likely that in a few years it will be the odd new
sequence that does not immediately match something known. While it is too early to be
sure how rapidly this goal will be achieved, there is room for considerable optimism at
present.
METHODS	FOR COMPARING SEQUENCES
Entire books have been written about the relative merits of different approaches to align-
ing sequences and testing their relatedness (Waterman, 1995; Gribokow and Deveraux,
1991). The topic is actually quite complex because the nonrandom nature of natural DNA
sequences greatly confounds attempts to construct simple statistical tests of relatedness.
Here our goal will be to present							the basic notions of how sequences are compared and
what these comparisons mean. Sequences are strings of symbols. Any two strings can be
compared	by	direct		alignment		and	the	use of	scoring	criteria for similarity. For two
strings of length					n	and	m	there	are	2(n m 1)possible continuous alignments, by
which we mean that no gaps are allowed in either string. Of course many of these align-
ments are fairly trivial and						uninteresting because the strings will barely overlap. The
moment gaps are allowed on one or both strings, the number of alignments rises in a
combinatorial manner to reach heights that can test the power of the fastest existing su-
percomputers if the problem is not handled intelligently.
An example of a very simple							case in which two very similar DNA sequences are
aligned is shown in Figure 15.15. In this case the alignment needed to maximize the ap-
parent similarity between the two sequences is obvious. What is less obvious is the sort of
score to give such an alignment. The simplest scoring scheme is black and white: Grade
all identities the same and all differences the same. However, this makes little sense from
either a biological or a statistical vantage point. As far as biology is concerned, if, for ex-
ample, we are looking at the functional relatedness of proteins coded for by these se-
quences, or if we are looking at possible evolutionary relationships between them, trans-
versions	(interchange				of	a purine and a pyrimidine) should be weighted as more
consequential differences than transitions (interchange between two pyrimidines or two
purines). This is because the rate of transversion mutations is much less than the rate of
transitions,		and	the	genetic		code appears to have evolved so that effects of transversions
on the resulting amino acids						are more functionally disruptive than the effect of transi-
tions. For example, many synonymous codons are related by a transition in their third po-
sition. But the example goes much deeper; for example, codons for different hydrophobic
amino acids are also related mostly by transitions.

Figure 15.15 A simple example of a comparison between two putatively related nucleic acid se-

quences and two ways in which their relatedness could be scored,	S transition and	V trans-
version.

552 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

To take statistical factors into account in estimating the signiﬁcance of a mismatch or a match purely at the DNA level, we have to consider the relative frequency of each residue

in the strings being compared. For example, sequences rich in A’s will show large numbers of A’s matched with A’s, just by chance. In order to take this into account, and to add issues like transitions and transversions, one needs to employ a scoring matrix. This is il-

lustrated in Figure 15.16		a . The 4		4 scoring matrix for nucleic acid comparisons allows
for any possible weight to be assigned to a particular set of bases at					an	alignment posi-
tion. Generally, the same scoring matrix is used for every alignment position, although
there is no reason why one should have to do this, nor is there any reason why it is desir-
able except for simplicity. Think ahead to the alignment of protein					sequences			where
residues on exterior loops can be quite variable without perturbing the overall structure.
Therefore, if one had some way of knowing a priori that a residue was					in a loop as op-
posed to a helix or sheet, one could adjust the weighting factors accordingly. This exam-
ple illustrates the complex interplay between sequence and structure information that re-
ally has to occur in very robust comparison algorithms.
The simplest possible DNA scoring matrix, corresponding to the rule used in Figure
15.15 is just a set of identities with no correction for overall base composition (Fig.
15.16b ). The general case would consist of a set of elements								a ij that are all different, ex-
cept that the matrix should be symmetrical; each						a ij a ji since we have no way, in com-
paring just two proteins, to favor one sequence over another. The elements								a ij must incor-
porate all of our biological and statistical prejudices. When					protein		sequences are
compared,	the scoring matrices	can become more complicated. First of			all,		the	matrix
must be 20	20 instead of 4		4. It can be as simple as an identity matrix, just as in the
case for nucleic acids, but a much more accurate picture will incorporate statistical infor-
mation about the relative frequency of amino acids. This immediately raises one serious
problem: Does one use the amino acid composition of the two proteins in question to
construct the scoring matrix, or does one use the amino acid compositions of all known
proteins, or all known proteins		from the particular species involved?			One can			elaborate
the problem even further by asking whether the nonrandomness of dipeptide frequencies
should be considered in making statistical evaluations for the scoring matrix. There are no
simple answers to these questions.
Most commonly, with protein sequence comparisons, one incorporates						information
about amino acid physical properties into the values of the elements of the scoring matrix.
Thus, for example, interchanges among ile, leu, and val, or ser and thr, among proteins
known to be related in structure		and function	are	very commonly seen and are presum-

ably mostly innocuous. Examples of two real scoring matrices are shown in Figure 15.17.

Figure 15.16

Comparison matrices between two nucleic acid sequences.

(a) A general matrix.

(b)

The simplest possible matrix.

METHODS FOR COMPARING SEQUENCES

553


Figure 15.17	An example of actual scoring matrices for protein sequences that takes into account
the similar properties of certain		types of amino acids.	(a) the Blosum G2 matrix used by BLAST
(Henikoff and Henikoff, 1993).		(b) The structural (STTR) matrix of Shpaer et al. (1996).

554 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

The values of these elements obviously vary over a wide range. However, despite their different origins, the two matrices are fairly similar.

There is still one additional complication that must be dealt with. This is especially serious when one wishes to estimate the evolutionary relatedness of two proteins or DNAs.

Here a yardstick that is often used as a time scale for evolutionary divergence is the probable average number of mutations needed to convert one sequence into the other. Such comparisons among very similar proteins or nucleic acids are relatively simple. Differences seen are presumably real, and similarities are also presumed real. However, when more distant sequences are compared, an apparent similarity has an increasing

chance of just being	a statistical event, or a reversion. For example, as shown in Figure
15.18 two matching A’s could be a true identity (no mutations) or a reversion (a minimum
of two mutations). The	more distantly related the two sequences, the more the latter pos-

sibility has to be weighted. Ways of doing	this for simple identity comparison matrices
were developed several decades ago by Jukes	and Cantor, and later elaborated consider-

ably to take into account statistical effects and similarities in residue properties. The kind of matrix needed in a very simple case is shown in Figure 15.19. It adjusts the relative weights of comparisons as a function of the average extent of differences between the two sequences. The problem of choosing an ideal comparison matrix, which deals with all of

these interrelated issues, is still not a simple one.

Once a comparison matrix is chosen, it can be used to evaluate the relative similarity seen in all possible alignments between two strings. When gaps (caused by a putative insertion or deletion, or a pure statistical artifact) are allowed, the problem of actually enumerating and testing all possible comparisons becomes computationally extremely de-

Figure 15.18 Difﬁculties in sequence comparisons when the goal is to estimate the probable number of mutations that have occurred to derive one sequence from another (or both from a com-

mon ancestor).

Figure 15.19	A simple scoring matrix that takes into account the average differences between two
sequences and allows for the possibility of revertants. Where		a	1/4(1 e 4d/3). The parameter		d
is a measure of the true evolutionary distance between two sequences being compared. It is the av-
erage number of mutations per site that separate one sequence from the other. In the limit			the	d	: 0
matrix becomes equal to	the right-hand panel of Figure 15-16. In the limit	all	of the ele-	d :

ments of the matrix become equal to 1/4. This means that the sequences have diverged so much that one is essentially comparing two random strings.

						METHODS FOR COMPARING SEQUENCES				555
manding. Figure 15.20 shows a very simple example. The issue is how to test the likeli-
hood that the postulated gap results in a statistically signiﬁcant improvement in the align-
ment score of the two sequences. Obviously there must be a statistical penalty attached to
the use of such a gap, since it greatly increases the number of possible comparisons, and
thus the chance of ﬁnding, at random, a comparison with a score better than some arbi-
trary value.
From a practical point of view, it is impossible to test all possible gap numbers and lo-
cations. One way to deal with this problem is to compare two sequences through smaller
windows, sets of successive residues, rather than					globally (Fig. 15.21). With two			strings
of length	n	and	m	, and a window of length		L, there are	(n	L 1)(m L 1)possible
comparisons to be done. This is not a major task for strings the sizes of typical genes. For
each choice of window, two substrings of length						L are compared, without gaps. The score
for this comparison is calculated as the sum over the					matrix	elements			a ij for each of the	L
residues pairs. To provide a visual overview of the comparison, it is usually convenient to
plot all scores above some threshold value as a dot in a rectangular ﬁeld formed by writ-
ing one sequence along the horizontal axis and the other along the vertical axis. Any point
in the ﬁeld corresponds to an alignment of						L residues positioned at particular residue po-
sitions in the two sequences. This kind of dot matrix						plot is shown, schematically			in
Figure 15.22, and a real example of a sequence					comparison at the DNA level for two
closely related		viruses,		SV40 and polyoma is	given	in Figure 15.23. Any	regions		with

Figure 15.20 A simple case of two sequences potentially related by an insertion or a deletion.

Figure 15.21 Window selection on a single sequence assists in comparisons.

Figure 15.22 An example of the comparison of two proteins or DNAs using windows on each, evaluated with a scoring matrix. Shown as dots are all comparisons that score above a selected threshold.

<<< < Предыдущая 1 23 / 53 4 5 > Следующая >>>

Соседние файлы в папке genomics11-15

#
17.08.2013277.66 Кб5911.pdf
#
17.08.2013510.17 Кб5812.pdf
#
17.08.2013311.59 Кб5813.pdf
#
17.08.2013577.75 Кб5814.pdf
#
17.08.2013499.07 Кб5915.pdf
#
17.08.201326.85 Кб58appendix databases.pdf