Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Скачиваний:
56
Добавлен:
17.08.2013
Размер:
499.07 Кб
Скачать

546

RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

picked up bacterial DNA rather than the intended mammalian insert. A number of simple

schemes

exist that

can help to find such

errors. Putative genomic or cDNA sequences

should

be screened

against all known common

vector sequences. A very frequent error is

to include pieces of vector inadvertently as part of the supposed insert. The presence of common repeats like Alu should be searched for in putative cDNAs or exons. Except in the rarest cases, these sequences should not be present there; finding them suggests that a cloning artifact may have occurred.

Rearrangements in cosmid clones and YACs are fairly common. The best way to find these errors at the DNA sequence level is to compare the sequences with other clones in available contigs. A major justification for the additional DNA sequencing required to examine a tiling set of cosmids is that there will be frequent overlaps which can help catch errors caused by rearrangements. Small sequencing errors are still about 1% in automated or manual sequencing. In many past efforts, considerable amounts of data were entered into sequence databases manually. It is vital that this be verified by a process of double entry and comparison. If not, except in the hands of the most compulsively careful individuals, typographical errors will abound.

When a single base is miscalled, either by misreading raw sequence data or by mistranscription in manipulating that data, the error is extremely difficult to detect. However, when a base is inserted or deleted, especially within an ORF, the error is sometimes easily

caught. One way to do this is a procedure developed

by

Janos

Posfai

and Richard

Roberts. In the course of searching a DNA database, to examine possible homology be-

tween a new sequence and all preexisting sequences, one can ask whether potential strong

sequence homology (usually after the DNA has been translated

into

protein)

is blocked

by a frame shift. Where this occurs, a DNA sequencing error is almost always responsi-

ble. Several examples of the power of this approach in

spotting sequencing

errors are

shown in Figure 15.11.

 

 

 

 

An unsolved problem is how to alert the community when errors are found. Given the size of the community and the complexity of the queries it makes against the sequence databases, this is an enormous problem. At some point the databases will have to be intel-

Figure 15.11 Finding frameshift errors by comparing a new sequence with sequences preexisting in the databases. Adapted from Posfai and Roberts (1992).

SEARCHING FOR THE BIOLOGICAL FUNCTION OF DNA SEQUENCES

547

ligent enough to be able to evaluate the effect of corrections on past queries and alert the initiators of those queries that might now be subject to altered outcomes. If this cannot be done, inevitably people will begin to repeat queries over and over again to guard against the effects of errors. A second potential unsolved problem is how to deal with fraudulent sequences. Research journals are increasingly reluctant to publish DNA sequence results, and it is almost impossible to publish the raw data supporting DNA sequencing results. Because of this, much sequence data are submitted directly to databases without editorial review of the actual experimental data. This entails the risk that databases might become contaminated willfully or accidentally by the deposit of sequences marred by artifacts or totally artificial. Just how these sequences could be detected and removed remains a serious dilemma. Ultimately it may be necessary to link the databases to archives of raw data so that validation of a suspected artifact is feasible.

SEARCHING

FOR THE BIOLOGICAL

FUNCTION OF

DNA

SEQUENCES

The major thrust of biological research is to understand function. From the viewpoint of

the genome, this search for function can occur

at two very different levels: individual

genes or patterns of gene organization. We first discuss the genome from this latter van-

tage point. An overview of the arrangement of sequences in the genome may provide pat-

terns of information that offer a clue to global aspects of function. These may be domains

of gene activity or gene type that reflect biological processes we have not yet discovered.

For example, most similar or related genes are

not clustered. Some small clusters are

seen, such as the globin genes

(Fig. 2.10). The pattern of arrangement of the genes in

these clusters presumably reflects an ancient gene duplication, which separated the alpha

and beta

families, and more recent duplications

that

evolved the more closely related

members of these families. What is striking, and not yet explained, is that the order of the genes in each of these families accurately corresponds to the temporal order in which the genes are expressed during human development.

Another example of intriguing patterns of gene arrangement is the hox gene family in man and the mouse, shown in Figure 15.12. The genes in this family code for factors that determine the segmental pattern of organization of the developing embryo. The family is

Figure 15.12 Organization of homeobox (hox) genes in the mammalian genomes. All genes in all four clusters are transcribed from left to right.

548 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

complex, and dispersed on a number of different chromosomes. Two aspects of the organization of the family are striking. First, it is so well conserved between the two species. Second, the spatial order of the genes within the family is the same as the order of the segments in the embryo that these genes affect. It is as if, for some totally unexplained reason, the map of the structure of the gene family is an image of the map of the function

of that family.

A final example of functionally interesting gene arrangements is seen in a number of the members of the immunoglobulin superfamily including the light and heavy chains of antibodies, and several chains of the T-cell receptor. Here large numbers of related genes

are grouped together, mostly in a single continuous segment of

a chromosome. The rea-

son for this is probably to assist the rearrangement of these

genes, which takes place by

DNA splicing to form mature expressed genes for antigen-specific proteins. If other re-

gions of the genome are found with very large clusters of similar genes, one may well

suspect

that somatic DNA rearrangement or some other unusual

biological mechanism

will be at play with these genes.

 

A

totally different view of global function afforded by

complete physical maps and

DNA sequences is the ability to compare these physical structures of DNA with the ge-

netic map. An example is shown in Figure 15.13 for yeast chromosome III. There are

clearly

some regions where meiotic recombination is much more

frequent than average

and others where it is greatly suppressed. We do not yet understand the origin of these ef-

fects. One possibility is just the presence or absence of local DNA sequences that constitute recombination hot spots. However, there are other more global possibilities.

Recombination may correlate with overall transcriptional

activity, since highly tran-

scribed chromatin is more open and accessible to all types of enzymes including those re-

sponsible for recombination. Thus there may be positional relationships between gene

function and recombination, and thus gene evolution, that

we still know nothing about

today.

 

SEARCHING FOR THE BIOLOGICAL FUNCTION OF GENES

Most biologists, when they think of biological function in the context of the genome project, are referring to the function of individual genes. A common criticism of the genome project is that it is relatively useless to know the DNA sequences of genes without strong prior hints about their function. Most traditional molecular genetics begins with a function of interest and attempts to find the genes that determine or affect that function. This traditional view of biology is contrasted with the challenge posed by genome research in

Figure 14.1,

which can well serve as a paradigm for all of biology. In genome research

we will discover DNA sequences with no a priori known function. Our current ability to

 

translate these

DNA sequences correctly into protein sequences

is excellent,

as

we

showed earlier, by using GRAIL or other powerful algorithms. Our current ability to take

these protein sequences and draw immediate inferences about their

possible function

is

 

well illustrated

by the example in Figure 15.14

a . Except

for those rare readers of this

book who are conversant in Dutch, this passage is largely unreadable. However, the frustrating aspect is that the passage is not totally unreadable. Because a number of scientific terms are cognates in Dutch and English, certain features stand out—one knows the passage has something to do with protein structure, but the full impact of the message is completely lost.

SEARCHING FOR THE BIOLOGICAL FUNCTION OF GENES

549

Figure 15.13

A comparison of the genetic and physical map of the yeast

S. cerevisiae.

Ingewikkelde en grote biologische macro-moleculen kunnen spontaan

in hun meest stabiele conformatie vouwen. Helaas, ontbreekt ons de kennis om dit proces te voorspellen want de gevouwen strutuur kan belangrijke aanwijzingen

over die functie van het molecuul bevatten.

(a)

We know that large biological molecular can fold into their most stable state spontaneously but we really have little ability at present to predict this folding. Our ignorance is most unfortunate since the folded structure may contain important clues on how the molecule functions.

 

 

(b)

 

Figure 15.14

An analogy for the current

(a) and desired future

(b) ability to interpret DNA se-

quence in terms of its likely biological function.

 

 

550

RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

 

When the passage in Figure 15.14

a is translated into English, it provides an important

clue to one direction that can help find functional clues (Fig. 15.14

b ). Considerable expe-

rience to date shows that protein three-dimensional structures are better conserved during

evolution than protein sequences. A great deal of

current research effort is being devoted

to improving our ability to infer possible protein structures from the sequences of sets of

related proteins, provided that at least one of them has a known three-dimensional struc-

ture. As our ability to do this improves, and as the number of different classes of protein

structures has one or more members successfully studied at high-resolution by X-ray

crystallographic or nuclear magnetic resonance

techniques, the prospects

of stepping

quickly from a sequence to a realistic, if not exact, model of the structure should improve markedly. However, just knowing a three-dimensional structure does not immediately provide definitive clues to function. It simply makes comparisons between a protein of

unknown function and the set of

proteins of known function more powerful and more

likely to yield useful insights.

 

 

Today, when a new segment of DNA sequence is determined, the first thing that is al-

most always done with it is to compare it to all

other known DNA sequences. The pur-

pose is to see if it is related to anything already known. By related, we mean, that there is

a statistically significant similarity to one or more preexisting DNA sequences. The defin-

ition of what statistically significant means in the context of sequence comparisons is not

universally accepted despite decades of work in this area. Obviously, at one extreme, one

may find that a new sequence is virtually identical to a preexisting one. Unless the two se-

quences derive from very similar but not identical organisms, the finding of near identity

means true identity with the differences due to sequencing errors, or a new member of a

gene family, or an example of proteins very strongly conserved in evolution, like the his-

tones. At the other extreme, a new sequence may match nothing to within whatever local

standards of minimal homology are considered operative.

Most often, however, when

a new DNA sequence is compared with the current data

base of more than 1000 Mb of DNA, some slight or significant sequence homology is

found. For coding sequences, it is usually much more powerful to search after translation

of DNA to protein. This translation loses very little functional information; it gains con-

siderable statistical power because the noise caused by the degeneracy of the genetic code

is blanked out. Thus consider, for example, two arginine codons like AGG and CGA in a

corresponding place on two sequences; the only evidence for similarity is the G in posi-

tion 2, which has roughly one chance in four of

occurring randomly. In contrast, posing

an arginine opposite an arginine at the same place in a protein sequence has, very crudely,

only one chance in 20 of occurring randomly. (In reality the statistical differences are not

this great because amino acids with six possible codons, like arginine, also tend to occur

much more often than average.)

 

 

A statistically significant match between a

new sequence and some preexisting se-

quence implies some or all of the following possibilities: similar function, similar struc-

ture, or evolutionary relatedness. It is not easy to sort out these different effects. However,

one encouraging feature of such global sequence searches is that their effectiveness ap-

pears to be increasing markedly and rapidly as the database grows. Ten years ago Russell

Doolittle noted that a new protein sequence had

a 25% chance of matching something

else in the databases. Currently the odds are considerably better than this. From the first

bacterial sequencing projects described earlier in this chapter, between 54% and 78% of

the ORFs found showed hints of homology in structure or function with something else in

the data base. With the

S. cerevisiae

ORFs on chromosome III, 42% gave hints of homol-

 

 

 

 

 

 

 

 

 

 

METHODS FOR COMPARING SEQUENCES

551

ogous structure or function of which 14% were deemed really quite strong. In the case of

 

C. elegans,

 

where

more

extensive data are available, 45% of the ORFs were reported to

 

be relatable to existing databases. It seems likely that in a few years it will be the odd new

 

sequence that does not immediately match something known. While it is too early to be

 

sure how rapidly this goal will be achieved, there is room for considerable optimism at

 

present.

 

 

 

 

 

 

 

 

 

 

 

METHODS

FOR COMPARING SEQUENCES

 

 

 

Entire books have been written about the relative merits of different approaches to align-

 

ing sequences and testing their relatedness (Waterman, 1995; Gribokow and Deveraux,

 

1991). The topic is actually quite complex because the nonrandom nature of natural DNA

 

sequences greatly confounds attempts to construct simple statistical tests of relatedness.

 

Here our goal will be to present

the basic notions of how sequences are compared and

 

what these comparisons mean. Sequences are strings of symbols. Any two strings can be

 

compared

by

direct

alignment

and

the

use of

scoring

criteria for similarity. For two

 

strings of length

 

 

n

and

m

there

are

2(n m 1)possible continuous alignments, by

 

which we mean that no gaps are allowed in either string. Of course many of these align-

 

ments are fairly trivial and

uninteresting because the strings will barely overlap. The

 

moment gaps are allowed on one or both strings, the number of alignments rises in a

 

combinatorial manner to reach heights that can test the power of the fastest existing su-

 

percomputers if the problem is not handled intelligently.

 

 

An example of a very simple

case in which two very similar DNA sequences are

 

aligned is shown in Figure 15.15. In this case the alignment needed to maximize the ap-

 

parent similarity between the two sequences is obvious. What is less obvious is the sort of

 

score to give such an alignment. The simplest scoring scheme is black and white: Grade

 

all identities the same and all differences the same. However, this makes little sense from

 

either a biological or a statistical vantage point. As far as biology is concerned, if, for ex-

 

ample, we are looking at the functional relatedness of proteins coded for by these se-

 

quences, or if we are looking at possible evolutionary relationships between them, trans-

 

versions

(interchange

of

a purine and a pyrimidine) should be weighted as more

 

consequential differences than transitions (interchange between two pyrimidines or two

 

purines). This is because the rate of transversion mutations is much less than the rate of

 

transitions,

and

the

genetic

code appears to have evolved so that effects of transversions

 

on the resulting amino acids

are more functionally disruptive than the effect of transi-

 

tions. For example, many synonymous codons are related by a transition in their third po-

 

sition. But the example goes much deeper; for example, codons for different hydrophobic

 

amino acids are also related mostly by transitions.

 

 

 

Figure 15.15 A simple example of a comparison between two putatively related nucleic acid se-

quences and two ways in which their relatedness could be scored,

S transition and

V trans-

version.

 

 

552 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

To take statistical factors into account in estimating the significance of a mismatch or a match purely at the DNA level, we have to consider the relative frequency of each residue

in the strings being compared. For example, sequences rich in A’s will show large numbers of A’s matched with A’s, just by chance. In order to take this into account, and to add issues like transitions and transversions, one needs to employ a scoring matrix. This is il-

lustrated in Figure 15.16

a . The 4

4 scoring matrix for nucleic acid comparisons allows

for any possible weight to be assigned to a particular set of bases at

an

alignment posi-

tion. Generally, the same scoring matrix is used for every alignment position, although

there is no reason why one should have to do this, nor is there any reason why it is desir-

able except for simplicity. Think ahead to the alignment of protein

sequences

where

residues on exterior loops can be quite variable without perturbing the overall structure.

Therefore, if one had some way of knowing a priori that a residue was

in a loop as op-

posed to a helix or sheet, one could adjust the weighting factors accordingly. This exam-

ple illustrates the complex interplay between sequence and structure information that re-

ally has to occur in very robust comparison algorithms.

 

 

 

 

The simplest possible DNA scoring matrix, corresponding to the rule used in Figure

15.15 is just a set of identities with no correction for overall base composition (Fig.

15.16b ). The general case would consist of a set of elements

 

 

 

a ij that are all different, ex-

cept that the matrix should be symmetrical; each

 

 

a ij a ji since we have no way, in com-

paring just two proteins, to favor one sequence over another. The elements

 

 

a ij must incor-

porate all of our biological and statistical prejudices. When

protein

sequences are

compared,

the scoring matrices

can become more complicated. First of

all,

the

matrix

must be 20

20 instead of 4

4. It can be as simple as an identity matrix, just as in the

case for nucleic acids, but a much more accurate picture will incorporate statistical infor-

mation about the relative frequency of amino acids. This immediately raises one serious

problem: Does one use the amino acid composition of the two proteins in question to

construct the scoring matrix, or does one use the amino acid compositions of all known

proteins, or all known proteins

from the particular species involved?

One can

elaborate

the problem even further by asking whether the nonrandomness of dipeptide frequencies

should be considered in making statistical evaluations for the scoring matrix. There are no

simple answers to these questions.

 

 

 

 

 

 

 

Most commonly, with protein sequence comparisons, one incorporates

information

about amino acid physical properties into the values of the elements of the scoring matrix.

Thus, for example, interchanges among ile, leu, and val, or ser and thr, among proteins

known to be related in structure

and function

are

very commonly seen and are presum-

ably mostly innocuous. Examples of two real scoring matrices are shown in Figure 15.17.

Figure 15.16

Comparison matrices between two nucleic acid sequences.

(a) A general matrix.

(b)

The simplest possible matrix.

METHODS FOR COMPARING SEQUENCES

553

Figure 15.17

An example of actual scoring matrices for protein sequences that takes into account

the similar properties of certain

types of amino acids.

(a) the Blosum G2 matrix used by BLAST

(Henikoff and Henikoff, 1993).

(b) The structural (STTR) matrix of Shpaer et al. (1996).

554 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

The values of these elements obviously vary over a wide range. However, despite their different origins, the two matrices are fairly similar.

There is still one additional complication that must be dealt with. This is especially serious when one wishes to estimate the evolutionary relatedness of two proteins or DNAs.

Here a yardstick that is often used as a time scale for evolutionary divergence is the probable average number of mutations needed to convert one sequence into the other. Such comparisons among very similar proteins or nucleic acids are relatively simple. Differences seen are presumably real, and similarities are also presumed real. However, when more distant sequences are compared, an apparent similarity has an increasing

chance of just being

a statistical event, or a reversion. For example, as shown in Figure

15.18 two matching A’s could be a true identity (no mutations) or a reversion (a minimum

of two mutations). The

more distantly related the two sequences, the more the latter pos-

sibility has to be weighted. Ways of doing

this for simple identity comparison matrices

were developed several decades ago by Jukes

and Cantor, and later elaborated consider-

ably to take into account statistical effects and similarities in residue properties. The kind of matrix needed in a very simple case is shown in Figure 15.19. It adjusts the relative weights of comparisons as a function of the average extent of differences between the two sequences. The problem of choosing an ideal comparison matrix, which deals with all of

these interrelated issues, is still not a simple one.

Once a comparison matrix is chosen, it can be used to evaluate the relative similarity seen in all possible alignments between two strings. When gaps (caused by a putative insertion or deletion, or a pure statistical artifact) are allowed, the problem of actually enumerating and testing all possible comparisons becomes computationally extremely de-

Figure 15.18 Difficulties in sequence comparisons when the goal is to estimate the probable number of mutations that have occurred to derive one sequence from another (or both from a com-

mon ancestor).

Figure 15.19

A simple scoring matrix that takes into account the average differences between two

 

 

sequences and allows for the possibility of revertants. Where

a

1/4(1 e 4d/3). The parameter

d

is a measure of the true evolutionary distance between two sequences being compared. It is the av-

 

 

 

erage number of mutations per site that separate one sequence from the other. In the limit

 

the

d

: 0

matrix becomes equal to

the right-hand panel of Figure 15-16. In the limit

all

of the ele-

d :

 

ments of the matrix become equal to 1/4. This means that the sequences have diverged so much that one is essentially comparing two random strings.

 

 

 

 

 

 

METHODS FOR COMPARING SEQUENCES

555

manding. Figure 15.20 shows a very simple example. The issue is how to test the likeli-

 

 

hood that the postulated gap results in a statistically significant improvement in the align-

 

ment score of the two sequences. Obviously there must be a statistical penalty attached to

 

 

 

the use of such a gap, since it greatly increases the number of possible comparisons, and

 

thus the chance of finding, at random, a comparison with a score better than some arbi-

 

trary value.

 

 

 

 

 

 

 

 

 

 

From a practical point of view, it is impossible to test all possible gap numbers and lo-

 

cations. One way to deal with this problem is to compare two sequences through smaller

 

 

 

windows, sets of successive residues, rather than

globally (Fig. 15.21). With two

strings

 

of length

n

and

m

, and a window of length

 

L, there are

(n

L 1)(m L 1)possible

 

comparisons to be done. This is not a major task for strings the sizes of typical genes. For

 

each choice of window, two substrings of length

 

L are compared, without gaps. The score

 

for this comparison is calculated as the sum over the

matrix

elements

 

 

a ij for each of the

L

residues pairs. To provide a visual overview of the comparison, it is usually convenient to

 

 

 

plot all scores above some threshold value as a dot in a rectangular field formed by writ-

 

 

ing one sequence along the horizontal axis and the other along the vertical axis. Any point

 

 

 

in the field corresponds to an alignment of

 

L residues positioned at particular residue po-

 

sitions in the two sequences. This kind of dot matrix

plot is shown, schematically

 

in

 

Figure 15.22, and a real example of a sequence

comparison at the DNA level for two

 

 

closely related

viruses,

SV40 and polyoma is

given

in Figure 15.23. Any

regions

with

 

Figure 15.20 A simple case of two sequences potentially related by an insertion or a deletion.

Figure 15.21 Window selection on a single sequence assists in comparisons.

Figure 15.22 An example of the comparison of two proteins or DNAs using windows on each, evaluated with a scoring matrix. Shown as dots are all comparisons that score above a selected threshold.

Соседние файлы в папке genomics11-15