Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Скачиваний:
47
Добавлен:
17.08.2013
Размер:
499.07 Кб
Скачать

536 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

We know that such networks can be trained (i.e., adjusted) to respond to signals or stimuli and to integrate the input from many different sources or sensors. Here the basic properties of neural nets will be illustrated, and then examples of how they have been applied to

the analysis of DNA sequences will be shown.

 

 

 

 

 

 

The basic element in a neural net is

a node, as shown in Figure 15.3

a . This node re-

ceives input from one or more sensors, and

it delivers output to one or more other nodes

or a detector. The behavior of nodes is quantized. The signal input from each sensor is

continuously scanned. It is recorded as positive if is above some threshold; otherwise, it is

scored as negative (Fig. 15.3

b ). An input can be stimulatory or inhibitory. A node receiv-

ing a stimulatory input will send out the same sign signal. A node receiving an inhibitory

signal will send out the opposite sign signal. By analogy, a nerve cell receiving a stimula-

tory impulse fires, while one receiving an inhibitory impulse does not fire.

 

 

 

Neural nets are collections of nodes wired in particular ways. They are generalizations

of simple logical circuits. The variables in a neural net are the signal thresholds and the

nature of the response of the nodes. We will illustrate this with three cases of increasing

complexity. Consider the simple two-input node shown in Figure 15.3. Suppose that it op-

erates under the following rules: If both sensors are positive, the node sends a positive

output. Otherwise, it sends a negative output. This node is operating as the logical and

function. It is behaving like a neuron that needs two simultaneous positive inputs in order

to fire.

 

 

 

 

 

 

As a second case, consider the same

node in Figure 15.3, but now imagine that the

node sends a positive output if either input or both inputs are positive. The only way the

node

sends a negative output is if both sensors are reading negative. This node is

acting

like the logical and/or function. It stimulates a nerve cell that needs only one positive

stimulus to fire.

 

 

 

 

 

 

The third case we will consider is a

node that sends a positive signal if either input

sensor is positive but not if both sensor inputs are positive. It is difficult to represent this

behavior by a single node with simple

 

/ binary logical properties. Instead, we can rep-

resent the behavior by a slightly more complex network with three nodes,

as shown in

Figure 15.4. Here the two sensors input their signal directly to two of the nodes. Each of

these nodes views one input as stimulatory

and the other input as inhibitory. Thus each

node

will fire if and only if it receives one

positive and one

negative

signal.

The two

nodes feed stimulatory inputs into the third node. This node will be directed to fire if it re-

ceives a positive input from either one of the two nodes that precede it. One way to view

the

structure of the simple neural network

shown

in Figure 15.4

is that

there

is hidden

Figure 15.3 The simplest possible neural net. This net can perform the logical operations “and”

and “and/or.” (a) Coupling of two inputs to a single output.

(b) Effect of sensor threshold on signal

value.

 

NEURAL NET ANALYSIS OF DNA SEQUENCES

537

Figure 15.4 A more complex neural net which can perform the logical operation either but not both.

layer of nodes between the sensors and the final output node. In this particular case the hidden layer has a very simple structure; yet it is already capable of executing a compli-

cated logical operation.

 

 

 

 

 

 

To use a neural net, one constructs a fairly general set of

nodes and connections with

one or more hidden layers, as shown in Figure

15.5. This is trained on sequences with

known properties. The net is cycled through the training set of data, and weighting factors

for each of the connections are adjusted to

try to achieve the highest positive output

scores for desired input characteristics and the

lowest ones for

undesired characteristics.

A neural net could be used to examine DNA sequence directly, but this would take a very

complex net, and the resulting training period would be computationally very intensive.

Instead, what works quite satisfactorily is to

use as sensor inputs, not individual bases,

but instead the seven-sequence analysis algorithms

described

in

the

previous

section.

These sensors are each allowed to scan the DNA sequence over 10-base intervals. The net

result of each scan is computed in a 99-base window. This is the

length

of sequence that

is scanned and input into the net. Then the sequence is frameshifted by one base, and the

analysis is repeated. The result is scaled, and

then

each sensor is

fed

into the

neural net.

The actual net structure used is shown in Figure 15.6. It consists of the 7 input sensors, 14 hidden nodes in a first layer, 5 hidden nodes in a second layer, and a single output node.

Edward Uberbacher and Robert Mural at Oak Ridge National Laboratory trained the neural net shown in Figure 15.6 on 240 kb of human DNA sequence data, adjusting thresholds, signs, and weighting until the performance of the net appeared to be optimum (1991). The result is a sequence analysis program called GRAIL. The detailed pattern of input into GRAIL from each of seven sensors for a particular DNA sequence is shown in

Figure 15.7. Each plot shows the relative probability that the given 99-base window is an exon with coding potential. It is apparent that some sensors like coding six-tuple in frame preferences have much more powerful discrimination than others. However, when the in-

put from all seven sensors is combined by the neural net, the result is a truly striking pattern of prediction of clear exons and introns. This is shown in Figure 15.8. GRAIL works

Figure 15.5 A still more complex neural net, with several hidden layers.

538 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

Figure 15.6 The actual neural net used in GRAIL analysis of DNA sequences. Adapted from Uberbacher and Mural (1991).

on many different types of human proteins that were not included in the original training set. A number of examples are shown in Figure 15.9. Some caution is needed, however, because not all human genomic sequence is handled well by GRAIL. For example, the human T-cell receptor gene cluster is not readably amenable to GRAIL analysis. The program also has difficulty in finding very small exons, which is not surprising in view of the 99-base window used.

Neural net approaches similar to GRAIL appear to have great promise in other complex problems in biological and chemical analysis. These include prediction of protein secondary and tertiary structure, correction of DNA sequencing errors, and analysis of mass spectrometric chemical fragmentation data. Note, however, that neural nets are only one of a number of different types of algorithmic approaches applicable to such problems, and the vote is still out on which will eventually turn out to be the most effective for

particular classes of analysis. However, for the past half-decade, GRAIL has proved to be an extremely useful tool for most applications to human DNA sequence analysis, and it is readily accessible via computer networks, to all interested users.

Since the introduction of GRAIL, improvements have been made on the original algorithms to produce GRAIL 2. Other approaches to gene finding have been proposed, including a linear discriminant method (Solovyev et al., 1994) and, most recently, a quadratic discriminant method (Zhang, 1997). These methods take into account additional factors like the compatibility of the reading frames of adjacent exons and consensus sequences to the intron segment that forms a branched structure as an intermediate step in

NEURAL NET ANALYSIS OF DNA SEQUENCES

539

Figure 15.7 Performance of each of the seven sensors of the net shown in Figure 15.6 on one particular DNA sequence. The vertical axis indicates the probability that each sliding segment of DNA

sequence is a coding exon. Taken from Uberbacher and Mural (1991).

Figure 15.8 The output of the neural net, based on its optimal evaluation of the sensor results shown in Figure 15.7. Adapted from Uberbacher and Mural (1991).

540 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

Figure 15.9 Examples of the performance of the neural net of Figure 15.6 on a set of different genomic DNA sequences. Adapted from Uberbacher and Mural (1991).

splicing. When tested in a large number of sequences, the three algorithms all perform well, but they are still far from perfect (Table 15.4).

TABLE 15.4 Success of Exon Prediction: Exons Found by Three Different Schemes

Scheme

Sensitivity TP/(TP

FN)

Specificity TP/(TP

FP)S

 

 

 

 

 

GRAIL 2

0.53

 

0.60

 

Linear discriminant analysis

0.73

 

0.75

 

Quadratic discriminant analysis

0.78

 

0.86

 

 

 

 

 

 

Source: Adapted from Hong (1997)

Note: True positives (TP) are true positives correctly predicted. False positives (FP) are true negatives predicted to be positive. False negatives (FN) are true positives predicted to be negative. Sensitivity is the fraction of true positives found. Specificity is the fraction of positives found that is true.

 

 

SURVEY OF PAST LARGE-SCALE DNA SEQUENCING PROJECTS

 

541

SURVEY OF PAST LARGE-SCALE DNA SEQUENCING PROJECTS

 

 

 

 

 

 

Most early large-scale DNA sequencing projects involved a pre-selected gene of particu-

 

 

 

 

lar interest. An example is the enzyme HPRT (57 kb). These projects are

milestones

in

 

 

 

the history of DNA sequencing, but it is difficult to extrapolate the results of such projects

 

 

 

 

to the situation that will apply in most genomic sequencing efforts. In such efforts, which

 

 

 

 

will form the overwhelming bulk of the human genome project, one will

be faced

with

 

 

 

 

large expanses of relatively uncharted DNA. While

the regions selected may contain a

 

 

 

 

few mapped genes, and many cDNA fragments, much of the rationale for looking at the

 

 

 

 

particular region will have to come a

posteriori,

after the sequence has been completed.

 

 

 

 

To try to get some impression of the difficulties in assembling the sequence, and making a

 

 

 

 

first pass at its interpretation, it is useful to examine the first few efforts at sequencing

 

 

 

segments of DNA without a strong functional pre-selection. Here we summarize results

 

 

 

 

from seven projects: the complete sequence of

 

 

H. influenzae,

M. genitalium,

partial se-

quences of

E. coli, S. cerevisiae, C. elegans,

and

D. melanogaster,

 

and several human cos-

mid DNAs. These

sequence data and all

other genomic

sequence data currently reside in

 

 

 

 

a set of publicly accessible databases. A description of these valuable resources, and how

 

 

 

 

they can be accessed, is provided in the Appendix. A summary of all complete genome

 

 

 

 

sequences publicly available in February 1997 is given in Table 15.5.

 

 

 

 

 

 

The complete DNA sequences of

 

 

Haemophilus influenzae

 

and

Mycoplasma genitalium

 

both correspond

to

relatively small

bacterial

genomes. As expected, they are very rich

 

 

 

 

in genes, and they are especially rich in genes whose function can be surmised by com-

 

 

 

 

parison to other sequences in the available genome databases.

 

 

 

M.

genetalium

has a

580,070 bp genome with 470 ORFs. These occur on average one per 1235 bp. The aver-

 

 

 

age ORF is 1040 bp. Overall the genome is 80% coding. Seventy-three percent of the

 

 

 

ORF’s correspond to previously known genes.

 

 

 

 

 

 

 

 

H. influenza

has a genome size of 1,830,137 bp. This contains 1743 coding regions, an

 

 

average of one every 1042 bp. The average gene

is 900 bp long. Overall, 85% of the

 

 

 

genome is coding. Currently 1007 (58%) of the coding regions can be assigned a func-

 

 

 

tional role. Of the remainder, 385 are new genes that show no significant matches to the

 

 

 

 

databases, while the others match known sequences of unknown function. At an average

 

 

 

 

direct cost of $0.48 per base this project is probably representative of other large-scale ef-

 

 

 

forts using similar technology.

 

 

 

 

 

 

 

 

 

Both the

 

H. influenzae

and

M. genetalium

sequencing

projects

were

carried

out at a

 

single location totally by automated fluorescent DNA sequencing. In contrast, one of the

 

 

 

 

TABLE 15.5 Completed Genome Sequences

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

DNA

 

Largest DNA

Open Reading

Genes for

 

Species

 

Molecules

kb DNA

(kb)

 

Frames

 

RNA

 

 

 

 

 

 

 

 

 

 

 

M. genitalium

 

 

1

 

580

580

 

470

38

M. pneumonia

 

 

1

 

816

816

 

677

39

M. janneschii

 

 

3

 

1740

1665

 

1738

45

H. influenza

 

 

1

 

1830

1830

 

 

1743

 

76

Synechoncystis sp.

 

 

1

 

3573

3573

 

 

3168

?

E. coli

 

 

1

 

4639

4639

 

4200

?

S. cerevisiae

 

 

16

 

12,068

1532

 

5885

455

 

 

 

 

 

 

 

 

 

 

 

 

542

RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

 

 

 

 

efforts to sequence major sections of the

 

 

 

E.

coli

genome, directed by Fred Blattner in

 

Madison, Wisconsin, started as basically low-technology, manual DNA sequencing, em-

 

 

 

 

ploying a large number of relatively unskilled workers, and

concentrated

on

relatively

 

 

 

 

simple protocols. The initial result was a 91.4 kb contig. The region contained 82 pre-

 

 

 

 

dicted ORFs or roughly one per kb. The ORFs constituted about 84% of the total se-

 

 

 

quence. If we scale the properties of this region to the entire

4.7 Mb

 

 

 

E. coli

genome, we

 

can predict that

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

4.7 Mb

82 ORFs

 

4200 genes

 

 

 

 

 

 

 

 

 

 

 

 

0.0914 Mb

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This is larger than estimates of the number of genes in

 

 

 

 

 

 

E. coli

based on the appearance of

 

protein spots in two-dimensional electrophoretic separations. Past sampling of

 

 

 

E. coli

re-

gions has revealed fairly uniform gene density except for areas around the terminus of

 

 

 

 

replication. Hence the preliminary sequencing results on

 

 

 

 

 

 

E. coli

suggest that a significant

 

number of new and interesting genes remain to be discovered. A more recent report of ad-

 

 

 

 

ditional

E.

coli

sequences is

quite

consistent

with the

earlier

observations

within

a

 

 

338,500 base contig, 319 ORFs were found—one per 1060 bases. Of these, 46% are po-

 

 

tentially new genes. The complete

 

 

 

E. coli

DNA sequence has just became available, and it

 

 

contains 4300 genes, in 4.54 Mb, quite consistent with predictions based on partial se-

 

 

 

quencing results.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The early major accomplishments in

 

 

 

 

 

S. cerevisiae

 

sequencing derive from a very dif-

 

ferent organizational model than the work on

 

 

 

 

E. coli.

The approach was still mostly very

 

low technology. It was mostly the result of a dispersed European effort among more than

 

 

 

 

30 different laboratories, coordinated through a common data collection center in France.

 

 

 

 

The complete DNA sequence of one of the smallest

 

 

 

 

 

 

 

S. cerevisiae

chromosomes,

number

 

III, was the first one determined. At 315 kb it represented

the longest continuous stretch

 

 

 

 

of DNA sequence known at the time. The chromosome III sequence was originally re-

 

 

 

 

ported to contain 182 ORFs. After this was corrected by a more rigorous examination,

 

 

 

 

carried out by Christian Sander in Heidelberg, 176

ORFs remained. These occur at

 

 

roughly one per 2 kb or half of the density seen in the three bacteria discussed above. The

 

 

 

 

ORFs cover 70% of the DNA sequence; this is not too much lower than the total density

 

 

 

 

of coding

sequence

in

E. coli.

We can make a

rough estimate the number of genes

in

 

S.

cerevisiae

 

by

scaling

these results

to the 12.1 Mb total size of the yeast genome. The re-

 

 

 

sult is

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

12.1 Mb

176 ORFs

6760 ORFs

 

 

 

 

 

 

 

 

 

 

0.315 Mb

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The total number of genes in

 

 

S. cerevisiae

will be slightly less than the number of ORFs

 

 

because occasional genes in yeast consist of more than one exon. In addition, for both

 

 

 

 

bacteria and yeast, we have to add in genes for rRNAs, tRNAs, and other nontranslated

 

 

 

species (Table 15.5).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The complete DNA sequences of several other

 

 

 

 

 

 

 

S. cerevisiae

chromosomes reported

 

were consistent with the results for chromosome III. For example, chromosome VIII has

 

 

 

 

562,698 bp. It contains 269 ORFs, or 1 per 2 kb. Of these, 124 (46%) corresponded to

 

 

genes of known function. Chromosome VI has 270 kb. It contains 129 ORFs, again about

 

 

 

 

1 per 2kb. Of these, 76 (59%) correspond to genes with previously known function. The

 

 

 

total sequence

of

 

S. cerevisiae

is

now

completed. First

estimates place

the number of

 

 

ORFs at 5885; doubtless this will change with further analysis.

 

 

 

 

 

 

 

SURVEY OF PAST LARGE-SCALE DNA SEQUENCING PROJECTS

 

 

543

 

In

the

case

of

C.

elegans

 

 

DNA sequencing, we are dealing not

with

continuous ge-

 

 

 

 

nomic sequence but with the sequence of selected cosmids. The effort, directed by John

 

 

 

 

 

 

Sulston of Cambridge, England, and Robert Waterston of St. Louis, Missouri,

is

also

 

 

 

 

state-of-the art fluorescent DNA sequencing technology with a great deal of automation.

 

 

 

 

 

 

 

The

strategy

is

mostly

shotgun,

with

directed sequencing

relegated

mostly to

closure

 

 

 

 

 

 

of gaps between contigs. The first 21.14 Mb of

 

 

 

 

 

 

 

 

C.

elegans

 

DNA

sequence

reported

contained a

total

 

of 3980 genes of 1 per

4.8 kb on the autosomes and

1 per 6.6 kb on

 

 

 

 

the X chromosome. Only 46% of these matched sequences already in

the

DNA

data-

 

 

 

 

 

 

bases. About 28% of the total DNA is coding; 50% of

 

 

 

 

 

 

 

 

C.

elegans

is

genes,

including

both exons and introns. This is a sharp drop from the

density

of

 

coding

sequences

in

 

 

 

 

 

 

simple

organisms.

 

The

total

number of genes

in

the

nematode

genome is

estimated

 

 

 

 

 

 

 

to

be

13,000

500.

This

is

a

 

number

close

to

most

contemporary

expectations

for

 

 

 

 

the sizes of the genomes of typical multicellular, highly differentiated organisms like the

 

 

 

 

 

 

nematode.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The remaining two DNA sequencing projects

that we

will

discuss

illustrate

some

of

 

 

 

 

 

 

the frustrations in detailing with the genomes of higher organisms. The complete DNA

 

 

 

 

 

 

 

sequence of a 338,234 bp region of

 

 

 

 

 

 

D. Melanogaster,

 

 

 

containing

the bithorax

complex,

 

important in development, has been reported by groups at Caltech and Berkeley. This re-

 

 

 

 

 

 

gion is less than 2% coding. It contains only six genes. The final sequencing project we

 

 

 

 

 

 

will discuss is a relatively early effort that involved several cosmids from

the tip

of the

 

 

 

 

 

 

short arm of human chromosome 4, a region known to contain the gene responsible for

 

 

 

 

 

 

 

Huntington’s disease. The region is band 4p16.3. It is estimated to contain a total of 2.5

 

 

 

 

Mb of DNA. A 225-kb subset of this region was sequenced. This yielded 13 transcripts in

 

 

 

 

 

 

 

225 kb or one per 18 kb on average. Another

estimate of gene density could

be

obtained

 

 

 

 

 

 

by determining the number of HTF islands in

the region. This will be a minimum

esti-

 

 

 

 

 

 

mate for the number of genes, since perhaps only half to two-thirds of all genes have HTF

 

 

 

 

 

 

islands nearby. In fact, in the 225 kb region, one HTF island was found on average per 28

 

 

 

 

 

 

kb. By comparison, when HTF islands were mapped to a different section of chromosome

 

 

 

 

 

 

 

 

4, a 460 kb region near the marker D4S111, the frequency of occurrence of these gene-

 

 

 

 

 

associated sequences was one per 30 kb. All of these estimates of gene density are re-

 

 

 

 

 

 

markably consistent. If we scale these expected gene densities to the entire Huntington’s

 

 

 

 

 

 

disease region, we obtain an estimate of

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2.5 Mb

13 genes

 

 

143 genes

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

0.225 Mb

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This

makes

it

 

clear

why

finding

the

gene

for

Huntington’s

disease

was

 

not

an

 

 

 

 

 

easy task.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

The first DNA sequencing effort in

 

band

4p16.3 was

carried

out

in

Bethesda,

 

 

 

 

Maryland, under the direction of Craig Ventor. It involved a total of 58 kb of DNA se-

 

 

 

 

 

quence in three cosmids. Three genes were found, each has an HTF island. The average

 

 

 

 

 

 

 

gene density in this relatively small region

is one per 19 kb, which is

quite

consistent

 

 

 

 

 

with expectations. Less than 10% of the region is coding sequence. The number of Alu

 

 

 

 

 

 

repeats in the region is 62, or roughly one per kb. This is comparable

to

what

has been

 

 

 

 

 

seen in the DNA sequence of two other gene rich, G

 

 

 

 

 

 

 

 

 

C-rich regions. In the human

growth hormone region 0.7 Alu’s were found per kb; in the HRPT region 0.9 Alu’s were

 

 

 

 

 

found per kb. In stark contrast, in the globin region which is G

 

 

 

 

 

 

 

 

 

C poor,

there

are only

0.1 Alu’s per kb. These results illustrate the mosaic nature of the human genome rather

 

 

 

 

 

 

dramatically.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

544

RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

Unlike

simple

genomes,

with

relatively uniform DNA compositions, mammalian

genomes have

mosaic

compositions

which

is

reflected

in chromosome banding patterns.

Scaling of a regional gene density to estimate the total number of genes, must take into

account regional characteristics. Long before large-scale DNA sequencing or genome

mapping was underway, Georgio Bernardi developed a method of fractionating genomes

into regions with various G

 

 

C content. This was done by equilibrium ultracentrifuga-

tion in density gradients (Chapter 5). The resulting fractions were called isochores.

Altogether, Bernardi obtained evidence for five distinct human DNA classes; these could

be divided into three easily separated and manipulated fractions. Their properties are

summarized below:

 

 

 

 

 

 

 

 

CLASS

GENE

DENSITY

GENOME FRACTION

LOCATION

 

 

L1,L2

 

 

1

62%

Dark bands

 

 

H1,H2

 

 

2

31%

Light bands

 

 

H3

 

 

16

7%

Telomeric

 

 

 

 

light bands

Several aspects of these results deserve comment. Gene density means the relative num-

 

ber of genes, based on cDNA library comparisons. The genome fraction is estimated from

 

the total amount of material in the density-separated fractions. The telomeric light bands

 

have very special properties, that we have alluded to before. Figure 15.10 illustrates the

 

actual locations seen when DNAs from Bernardi’s fraction H3 are mapped by FISH. The

 

preferential location of these sequences on just a small subset of human chromosomal re-

 

gions is really remarkable.

 

 

 

The

Huntington’s disease region

is known to be a

gene-rich light band, so we can

 

pretty much exclude the L1 and L2 classes from consideration. In the Huntington’s re-

 

gion, there is one gene on average per 18 kb. If this region is an H3 region, then we can

 

estimate the number of genes in the human genome as

 

 

 

 

H3

11,700 genes

 

 

 

H1,H2

6500 genes

 

 

 

L1,L2

6500 genes

 

for a total of 24,700 genes. This estimate is less than twice the number of genes in

C. ele-

gans,

which seems far too low. If we assume that the Huntington’s disease region is an

 

H1,H2 region, then the estimate of the number of genes in the human genome becomes

 

 

H3

92,000 genes

 

 

H1,H2

51,100 genes

 

 

L1,L2

51,100 genes for a total of 194,200 genes.

 

This is a depressingly large number, much larger than previous estimates. This example

 

illustrates how difficult it is to know

from very fragmentary data what the real target size

 

of the human genome project is. Perhaps the Huntington’s disease region is somewhere

 

between the properties of the H3, and H1 plus H2 fractions, and the gene number some-

 

where mercifully between the two rather upsetting extremes we have computed. More re-

 

cent estimates of the number of human genes range from 65,000 to 150,000, which is not

 

too different from the average of our original estimates.

 

 

FINDING ERRORS IN DNA SEQUENCES

545

Figure 15.10

Distribution of extremely G

C-rich sequences in the human genome. Solid

bars show relative hybridization of the H3 dark fraction. Open bars show rRNA-encoding

DNA. Taken from Saccone et al. (1992).

 

FINDING ERRORS IN DNA SEQUENCES

 

Quite a few

different kinds of errors contaminate

data in existing DNA sequence banks.

As the amount of data escalates, it will become increasingly important to audit these data continuously. Suspect data need to be flagged before they propagate and affect the results of many sequence comparisons or experimental scientific efforts. For example, an error in one of the earliest complete DNA sequences, the plasmid pBR322, produced a spurious stop codon in one of the proteins coded for by this plasmid. This confounded many researchers who were using this plasmid as a cloning and expression system, since a protein band with an unexplainable size was frequently seen.

Some common errors in DNA sequence data are quite easy to find and correct; others

are almost impossible. A major class of error is incorporation of a totally inappropriate sequence. This can come about if, as is not uncommon, DNA samples are mixed up in the laboratory prior to sequencing. It can arise from cloning artifacts. A clone may have

Соседние файлы в папке genomics11-15