Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный университет

Предмет:

Генетика

Файл:

Genomics- The Science and Technology Behind the Human Genome Project. Charles R. Cantor, Cassandra L / genomics11-15 / 15

.pdf

Скачиваний:

Добавлен:

17.08.2013

Размер:

499.07 Кб

Скачать

☆

<<< < Предыдущая 12 / 52 3 4 5 > Следующая >>>

536 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

We know that such networks can be trained (i.e., adjusted) to respond to signals or stimuli and to integrate the input from many different sources or sensors. Here the basic properties of neural nets will be illustrated, and then examples of how they have been applied to

the analysis of DNA sequences will be shown.
	The basic element in a neural net is	a node, as shown in Figure 15.3				a . This node re-
ceives input from one or more sensors, and		it delivers output to one or more other nodes
or a detector. The behavior of nodes is quantized. The signal input from each sensor is
continuously scanned. It is recorded as positive if is above some threshold; otherwise, it is
scored as negative (Fig. 15.3		b ). An input can be stimulatory or inhibitory. A node receiv-
ing a stimulatory input will send out the same sign signal. A node receiving an inhibitory
signal will send out the opposite sign signal. By analogy, a nerve cell receiving a stimula-
tory impulse ﬁres, while one receiving an inhibitory impulse does not ﬁre.
	Neural nets are collections of nodes wired in particular ways. They are generalizations
of simple logical circuits. The variables in a neural net are the signal thresholds and the
nature of the response of the nodes. We will illustrate this with three cases of increasing
complexity. Consider the simple two-input node shown in Figure 15.3. Suppose that it op-
erates under the following rules: If both sensors are positive, the node sends a positive
output. Otherwise, it sends a negative output. This node is operating as the logical and
function. It is behaving like a neuron that needs two simultaneous positive inputs in order
to ﬁre.
	As a second case, consider the same	node in Figure 15.3, but now imagine that the
node sends a positive output if either input or both inputs are positive. The only way the
node	sends a negative output is if both sensors are reading negative. This node is					acting
like the logical and/or function. It stimulates a nerve cell that needs only one positive
stimulus to ﬁre.
	The third case we will consider is a	node that sends a positive signal if either input
sensor is positive but not if both sensor inputs are positive. It is difﬁcult to represent this
behavior by a single node with simple			/ binary logical properties. Instead, we can rep-
resent the behavior by a slightly more complex network with three nodes,					as shown in
Figure 15.4. Here the two sensors input their signal directly to two of the nodes. Each of
these nodes views one input as stimulatory		and the other input as inhibitory. Thus each
node	will ﬁre if and only if it receives one		positive and one	negative	signal.	The two
nodes feed stimulatory inputs into the third node. This node will be directed to ﬁre if it re-
ceives a positive input from either one of the two nodes that precede it. One way to view
the	structure of the simple neural network	shown	in Figure 15.4	is that	there	is hidden

Figure 15.3 The simplest possible neural net. This net can perform the logical operations “and”

and “and/or.” (a) Coupling of two inputs to a single output.	(b) Effect of sensor threshold on signal
value.

NEURAL NET ANALYSIS OF DNA SEQUENCES

537

Figure 15.4 A more complex neural net which can perform the logical operation either but not both.

layer of nodes between the sensors and the ﬁnal output node. In this particular case the hidden layer has a very simple structure; yet it is already capable of executing a compli-

cated logical operation.
To use a neural net, one constructs a fairly general set of			nodes and connections with
one or more hidden layers, as shown in Figure	15.5. This is trained on sequences with
known properties. The net is cycled through the training set of data, and weighting factors
for each of the connections are adjusted to	try to achieve the highest positive output
scores for desired input characteristics and the	lowest ones for		undesired characteristics.
A neural net could be used to examine DNA sequence directly, but this would take a very
complex net, and the resulting training period would be computationally very intensive.
Instead, what works quite satisfactorily is to	use as sensor inputs, not individual bases,
but instead the seven-sequence analysis algorithms		described	in	the	previous	section.
These sensors are each allowed to scan the DNA sequence over 10-base intervals. The net
result of each scan is computed in a 99-base window. This is the			length		of sequence that
is scanned and input into the net. Then the sequence is frameshifted by one base, and the
analysis is repeated. The result is scaled, and	then	each sensor is		fed	into the	neural net.

The actual net structure used is shown in Figure 15.6. It consists of the 7 input sensors, 14 hidden nodes in a ﬁrst layer, 5 hidden nodes in a second layer, and a single output node.

Edward Uberbacher and Robert Mural at Oak Ridge National Laboratory trained the neural net shown in Figure 15.6 on 240 kb of human DNA sequence data, adjusting thresholds, signs, and weighting until the performance of the net appeared to be optimum (1991). The result is a sequence analysis program called GRAIL. The detailed pattern of input into GRAIL from each of seven sensors for a particular DNA sequence is shown in

Figure 15.7. Each plot shows the relative probability that the given 99-base window is an exon with coding potential. It is apparent that some sensors like coding six-tuple in frame preferences have much more powerful discrimination than others. However, when the in-

put from all seven sensors is combined by the neural net, the result is a truly striking pattern of prediction of clear exons and introns. This is shown in Figure 15.8. GRAIL works

Figure 15.5 A still more complex neural net, with several hidden layers.

538 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

Figure 15.6 The actual neural net used in GRAIL analysis of DNA sequences. Adapted from Uberbacher and Mural (1991).

on many different types of human proteins that were not included in the original training set. A number of examples are shown in Figure 15.9. Some caution is needed, however, because not all human genomic sequence is handled well by GRAIL. For example, the human T-cell receptor gene cluster is not readably amenable to GRAIL analysis. The program also has difﬁculty in ﬁnding very small exons, which is not surprising in view of the 99-base window used.

Neural net approaches similar to GRAIL appear to have great promise in other complex problems in biological and chemical analysis. These include prediction of protein secondary and tertiary structure, correction of DNA sequencing errors, and analysis of mass spectrometric chemical fragmentation data. Note, however, that neural nets are only one of a number of different types of algorithmic approaches applicable to such problems, and the vote is still out on which will eventually turn out to be the most effective for

particular classes of analysis. However, for the past half-decade, GRAIL has proved to be an extremely useful tool for most applications to human DNA sequence analysis, and it is readily accessible via computer networks, to all interested users.

Since the introduction of GRAIL, improvements have been made on the original algorithms to produce GRAIL 2. Other approaches to gene ﬁnding have been proposed, including a linear discriminant method (Solovyev et al., 1994) and, most recently, a quadratic discriminant method (Zhang, 1997). These methods take into account additional factors like the compatibility of the reading frames of adjacent exons and consensus sequences to the intron segment that forms a branched structure as an intermediate step in

NEURAL NET ANALYSIS OF DNA SEQUENCES

539

Figure 15.7 Performance of each of the seven sensors of the net shown in Figure 15.6 on one particular DNA sequence. The vertical axis indicates the probability that each sliding segment of DNA

sequence is a coding exon. Taken from Uberbacher and Mural (1991).

Figure 15.8 The output of the neural net, based on its optimal evaluation of the sensor results shown in Figure 15.7. Adapted from Uberbacher and Mural (1991).

540 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

Figure 15.9 Examples of the performance of the neural net of Figure 15.6 on a set of different genomic DNA sequences. Adapted from Uberbacher and Mural (1991).

splicing. When tested in a large number of sequences, the three algorithms all perform well, but they are still far from perfect (Table 15.4).

TABLE 15.4 Success of Exon Prediction: Exons Found by Three Different Schemes

Scheme	Sensitivity TP/(TP	FN)	Speciﬁcity TP/(TP	FP)S

GRAIL 2	0.53		0.60
Linear discriminant analysis	0.73		0.75
Quadratic discriminant analysis	0.78		0.86

Source: Adapted from Hong (1997)

Note: True positives (TP) are true positives correctly predicted. False positives (FP) are true negatives predicted to be positive. False negatives (FN) are true positives predicted to be negative. Sensitivity is the fraction of true positives found. Speciﬁcity is the fraction of positives found that is true.

		SURVEY OF PAST LARGE-SCALE DNA SEQUENCING PROJECTS									541
SURVEY OF PAST LARGE-SCALE DNA SEQUENCING PROJECTS
Most early large-scale DNA sequencing projects involved a pre-selected gene of particu-
lar interest. An example is the enzyme HPRT (57 kb). These projects are							milestones	in
the history of DNA sequencing, but it is difﬁcult to extrapolate the results of such projects
to the situation that will apply in most genomic sequencing efforts. In such efforts, which
will form the overwhelming bulk of the human genome project, one will						be faced	with
large expanses of relatively uncharted DNA. While					the regions selected may contain a
few mapped genes, and many cDNA fragments, much of the rationale for looking at the
particular region will have to come a			posteriori,		after the sequence has been completed.
To try to get some impression of the difﬁculties in assembling the sequence, and making a
ﬁrst pass at its interpretation, it is useful to examine the ﬁrst few efforts at sequencing
segments of DNA without a strong functional pre-selection. Here we summarize results
from seven projects: the complete sequence of						H. inﬂuenzae,		M. genitalium,			partial se-
quences of	E. coli, S. cerevisiae, C. elegans,				and	D. melanogaster,			and several human cos-
mid DNAs. These	sequence data and all		other genomic		sequence data currently reside in
a set of publicly accessible databases. A description of these valuable resources, and how
they can be accessed, is provided in the Appendix. A summary of all complete genome
sequences publicly available in February 1997 is given in Table 15.5.
The complete DNA sequences of					Haemophilus inﬂuenzae			and	Mycoplasma genitalium
both correspond	to	relatively small	bacterial	genomes. As expected, they are very rich
in genes, and they are especially rich in genes whose function can be surmised by com-
parison to other sequences in the available genome databases.									M.	genetalium	has a
580,070 bp genome with 470 ORFs. These occur on average one per 1235 bp. The aver-
age ORF is 1040 bp. Overall the genome is 80% coding. Seventy-three percent of the
ORF’s correspond to previously known genes.
H. inﬂuenza		has a genome size of 1,830,137 bp. This contains 1743 coding regions, an
average of one every 1042 bp. The average gene					is 900 bp long. Overall, 85% of the
genome is coding. Currently 1007 (58%) of the coding regions can be assigned a func-
tional role. Of the remainder, 385 are new genes that show no signiﬁcant matches to the
databases, while the others match known sequences of unknown function. At an average
direct cost of $0.48 per base this project is probably representative of other large-scale ef-
forts using similar technology.
Both the		H. inﬂuenzae	and	M. genetalium		sequencing	projects	were	carried	out at a
single location totally by automated ﬂuorescent DNA sequencing. In contrast, one of the
TABLE 15.5 Completed Genome Sequences

		DNA			Largest DNA		Open Reading		Genes for
Species		Molecules	kb DNA		(kb)		Frames		RNA

M. genitalium			1		580	580			470		38
M. pneumonia			1		816	816			677		39
M. janneschii			3		1740	1665			1738		45
H. inﬂuenza			1		1830	1830			1743		76
Synechoncystis sp.			1		3573	3573			3168		?
E. coli			1		4639	4639			4200		?
S. cerevisiae			16		12,068	1532			5885		455

542

RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING

efforts to sequence major sections of the

coli

genome, directed by Fred Blattner in

Madison, Wisconsin, started as basically low-technology, manual DNA sequencing, em-

ploying a large number of relatively unskilled workers, and

concentrated

relatively

simple protocols. The initial result was a 91.4 kb contig. The region contained 82 pre-

dicted ORFs or roughly one per kb. The ORFs constituted about 84% of the total se-

quence. If we scale the properties of this region to the entire

4.7 Mb

E. coli

genome, we

can predict that

4.7 Mb

82 ORFs

4200 genes

0.0914 Mb

This is larger than estimates of the number of genes in

E. coli

based on the appearance of

protein spots in two-dimensional electrophoretic separations. Past sampling of

E. coli

re-

gions has revealed fairly uniform gene density except for areas around the terminus of

replication. Hence the preliminary sequencing results on

E. coli

suggest that a signiﬁcant

number of new and interesting genes remain to be discovered. A more recent report of ad-

ditional

coli

sequences is

quite

consistent

with the

earlier

observations

within

338,500 base contig, 319 ORFs were found—one per 1060 bases. Of these, 46% are po-

tentially new genes. The complete

E. coli

DNA sequence has just became available, and it

contains 4300 genes, in 4.54 Mb, quite consistent with predictions based on partial se-

quencing results.

The early major accomplishments in

S. cerevisiae

sequencing derive from a very dif-

ferent organizational model than the work on

E. coli.

The approach was still mostly very

low technology. It was mostly the result of a dispersed European effort among more than

30 different laboratories, coordinated through a common data collection center in France.

The complete DNA sequence of one of the smallest

S. cerevisiae

chromosomes,

number

III, was the ﬁrst one determined. At 315 kb it represented

the longest continuous stretch

of DNA sequence known at the time. The chromosome III sequence was originally re-

ported to contain 182 ORFs. After this was corrected by a more rigorous examination,

carried out by Christian Sander in Heidelberg, 176

ORFs remained. These occur at

roughly one per 2 kb or half of the density seen in the three bacteria discussed above. The

ORFs cover 70% of the DNA sequence; this is not too much lower than the total density

of coding

sequence

E. coli.

We can make a

rough estimate the number of genes

cerevisiae

scaling

these results

to the 12.1 Mb total size of the yeast genome. The re-

sult is

12.1 Mb

176 ORFs

6760 ORFs

0.315 Mb

The total number of genes in

S. cerevisiae

will be slightly less than the number of ORFs

because occasional genes in yeast consist of more than one exon. In addition, for both

bacteria and yeast, we have to add in genes for rRNAs, tRNAs, and other nontranslated

species (Table 15.5).

The complete DNA sequences of several other

S. cerevisiae

chromosomes reported

were consistent with the results for chromosome III. For example, chromosome VIII has

562,698 bp. It contains 269 ORFs, or 1 per 2 kb. Of these, 124 (46%) corresponded to

genes of known function. Chromosome VI has 270 kb. It contains 129 ORFs, again about

1 per 2kb. Of these, 76 (59%) correspond to genes with previously known function. The

total sequence

S. cerevisiae

now

completed. First

estimates place

the number of

ORFs at 5885; doubtless this will change with further analysis.

SURVEY OF PAST LARGE-SCALE DNA SEQUENCING PROJECTS

543

the

case

elegans

DNA sequencing, we are dealing not

with

continuous ge-

nomic sequence but with the sequence of selected cosmids. The effort, directed by John

Sulston of Cambridge, England, and Robert Waterston of St. Louis, Missouri,

also

state-of-the art ﬂuorescent DNA sequencing technology with a great deal of automation.

The

strategy

mostly

shotgun,

with

directed sequencing

relegated

mostly to

closure

of gaps between contigs. The ﬁrst 21.14 Mb of

elegans

DNA

sequence

reported

contained a

total

of 3980 genes of 1 per

4.8 kb on the autosomes and

1 per 6.6 kb on

the X chromosome. Only 46% of these matched sequences already in

the

DNA

data-

bases. About 28% of the total DNA is coding; 50% of

elegans

genes,

including

both exons and introns. This is a sharp drop from the

density

coding

sequences

simple

organisms.

The

total

number of genes

the

nematode

genome is

estimated

13,000

500.

This

number

most

contemporary

expectations

for

the sizes of the genomes of typical multicellular, highly differentiated organisms like the

nematode.

The remaining two DNA sequencing projects

that we

will

discuss

illustrate

some

the frustrations in detailing with the genomes of higher organisms. The complete DNA

sequence of a 338,234 bp region of

D. Melanogaster,

containing

the bithorax

complex,

important in development, has been reported by groups at Caltech and Berkeley. This re-

gion is less than 2% coding. It contains only six genes. The ﬁnal sequencing project we

will discuss is a relatively early effort that involved several cosmids from

the tip

of the

short arm of human chromosome 4, a region known to contain the gene responsible for

Huntington’s disease. The region is band 4p16.3. It is estimated to contain a total of 2.5

Mb of DNA. A 225-kb subset of this region was sequenced. This yielded 13 transcripts in

225 kb or one per 18 kb on average. Another

estimate of gene density could

obtained

by determining the number of HTF islands in

the region. This will be a minimum

esti-

mate for the number of genes, since perhaps only half to two-thirds of all genes have HTF

islands nearby. In fact, in the 225 kb region, one HTF island was found on average per 28

kb. By comparison, when HTF islands were mapped to a different section of chromosome

4, a 460 kb region near the marker D4S111, the frequency of occurrence of these gene-

associated sequences was one per 30 kb. All of these estimates of gene density are re-

markably consistent. If we scale these expected gene densities to the entire Huntington’s

disease region, we obtain an estimate of

2.5 Mb

13 genes

143 genes

0.225 Mb

This

makes

clear

why

ﬁnding

the

gene

for

Huntington’s

disease

was

not

easy task.

The ﬁrst DNA sequencing effort in

band

4p16.3 was

carried

out

Bethesda,

Maryland, under the direction of Craig Ventor. It involved a total of 58 kb of DNA se-

quence in three cosmids. Three genes were found, each has an HTF island. The average

gene density in this relatively small region

is one per 19 kb, which is

quite

consistent

with expectations. Less than 10% of the region is coding sequence. The number of Alu

repeats in the region is 62, or roughly one per kb. This is comparable

what

has been

seen in the DNA sequence of two other gene rich, G

C-rich regions. In the human

growth hormone region 0.7 Alu’s were found per kb; in the HRPT region 0.9 Alu’s were

found per kb. In stark contrast, in the globin region which is G

C poor,

there

are only

0.1 Alu’s per kb. These results illustrate the mosaic nature of the human genome rather

dramatically.

544	RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING
Unlike	simple	genomes,	with	relatively uniform DNA compositions, mammalian
genomes have	mosaic	compositions	which	is	reﬂected	in chromosome banding patterns.
Scaling of a regional gene density to estimate the total number of genes, must take into
account regional characteristics. Long before large-scale DNA sequencing or genome
mapping was underway, Georgio Bernardi developed a method of fractionating genomes
into regions with various G					C content. This was done by equilibrium ultracentrifuga-
tion in density gradients (Chapter 5). The resulting fractions were called isochores.
Altogether, Bernardi obtained evidence for ﬁve distinct human DNA classes; these could
be divided into three easily separated and manipulated fractions. Their properties are
summarized below:
		CLASS	GENE		DENSITY	GENOME FRACTION	LOCATION
		L1,L2			1	62%	Dark bands
		H1,H2			2	31%	Light bands
		H3			16	7%	Telomeric

				light bands
Several aspects of these results deserve comment. Gene density means the relative num-
ber of genes, based on cDNA library comparisons. The genome fraction is estimated from
the total amount of material in the density-separated fractions. The telomeric light bands
have very special properties, that we have alluded to before. Figure 15.10 illustrates the
actual locations seen when DNAs from Bernardi’s fraction H3 are mapped by FISH. The
preferential location of these sequences on just a small subset of human chromosomal re-
gions is really remarkable.
The	Huntington’s disease region	is known to be a	gene-rich light band, so we can
pretty much exclude the L1 and L2 classes from consideration. In the Huntington’s re-
gion, there is one gene on average per 18 kb. If this region is an H3 region, then we can
estimate the number of genes in the human genome as
		H3	11,700 genes
		H1,H2	6500 genes
		L1,L2	6500 genes
for a total of 24,700 genes. This estimate is less than twice the number of genes in				C. ele-
gans,	which seems far too low. If we assume that the Huntington’s disease region is an
H1,H2 region, then the estimate of the number of genes in the human genome becomes
	H3	92,000 genes
	H1,H2	51,100 genes
	L1,L2	51,100 genes for a total of 194,200 genes.
This is a depressingly large number, much larger than previous estimates. This example
illustrates how difﬁcult it is to know		from very fragmentary data what the real target size
of the human genome project is. Perhaps the Huntington’s disease region is somewhere
between the properties of the H3, and H1 plus H2 fractions, and the gene number some-
where mercifully between the two rather upsetting extremes we have computed. More re-
cent estimates of the number of human genes range from 65,000 to 150,000, which is not
too different from the average of our original estimates.

FINDING ERRORS IN DNA SEQUENCES

545

Figure 15.10	Distribution of extremely G	C-rich sequences in the human genome. Solid
bars show relative hybridization of the H3 dark fraction. Open bars show rRNA-encoding
DNA. Taken from Saccone et al. (1992).
FINDING ERRORS IN DNA SEQUENCES
Quite a few	different kinds of errors contaminate	data in existing DNA sequence banks.

As the amount of data escalates, it will become increasingly important to audit these data continuously. Suspect data need to be ﬂagged before they propagate and affect the results of many sequence comparisons or experimental scientiﬁc efforts. For example, an error in one of the earliest complete DNA sequences, the plasmid pBR322, produced a spurious stop codon in one of the proteins coded for by this plasmid. This confounded many researchers who were using this plasmid as a cloning and expression system, since a protein band with an unexplainable size was frequently seen.

Some common errors in DNA sequence data are quite easy to ﬁnd and correct; others

are almost impossible. A major class of error is incorporation of a totally inappropriate sequence. This can come about if, as is not uncommon, DNA samples are mixed up in the laboratory prior to sequencing. It can arise from cloning artifacts. A clone may have

<<< < Предыдущая 12 / 52 3 4 5 > Следующая >>>

Соседние файлы в папке genomics11-15

#
17.08.2013277.66 Кб4711.pdf
#
17.08.2013510.17 Кб4612.pdf
#
17.08.2013311.59 Кб4613.pdf
#
17.08.2013577.75 Кб4514.pdf
#
17.08.2013499.07 Кб4715.pdf
#
17.08.201326.85 Кб46appendix databases.pdf