536 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING
We know that such networks can be trained (i.e., adjusted) to respond to signals or stimuli and to integrate the input from many different sources or sensors. Here the basic properties of neural nets will be illustrated, and then examples of how they have been applied to
the analysis of DNA sequences will be shown. |
|
|
|
|
|
|
|
The basic element in a neural net is |
a node, as shown in Figure 15.3 |
a . This node re- |
|||
ceives input from one or more sensors, and |
it delivers output to one or more other nodes |
|||||
or a detector. The behavior of nodes is quantized. The signal input from each sensor is |
||||||
continuously scanned. It is recorded as positive if is above some threshold; otherwise, it is |
||||||
scored as negative (Fig. 15.3 |
b ). An input can be stimulatory or inhibitory. A node receiv- |
|||||
ing a stimulatory input will send out the same sign signal. A node receiving an inhibitory |
||||||
signal will send out the opposite sign signal. By analogy, a nerve cell receiving a stimula- |
||||||
tory impulse fires, while one receiving an inhibitory impulse does not fire. |
|
|
||||
|
Neural nets are collections of nodes wired in particular ways. They are generalizations |
|||||
of simple logical circuits. The variables in a neural net are the signal thresholds and the |
||||||
nature of the response of the nodes. We will illustrate this with three cases of increasing |
||||||
complexity. Consider the simple two-input node shown in Figure 15.3. Suppose that it op- |
||||||
erates under the following rules: If both sensors are positive, the node sends a positive |
||||||
output. Otherwise, it sends a negative output. This node is operating as the logical and |
||||||
function. It is behaving like a neuron that needs two simultaneous positive inputs in order |
||||||
to fire. |
|
|
|
|
|
|
|
As a second case, consider the same |
node in Figure 15.3, but now imagine that the |
||||
node sends a positive output if either input or both inputs are positive. The only way the |
||||||
node |
sends a negative output is if both sensors are reading negative. This node is |
acting |
||||
like the logical and/or function. It stimulates a nerve cell that needs only one positive |
||||||
stimulus to fire. |
|
|
|
|
|
|
|
The third case we will consider is a |
node that sends a positive signal if either input |
||||
sensor is positive but not if both sensor inputs are positive. It is difficult to represent this |
||||||
behavior by a single node with simple |
|
/ binary logical properties. Instead, we can rep- |
||||
resent the behavior by a slightly more complex network with three nodes, |
as shown in |
|||||
Figure 15.4. Here the two sensors input their signal directly to two of the nodes. Each of |
||||||
these nodes views one input as stimulatory |
and the other input as inhibitory. Thus each |
|||||
node |
will fire if and only if it receives one |
positive and one |
negative |
signal. |
The two |
|
nodes feed stimulatory inputs into the third node. This node will be directed to fire if it re- |
||||||
ceives a positive input from either one of the two nodes that precede it. One way to view |
||||||
the |
structure of the simple neural network |
shown |
in Figure 15.4 |
is that |
there |
is hidden |
Figure 15.3 The simplest possible neural net. This net can perform the logical operations “and”
and “and/or.” (a) Coupling of two inputs to a single output. |
(b) Effect of sensor threshold on signal |
value. |
|
NEURAL NET ANALYSIS OF DNA SEQUENCES |
537 |
Figure 15.4 A more complex neural net which can perform the logical operation either but not both.
layer of nodes between the sensors and the final output node. In this particular case the hidden layer has a very simple structure; yet it is already capable of executing a compli-
cated logical operation. |
|
|
|
|
|
|
To use a neural net, one constructs a fairly general set of |
nodes and connections with |
|||||
one or more hidden layers, as shown in Figure |
15.5. This is trained on sequences with |
|||||
known properties. The net is cycled through the training set of data, and weighting factors |
||||||
for each of the connections are adjusted to |
try to achieve the highest positive output |
|||||
scores for desired input characteristics and the |
lowest ones for |
undesired characteristics. |
||||
A neural net could be used to examine DNA sequence directly, but this would take a very |
||||||
complex net, and the resulting training period would be computationally very intensive. |
||||||
Instead, what works quite satisfactorily is to |
use as sensor inputs, not individual bases, |
|||||
but instead the seven-sequence analysis algorithms |
described |
in |
the |
previous |
section. |
|
These sensors are each allowed to scan the DNA sequence over 10-base intervals. The net |
||||||
result of each scan is computed in a 99-base window. This is the |
length |
of sequence that |
||||
is scanned and input into the net. Then the sequence is frameshifted by one base, and the |
||||||
analysis is repeated. The result is scaled, and |
then |
each sensor is |
fed |
into the |
neural net. |
The actual net structure used is shown in Figure 15.6. It consists of the 7 input sensors, 14 hidden nodes in a first layer, 5 hidden nodes in a second layer, and a single output node.
Edward Uberbacher and Robert Mural at Oak Ridge National Laboratory trained the neural net shown in Figure 15.6 on 240 kb of human DNA sequence data, adjusting thresholds, signs, and weighting until the performance of the net appeared to be optimum (1991). The result is a sequence analysis program called GRAIL. The detailed pattern of input into GRAIL from each of seven sensors for a particular DNA sequence is shown in
Figure 15.7. Each plot shows the relative probability that the given 99-base window is an exon with coding potential. It is apparent that some sensors like coding six-tuple in frame preferences have much more powerful discrimination than others. However, when the in-
put from all seven sensors is combined by the neural net, the result is a truly striking pattern of prediction of clear exons and introns. This is shown in Figure 15.8. GRAIL works
Figure 15.5 A still more complex neural net, with several hidden layers.
538 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING
Figure 15.6 The actual neural net used in GRAIL analysis of DNA sequences. Adapted from Uberbacher and Mural (1991).
on many different types of human proteins that were not included in the original training set. A number of examples are shown in Figure 15.9. Some caution is needed, however, because not all human genomic sequence is handled well by GRAIL. For example, the human T-cell receptor gene cluster is not readably amenable to GRAIL analysis. The program also has difficulty in finding very small exons, which is not surprising in view of the 99-base window used.
Neural net approaches similar to GRAIL appear to have great promise in other complex problems in biological and chemical analysis. These include prediction of protein secondary and tertiary structure, correction of DNA sequencing errors, and analysis of mass spectrometric chemical fragmentation data. Note, however, that neural nets are only one of a number of different types of algorithmic approaches applicable to such problems, and the vote is still out on which will eventually turn out to be the most effective for
particular classes of analysis. However, for the past half-decade, GRAIL has proved to be an extremely useful tool for most applications to human DNA sequence analysis, and it is readily accessible via computer networks, to all interested users.
Since the introduction of GRAIL, improvements have been made on the original algorithms to produce GRAIL 2. Other approaches to gene finding have been proposed, including a linear discriminant method (Solovyev et al., 1994) and, most recently, a quadratic discriminant method (Zhang, 1997). These methods take into account additional factors like the compatibility of the reading frames of adjacent exons and consensus sequences to the intron segment that forms a branched structure as an intermediate step in
NEURAL NET ANALYSIS OF DNA SEQUENCES |
539 |
Figure 15.7 Performance of each of the seven sensors of the net shown in Figure 15.6 on one particular DNA sequence. The vertical axis indicates the probability that each sliding segment of DNA
sequence is a coding exon. Taken from Uberbacher and Mural (1991).
Figure 15.8 The output of the neural net, based on its optimal evaluation of the sensor results shown in Figure 15.7. Adapted from Uberbacher and Mural (1991).
540 RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING
Figure 15.9 Examples of the performance of the neural net of Figure 15.6 on a set of different genomic DNA sequences. Adapted from Uberbacher and Mural (1991).
splicing. When tested in a large number of sequences, the three algorithms all perform well, but they are still far from perfect (Table 15.4).
TABLE 15.4 Success of Exon Prediction: Exons Found by Three Different Schemes
Scheme |
Sensitivity TP/(TP |
FN) |
Specificity TP/(TP |
FP)S |
|
|
|
|
|
GRAIL 2 |
0.53 |
|
0.60 |
|
Linear discriminant analysis |
0.73 |
|
0.75 |
|
Quadratic discriminant analysis |
0.78 |
|
0.86 |
|
|
|
|
|
|
Source: Adapted from Hong (1997)
Note: True positives (TP) are true positives correctly predicted. False positives (FP) are true negatives predicted to be positive. False negatives (FN) are true positives predicted to be negative. Sensitivity is the fraction of true positives found. Specificity is the fraction of positives found that is true.
|
|
SURVEY OF PAST LARGE-SCALE DNA SEQUENCING PROJECTS |
|
541 |
|||||||
SURVEY OF PAST LARGE-SCALE DNA SEQUENCING PROJECTS |
|
|
|
|
|
|
|||||
Most early large-scale DNA sequencing projects involved a pre-selected gene of particu- |
|
|
|
|
|||||||
lar interest. An example is the enzyme HPRT (57 kb). These projects are |
milestones |
in |
|
|
|
||||||
the history of DNA sequencing, but it is difficult to extrapolate the results of such projects |
|
|
|
|
|||||||
to the situation that will apply in most genomic sequencing efforts. In such efforts, which |
|
|
|
|
|||||||
will form the overwhelming bulk of the human genome project, one will |
be faced |
with |
|
|
|
|
|||||
large expanses of relatively uncharted DNA. While |
the regions selected may contain a |
|
|
|
|
||||||
few mapped genes, and many cDNA fragments, much of the rationale for looking at the |
|
|
|
|
|||||||
particular region will have to come a |
posteriori, |
after the sequence has been completed. |
|
|
|
|
|||||
To try to get some impression of the difficulties in assembling the sequence, and making a |
|
|
|
|
|||||||
first pass at its interpretation, it is useful to examine the first few efforts at sequencing |
|
|
|
||||||||
segments of DNA without a strong functional pre-selection. Here we summarize results |
|
|
|
|
|||||||
from seven projects: the complete sequence of |
|
|
H. influenzae, |
M. genitalium, |
partial se- |
||||||
quences of |
E. coli, S. cerevisiae, C. elegans, |
and |
D. melanogaster, |
|
and several human cos- |
||||||
mid DNAs. These |
sequence data and all |
other genomic |
sequence data currently reside in |
|
|
|
|
||||
a set of publicly accessible databases. A description of these valuable resources, and how |
|
|
|
|
|||||||
they can be accessed, is provided in the Appendix. A summary of all complete genome |
|
|
|
|
|||||||
sequences publicly available in February 1997 is given in Table 15.5. |
|
|
|
|
|
|
|||||
The complete DNA sequences of |
|
|
Haemophilus influenzae |
|
and |
Mycoplasma genitalium |
|
||||
both correspond |
to |
relatively small |
bacterial |
genomes. As expected, they are very rich |
|
|
|
|
|||
in genes, and they are especially rich in genes whose function can be surmised by com- |
|
|
|
|
|||||||
parison to other sequences in the available genome databases. |
|
|
|
M. |
genetalium |
has a |
|||||
580,070 bp genome with 470 ORFs. These occur on average one per 1235 bp. The aver- |
|
|
|
||||||||
age ORF is 1040 bp. Overall the genome is 80% coding. Seventy-three percent of the |
|
|
|
||||||||
ORF’s correspond to previously known genes. |
|
|
|
|
|
|
|
|
|||
H. influenza |
has a genome size of 1,830,137 bp. This contains 1743 coding regions, an |
|
|
||||||||
average of one every 1042 bp. The average gene |
is 900 bp long. Overall, 85% of the |
|
|
|
|||||||
genome is coding. Currently 1007 (58%) of the coding regions can be assigned a func- |
|
|
|
||||||||
tional role. Of the remainder, 385 are new genes that show no significant matches to the |
|
|
|
|
|||||||
databases, while the others match known sequences of unknown function. At an average |
|
|
|
|
|||||||
direct cost of $0.48 per base this project is probably representative of other large-scale ef- |
|
|
|
||||||||
forts using similar technology. |
|
|
|
|
|
|
|
|
|
||
Both the |
|
H. influenzae |
and |
M. genetalium |
sequencing |
projects |
were |
carried |
out at a |
|
|
single location totally by automated fluorescent DNA sequencing. In contrast, one of the |
|
|
|
|
|||||||
TABLE 15.5 Completed Genome Sequences |
|
|
|
|
|
|
|
|
|
||
|
|
|
|
|
|
|
|
||||
|
|
DNA |
|
Largest DNA |
Open Reading |
Genes for |
|
||||
Species |
|
Molecules |
kb DNA |
(kb) |
|
Frames |
|
RNA |
|
||
|
|
|
|
|
|
|
|
|
|
||
M. genitalium |
|
|
1 |
|
580 |
580 |
|
470 |
38 |
||
M. pneumonia |
|
|
1 |
|
816 |
816 |
|
677 |
39 |
||
M. janneschii |
|
|
3 |
|
1740 |
1665 |
|
1738 |
45 |
||
H. influenza |
|
|
1 |
|
1830 |
1830 |
|
|
1743 |
|
76 |
Synechoncystis sp. |
|
|
1 |
|
3573 |
3573 |
|
|
3168 |
? |
|
E. coli |
|
|
1 |
|
4639 |
4639 |
|
4200 |
? |
||
S. cerevisiae |
|
|
16 |
|
12,068 |
1532 |
|
5885 |
455 |
||
|
|
|
|
|
|
|
|
|
|
|
|
542 |
RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING |
|
|
|
|
|||||||||||||||
efforts to sequence major sections of the |
|
|
|
E. |
coli |
genome, directed by Fred Blattner in |
|
|||||||||||||
Madison, Wisconsin, started as basically low-technology, manual DNA sequencing, em- |
|
|
|
|
||||||||||||||||
ploying a large number of relatively unskilled workers, and |
concentrated |
on |
relatively |
|
|
|
|
|||||||||||||
simple protocols. The initial result was a 91.4 kb contig. The region contained 82 pre- |
|
|
|
|
||||||||||||||||
dicted ORFs or roughly one per kb. The ORFs constituted about 84% of the total se- |
|
|
|
|||||||||||||||||
quence. If we scale the properties of this region to the entire |
4.7 Mb |
|
|
|
E. coli |
genome, we |
|
|||||||||||||
can predict that |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
4.7 Mb |
82 ORFs |
|
4200 genes |
|
|
|
|
||||||
|
|
|
|
|
|
|
|
0.0914 Mb |
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
This is larger than estimates of the number of genes in |
|
|
|
|
|
|
E. coli |
based on the appearance of |
|
|||||||||||
protein spots in two-dimensional electrophoretic separations. Past sampling of |
|
|
|
E. coli |
re- |
|||||||||||||||
gions has revealed fairly uniform gene density except for areas around the terminus of |
|
|
|
|
||||||||||||||||
replication. Hence the preliminary sequencing results on |
|
|
|
|
|
|
E. coli |
suggest that a significant |
|
|||||||||||
number of new and interesting genes remain to be discovered. A more recent report of ad- |
|
|
|
|
||||||||||||||||
ditional |
E. |
coli |
sequences is |
quite |
consistent |
with the |
earlier |
observations |
within |
a |
|
|
||||||||
338,500 base contig, 319 ORFs were found—one per 1060 bases. Of these, 46% are po- |
|
|
||||||||||||||||||
tentially new genes. The complete |
|
|
|
E. coli |
DNA sequence has just became available, and it |
|
|
|||||||||||||
contains 4300 genes, in 4.54 Mb, quite consistent with predictions based on partial se- |
|
|
|
|||||||||||||||||
quencing results. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
The early major accomplishments in |
|
|
|
|
|
S. cerevisiae |
|
sequencing derive from a very dif- |
|
|||||||||||
ferent organizational model than the work on |
|
|
|
|
E. coli. |
The approach was still mostly very |
|
|||||||||||||
low technology. It was mostly the result of a dispersed European effort among more than |
|
|
|
|
||||||||||||||||
30 different laboratories, coordinated through a common data collection center in France. |
|
|
|
|
||||||||||||||||
The complete DNA sequence of one of the smallest |
|
|
|
|
|
|
|
S. cerevisiae |
chromosomes, |
number |
|
|||||||||
III, was the first one determined. At 315 kb it represented |
the longest continuous stretch |
|
|
|
|
|||||||||||||||
of DNA sequence known at the time. The chromosome III sequence was originally re- |
|
|
|
|
||||||||||||||||
ported to contain 182 ORFs. After this was corrected by a more rigorous examination, |
|
|
|
|
||||||||||||||||
carried out by Christian Sander in Heidelberg, 176 |
ORFs remained. These occur at |
|
|
|||||||||||||||||
roughly one per 2 kb or half of the density seen in the three bacteria discussed above. The |
|
|
|
|
||||||||||||||||
ORFs cover 70% of the DNA sequence; this is not too much lower than the total density |
|
|
|
|
||||||||||||||||
of coding |
sequence |
in |
E. coli. |
We can make a |
rough estimate the number of genes |
in |
|
S. |
||||||||||||
cerevisiae |
|
by |
scaling |
these results |
to the 12.1 Mb total size of the yeast genome. The re- |
|
|
|
||||||||||||
sult is |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12.1 Mb |
176 ORFs |
6760 ORFs |
|
|
|||||||||||
|
|
|
|
|
|
|
|
0.315 Mb |
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
The total number of genes in |
|
|
S. cerevisiae |
will be slightly less than the number of ORFs |
|
|
||||||||||||||
because occasional genes in yeast consist of more than one exon. In addition, for both |
|
|
|
|
||||||||||||||||
bacteria and yeast, we have to add in genes for rRNAs, tRNAs, and other nontranslated |
|
|
|
|||||||||||||||||
species (Table 15.5). |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|||
The complete DNA sequences of several other |
|
|
|
|
|
|
|
S. cerevisiae |
chromosomes reported |
|
||||||||||
were consistent with the results for chromosome III. For example, chromosome VIII has |
|
|
|
|
||||||||||||||||
562,698 bp. It contains 269 ORFs, or 1 per 2 kb. Of these, 124 (46%) corresponded to |
|
|
||||||||||||||||||
genes of known function. Chromosome VI has 270 kb. It contains 129 ORFs, again about |
|
|
|
|
||||||||||||||||
1 per 2kb. Of these, 76 (59%) correspond to genes with previously known function. The |
|
|
|
|||||||||||||||||
total sequence |
of |
|
S. cerevisiae |
is |
now |
completed. First |
estimates place |
the number of |
|
|
ORFs at 5885; doubtless this will change with further analysis.
|
|
|
|
|
|
|
SURVEY OF PAST LARGE-SCALE DNA SEQUENCING PROJECTS |
|
|
543 |
|||||||||||||||||||||
|
In |
the |
case |
of |
C. |
elegans |
|
|
DNA sequencing, we are dealing not |
with |
continuous ge- |
|
|
|
|
||||||||||||||||
nomic sequence but with the sequence of selected cosmids. The effort, directed by John |
|
|
|
|
|
|
|||||||||||||||||||||||||
Sulston of Cambridge, England, and Robert Waterston of St. Louis, Missouri, |
is |
also |
|
|
|
|
|||||||||||||||||||||||||
state-of-the art fluorescent DNA sequencing technology with a great deal of automation. |
|
|
|
|
|
|
|
||||||||||||||||||||||||
The |
strategy |
is |
mostly |
shotgun, |
with |
directed sequencing |
relegated |
mostly to |
closure |
|
|
|
|
|
|
||||||||||||||||
of gaps between contigs. The first 21.14 Mb of |
|
|
|
|
|
|
|
|
C. |
elegans |
|
DNA |
sequence |
reported |
|||||||||||||||||
contained a |
total |
|
of 3980 genes of 1 per |
4.8 kb on the autosomes and |
1 per 6.6 kb on |
|
|
|
|
||||||||||||||||||||||
the X chromosome. Only 46% of these matched sequences already in |
the |
DNA |
data- |
|
|
|
|
|
|
||||||||||||||||||||||
bases. About 28% of the total DNA is coding; 50% of |
|
|
|
|
|
|
|
|
C. |
elegans |
is |
genes, |
including |
||||||||||||||||||
both exons and introns. This is a sharp drop from the |
density |
of |
|
coding |
sequences |
in |
|
|
|
|
|
|
|||||||||||||||||||
simple |
organisms. |
|
The |
total |
number of genes |
in |
the |
nematode |
genome is |
estimated |
|
|
|
|
|
|
|
||||||||||||||
to |
be |
13,000 |
500. |
This |
is |
a |
|
number |
close |
to |
most |
contemporary |
expectations |
for |
|
|
|
|
|||||||||||||
the sizes of the genomes of typical multicellular, highly differentiated organisms like the |
|
|
|
|
|
|
|||||||||||||||||||||||||
nematode. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
The remaining two DNA sequencing projects |
that we |
will |
discuss |
illustrate |
some |
of |
|
|
|
|
|
|
||||||||||||||||||
the frustrations in detailing with the genomes of higher organisms. The complete DNA |
|
|
|
|
|
|
|
||||||||||||||||||||||||
sequence of a 338,234 bp region of |
|
|
|
|
|
|
D. Melanogaster, |
|
|
|
containing |
the bithorax |
complex, |
|
|||||||||||||||||
important in development, has been reported by groups at Caltech and Berkeley. This re- |
|
|
|
|
|
|
|||||||||||||||||||||||||
gion is less than 2% coding. It contains only six genes. The final sequencing project we |
|
|
|
|
|
|
|||||||||||||||||||||||||
will discuss is a relatively early effort that involved several cosmids from |
the tip |
of the |
|
|
|
|
|
|
|||||||||||||||||||||||
short arm of human chromosome 4, a region known to contain the gene responsible for |
|
|
|
|
|
|
|
||||||||||||||||||||||||
Huntington’s disease. The region is band 4p16.3. It is estimated to contain a total of 2.5 |
|
|
|
|
|||||||||||||||||||||||||||
Mb of DNA. A 225-kb subset of this region was sequenced. This yielded 13 transcripts in |
|
|
|
|
|
|
|
||||||||||||||||||||||||
225 kb or one per 18 kb on average. Another |
estimate of gene density could |
be |
obtained |
|
|
|
|
|
|
||||||||||||||||||||||
by determining the number of HTF islands in |
the region. This will be a minimum |
esti- |
|
|
|
|
|
|
|||||||||||||||||||||||
mate for the number of genes, since perhaps only half to two-thirds of all genes have HTF |
|
|
|
|
|
|
|||||||||||||||||||||||||
islands nearby. In fact, in the 225 kb region, one HTF island was found on average per 28 |
|
|
|
|
|
|
|||||||||||||||||||||||||
kb. By comparison, when HTF islands were mapped to a different section of chromosome |
|
|
|
|
|
|
|
|
|||||||||||||||||||||||
4, a 460 kb region near the marker D4S111, the frequency of occurrence of these gene- |
|
|
|
|
|
||||||||||||||||||||||||||
associated sequences was one per 30 kb. All of these estimates of gene density are re- |
|
|
|
|
|
|
|||||||||||||||||||||||||
markably consistent. If we scale these expected gene densities to the entire Huntington’s |
|
|
|
|
|
|
|||||||||||||||||||||||||
disease region, we obtain an estimate of |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||||
|
|
|
|
|
|
|
|
|
|
2.5 Mb |
13 genes |
|
|
143 genes |
|
|
|
|
|
|
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
0.225 Mb |
|
|
|
|
|
|
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||||
This |
makes |
it |
|
clear |
why |
finding |
the |
gene |
for |
Huntington’s |
disease |
was |
|
not |
an |
|
|
|
|
|
|||||||||||
easy task. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
||
|
The first DNA sequencing effort in |
|
band |
4p16.3 was |
carried |
out |
in |
Bethesda, |
|
|
|
|
|||||||||||||||||||
Maryland, under the direction of Craig Ventor. It involved a total of 58 kb of DNA se- |
|
|
|
|
|
||||||||||||||||||||||||||
quence in three cosmids. Three genes were found, each has an HTF island. The average |
|
|
|
|
|
|
|
||||||||||||||||||||||||
gene density in this relatively small region |
is one per 19 kb, which is |
quite |
consistent |
|
|
|
|
|
|||||||||||||||||||||||
with expectations. Less than 10% of the region is coding sequence. The number of Alu |
|
|
|
|
|
|
|||||||||||||||||||||||||
repeats in the region is 62, or roughly one per kb. This is comparable |
to |
what |
has been |
|
|
|
|
|
|||||||||||||||||||||||
seen in the DNA sequence of two other gene rich, G |
|
|
|
|
|
|
|
|
|
C-rich regions. In the human |
|||||||||||||||||||||
growth hormone region 0.7 Alu’s were found per kb; in the HRPT region 0.9 Alu’s were |
|
|
|
|
|
||||||||||||||||||||||||||
found per kb. In stark contrast, in the globin region which is G |
|
|
|
|
|
|
|
|
|
C poor, |
there |
are only |
|||||||||||||||||||
0.1 Alu’s per kb. These results illustrate the mosaic nature of the human genome rather |
|
|
|
|
|
|
|||||||||||||||||||||||||
dramatically. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
544 |
RESULTS AND IMPLICATIONS OF LARGE-SCALE DNA SEQUENCING |
||||||
Unlike |
simple |
genomes, |
with |
relatively uniform DNA compositions, mammalian |
|||
genomes have |
mosaic |
compositions |
which |
is |
reflected |
in chromosome banding patterns. |
|
Scaling of a regional gene density to estimate the total number of genes, must take into |
|||||||
account regional characteristics. Long before large-scale DNA sequencing or genome |
|||||||
mapping was underway, Georgio Bernardi developed a method of fractionating genomes |
|||||||
into regions with various G |
|
|
C content. This was done by equilibrium ultracentrifuga- |
||||
tion in density gradients (Chapter 5). The resulting fractions were called isochores. |
|||||||
Altogether, Bernardi obtained evidence for five distinct human DNA classes; these could |
|||||||
be divided into three easily separated and manipulated fractions. Their properties are |
|||||||
summarized below: |
|
|
|
|
|
|
|
|
|
CLASS |
GENE |
DENSITY |
GENOME FRACTION |
LOCATION |
|
|
|
L1,L2 |
|
|
1 |
62% |
Dark bands |
|
|
H1,H2 |
|
|
2 |
31% |
Light bands |
|
|
H3 |
|
|
16 |
7% |
Telomeric |
|
|
|
|
light bands |
Several aspects of these results deserve comment. Gene density means the relative num- |
|
|||
ber of genes, based on cDNA library comparisons. The genome fraction is estimated from |
|
|||
the total amount of material in the density-separated fractions. The telomeric light bands |
|
|||
have very special properties, that we have alluded to before. Figure 15.10 illustrates the |
|
|||
actual locations seen when DNAs from Bernardi’s fraction H3 are mapped by FISH. The |
|
|||
preferential location of these sequences on just a small subset of human chromosomal re- |
|
|||
gions is really remarkable. |
|
|
|
|
The |
Huntington’s disease region |
is known to be a |
gene-rich light band, so we can |
|
pretty much exclude the L1 and L2 classes from consideration. In the Huntington’s re- |
|
|||
gion, there is one gene on average per 18 kb. If this region is an H3 region, then we can |
|
|||
estimate the number of genes in the human genome as |
|
|
||
|
|
H3 |
11,700 genes |
|
|
|
H1,H2 |
6500 genes |
|
|
|
L1,L2 |
6500 genes |
|
for a total of 24,700 genes. This estimate is less than twice the number of genes in |
C. ele- |
|||
gans, |
which seems far too low. If we assume that the Huntington’s disease region is an |
|
||
H1,H2 region, then the estimate of the number of genes in the human genome becomes |
|
|||
|
H3 |
92,000 genes |
|
|
|
H1,H2 |
51,100 genes |
|
|
|
L1,L2 |
51,100 genes for a total of 194,200 genes. |
|
|
This is a depressingly large number, much larger than previous estimates. This example |
|
|||
illustrates how difficult it is to know |
from very fragmentary data what the real target size |
|
||
of the human genome project is. Perhaps the Huntington’s disease region is somewhere |
|
|||
between the properties of the H3, and H1 plus H2 fractions, and the gene number some- |
|
|||
where mercifully between the two rather upsetting extremes we have computed. More re- |
|
|||
cent estimates of the number of human genes range from 65,000 to 150,000, which is not |
|
|||
too different from the average of our original estimates. |
|
|
FINDING ERRORS IN DNA SEQUENCES |
545 |
Figure 15.10 |
Distribution of extremely G |
C-rich sequences in the human genome. Solid |
bars show relative hybridization of the H3 dark fraction. Open bars show rRNA-encoding |
||
DNA. Taken from Saccone et al. (1992). |
|
|
FINDING ERRORS IN DNA SEQUENCES |
|
|
Quite a few |
different kinds of errors contaminate |
data in existing DNA sequence banks. |
As the amount of data escalates, it will become increasingly important to audit these data continuously. Suspect data need to be flagged before they propagate and affect the results of many sequence comparisons or experimental scientific efforts. For example, an error in one of the earliest complete DNA sequences, the plasmid pBR322, produced a spurious stop codon in one of the proteins coded for by this plasmid. This confounded many researchers who were using this plasmid as a cloning and expression system, since a protein band with an unexplainable size was frequently seen.
Some common errors in DNA sequence data are quite easy to find and correct; others
are almost impossible. A major class of error is incorporation of a totally inappropriate sequence. This can come about if, as is not uncommon, DNA samples are mixed up in the laboratory prior to sequencing. It can arise from cloning artifacts. A clone may have