Genomics: The Science and Technology Behind the Human Genome Project. |
Charles R. Cantor, Cassandra L. Smith |
|
Copyright © 1999 John Wiley & Sons, Inc. |
|
ISBNs: 0-471-59908-5 (Hardback); 0-471-22056-6 (Electronic) |
11 Strategies for Large-Scale DNA Sequencing
WHY |
STRATEGIES ARE |
NEEDED |
|
|
|
|
|
Consider the task of determining |
the sequence of a 40-kb cosmid insert. Suppose |
that |
|||||
each sequence read generates 400 |
base pairs of raw |
DNA sequence. At |
the |
minimum |
|
||
two-fold redundancy needed to assemble a completed sequence, 200 distinct DNA sam- |
|
||||||
ples will have to be prepared, sequenced, and analyzed. At a more typical redundancy of |
|||||||
ten, |
1000 samples |
will have to |
be processed. The |
chain of events |
that |
takes |
the cosmid, |
converts it into these samples, and provides whatever sequencing primers are needed is the upstream part of a DNA sequencing strategy. The collation of the data, their analysis, and their insertion into databases is the downstream part of the strategy. It is obvious that this number of samples and the amount of data involved cannot be handled manually, and
the project cannot be managed |
in an unsystematic fashion. A key part of |
the |
strategy will |
be developing the protocols |
needed to name samples generated along |
the |
way, retain- |
ing an audit of their history, and developing procedures that minimize lost samples, unnecessary duplication of samples or sequence data, and prevent samples from becoming mixed up.
There are several well-established, basic strategies for sequencing DNA targets on the scale of cosmids. It is still not yet totally clear how these can best be adopted for Mb scale sequencing, but it is daunting to realize that this is an increase in scale by a factor of 50, which is not something to be undertaken lightly. The basic strategies range from shotgun, which is a purely random approach, to highly directed walking methods. In this chapter the principles behind these strategies will be illustrated, and some of their relative advantages and disadvantages will be described. In addition we will discuss the relative
merits |
of sequencing genomes and sequencing just the genes, by the use of cDNA li- |
braries. Current approaches for both of these objectives will be described. |
|
SHOTGUN |
DNA SEQUENCING |
This is the method that has been used, up to now, for virtually all large-scale DNA sequencing projects. Typically the genomic (or cosmid) DNA is broken by random shear.
The DNA fragments are trimmed to make sure they have blunt ends, and as such they are subcloned into an M13 sequencing vector. By using shear breakage, one can generate fragments of reasonably uniform size, and all regions of the target should be equally represented. At first, randomly selected clones are used for sequencing. When this becomes
inefficient, because the same clones start |
to |
appear often, |
the set |
of clones already se- |
||||
quenced can |
be pooled |
and hybridized back |
to |
the library |
to |
exclude |
clones that |
come |
from regions |
that have |
already been over sampled (a similar |
procedure |
is described |
later |
361
362 STRATEGIES FOR LARGE-SCALE DNA SEQUENCING
in the chapter for cDNA sequencing; see Fig. 11.17). The use of a random set of small
targets ensures that overlapping sequence data will be obtained, |
across the entire target, |
||
and that both strands will |
be sampled. An obvious complication |
that this |
causes is that |
one does not know a priori |
to which strand a given segment of sequence |
corresponds. |
|
This has to be sorted out by software as part of problem of assembling the shotgun clones |
|||
into finished sequence. |
|
|
|
Shotgun sequencing has |
a number of distinct advantages. It |
works, it |
uses only tech- |
nology that is fully developed and tested, and this technology is all fairly easy to automate. Shotgun sequencing is very easy at the start. As redundancy builds, it catches some
of the errors. The number of primers needed is very small. If one-color sequencing or dye terminator sequencing is used, two primers suffice for the entire project if both sides of each clone are examined; only one primer is needed if a single side of each clone is examined. With four-color dye primer sequencing the number of primers needed are eight
and four, respectively.
The disadvantages of shotgun sequencing fall into three categories. The redundancy is very high, definitely higher than more orderly methods, and so, in principle, optimized shotgun sequencing will be less efficient than optimized directed methods. The assembly
of shotgun sequence data is computationally very intensive. Algorithms exist that do the job effectively, such as Phil Green’s phrap program, but these use much computer time. The reason is that one must try, in assembly, to match each new sequence obtained with
all of the other sequences and their complements in order to compensate for the lack of prior knowledge of where in the target and which strand the sequence derives from. Repeats can confuse the assembly process. A simple example is shown in Figure 11.1. Suppose that two sequence fragments have been identified with a nearly identical repeat.
One possibility is that the two form a contig; they derive from two adjacent clones that overlap at the repeat, and the nonidentity is due to sequencing errors. The other possibility is that the two sequences are not a true contig, and they should not be assembled. The software must keep track of all these ambiguities and try to resolve them, by appropriate statistical tests, as the amount of data accumulates.
Closure is a considerable problem in most shotgun sequencing. Because of the statistics of random picking and the occurrence of uncloneable sequences, even at high redun-
dancy, and even if hybridization is used to help select new clones, there will be gaps. These have to be filled by an alternate strategy; since the gaps are usually small, PCRbased sequencing across them is usually quite effective. A typical shotgun approach to se-
Figure 11.1 Effect of a repeated sequence on sequence assembly.
|
|
DIRECTED SEQUENCING WITH WALKING PRIMERS |
363 |
||||||||||
quencing an entire cosmid will require a total of 400 kb of sequence data (redundancy of |
|
|
|
||||||||||
10) and 1000 four-color sequencing gel lanes. If 32 samples are analyzed at once, and |
|
||||||||||||
three runs can be done per day, this will require 10 working days with a single automated |
|
|
|||||||||||
instrument. In practice, around 20% sample failure can be anticipated, and most laborato- |
|
|
|||||||||||
ries can only manage two runs a day. Therefore a more conservative time estimate is that |
|
|
|||||||||||
a typical laboratory, with one automated instrument at efficient production speed, will re- |
|
|
|||||||||||
quire about 18 working days to sequence a cosmid. This is a dramatic advance from the |
|
|
|||||||||||
rates |
of data acquisition in the very first DNA sequencing |
projects. However, it is |
not yet |
|
|||||||||
a speed at which the entire human genome can be done efficiently. At 8 |
|
|
|
|
104 |
cosmid |
|||||||
equivalents, the human genome would require 4 |
|
|
|
|
|
103 |
laboratory years of effort at this |
|
|||||
rate, even assuming seven-day work weeks. It is not an inconceivable |
task, |
but |
the |
|
|||||||||
prospects will be much more attractive once an additional factor of 10 |
in throughput |
can |
|
|
|||||||||
be achieved. This is basically achievable by the optimization of existing gel-based meth- |
|
||||||||||||
ods described in Chapter 10. |
|
|
|
|
|
|
|
|
|
|
|
||
DIRECTED SEQUENCING WITH WALKING PRIMERS |
|
|
|
|
|
|
|
|
|
|
|
||
This strategy is the opposite of shotgun sequencing, in that nothing is left to chance. The |
|
||||||||||||
basic idea is illustrated in Figure 11.2. A primer is selected in the vector arm, adjacent to |
|
||||||||||||
the genomic insert. This is used to read as far along the sequence as |
one can in a single |
|
|||||||||||
pass. |
Then the end of the newly determined sequence |
is |
used |
to |
synthesize |
another |
|
||||||
primer, and this |
in turn serves to generate the |
next |
patch of |
|
sequence |
data. |
One |
can |
|
||||
walk in from both ends of the clone, and in practice, this will allow both strands to be |
|
||||||||||||
examined. |
|
|
|
|
|
|
|
|
|
|
|
|
|
The immediate, obvious advantage of primer walking is that only about a twofold re- |
|
|
|||||||||||
dundancy is required. There are also no assembly or closure problems. With good sensi- |
|
|
|||||||||||
tivity in the sequence reading, it is possible to perform walking |
primer sequencing |
di- |
|
|
|||||||||
rectly on bacteriophage or cosmid DNA. This eliminates almost all of |
the steps involved |
|
|
||||||||||
in subcloning and DNA sample preparation. These advantages are |
|
all |
significant, |
and |
|
|
|||||||
would |
make directed priming the method of choice for large-scale |
DNA |
projects if |
a |
|
|
|
||||||
number of the current disadvantages in this approach could be overcome. |
|
|
|
|
|
|
|||||||
The major, obvious, disadvantage in primer walking is that a great deal of DNA syn- |
|
||||||||||||
thesis is required. If 20-base-long primers are used, to guarantee that the |
sequence se- |
|
|||||||||||
lected |
is unique |
in the genome, one must synthesize |
5% of |
the |
total sequence for |
each |
|
Figure 11.2 Basic design of a walking primer sequencing strategy.
364 STRATEGIES FOR LARGE-SCALE DNA SEQUENCING
strand, or 10% of the overall sequence. By accepting a small failure rate, the size of the primer can be reduced. For example, in 45 kb of DNA sequence, one can estimate by Poisson statistics that the probability of a primer sequence not occurring elsewhere, and only at the site of interest is
Primer length |
10 |
11 |
12 |
Unique |
0.918 |
0.979 |
0.998 |
Thus primers less than 12 bases long are not going to be reliable. With 12 mers, one still has to synthesize 6% of the total sequence to examine both strands. This is a great deal of effort, and considerable expense is involved with the current technology for automated oligonucleotide synthesis.
The problem is that existing automated oligonucleotide synthesizers make a minimum of a thousand times the amount of primer actually needed for a sequence determination.
The major cost is in the chemicals used for the synthesis. Since the primers are effectively unique, they can be used only once, and therefore all of the cost involved in their synthesis is essentially wasted. Methods are badly needed to scale down existing automated
DNA synthesis. |
The first multiplex synthesizers built could only make 4 to 10 samples at |
||
a time; this |
is far less than would be needed for large-scale DNA sequencing by primer |
||
walking. Recently |
instruments |
have been described that can make 100 primers at a time |
|
in amounts 10% |
or less of |
what is now done conventionally. However, such instruments |
are not yet widely available. In the near |
future primers will be made thousands at a time |
by in situ synthesis in an array format (see |
Chapter 12). |
A second major problem with directed |
sequencing is that each sample must be done |
serially. This is illustrated in Figure 11.3. Until the first patch of sequence is determined, one does not know which primer to synthesize for the next patch of sequence. What this
means is that it is terribly inefficient to sequence only a single sample at once by primer
walking. |
Instead, one must |
work with at least as many samples as there are |
DNA |
se- |
||
quencing |
lanes |
available, |
and ideally one would be processing two to three |
times |
this |
|
amount so |
that |
synthesis |
is |
never a bottleneck. This means that the whole project has to |
||
be orchestrated carefully |
so |
that an efficient flow of samples occurs. Directed |
sequencing |
is not appropriate for a single laboratory interested in only one target. Its attractiveness is for a laboratory that is interested in many different targets simultaneously, or that works on a system that can be divided into many targets that can be handled in parallel.
At present, a sensible compromise is to start a sequencing project with the shotgun strategy. This will generate a fairly large number of contigs. Then at some point one switches to a directed strategy to close all of the gaps in the sequence. It ought to be possible to make this switch at an earlier stage than has been done in the past as walking becomes more efficient and can cover larger regions. In such a hybrid scheme the initial
Figure 11.3 Successive cycles of primer synthesis and sequence determination lead to a potential bottleneck in directed sequencing strategies.
|
PRIMING WITH MIXTURES OF SHORT OLIGONUCLEOTIDES |
365 |
||||||
shotgunning serves to accumulate large numbers of samples, which are then extended by |
|
|||||||
parallel directed sequencing. |
|
|
|
|
|
|
|
|
Three practical considerations enter the |
selection |
of |
particular primer |
sequences. |
|
|||
These can be dealt with by existing software packages that scan the sequence and suggest |
|
|||||||
potentially good primers. The primer must be as specific as possible for the template. |
|
|||||||
Ideally a sequence is chosen where slight mismatches elsewhere on the template are not |
|
|||||||
possible. The primer sequence must occur once and only once in the template. Finally the |
|
|||||||
primer should not be capable of forming stable secondary structure with itself—intramol- |
|
|||||||
ecularly, as a hairpin, or intermolecularly—as a dimer. The former category is particularly |
|
|||||||
troublesome, since, as illustrated in Chapter 10, there are occasional DNA sequences that |
|
|||||||
form unexpectedly stable hairpins, and we do not yet understand these well enough to |
|
|||||||
generalize and be able to identify and reject them a priori. |
|
|
|
|
|
|||
PRIMING WITH MIXTURES OF SHORT OLIGONUCLEOTIDES |
|
|
|
|
|
|||
Several groups have recently |
developed schemes |
to circumvent the difficulties required |
|
|||||
by custom synthesis of large numbers of unique DNA sequencing primers. The relative |
|
|||||||
merits of these different schemes are still not |
fully evaluated, but on the surface, they |
|
||||||
seem quite powerful. The first notion, put forth by William Studier, was to select, arbitrar- |
|
|||||||
ily, a subset of the 4 |
9 possible 9-mers |
and make all of these compounds in advance. The |
|
|||||
chances that a particular 9-mer would be present in a 45-kb cosmid are given below, cal- |
|
|||||||
culated from the Poisson distribution P( |
|
|
|
n ), where |
n |
is the expected number of |
occur- |
|
rences: |
|
|
|
|
|
|
|
|
|
P(O) 0.709 |
P(1) 0.244 |
P( |
1) 0.047 |
|
|||
Therefore, with no prior knowledge about the |
sequence, |
a |
randomly |
selected |
9-mer |
|
||
primer will give a readable sequence about 20% of the time. If it does, as shown in Figure |
|
|||||||
11.4, its complement will also give readable sequence. Thus the original subset of 4 |
9 |
|||||||
|
||||||||
compounds should be constructed of 9-mers and their complements. |
|
|
|
|
||||
One advantage of this approach is that no sequence information about the sample is |
|
|||||||
needed initially. The same set of primers will be used over and over again. As sequence |
|
|||||||
data accumulate, some will contain sites that allow |
priming |
from members |
of the |
pre- |
|
|||
made 9-mer subset. Actual simulations indicate that the whole process could be rather ef- |
|
|||||||
ficient. It is also parallelizable in an unlimited fashion because one can carry as many dif- |
|
|||||||
ferent targets along as one has sequencing gel lanes |
available. An obvious and |
serious |
|
Figure 11.4 Use of two complementary 9-base primers allows sequence reading in two directions.
366 STRATEGIES FOR LARGE-SCALE DNA SEQUENCING
disadvantage of this method is the high fraction of wasted reactions. There are possible ways to improve this, although they have not yet been tried in practice. For example one could pre-screen target-oligonucleotide pairs by hybridization, or by DNA polymerasecatalyzed total incorporation of dpppN’s. This would allow those combinations in which the primer is not present in the target to be eliminated before the time-consuming steps of running and analyzing a sequencing reaction are undertaken.
At about the same time that Studier proposed and began to test his scheme, an alterna-
tive solution to this problem |
was put forth by Wacslaw Szybalski. His original scheme |
was to make all possible 6-mers |
using current automated DNA synthesis techniques. With |
current methods this is quite |
a manageable task. Keeping track of 4096 similar com- |
pounds is feasible with microtitre plate handling robots, although this is at the high end of the current capacity of commercial equipment. To prime DNA polymerase, three adjacent
6-mers are used in tandem, as illustrated in Figure 11.5. They are bound to the target, covalently linked together with DNA ligase, and then DNA polymerase and its substrates
are added. The principle behind this approach is that end stacking (see Chapter 12) will favor adjacent binding of the 6-mers, even though there will be competing sites elsewhere
on the template. Ligation will help ensure a preference for sequence-specific, adjacent 6- mers because ligase requires this for its own function. The beauty of this approach is that the same 6-mer set will be used ad infinitum. Otherwise, the basic scheme is ordinary di-
rected sequencing by primer walking. After each step the new sequence |
must |
be |
scanned |
|
|
to determine which combination of 6-mers should |
be selected as the |
optimal |
choice |
for |
|
the next primer construction. |
|
|
|
|
|
The early attempts to carry out tandem |
6-mer priming seemed |
discouraging. The |
|||
method appeared to be noisy and unreliable. Too |
many of the lanes |
gave unreadable |
se- |
quence. This prompted several variants to be developed, as described below. More recently, however, conditions have been found which make the ligation-mediated 6-mer priming very robust and dependable. This approach, or one of the variants below, is worth considering seriously for some of the next generation large-scale DNA sequencing pro-
jects. The key issues that need to be resolved are its generality and the ultimate cost savings it affords against any possible loss in throughput. Smaller-scale, highly multiplexed synthesis of longer primers will surely be developed. These larger primers are always
Figure 11.5 Directed sequencing by ligation-mediated synthesis of a primer, directed by the very template that constitutes the sequencing target.
|
|
|
|
|
PRIMING WITH MIXTURES OF SHORT OLIGONUCLEOTIDES |
367 |
||
likely to perform slightly better than ligated mixtures of smaller compounds because the |
|
|||||||
possibility of using the 6-mers in unintended orders, or using only two of the three, is |
|
|||||||
avoided. There is enough experience with longer primers to know how to design them to |
|
|||||||
minimize chances of failure. Similar experience must be gained with 6-mers before an ac- |
|
|||||||
curate evaluation of their relative effectiveness can be made. |
|
|||||||
Two more recent attempts to use 6-base oligonucleotide mixtures as primers avoid the |
|
|||||||
ligation step. This saves time, and it also saves any potential complexities arising from |
|
|||||||
ligase side reactions. Studier’s approach is to add the 6-mers to DNA in the presence of a |
|
|||||||
nearly |
saturating |
amount |
of |
single-stranded DNA binding protein (ssb). This protein |
|
|||
binds to DNA single strands selectively and cooperatively. Its effect on the binding of sets |
|
|||||||
of small complementary short |
oligonucleotides is shown schematically in Figure 11.6. |
|
||||||
The key point is the high cooperativity |
of ssb to single-stranded DNA. What this does is |
|
||||||
to make configurations in which 6-mers are bound at separate sites less favorable, ener- |
|
|||||||
getically, than configurations in which they are stacked together. In essence, ssb coopera- |
|
|||||||
tivity is translated into 6-mer DNA binding cooperativity. This cuts down on false starts |
|
|||||||
from individual or |
mispaired |
6-mers. An unresolved issue is how much mispairing still |
|
|||||
can go on under the conditions used. We have faced before the problem in evaluating this; |
|
|||||||
there are many possible 6-mers, |
and many |
more possible sets of 6-mer mixtures, so only |
|
|||||
a minute fraction of these possibilities has yet been tried experimentally. One simplifica- |
|
|||||||
tion is the use of primers with 5-mC and 2-aminoA, instead of C and A, respectively. This |
|
|||||||
allows five base primers to be substituted for six base primers, reducing the complexity of |
|
|||||||
the overall set of compounds needed to only 1024. |
|
|||||||
An |
alternative approach |
to |
the problem has been developed by Levy Ulanovsky. He |
|
||||
has shown that conditions can |
be |
found where it is possible to eliminate both the ligase |
|
|||||
and the ssb and still see highly |
reliable priming. Depending on the particular sequence, |
|
||||||
some five base primers can be substituted for six base primers. The effectiveness of the |
|
|||||||
approach appears to derive from the intrinsic preference of DNA polymerase for long |
|
|||||||
primers, so a stacked set is preferentially utilized even though individual members of the |
|
|||||||
set may bind to the DNA template |
elsewhere alone or in pairs. It remains to be seen, as |
|
||||||
each of |
these approaches |
is developed further, which will prove to be the most efficient |
|
|||||
and accurate. Note that in any directed priming scheme, for a 40-kb cosmid, one must se- |
|
|||||||
quence 80 kb of DNA. If a net of 400 new bases of high-quality DNA sequence can be |
|
|||||||
completed per run, it will take 200 steps |
to |
complete the walk. As a worst-case analysis, |
|
|||||
one may have to search the 3 |
|
|
|
-terminal 100 bases of a new sequence to find a target suit- |
|
|||
able for priming. Around |
500 |
base pairs |
of |
sequence must actually be accumulated per |
|
run to achieve this efficiency. Thus a directed priming cosmid sequencing project is still a considerable amount of work.
Figure 11.6 Use of single-stranded binding protein (ssb) to improve the effectiveness of tandem, short sequencing primers.
368 STRATEGIES FOR LARGE-SCALE DNA SEQUENCING
ORDERED SUBDIVISION OF DNA TARGETS
There are a number of strategies for DNA sequencing that lie between the extreme randomness of the shotgun approach and the totally planned order of primer walking. These
schemes |
all have in common a set of cloning or physical purification steps that divides |
the DNA |
target into samples that can be sequenced and that also provides information |
about their location. In this way a large number of samples can be prepared for sequenc- |
|
ing in parallel, which circumvents some of the potential log jams that are undesirable fea- |
|
tures of |
primer walking. At the same time from the known locations one can pick clones |
to sequence in an intelligent, directed way and thus cut down substantially on the high re- |
dundancy of shotgun sequencing. One must keep in mind, however, that the cloning or separation steps needed for ordered subdivision of the target are themselves quite labor intensive and not yet easily automated. Thus there is no current consensus on which is the
best approach. One strategy used recently |
to |
sequence |
a number |
of |
small |
bacterial |
|
genomes is the creation of a dense (highly redundant) set of clones in a sequence-ready |
|||||||
vector—that is, a vector like a cosmid which allows direct sequencing without additional |
|||||||
subcloning. The sequence at both ends of each of the |
clones is |
determined |
by |
using |
|||
primers from known vector sequence. As more |
primary DNA |
sequence |
data |
become |
|
||
known and available in databases, the chances that one or both ends |
of each clone |
will |
|||||
match previously determined sequence increase considerably. Such database matches, as |
|
||||||
well as any matches among the set of clones |
themselves, |
accomplish |
two |
purposes |
at |
||
once. They order the clones, thus obviating much or all of the need for prior mapping ef- |
|||||||
forts. In addition they serve as staging points |
for the construction |
of |
additional |
primers |
|||
for directed sequencing. Extensions of the use of sequence-ready clones |
are discussed |
||||||
later in the chapter. |
|
|
|
|
|
|
|
Today, there is no proven efficient strategy to subdivide a mega-YAC for DNA se- |
|||||||
quencing. Some strategies that have proved effective for subdividing smaller clones are |
|||||||
described in the next few sections. A megabase YAC is, at an absolute minimum, going to |
|||||||
correspond to 25 cosmids. However, there is |
no reason to think that this is |
the |
optimal |
||||
way to subdivide it. Even if direct genomic sequencing from YACs proves feasible, one |
|||||||
will want to have a way of knowing, at least approximately, where in the YAC a given |
|||||||
primer or clone is located, to avoid becoming hopelessly confused by interspersed repeat- |
|||||||
ing sequences. However, for large YACs there is an additional problem. As described |
|||||||
elsewhere, these clones are plagued by rearrangements and unwanted insertions of yeast |
|||||||
genomic DNA. For sequencing, the former problem may be tolerable, but |
not |
the latter. |
|||||
To use mega-YACs for direct sequencing, we would first need to develop methods to re- |
|||||||
move or ignore contaminating yeast genomic DNA. |
|
|
|
|
|
|
|
TRANSPOSON-MEDIATED DNA SEQUENCING
The structure of a typical transposon is shown in Figure 11.7. It has terminal repeated se-
quences, a selectable marker, and codes for a transposase that allows |
the transposon to |
hop from one DNA molecule (or location) to another. The ideal transposon for DNA se- |
|
quencing will have recognition sites for a rare restriction enzyme, |
like Not I, located at |
both ends, and it will have different, known, unique DNA sequences close to each end. A |
|
scheme for using transposons to assist the DNA sequencing of clones in |
|
in Figure 11.8. The clone to be sequenced is transformed into |
a transposon-carrying |
E. coli |
is shown |
E.
TRANSPOSON-MEDIATED DNA SEQUENCING |
369 |
Figure 11.7 |
Structure of a typical transposon useful for DNA sequencing. |
N is a Not I site; |
a and |
b are unique transposon sequences. Arrow heads show the terminal inverted repeats of the transpo- |
|
|
|
son. |
|
|
|
coli. |
It is present in an |
original vector containing a single infrequent |
cutting site |
(which |
|||
can be the same as the ones |
on the transposon) and two unique known DNA sequences |
|
|||||
near to the vector-insert boundary. The transposon may jump into the cloned DNA. If it |
|
||||||
does so, this appears to occur approximately at random. The nonselectivity of transposon |
|
||||||
hopping is a key assumption |
in the potential effectiveness of this strategy. From results |
||||||
that have been obtained thus far, the randomness of transposon insertions seems quite suf- |
|
||||||
ficient to make this approach viable. |
|
|
|
||||
|
DNA from cells in which transposon jumping has potentially occurred can be trans- |
|
|||||
ferred to a new host cell that does not have a resident transposon, and recovered by selec- |
|||||||
tion. The transfer can be done by mating or by transformation. Since the transposon car- |
|
||||||
ries a selectable marker, and the vector |
which originally contained the |
clone carries |
a |
||||
second selectable marker, there will be no difficulty in capturing the desired transposon |
|
||||||
insertions. The goal now is to find a large set of different insertions and map them within |
|||||||
the |
original cloned |
insert |
as efficiently as possible. The infrequent cutting sites placed |
||||
into |
the |
transposon |
and the |
vector greatly |
facilitate this analysis. DNA is |
prepared from |
all of the putative transposon insertions, cut with the infrequent cutter, and analyzed by ordinary agarose gel electrophoresis, or PFG, depending on the size of the clones. Each insert should give two fragments. Their sizes give the position of the insert. There is an ambiguity about which fragment is on the left and which is on the right. This can be resolved if the separated DNAs are blotted and hybridized with one of the two unique se-
quences originally placed in the vector (labeled 1 and 2 in Fig. 11.8).
Figure 11.8 |
One strategy for DNA sequencing by transposon hopping. |
a, b, and N are as defined |
in Figure 11.7; 1 and 2 are vector unique sequences.
370 STRATEGIES FOR LARGE-SCALE DNA SEQUENCING
Figure 11.9 |
DNA sequencing from a |
set of transposon insertions. |
(a ) Chosing a set of properly |
spaced transposons. ( |
b ) Use of |
transposon sequences as primers. |
|
Now the position of the transposon insertion is known. One can select samples that ex- |
|
||||||
tend through the target at intervals than |
can be spanned by sequencing, |
as |
shown |
in |
|
||
Figure 11.9. To actually determine the sequence, one can use the two unique sequences |
|
||||||
within |
the transposon as primers. These will always allow sequence to be |
obtained |
in |
|
|||
both directions extending out from the site of the transposon insertion (Fig. 11.9). One |
|
||||||
does not a priori know the orientation of the two sequences primed from the transposon |
|
|
|||||
relative to the initial clone. However, this |
can be determined, if necessary, by |
reprobing |
|
||||
the infrequent-cutter digest fingerprint of the clone with one of the transposon unique se- |
|
||||||
quences. In practice, though, if a sufficiently dense set of transposon inserts is selected, |
|
||||||
there will be sufficient sequence overlap to assemble the sequence and confirm the origi- |
|
||||||
nal map of the transposon insertion sites. |
|
|
|
|
|
||
DELTA |
RESTRICTION |
CLONING |
|
|
|
|
|
This |
is a relatively straightforward procedure that allows DNA sequences |
from |
vector |
|
|||
arms to be used multiple times as sequencing primers. It is a particularly efficient strategy |
|
||||||
for sampling the DNA sequence of a target in numerous places and for subsequent, paral- |
|
|
|||||
lel, directed primer sequencing. Delta restriction cloning requires that the target of inter- |
|
||||||
est be contained near a multicloning site in the vector, as shown in Figure 11.10. The var- |
|
||||||
ious enzyme recognition sequences in that site are used, singly, to digest the clone, and all |
|
||||||
of the resulting digests are analyzed by gel |
electrophoresis. One looks for |
enzymes |
that |
|
|||
cut the clone into two or more fragments. These enzymes will have cutting |
sites within |
|
|||||
the insert. From the size of the fragments |
generated, one can determine approximately |
|
|||||
where in the insert the internal cleavage site is located. Those cuts that occur near regions |
|
||||||
of the insert where sequence is desired can be converted to targets for sequencing by di- |
|
||||||
luting the cut clones, religating them, and transforming them back into |
|
|
E. coli. |
The effect |
|||
of the cleavage of religation is to drop out from the clone all of the DNA between the |
|
||||||
multicloning site |
and the internal restriction |
site. This brings the vector arm containing |
|