Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Скачиваний:
58
Добавлен:
17.08.2013
Размер:
277.66 Кб
Скачать

NESTED DELETIONS

371

Figure 11.10 Delta restriction subcloning. Top panel shows restriction digests of the target plasmid. Bottom panel shows sequences read from delta subclones. Adapted from Ansorge et al. (1996).

the multicloning site adjacent to what was originally an internal segment of the insert and

 

 

allows vector sequence to be used as

a primer to obtain this internal sequence. In test

 

 

cases about two-thirds of a 2- to 3-kb insert can be sequenced by testing 10 enzymes that

 

 

cut within the polylinker. Only a few of these will need to be used for actual subcloning.

 

 

The problem with this approach is that one is at the mercy of an unknown distribution of

 

 

restriction sites, and at present, it is not clear how the whole process could be

automated

 

 

to the point where human intervention becomes unnecessary.

 

 

 

NESTED DELETIONS

 

 

 

 

This is a more systematic variant of the type of delta restriction cloning approach just de-

 

 

scribed. Here a clone is systematically truncated from one or both ends by the use of

 

 

exonucleases. The original procedure, developed by Stephen Henikoff, is illustrated in

 

 

Figure 11.11. A DNA target is

cut with two different restriction nucleases. One yields

a

 

3 -overhang; the other yields a 5

-overhang. The enzyme

E. coli

exonuclease III degrades

a 3 -overhang very inefficiently, while it degrades the 3

-strand in a 5

-overhang very effi-

ciently. The result is degradation from only a single end of the DNA. After exonuclease

 

 

treatment, the ends of the shortened insert must be trimmed to produce cloneable

blunt

 

 

ends. The DNA target is then taken up in a new vector and sequenced using primers from

 

 

372 STRATEGIES FOR LARGE-SCALE DNA SEQUENCING

Figure 11.11

Preparation of nested deletion clones.

A and

B are restriction enzyme cleavage sites

that give the overhanging ends indicated. Adapted from Henikoff (1984).

that

new vector. In principle, this process ought to be quite efficient. In practice, while

this

proposed strategy has been known for many years, it does not seem

to have found

many adherents.

 

 

A variation on the original exonuclease III procedure for generating nested deletions

has

been described by Chuan Li and Philip Tucker. In this method, termed exoquence

DNA

sequencing, a DNA fragment with different overhangs at its ends is

produced and

one

end is selectively degraded with exonuclease III. At various time

points the reaction

is stopped, and the resulting template-primer complexes are treated separately with one of

several different restriction enzymes and then subjected to

Sanger sequencing reactions,

as shown in Figure 11.12. The final DNA sequencing

ladders are examined directly by

gel electrophoresis. Thus no cloning is required. If the

restriction enzymes are chosen

well, and the times at which the reactions are stopped are spaced sufficiently closely, suf-

ficient sequence data will be revealed to generate overlaps that will allow the reconstruc-

tion

of contiguous sequence. This is an attractive method in

principle; it remains to be

seen

whether it will prove more generally appealing than the

original nested deletion

cloning approach.

 

PRIMER JUMPING

373

Figure 11.12

Strategy for exoquence DNA sequencing. Shown is a relatively simple case; there

 

are more elaborate cases if the restriction enzymes used cut more frequently.

A and B are restriction

enzyme sites as in Figure 11.11;

R is an additional restriction enyzme cleavage site. Taken from Li

and Tucker (1993).

 

 

 

PRIMER

JUMPING

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

This strategy has been discussed quite a bit. However, there are yet no reported examples

 

 

 

 

of its implementation. The basic notion is outlined in Figure 11.13. It is similar in many

 

 

 

ways to delta subcloning, but it differs in a number of significant features. PCR is used,

 

 

 

 

rather than subcloning. A very specific set of restriction enzymes is used: one rare cutter

 

 

 

 

which can have any cleavage pattern and an additional pair of restriction enzymes con-

 

 

 

 

sisting of an eight base cutter and a four or six base cutter; but they have to produce the

 

 

 

 

same

set of complementary single-stranded

ends. Examples are

 

 

 

 

 

Not

I (GC/GGCCGC)

 

and

Sse

8387

I (CCTGCA/GG)

for

the eight cutters,

and

 

 

 

Ene

I (Y/GGCCR) and

Nsi

I

(ATGCA/T) or

Pst I (CTGCA/G), respectively, as more frequent cutters. In principle, the

 

 

 

approach shown in Figure 11.13 ought to be applicable to much larger DNA than delta

 

 

 

 

subcloning,

based

on

the

past

success

at

making reasonably

large

jumping

libraries

 

 

 

 

(Chapter 8).

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

For primer jumping the insert is cloned next to a vector fragment containing any de-

 

 

 

 

sired

infrequent cleavage

site

between two known unique DNA sequences,

shown

as

 

 

 

 

 

a

and

b

in Figure 11.13. The vector fragment is constructed so that it contains no cleavage

 

 

 

sites

for the 4-base

or 6-base cutting enzyme between the unique sequences

and

the

 

 

 

 

insert,

but it does contain a second

infrequent cleavage site,

 

 

 

 

 

N in Figure 11.13, as close

 

to the

upstream unique

sequence

as

possible.

Ideally the vector will be one

arm

of a

 

 

 

 

374 STRATEGIES FOR LARGE-SCALE DNA SEQUENCING

Figure 11.13

Primer jumping, an untested but potentially attractive method for directed DNA se-

quencing. Restriction

enzyme site

N

must

produce an end that is complementary to the end gener-

ated by the four-base cutter. Sites

a

and

b can be anything so long as they are not present in the tar-

get, but there must a site for infrequent cleavage between them.

YAC, and the other arm

could be treated in a similar fashion. The clone is

cut at

the

distal rare cleavage to completion and then partially digested with the frequently cutting

 

enzyme. The resulting fragments are separated by length, and the separation gel is sliced

 

into

pieces.

The

resulting DNA fragments are diluted to very low concentration and

ligated. This will produce DNA circles in which the vector sequence, including segments

 

a

and

 

b

in

Figure

11.13, is now located next to each site in the partial digest which

was

cleaved

by

the

frequently cleaving enzyme. Thus the known sequence can now be

 

used

for

starting a

primer walk. The approximate position of

the walk within the large

clone

will

be known

from the size of the fragment. With the 800-bp to 1-kb sequence

reads

now

being

achieved

under good

circumstances, it is conceivable

that

one

would

 

be able to sequence from the cleaved site up to the next equivalent restriction site without

the need to make additional primers, in most cases. If this were the case, one could do a

 

directed walking strategy on a large DNA target using only two primers—one for each

 

vector arm.

 

 

 

 

 

 

 

 

 

 

 

In

a

similar

vein

to primer

jumping, if single-sided

PCR

ever

works

well

enough

(Chapter 4), these methods could be used for directed cycle sequencing by the approaches just described.

PRIMER MULTIPLEXING

375

PRIMER MULTIPLEXING

This is a potentially very powerful strategy for large-scale DNA sequencing. It was developed by George Church and has been elaborated, independently, by Raymond Gesteland. There are a number of features that set multiplexing aside from many other approaches. A

major peculiarity of the method is that it does not scale down efficiently, so it is best suited for fairly massive projects, typically several hundred kb of DNA sequence or more.

The basic scheme for primer multiplexing is shown in Figure 11.14. In the particular case shown, a multiplexing of 40 is used. Forty different vectors are constructed; each has a unique 20-base sequence on each side of the cloning site. The DNA target of interest is shotgun cloned, separately, into all 40 vectors. This produces 40 different libraries. Pools are constructed by selecting one clone from each of the libraries and mixing them. These 40-clone pools are the samples on which DNA sequencing is performed. The pools are

subjected to standard DNA sequencing chemistry to generate a mixture of 40 different ladders, but no radioactivity or other label is introduced into the DNA at this stage. The mixture is fractionated by polyacrylamide electrophoresis and blotted onto a membrane.

A particularly convenient way to do this is by the bottom wiper described in Chapter 10.

The blotted DNA is crosslinked onto the filter by UV irradiation to attach it very stably. This is a key step, since the filters will be reused many times.

To read the DNA sequence from each pool of clones, the filter is hybridized with a probe corresponding to one of the 40 unique 20-base sequences. By this indirect endlabeling method (introduced in Chapter 8), only one of the 40 clones in the sequence lad-

der is visible. The probe is removed from the filter by washing, and then the hybridization and washing are repeated successively for each of the other unique sequence primers. By

this multiplexing approach, most stages of the project are streamlined by a factor of 40.

Figure 11.14

Basic scheme used for primer multiplexing:

a, b, c, and so on, represent unique vec-

tor primer sequences.

 

 

376

STRATEGIES FOR LARGE-SCALE DNA SEQUENCING

 

 

The exceptions are the hybridization, autoradiography, or other color detection, and wash-

 

 

ing. Thus great care must be taken to automate these steps in the most efficient way.

 

 

Recently a

fairly successful demonstration

of the efficiency that can be achieved with

 

 

primer multiplexing, combined with transposon

jumping, was reported by Robert Weiss

 

 

and Raymond Gesteland.

 

 

 

MULTIPLEX

GENOMIC WALKING

 

 

 

A different approach to multiplex sequencing has been suggested by Walter Gilbert. This

 

 

is designed

to be used for the sequencing of

entire small genomes like

E. coli

where di-

rect genomic DNA sequencing is feasible. The method is illustrated in Figure 11.15

 

a. The

great appeal of this method is that absolutely no cloning is required. The total genome is

 

 

digested separately with a set of different restriction enzymes. The products of this diges-

 

 

tion are loaded onto polyacrylamide gels in adjacent lanes and fractionated. A highly la-

 

 

beled probe with an arbitrary sequence is selected (with a length chosen to occur on aver-

 

 

Figure 11.15

Multiplex

genomic walking.

(a ) Basic

outline

of the experiment. (

b ) Restriction

map

in a typical region, and resulting segments of sequence,

 

 

A, B, C

revealed by hybridization with

one

specific probe.

(

c )

Sections of readable and

unreadable sequence

on a particular restriction

 

fragment. The probe is located

 

L bases from the end of the fragment.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

GLOBAL STRATEGIES

377

age once per genome) and used to hybridize with a blot of the separated fragments (Fig.

 

 

 

 

11.15b ). In most of the lanes this probe will give a readable sequence. Suppose that the

 

 

 

probe lies 60 bp upstream from a given restriction site. The first 60 bases of sequence will

 

 

 

 

be unreadable because data will extend in both directions (Fig. 11.15

 

 

c ). However, longer

 

regions of the ladder will be interpretable, since they must lie in the direction away from

 

 

 

 

the nearby restriction site. In general, one will expect to get a number of usable reads in

 

 

 

 

both directions from the probe, just by the fortuitous occurrence of useful restriction sites.

 

 

 

 

These reads are assembled into a

segment of DNA sequence. Next probes are designed

 

 

 

 

from the most distal regions of the segment, and these are used to continue the genomic

 

 

 

 

walk.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

In

principle, multiplex genomic

walking

is a

very

elegant

and

spartan

approach to

 

 

 

 

DNA sequencing. One has a choice at any time

whether

to

use

additional

arbitrary

 

 

 

 

probes, and so increase the number of parallel sequencing thrusts, or whether to focus on

 

 

 

 

directed walking. Thus one has a

method with some of the advantages

of both

random

 

 

 

 

and directed strategies. A potential weakness is the relatively high fraction of failed lanes

 

 

 

 

that will occur unless the probe has

a single binding site in the genome. Another problem

 

 

 

 

is the technical demands that genomic sequencing makes. It is also not obvious how easy

 

 

 

 

this strategy will be to automate. It does work, but the overall efficiency needs to be es-

 

 

 

 

tablished before the method can be compared quantitatively with others.

 

 

 

 

 

 

 

GLOBAL

STRATEGIES

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A basic issue that has confronted the

human genome

project

since

its

conception

is not

 

 

 

 

how to sequence but what to sequence. From a purely biological standpoint, the most in-

 

 

 

 

teresting sequencing targets are genes. The choice of genes depends on the sorts of bio-

 

 

 

 

logical questions one is interested in. An evolutionary biologist may want to sequence one

 

 

 

 

homologous gene in a wide variety

of organisms. Cell biologists or physiologists may

 

 

 

 

want to focus on a set of functionally related genes or gene families within just a few or-

 

 

 

 

ganisms. However, from the point of view of whole genome studies, the purpose of se-

 

 

 

 

quencing is really to find genes and make them available for subsequent biological stud-

 

 

 

 

ies. This puts a very different tilt on the issues that affect the choice of sequencing targets.

 

 

 

For simple gene-rich organisms like bacteria and yeasts, there is little doubt that com-

 

 

 

 

plete genomic sequencing is desired and worth doing even with existing DNA sequencing

 

 

 

 

technology. Indeed sequencing projects have been completed on many bacteria including

 

 

 

 

H. influenzae, Mycoplasma genitalium, Mycoplasma pneumoniae, Methanococcus jan-

 

 

 

naschii,

Synechocystis

strain pcc6803,

and

 

 

Escherichia coli,

and

one yeast,

S. cerevisiae

 

(see Chapter 15). Additional projects are well underway with a number of other microor-

 

 

 

 

ganisms, including the bacterium

 

 

 

Mycobacterium tuberculosis

 

and the yeast,

S. pombe. E.

 

coli is an obvious choice as the focus of much of our fundamental studies in prokaryotic

 

 

 

 

molecular biology. Mycoplasmas represent the smallest known free-living genomes.

 

 

 

 

Mycobacterium tuberculans

 

is

important because

of the

current medical crisis

with

drug-

 

 

resistant tuberculosis. The two yeasts account for most of our current knowledge and technical power in fungal genetics. They are also very different from each other, so much will be learned from comparisons between them. The real issue that will have to be faced in the future is at what stage in DNA sequencing technology is it desirable and affordable to sequence the genomes of many other simple organisms?

378

STRATEGIES FOR LARGE-SCALE DNA SEQUENCING

 

 

 

 

 

 

 

 

 

 

 

 

 

There are a number of more advanced organisms that appear to have relatively high

 

 

 

 

 

coding percentages of DNA. These include a simple plant,

 

 

 

 

 

 

 

 

 

 

Arabidopsis thaliana,

a much

more economically important plant, rice, the fruitfly,

 

 

 

 

 

 

 

Drosophila melanogaster,

and

the

nematode,

Caenorhabditis

elegans.

 

 

 

There

are

strong

arguments

in

favor

of

obtaining

 

 

complete DNA sequences on these organisms rapidly. They all are systems where a great

 

 

 

 

 

 

 

 

deal of past genetics has been done, and a great deal of ongoing

interest

in

biological

 

 

 

 

 

studies remains. Certain primitive fishes may also have small genomes as does the puffer

 

 

 

 

 

 

 

fish. Here the argument in favor of sequencing is

that it will

be

relatively

easy to

find

 

 

 

 

 

most of the genes. However, these organisms are currently pretty much in a biological

 

 

 

 

 

 

 

vacuum.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

For more complex, gene-dilute organisms, the selection of sequencing targets is, not

 

 

 

 

 

 

surprisingly, also more complex. Here there is little debate that

 

 

 

 

 

 

 

 

 

 

Homo sapiens

and

the

mouse,

Mus musculus,

 

are the

obvious first choice. It is much

less

clear

what

should

 

 

come after this. Do we target other primates because they will be most useful in under-

 

 

 

 

 

 

 

standing the very large fraction of human genes that are believed to be central nervous

 

 

 

 

 

system specific? Do we examine genomes of organisms that have long been the focus of

 

 

 

 

 

 

 

physiological studies like rats, dogs, and cats. Or do we aim for a much broader represen-

 

 

 

 

 

 

tation of evolutionary diversity? Alternatively, how important should

the

commercial

 

 

 

 

 

value of potential genome targets be? Cows, horses, pine trees, maize, and salmon have a

 

 

 

 

 

much more important economic role than

 

 

 

 

 

Arabidopsis

 

or

C. elegans.

These questions

are

 

interesting to ponder, but they really do not require answers at the present time. If suffi-

 

 

 

 

 

ciently inexpensive DNA sequencing methods are

developed in the future, we will

want

 

 

 

 

 

 

 

to sequence every genome of biological interest. For the present, technology pretty much

 

 

 

 

 

 

 

limits us to a few choices.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

With most complex organisms, only a few percent of

the

genome

is

known

to

be

 

 

 

 

 

 

 

coding sequence. The function of the rest,

which we earlier termed junk, is

unknown,

 

 

 

 

 

 

today. With limited resources, and relatively slow sequencing

technology,

most

in-

 

 

 

 

 

volved

groups are

choosing

to focus

on

selectively

sequencing

genes

from

human

 

 

 

 

 

 

or other sources. There are

two

ways

to

go about this. One approach

is

to

find

a

 

 

 

 

gene-rich region in a genome and sequence it completely. Regions that have been selected

 

 

 

 

 

 

 

include the T-cell receptor loci, immunoglobulin gene families, and the major histo-

 

 

 

 

 

compatibility complex.

All of

these

regions

are

of

intense

interest

in

understanding

 

 

 

 

 

the function of the immune system. Another region of interest is the Huntington’s disease

 

 

 

 

 

 

 

region because it is very gene rich, and in the process of finding the particular gene

 

 

 

 

 

responsible for the disease a

large

set

of

cloned

DNA

samples

from

this

region

has

 

 

 

 

 

 

become available.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

An alternative

to

genomic

sequencing

in

a gene-rich

region

 

is

to

sequence

cDNAs,

 

 

 

 

 

 

DNA copies of expressed mRNAs. These are relatively easy to produce, and many cDNA libraries are available. Each represents the pattern of gene expression of the particular tissue or sample from which the original mRNA was obtained. In sequencing a cDNA, one

knows one is dealing with an expressed gene, therefore a functional gene. This is a considerable advantage over genomic sequencing where one has no knowledge a priori that a particular gene found at the DNA level is actually ever used by the organism. With cDNA sequencing, one is always examining genes or nearby flanking sequences. This is another great advantage over genomic sequencing where, even in the best of cases, most of the sequence will not be coding. However, there are some potential difficulties with projects to examine massive numbers of cDNA sequences, as we will demonstrate.

SEQUENCE-READY LIBRARIES

379

SEQUENCE-READY LIBRARIES

Today, the notion of sequencing an entire human chromosome from left to right telomere is being considered seriously at a number of Genome Centers. In some cases the plans are based on a preexisting minimum tiling set of clones. Here, as long as the set is complete

and exists in a vector like a cosmid or a BAC that allows direct sequencing, the strategy is

predetermined. The clones are selected and sequenced one by one by whatever method is

 

 

 

deemed optimal at the time for 50-

to 150-kb clones.

 

 

 

 

 

 

 

 

 

 

 

Suppose,

however, that,

with

sequencing

as

the eventual

goal, one

wishes

to

create

 

an optimal library to facilitate subsequent sequencing of any particular region deemed

 

 

interesting.

There

are

two

basically

similar

strategies

for

achieving

this

objective. If

a dense ordered library already exists

in

an

appropriate

vector,

one can

sequence

the

 

ends of all

of

the

clones

in a

relatively

easy

and

cost-effective

manner.

Since vector

priming

can

be

used,

the

goal

is

to read

into

the

cloned

insert

as

far

as

possible

in

a

single

pass

of raw DNA sequencing. If this is done for all the clones,

the

result

is

a

sampling of the genomic sequence (Smith et al., 1994). For example, suppose that the

 

initial library is 20-fold redundant

50-kb cosmids. A

cosmid end

on

average

would

 

occur

every

1.25

kb.

A

 

700-base

sequence

read

at

each

end

would

generate

a

total

of 28 kb of sequence. When realistic failure rates and some inevitable overlaps are

considered,

the result would still be roughly half

the total sequence.

This

is suffi-

ciently

dense that

almost

any

cDNA

sequence

from

the

region

would

be

represented

 

in some

of

the available

genomic

DNA

sequence.

Thus

all

sequenced

cDNAs

could

 

be mapped by software sequence comparisons without the

need

for

any additional

 

experiments.

The

average

spacing

between sequenced

genomic

regions

would be

short enough so that PCR primers could be designed to close any of the gaps by cycle

sequencing.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

For many targets, however, there is no existing clone map. The effort to create one de

novo is considerable, even by the enhanced methods described in Chapter 9. For this rea-

son, as automated DNA sequencing becomes more and

more

efficient,

strategies

that

 

avoid the construction of

a map altogether

become

attractive.

One recent

proposal

for

such a scheme also relies on the sequencing of the ends of the clones (Venter et al., 1996). Consider, for example, an ordered tenfold redundant BAC library of the human genome.

With 150-kb inserts, 200,000 BACs are required. If each of these is sequenced for 500 bp from both ends, the resulting data set will contain 400,000 sequence reads encompassing

200 Mb of DNA. On average, the density of DNA sequence is a 500-bp block every 7.5 kb. Once created, such a resource would serve two functions. Many cDNAs would still match up with a segment of BAC sequence, and they could serve to correlate the BAC library with other existing genome resources and information. The utility of the BACs in this regard could be improved if, for example, they were created so that their ends had a bias to occur in coding sequence. However, even in the absence of cDNA information, the BACs will serve as a starting point for the genomic sequencing of any region of interest. One could choose any BAC that corresponds to the region of interest and sequence it

completely. Then, by inspection, the BACs in the library that overlapped least with the first sequenced BAC could be picked out and used for the next round of sequencing. The

process would continue until the region of interest were completed.

In this

way the

sequencing project itself would create the minimum tiling set of BACs

needed for

the

region.

 

 

380 STRATEGIES FOR LARGE-SCALE DNA SEQUENCING

SEQUENCING cDNA LIBRARIES

Usually cDNA libraries are made by a scheme like that shown in Figure 11.16. To prepare high-quality cDNAs, it is important to start with a population of intact mRNAs. This

is not always easy; mRNAs are very susceptible to cleavage by endogenous cellular ribonucleases, and some tissues or samples are very rich in these enzymes. Most eukaryotic

mRNAs have several hundred bases of A at their 3

 

 

 

 

-end. This poly A tail can be used to

capture these mRNAs and remove contaminating rRNA, tRNA, and other small cytoplas-

 

mic and nuclear RNAs. Unfortunately, one also loses that fraction of mRNAs that lack a

poly A tail. An oligo-T primer can then be used with reverse transcriptase to make a DNA

copy of the mRNA strand. Alternatively, random primers

can

be used

to

copy

the

mRNAs, or specific primers can be used if one is searching

for a particular mRNA or

class of mRNAs. There are two general methods to convert the resulting RNA-DNA du-

 

plexes into cDNAs. Left to their own devices, some reverse transcriptases will, once the

RNA strand is displaced or degraded, continue synthesis, after

making a

hairpin,

until

they have copied the entire DNA strand of the duplex. As shown in Figure 11.16

a, S1 nu-

clease can then be used to cleave the hairpin and generate a cloneable end. Unfortunately,

the S1 nuclease treatment can also destroy some of the ends of the cDNA. An alternative

procedure is to use RNase H to nick the RNA

strand of

the

duplex. The

resulting

nicks

can serve as primer for DNA polymerases like

 

 

 

E. coli

DNA polymerase I. This eventually

leads to a complete DNA copy except for a few

nicks which

can

be

sealed by

DNA

lig-

 

Figure 11.16 Approaches to the construction of cDNA libraries: Use of S1 nuclease to generate clonable inserts.

Соседние файлы в папке genomics11-15