Добавил:

fench Опубликованный материал нарушает ваши авторские права? Сообщите нам.

Вуз:

Сумский государственный университет

Предмет:

Генетика

Файл:

Genomics- The Science and Technology Behind the Human Genome Project. Charles R. Cantor, Cassandra L / genomics11-15 / 11

.pdf

Скачиваний:

Добавлен:

17.08.2013

Размер:

277.66 Кб

Скачать

☆

<<< < Предыдущая 1 23 / 43 4 > Следующая >>>

			DEALING	WITH UNEVEN	cDNA DISTRIBUTION	381
ase. This procedure is generally preferred over the S1 nuclease method because it tends to
produce longer, more intact cDNAs.
An unfortunate fact about many cDNA clones is that they are biased toward the 3						-end
of the original message because of the poly A capture and the oligo-T priming used to
prepare them. The true 5		-end is often missing and needs to be found in other clones or in
the genome. Some attempts have been made to take advantage of the specialized cap
structure at the 5	-end of eukaryotic mRNAs to purify intact molecules. One possibility is
to try to produce high-afﬁnity monoclonal antibodies speciﬁc for this cap structure. More
effective has been the use of an enzyme called tobacco pyrophosphatase						. This cleaves off
the cap to leave an ordinary 5		-phosphate-terminated DNA strand that then can serve as a
substrate in a ligation reaction, which can be used to add a known sequence. This known
sequence will serve as the staging site for subsequent PCR ampliﬁcation. Several differ-
ent Japanese groups have recently perfected such strategies to the point where					5	-end-
containing cDNA libraries can now be made quite reliably.
DEALING WITH UNEVEN cDNA DISTRIBUTION
With relatively rare exceptions like rDNAs, genes in the genome are in approximately a
1:1 ratio. In contrast, the relative amount of mRNAs present in a typical cell extends over
a range of more than 10		5. This leads to very serious biases in most cDNA libraries. These
will tend to be overrepresented		with a relatively small numbers of different			high-fre-
quency clones. In addition existing cloning methods will tend to bias the libraries toward
short mRNAs. If one attempts to sequence cDNAs at random from a library, in most cases
the high copy number clones will be re-sequenced over and over again, while most rare
mRNAs will never be sampled. It is important to stress that the problems of random se-
lection and library biases seriously interfere with			genomic DNA	sequencing projects,
even though one is starting with an almost uniform sample of the genome. With cDNAs
these problems are much more serious and must be dealt with directly and forcefully.
One simple approach to systematic sequencing of cDNA libraries is shown in Figure
11.17. One starts with an arrayed library. A small number of clones, say 100, are selected
and sequenced. All of the	sequenced	clones are pooled,	labeled, and hybridized back to


Figure 11.17		Basic scheme for sequencing an arrayed cDNA library, and periodically screening
the	library to detect repeats of clones that have already been sequenced. The schematic array shown
has	only 56 clones; a real array would have tens to hundreds of times more.

382

STRATEGIES FOR LARGE-SCALE DNA SEQUENCING

the library. Any clones that are detected by this hybridization are not included in the next

set of 100 to be sequenced. By continuing in this way, most duplication can be avoided.

Unfortunately, there will also be a tendency

discard

cDNAs from

gene

families,

many of the members of these families will be underrepresented or missed. As an alterna-

tive to handling the clones as arrays, one can carry out

a solution

hybridization

of the

entire cDNA library with an excess of sequenced clones, discard all the samples that

hybridize, and regrow the remainder. This effectively replaces a screen by a physical

selection process.

A more complex approach to compensating for uneven cDNA distribution is to try to

equalize or normalize the library. The distribution of mRNAs in a typical cell is shown in

Table 11.1. Roughly speaking there are three classes of messages: a few very common

species, then approximately equal total amounts of species

times

rare,

and

species another factor of 20 rarer still. The goal is to try to even out these differences. The

approach used is to anneal the library to itself and remove all the double-stranded species

that are formed. We will do this by allowing the reannealing to occur at a very high

0 t:

Typically

0 t 250 is

used. From

Chapter 3,

can

write

for

the fraction

single

strand remaining in a hybridization:

C 0 t1/2

1 n

t k /2N

1 C

t/C

t1/2 C

t1/2

where all the quantities in this equation have

been deﬁned in Chapter

3. When

C 0 t

1/2

, we can approximate this result as

C 0 t1/2

Note

that

the

C 0 t1/2value for a

given

sequence

depends

the

ratio of genome

size,

and the number of times the sequence is represented,

For a cDNA library,

is the total

complexity of the DNA sequences represented in the library, and

n is the number of times

a given sequence is represented. Thus, for highly frequent cDNAs,

/n will be

small so

that

the

C 0 t1/2will be small, and these species will

renature relatively more rapidly. Note

that the amount of a particular cDNA remaining after extensive annealing will be propor-

tional to its original abundance

and to

its

hybridization

rate,

which

will

scale as

/n.

Thus, at very long times in the reaction, a relatively even distribution of cDNAs should be

produced. We can evaluate the expected results for an attempt

normalize

the typical

cell mRNAs shown in Table 11.1. This is given in Table 11.2.

TABLE 11.1	Distribution of mRNA in a Typical Cell
Species	Percent	Number of	Relative	C	0	t1/2
		Species	Frequency		0
		Species	Frequency

Common	10	10	330	0.08
Medium	45	1000	15	1.7
Rare	45	15,000	1	25.

					DEALING WITH UNEVEN cDNA	DISTRIBUTION	383
TABLE 11.2	Effect of Self-Annealing a cDNA Library

Class	f at C		0	t 250	Initial Frequency	Equalized Frequency
	s		0
Common	3	10 4			330	9.9	10 2
Medium	7	10 3			15	1.0	10 1
Rare	9	10 2			1	9.0	10 2

The predictions in Table 11.2 look very encouraging. However, a serious potential problem is that one has to discard most of the library in order to achieve this result. PCR or efﬁcient recloning must be used to recover the cDNA clones which have not self-annealed.

An actual scheme for efﬁcient cDNA normalization is shown in Figure 11.18. This has been developed by Bento Soares, Argiris Efstratiadis, and their collaborators. It is designed to avoid the preferential loss of long cDNA clones during the self-annealing, and

also to avoid the loss of cDNAs from closely related gene families. Long clones would be

Figure 11.18 A relatively elaborate scheme for cDNA normalization that attempts to prevent a bias against shorter cDNAs and the loss of cDNAs from gene families. Adapted from Soares et al.

(1994).

384			STRATEGIES FOR LARGE-SCALE DNA SEQUENCING
	TABLE	11.3	cDNA Library Normalization

	Probes			Original Library	Normalized (HAP-FT)	HAP-Bound

	Human C		0 t 1 DNA	10%	2%	2.6%
	Elongation factor 1a			4.6%	0.04%	3.7%
	a-Tubulin			3.7%–4.4%	0.045%	6%
	b-Tubulin			2.9%	0.4%	0.85%
	Mitochondrial 16S rRNA			1.3%	1%
	Myelin basic protein			1%	0.09%
	g-Actin			0.35%	0.1%	1.3%
	Hsp 89			0.4%	0.05%	0.14%
	CH13-cDNA#8			0.009%	0.035%

	Source:	Data adapted from Soares et al. (1994).

lost if entire cDNA sequences were used

for

hybridization

because,

described

Chapter 3, their rate constants for duplex formation (

k ’s) are larger and because there are

more places to nucleate potential duplex. The trick used in the scheme of Figure 11.18 is

to start with single-stranded cDNA clones and produce

a short

duplex

the 3

-end of

each cDNA clone by primer extension in the presence

chain terminators. By

focusing

on this region, one

will ensure that the new

strands

synthesized

preferentially come from

3 noncoding ﬂanking regions where even closely related genes have signiﬁcant diver-

gence, since the sequences are not translated and presumably

have little

function. Any

cDNAs that have not successfully templated the synthesis of a short duplex are discarded

by chromatography on hydroxyapatite, which speciﬁcally binds only duplexes, under the

conditions used. These duplex-containing clones are then eluted, melted, and allowed to

self-anneal to high

C 0 t. Now any

clones

with duplexes are

removed,

and the

clones that

have remained as single strands represent the normalized library. These are then ampliﬁed

and sequenced.

Some actual results using the scheme of Figure 11.18 are given in Table 11.3. It is ap-

parent that the equalization is far from perfect. However, it represents a major improve-

ment over nonnormalized libraries, and materials made in this

way

are

currently being

used extensively for cDNA sequencing. Two additional schemes for cDNA normalization

are described in Box 11.1. It is not clear at the present time just which schemes will ulti-

mately be widely adopted.

LARGE-SCALE cDNA SEQUENCING

In the past three years at least ﬁve separate efforts have

been made to collect massive

amounts of cDNA sequence. One of these is a collaboration between the Institute for

Genome Research (TIGR) and Human Genome Sciences, Inc. At least initially, this effort

took an anatomical approach. Libraries of cDNAs from as many different major tissues as

possible were collected, and large numbers of clones from each of these were sequenced.

The second approach was orchestrated by Incyte Pharmaceuticals, Inc. Here the emphasis

was on cell physiology. Sets of cDNA libraries were collected

from

pairs

cells

known, related physiological states, such

activated or

unactivated

macrophages.

ﬁxed number of cDNAs, 5000 in the earliest studies, was randomly selected

for

each of

the cell pairs and sequenced. In this way information was obtained

about the frequencies

of common cDNAs in

addition to the sequence

information from all

classes

cDNAs.

LARGE-SCALE cDNA SEQUENCING

385

BOX 11.1

ALTERNATE SCHEMES FOR NORMALIZATION OF cDNA LIBRARIES

Two different schemes for the production of normalized cDNA libraries have been described. The ﬁrst, proposed by Sherman Weissman and coworkers, is shown schemati-

cally in Figure 11.19. First PCR is used to amplify cloned cDNAs. Then, as in the Soares and Efstratiadis scheme described in the text, hydroxylapatite fractionation is

used to deplete a reaction mixture of double-stranded products. Next a nested set of PCR primers is used to amplify the single-stranded material that survives hydroxyapatite. Finally this material is cloned to make the normalized library. A survey of typical results is given in Table 11.4.

The scheme developed by Michio Oishi is quite different (Fig. 11.20). Here cDNA immobilized on microbeads is annealed to a vast excess of mRNA from the same

source. Under these conditions the kinetics of hybridization become pseudo–ﬁrst order as described in Chapter 3. The highly overrepresented components in the mRNA will actually deplete the corresponding cDNAs below the level of normalization. The resulting cDNA library will be enriched for rare cDNA sequences. A survey of typical results is given in Table 11.5.

Figure 11.19

A relatively simple scheme for cDNA normalization. Adapted from

Sankhavaram et al. (1991).

(continued)

386 STRATEGIES FOR LARGE-SCALE DNA SEQUENCING

BOX 11.1	(Continued)
TABLE 11.4	Effect of Normalization on a cDNA Library

		Number of Clones Identiﬁed per 100,000 Plaques

Probe	STH	NSTH I	NSTH II

R-DNA	30,000	94	12
Blur-8	800	450	360
-actin	110	37	NT
HLA-H	104	80	10
CD4	28	37	12
CD8	15	55	12
Oct-1	9	NT	8
-globin	7	NT	10
c-myc	5	NT	11
TCR	5	NT	8
TNF-	5	NT	6
-fodrin	3	NT	9

Source:	From Patanjali et al. (1991).
Note: cDNAs present at various levels of abundance in STH library become almost identically abundant in
the normalized (NSTH) libraries. Increased reassociation times, as indicated by the increased			C 0 t value, ren-
der better normalized libraries. NT, not tested. For NSTH I the		C 0 t value was 41.7 mol-s /liter, and for NSTH
II the C 0 t value was 59.0 mol-s /liter.

Figure 11.20 A scheme for the preferential enrichment of rare cDNAs. Adapted from Sasaki et al. (1994a).

(continued)

X 11.1

(Continued)

T ABLE	11.5	Change of the Pr	oportion of cDN		A Clones Bef	or e and After Self-Hybridization
Probe			Input	a	Before	Percentage	After	Percentage	After /Before

Rabbit -globin		1	111/10,500	1.067	5/55,000
X174 Hae	III 0.6 kb	0.01	2 /30,000	0.0067	8/35,000
X174 Hae	III 0.9 kb	0.01	1/30,000	0.0033	6/30,000
neo r		0.0001	0/250,000	0.0004	2 /250,000
-actin			54 /10,000	0.54	2 /10,000
IL-4			0/320,000	0.0003	3/320,000
IL-2			0/320,000	0.0003	0/320,000

0.009	0.086
0.023	3.43
0.02	6
0.0008	2
0.02	0.037
0.0009	3
0.0003	—


Sour ce:	Adapted from Sasaki et al. (1994).
a Percent of total RN		A (w/ w).
b The positi	ve clones were conﬁrmed by sequencing approximately 300 bp of the inserts.

387

388 STRATEGIES FOR LARGE-SCALE DNA SEQUENCING

The frequency information, when pairs of cells are compared, is often quite interesting,

and it suggests potential functional roles for				a number of newly discovered genes in the
libraries. Incyte has	termed	such	comparisons transcript imaging. A third large-scale
cDNA sequencing effort	is the	Image	consortium	involving several academic or govern-

ment laboratories and Merck, Inc. Here normalized libraries are serving as the source of

clones for sequencing, and the goal is to collect	at	least one	representative cDNA se-
quence from all human genes.
At present, the sequencing of human cDNAs is being carried out in large laboratory ef-
forts like the three just described as well as many	smaller, more focused efforts. Within
each laboratory the amount of duplication seen thus	far	has been	relatively small. Thus

the early course of the cDNA sequencing strategy appears to be very effective. At what point it will peter out into a morass of duplicate clones is unknown. It is also really unclear what fraction of the total amount of genes will actually be found ﬁrst through their cDNAs. The tissues used for the majority of these studies are those where large numbers

of different genes are expected to be active. These include early embryos, hytidaform moles, which are differentiated but disordered tumors with many different tissue types, liver, and a number of parts of the brain. Whether many specialized tissues will have to be looked at to get genes expressed only in these tissues, or whether there is a broad enough low-level synthesis of almost any mRNA in one or more of the common tissues to let all genes be found there, is an issue that has not yet been answered. One way to try to extend the cDNA approach to ﬁnd all of the human genes is described in Box 11.2

BOX 11.2

PREPARATION AND USE OF hncDNAs

A major purpose of making ordered libraries is to assist the ﬁnding and mapping of genes. Eugene Sverdlov and his colleagues in Moscow have developed an efﬁcient procedure for preparing chromosome-speciﬁc hncDNA libraries. Their method is an elaboration of the scheme originally described by Corbo et al. (1990). The procedureis

outlined in Figure 11.21. It uses an initial Alu-primed		PCR	reaction to make an
hncDNA copy of the hnRNA produced in a hybrid cell containing	just	the chromo-
some of interest. (See Chapter 14 for details about the Alu	repeat and Alu-speciﬁc
PCR primers.) The resulting DNA is equipped with an oligo-G		tail, and then a ﬁrst
round PCR is carried out using an oligo-C containing primer and an Alu primer. Then
a second round of PCR is done with a nested Alu primer. The PCR primers are also
designed so that the ﬁrst round introduces one restriction site and the second round an-
other. The resulting products are then directionally cloned into a vector requiring both
restriction enzyme cleavage sites. In studies to date, Sverdlov and his coworkers have
found that this scheme produces a diverse set of highly enriched human cDNAs. Since
these come from Alu’s in hnRNA, they will contain introns, but they can be used to lo-
cate genes on the chromosome equally well if not better than conventional cDNAs.
Note that the hncDNA clones as produced by the Sverdlov method actually contain
substantial amounts of intronic regions. This means that they	will	be more	effective

than the ordinary cDNAs in dealing with gene families and in avoiding cross-hy- bridization with conserved exonic sequences in rodent-human cell hybrids.

(continued)

WHAT IS MEANT BY A COMPLETE GENOME SEQUENCE?

389

BOX 11.2

(Continued)

Figure 11.21	Steps involved in making an hncDNA library. Interspersed repeat elements in
hnRNA are represented in the upper line by solid boxes		(R ). Arrows	indicate	primers. Vertical
lines crossing the arrows symbolise the primers with sites for		Eco	R I	and	Bam	H I restriction
endonucleases (	E ) or ( B ). EC is 5 GAGAATTC(C)203	. The open boxes with		similar	symbols

represent sequences corresponding to primers that are included in PCR products P-1 and P-2. Provided by Eugene Sverdlov.

WHAT IS MEANT BY A COMPLETE GENOME SEQUENCE?

If the strictest deﬁnition is used for a complete genome sequence, namely every base on every chromosome in a cell has been identiﬁed, then it is probably safe to say that we will never accomplish this. This is not to say that the task couldn’t be accomplished in principle; it could be, but for several reasons it is a foolish task, at least for the human genome. The human genome is quite variable. This will be discussed in more detail later. Sufﬁce it to say here that there are millions of differences in DNA sequence between the set of two homologous chromosomes in a diploid cell. Unless one could separate these into sepa-

rate, cloned libraries, inevitable confusion will develop as to which homolog one is on. Hybrid cells make these separations for us, and libraries made from chromosomes of hy-

390 STRATEGIES FOR LARGE-SCALE DNA SEQUENCING

brid cells are major candidates for eventual large-scale sequencing. However, because of the history behind the construction of such hybrids, we rarely have separate clones of two homologous chromosomes from a single individual. Even more important, most different

single chromosome hybrid cell lines have			been made from different individuals. So the
real answer to the often asked question “Who will be sequenced in the human genome
project” is that the sequence will inevitably represent a mosaic of many individuals. This
is probably quite appropriate, given the global nature and implications of the project.
There are other bars to total genome sequencing, or even total sequencing of a given
chromosome. We have indicated many times already that closure in mapping is a difﬁcult
task; closure in large-scale sequencing projects will also be extremely difﬁcult. For what-
ever reason, there are bound to be a few regions in any chromosome that cannot be cloned
by any of	our existing methods, or that may not be approachable even by PCR or ge-
nomic sequencing. Sequences with very peculiar secondary structures or sequences toxic
to the enzymes or cells we must rely on			could lead to this kind of problem. The issue of
how much effort should be devoted to a			few missing stretches has not yet been forced
upon us, but placed in any kind of reasonable perspective, it cannot have high priority rel-
ative to more productive use of large-scale sequencing facilities.
Finally, some regions of chromosomes are either very variable or dull, at least at the
level of ﬁne details. Examples are long simple sequence or tandemly repeating sequence
repeats. Human centromeres appear to have millions of base pairs of such repeats. Other
heterochromatic regions are occasionally seen on certain chromosomes. Some of these re-
gions show quite signiﬁcant size variation within the human population. For example, a
case is known of		an apparently healthy individual with one copy of chromosome 21 that
is 50% longer than average. Surely we will not select these extra long variants for initial
mapping or sequencing projects. However, the key point is that extensive, large-scale se-
quencing of simple repeats does not seem to be justiﬁed at the present time by any hints
that this large amount of sequencing data will be interesting or interpretable. Furthermore
our current methods are actually incapable of dealing with such regions of the genome.
Thus we will almost certainly have to claim completeness, missing the centromeres and
certain other unmanageable genome regions.
SEQUENCING	THE	FIFTH BASE
When we have sequenced all of the cloned DNAs from each human single chromosome
library, we will not have the complete			DNA sequence of an individual, for the reasons
cited above. Some of the troublesome regions are almost certainly not in our libraries or,
if they are present, they probably represent badly rearranged remnants of what was actu-
ally in the genome. If we ever want to			look at such sequences in their native state, we
may have to sequence them directly from			the genome. For the reasons cited above, this
may not be a terribly interesting or useful thing to do. However, there is a tremendously
important additional reason to perfect methods for direct genomic mammalian sequenc-
ing. This is to look at the ﬁfth base,			m C, which is lost in all common current cloning sys-

tems. It is also lost in PCR ampliﬁcation. Therefore, to ﬁnd the location of malian or other higher eukaryotic DNA sequences, it is necessary to immortalize the positions of these residues before any ampliﬁcation.

PCR can be used to determine DNA sequence directly from genomic samples with mammalian complexity by a ligation technique that is shown in Figure 11.22

mC in mam-

a. The ge-

<<< < Предыдущая 1 23 / 43 4 > Следующая >>>

Соседние файлы в папке genomics11-15

#
17.08.2013277.66 Кб4711.pdf
#
17.08.2013510.17 Кб4612.pdf
#
17.08.2013311.59 Кб4613.pdf
#
17.08.2013577.75 Кб4514.pdf
#
17.08.2013499.07 Кб4715.pdf
#
17.08.201326.85 Кб46appendix databases.pdf