Genomics: The Science and Technology Behind the Human Genome Project. |
Charles R. Cantor, Cassandra L. Smith |
|
Copyright © 1999 John Wiley & Sons, Inc. |
|
ISBNs: 0-471-59908-5 (Hardback); 0-471-22056-6 (Electronic) |
8 Physical Mapping
WHY HIGH-RESOLUTION PHYSICAL MAPS ARE NEEDED |
|
Physical maps are needed because ordinary human genetic maps are |
not detailed enough |
to allow the DNA that corresponds to particular genes to be isolated efficiently. Physical |
|
maps are also needed as the source for the DNA samples that can serve as the actual sub- |
|
strate for large-scale DNA sequencing projects. Genetic linkage |
mapping provides a set |
of ordered markers. In experimental organisms this set can |
be almost as dense as one |
wishes. In humans one is much more limited because of the inability to control breeding |
|
and produce large numbers of offspring. Distances that emerge from human genetic map- |
|
ping efforts are vague because of the uneven distribution of meiotic recombination events |
across the genome. Cytogenetic mapping, until recently, was low resolution; it is improving now, and if it were simpler to automate, it could well provide a method that would supplant most others. Unfortunately, current conventional approaches to cytogenetic map-
ping seem difficult to automate, and they are slower than many other approaches that do not involve image analysis.
It is worth noting here that in some other organisms, cytogenetics is more powerful than in the human. For example, in Drosophila the presence of polytene salivary gland chromosomes provides a tool of remarkable power and simplicity. The greater extension
and higher-resolution imaging of polytene chromosomes allows bands to be seen at a typ-
ical resolution of 50 kb (Fig. 8.1). This is more than 20 times the resolution in the best human metaphase FISH. Furthermore the large number of DNA copies in the metaphase
salivary chromosomes means that microdissection and cloning are much more powerful
here than they are in the human. It would be nice if there were a convenient way to place large continuous segments of human DNA into Drosophila and then use the power of the cytogenetics in this simple organism to map the human DNA inserts.
Radiation hybrids offer, in principle, a way to measure distances between markers accurately. However, in practice, the relative distances in radiation hybrid maps appear to be distorted (Chapter 7). In addition there are a considerable number of unknown features of
these cells; for example, the |
types |
of unseen |
DNA rearrangements that may be present |
|||
need to be characterized. Thus it |
is not |
yet clear |
that the use of radiation hybrids |
alone |
||
can |
produce a complete |
accurate map or yield a set |
of DNA samples worth subcloning |
|
||
and |
characterizing at |
higher resolution. Instead, |
currently other methods must be used |
to |
accomplish the three major goals in physical mapping:
1.Provide an ordered set of all of the DNA of a chromosome (or genome).
2.Provide accurate distances between a dense set of DNA markers.
3.Provide a set of DNA samples from which direct DNA sequencing of the chromo-
some or genome is possible. This is sometimes called a |
sequence-ready map. |
234
RESTRICTION MAPS |
235 |
Figure 8.1 Banding seen in polytene Drosophila salivary gland chromosomes. Shown is part of chromosome 3 of for which a radiolabeled probe to the gene for heat-shock protein hsp70 has been hybridized. A strong and weak site are seen. (Figure kindly provided by Mary Lou Pardue.)
Most workers in the field focus their efforts on particular chromosomes or chromosome regions. Usually a variety of different methods are brought to bear on the problems
of isolating clones and probes and using these to order each other and ultimately produce finished maps. However, some research efforts have produced complete genomic maps with a restricted number of approaches. Such samples are turned over to communities interested in particular regions or chromosomes that finish a high-resolution map of the regions of interest.
There are basically two kinds of physical maps commonly used in genome studies: restriction maps and ordered clone banks or ordered libraries. We will discuss the basic methodologies used in both approaches separately, and then, in Chapter 9, we will show how the two methods can be merged in more powerful second generation strategies. First
we outline the relative advantages and disadvantages of each type of map.
RESTRICTION MAPS
A typical restriction map is shown in Figure 8.2. It consists of an ordered set of DNA fragments that can be generated from a chromosome by cleavage with restriction enzymes individually or in pairs. Distances along the map are known as precisely as the lengths of the DNA fragments generated by the enzymes can be measured. In practice,
the |
lengths are measured |
by electrophoresis in virtually all currently used methods; |
they |
are accurate to a |
single base pair up to DNAs around around 1 kb in sizes. |
Lengths can be measured with better than 1% accuracy for fragments up to 10 kb, and
with a low percent of accuracy for fragments up to 1 Mb |
in size. Above this length, |
|
measurements today are still fairly qualitative, |
and it is |
always best to try to subdivide |
a target into pieces less than 1 Mb before any |
quantitative claims are made about its |
|
true total size. |
|
|
236 PHYSICAL MAPPING
Figure 8.2 Typical section of a restriction map generated by digestion of genomic or cloned DNA with two enzymes with different recognition sites A, and N.
In an ideal restriction map each DNA fragment is pinned |
to markers on other maps. |
||
Note that |
if this is |
done with randomly chosen probes, |
the locations of these probes |
within each |
DNA fragment |
are generally unknown (Fig. 8.3). |
Probes that correspond to |
the ends of the DNA fragments are more useful, when they are available, because their position on the restriction map is known precisely. Originally probes consisted of unsequenced DNA segments. However, the power of PCR has increasingly favored the use of sequenced DNA segments.
A major advantage of a restriction map is that accurate lengths are known between sets of reference points, even at very early stages in the construction of the map. A second advantage is that most restriction mapping can be carried out using a top-down strategy that
preserves an overview of the target and that reaches a nearly |
complete |
map relatively |
quickly. A third advantage of restriction mapping is that one |
is working |
with genomic |
DNA fragments rather than cloned DNA. Thus all of the potential artifacts that can arise |
||
from cloning procedures are avoided. Filling in the last few small pieces is always a chore |
in restriction mapping, but the overall map is a useful tool long before this is accomplished, and experience has shown that restriction maps can be accurately and completely constructed in reasonably short time periods.
In top-down mapping one successively divides a chromosome target into finer regions and orders these (Fig. 8.4). Usually a chromosome is selected by choosing a hybrid cell in which it is the only material of interest. There have been some concerns about the use of
hybrid cells as the source of DNA for mapping projects. In a typical hybrid cell there is no compelling reason for the structure of most of the human DNA to remain intact. The biological selection that is used to retain the human chromosome is actually applicable to
only a single gene on it. However, the available results, at least for chromosome 21, indicate that there are no significant differences in the order of DNA markers in a selected set
of hybrid and human cell lines (see Fig. 8.49). In a way this is not surprising; even in a human cell line most of the human genome is silent, and if loss or rearrangement of DNA
were facile under these circumstances, |
it should have been observed. Of course, for |
model organisms with small genomes, there is no need to resort to a hybrid cell at all— |
|
their genomes can be studied intact, or the chromosomes can be purified in bulk by PFG. |
|
In a typical restriction mapping effort, any preexisting genetic map information can be |
|
used as a framework for constructing the physical map. Alternatively, the chromosome of |
|
interest can be divided into regions by |
cytogenetic methods or low-resolution FISH. |
Figure 8.3 Ambiguity in the location of a hybridization probe on a DNA fragment.
RESTRICTION MAPS |
237 |
Figure 8.4 |
Schematic |
illustration of methods used in physical mapping. ( |
a ) Top-down strategy |
used in restriction mapping. ( |
b ) Bottom-up strategy used in making an ordered library. |
|
Large DNA fragments are produced by cutting the chromosome with restriction enzymes |
|
|
|
||||||||||
with very rare recognition sites. The fragments are separated by size and assigned to re- |
|
|
|||||||||||
gions by hybridization with genetically or cytogenetically mapped DNA probes. Then the |
|
|
|
||||||||||
fragments are assembled into contiguous blocks, by |
methods that will be described |
later |
|
|
|||||||||
in this chapter. The result at this point |
is |
called |
a |
|
|
macrorestriction map. |
The fragments |
||||||
may average 1 Mb in size. For a simple genome this means that only 25 fragments will |
|
|
|||||||||||
have to be ordered. This is relatively straightforward. For an intact human genome, the |
|
|
|||||||||||
corresponding number is 3600 fragments. This is an |
unthinkable |
task unless |
the frag- |
|
|||||||||
ments are first assorted into individual chromosomes. |
|
|
|
|
|
|
|
|
|||||
If a finer |
map |
is desired, it |
can |
be constructed by taking |
the |
ordered |
fragments |
one |
|
||||
at a time, |
and |
dissecting these |
with |
more |
frequently |
cutting |
restriction |
nucleases. |
An |
|
238 |
|
|
PHYSICAL |
MAPPING |
|
|
|
|
|
|
advantage |
of |
this reductionist mapping approach is that the finer maps can be made only |
||||||||
in |
those regions where there is sufficient interest to justify this much more arduous task. |
|||||||||
|
The major disadvantage of most restriction mapping efforts is that they do not produce |
|||||||||
the |
DNA |
in |
a |
convenient, immortal |
form |
where |
it can be distributed or sequenced |
|||
by |
available methods. One could try to clone the large DNA fragments that compose |
|||||||||
the |
macrorestriction map, and there has been some progress in developing the vectors |
|||||||||
and |
techniques |
needed |
to do this. One could also use PCR |
to |
generate segments of |
|||||
these large fragments (see Chapter 14). For a small genome, most of the macrorestriction |
||||||||||
fragments it contains can usually be separated and purified by a single PFG fractionation. |
||||||||||
An |
example |
is |
shown |
in Figure 8.5. |
In |
cases |
like this, |
one |
does really possess the |
Figure 8.5 Example of the fractionation of a |
|
||||||||
restriction |
enzyme |
digest |
of |
an |
entire |
small |
|
||
genome |
by |
PFG. |
|
Above: |
Not |
|
I digest of |
the |
|
4.6 |
Mb |
E. coli |
genome |
shown |
in lane |
5; |
Sfi I |
||
digest |
in |
lane 6. |
Other |
lanes |
shown |
enzymes |
|
||
that cut too frequently to |
be useful for |
|
map- |
|
|||||
ping. (Adapted from Smith et al., 1987.) |
|
Left: |
|||||||
Structure of the ethidium cation used to stain the |
|
||||||||
DNA fragments. |
|
|
|
|
|
|
|
ORDERED LIBRARIES |
239 |
genome, but it is not in a form where it is easy to handle by most existing techniques. For a large genome, PFG does not have sufficient resolution to resolve individual macrore-
striction fragments. If one starts instead with a hybrid |
cell, containing only a single hu- |
||||
man chromosome or chromosome fragment, most of the |
human macrorestriction frag- |
||||
ments will be separable from one another. But they will |
still be contaminated, each by |
||||
many other background fragments from the rodent host. |
|
|
|||
ORDERED |
LIBRARIES |
|
|
|
|
Most genomic libraries are made by partial digestion with a relatively frequently cutting |
|||||
restriction |
enzymes, size selection of the fragments |
to provide a fairly uniform set of |
|||
DNA inserts, and then cloning these into a vector appropriate for the size range of inter- |
|||||
est. Because of the method by which |
they were produced, the cloned fragments are a |
||||
nearly random set of DNA pieces. Within the library, a given small DNA region will be |
|||||
present on many different clones. These extend to varying degrees on both sides of the |
|||||
particular region (Fig. 8.6). Because the clones |
contain overlapping regions of the |
||||
genome, it is possible to detect these overlaps by various fingerprinting methods that ex- |
|||||
amine patterns of sequence on particular clones. The random nature of the cloned frag- |
|||||
ments means that many more clones exist than the minimum set necessary to cover the |
|||||
genome. In practice, the redundancy of the library is usually set at fiveto tenfold in order |
|||||
to ensure that almost all regions of the genome will have been sampled at least once (as |
|||||
discussed in Chapter 2). From this vast library the goal is to assemble and to order the |
|||||
minimum set of clones that covers the genome in one contiguous block. This set is called |
|||||
the |
tiling path. |
|
|
|
|
|
Clone libraries have usually been ordered by a bottom-up approach. Here individual |
||||
clones are initially selected from the |
library at random. Usually the library is handled as |
||||
an array of samples so that each clone has a unique location on a set of microtitre plates, |
|||||
and the chances of accidentally confusing two different |
clones can be minimized. The |
||||
clone is fingerprinted, by hybridization, by restriction mapping, or by determining bits of |
|||||
DNA sequence. Eventually clones appear that share some or all of the same fingerprint |
|||||
pattern (Fig. 8.7). These are clearly overlapping, if not identical, and they are assembled |
|||||
into |
overlapping sets, called |
contigs, |
which is short for contiguous blocks. There are sev- |
||
eral obvious advantages to this approach. Since the DNA is handled as clones, the map is |
|||||
built up of immortal samples that are |
easily distributed |
and that are potentially suitable |
|||
for direct sequencing. The maps are usually fairly high resolution when small clones are |
|||||
used, and some forms of fingerprinting provide very |
useful internal information about |
||||
each clone. |
|
|
|
|
Figure 8.6 |
Example of a dense library of clones. ( |
a ) The large number of clones insures that a |
||
given DNA probe or region (vertical |
dashed line) will occur on quite a few different clones in the li- |
|||
brary. ( |
b ) The minimum tiling |
set is the smallest number of clones that can be selected to span the |
||
entire sample of DNA. |
|
|
|
240 PHYSICAL MAPPING
Figure 8.7 |
Example of a bottom-up |
fingerprinting strategy |
to order |
a dense set of clones. ( |
a ) A |
clone is selected at random and fingerprinted. ( |
|
b ) |
Two clones that share an overlapping fingerprint |
||
pattern are assembled into a contig. ( |
c ) Longer |
contigs |
are assembled as more overlapping |
clones |
|
are found. |
|
|
|
|
|
There |
are a |
number |
of disadvantages to bottom-up |
mapping. |
While |
this |
process |
is |
||||
easy to automate, no overview of the |
chromosome |
or genome is |
|
provided |
by |
the |
||||||
fingerprinting |
and |
contig |
building. Additional experiments have to |
be |
done |
to |
place |
|||||
contigs on |
a |
lower-resolution framework map, and most approaches do |
not necessarily |
|||||||||
allow the |
orientation of |
each contig to be |
determined |
easily. A |
more |
serious |
limitation |
of pure bottom-up mapping strategies is that they do not reach completion. Even if the original library is a perfectly even representation of the genome, there is a statistical prob-
lem |
associated with the random clone picking used. After a while, most new clones |
that |
are |
picked will fall into contigs that are already saturated. No new information |
will be |
gained from the characterization of these new clones. As the map proceeds, a diminishingly smaller fraction of new clones will add any additional useful information. This problem becomes much more serious if the original library is an uneven sample of the chromosome or genome. The problem can be alleviated somewhat if mapped contigs are
used to screen new clones prior to selection to try to discard those that cannot yield new information. An additional final problem with bottom-up maps is shown in Figure 8.8.
Usually the overlap distance between two |
adjacent members of |
a |
contig |
is not known |
|
with much precision. Therefore distances |
on a typical bottom-up |
map are |
not |
well de- |
|
fined. |
|
|
|
|
|
The number of samples or clones that must be handled in top-down or bottom-up mapping |
|||||
projects can be daunting. This number also |
scales linearly with |
the |
resolution |
desired. To |
gain some perspective on the problem, consider the task of mapping a 150-Mb chromosome.
Figure 8.8 Ambiguity in the degree of clone overlap resulting from most fingerprinting or clone ordering methods.
RESTRICTION NUCLEASE GENOMIC DIGESTS |
241 |
This is the average size of |
a human chromosome. The |
numbers of samples needed are |
|
|
|
shown below: |
|
|
|
|
|
RESOLUTION |
RESTRICTION MAP |
|
ORDERED LIBRARY |
(5 REDUNDANCY |
) |
1 Mb |
150 fragments |
|
750 clones |
|
|
0.1 Mb |
1500 fragments |
|
7500 clones |
|
|
0.01 Mb |
15,000 fragments |
75,000 clones |
|
||
With existing methods the current convenient range for constructing physical maps of en- |
|
|
|||
tire human chromosomes allows a resolution of somewhere between 50 kb (for the most |
|
|
|||
arduous bottom-up approaches attempted) to 1 Mb for much easier restriction mapping or |
|
|
|||
large insert clone contig building. |
|
|
|
|
|
The resolution desired in a map will determine the sorts of clones that are conveniently |
|
|
|||
used in bottom-up approaches. Among the possibilities currently available are |
|
|
|||
Bacteriophage lambda |
|
10 kb inserts |
|
|
|
Cosmids |
|
|
40 kb inserts |
|
|
P1 clones |
|
|
80 to 100 kb inserts |
|
|
Bacterial artificial chromosomes (BACs) |
100 to 400 kb inserts |
|
|
||
Yeast artificial chromosomes (YACs) |
|
100 to 1300 kb inserts |
|
|
|
The first three types of clones can be easily grown in |
|
large numbers of copies per cell. |
|
|
|
This greatly simplifies DNA preparation, hybridization, and other analytical procedures. |
|
|
|||
The last two types of clones are usually handled as single copies per host cell, although |
|
|
|||
some methods exist for amplifying them. Thus they are more difficult to work with, indi- |
|
|
|||
vidually, but their larger insert size makes low-resolution mapping much more rapid and |
|
|
|||
efficient. |
|
|
|
|
|
RESTRICTION NUCLEASE GENOMIC |
DIGESTS |
|
|
|
|
Generating DNA fragments of the desired size range is critical for producing useful li- |
|
|
|||
braries and genomic restriction maps. If genomes were statistically random collections of |
|
|
|||
the four bases, simple binomial statistics would allow us to estimate the average fragment |
|
|
|||
length that would result from a given restriction nuclease recognition site in a total digest. |
|
|
|||
For a genome where each of the four bases is equally represented, the probability of oc- |
|
|
|||
currence of a particular site of size |
n |
is 4 n ; therefore the average fragment length gener- |
|
||
ated by that enzyme will be 4 |
n . In practice, this becomes |
|
|
||
|
SITE SIZE (N ) |
|
AVERAGE FRAGMENT LENGTH |
(kb) |
|
|
4 |
|
1 |
|
|
|
6 |
|
4 |
|
|
|
8 |
|
64 |
|
|
|
10 |
|
1000 |
|
|
242 |
|
PHYSICAL |
MAPPING |
|
|
|
|
|
|
|
|
|
|
|
|
|
This tabulation indicates that enzymes with four or six base sites are convenient for the |
|
|
|
|||||||||||||
construction of partial digest small insert libraries. Enzymes with sites |
ten bases |
long |
|
|
|
|||||||||||
would be the preferred choice for low-resolution macrorestriction mapping, but such en- |
|
|
|
|
||||||||||||
zymes are unknown. Enzymes with eight-base sites would be most useful for currently |
|
|
|
|
||||||||||||
achievable large-insert cloning. More accurate schemes for predicting cutting frequencies |
|
|
|
|
||||||||||||
are discussed in Box 8.1. |
|
|
|
|
|
|
|
|
|
|
|
|
||||
|
A list of the enzymes currently available that have relatively rare cutting sites in mam- |
|
|
|
||||||||||||
malian genomes is given in Table 8.1. Unfortunately, there are only a few known restriction |
|
|
|
|
||||||||||||
enzymes with eight-base specificity, and most of these have sites that are not well-predicted |
|
|
|
|
||||||||||||
by random statistics. A few enzymes are known with larger recognition sequences. In most |
|
|
|
|
|
|||||||||||
cases there is not yet convincing evidence that the available |
preparations of |
these |
enzymes |
|
|
|
|
|||||||||
have low enough contamination with random nonspecific nucleases to allow them to be used |
|
|
|
|
|
|
||||||||||
for a complete digest to generate a discrete nonoverlapping |
set of large DNA fragments. |
|
|
|
|
|||||||||||
Several enzymes exist that have sites so rare that they |
will not occur at all |
in |
natural |
|
|
|||||||||||
genomes. To use these enzymes, one must employ strategies in which the sites are introduced |
|
|
|
|
|
|||||||||||
into the genome, their location is determined, and then cutting at the site is used to generate |
|
|
|
|||||||||||||
fragments containing segments of the region where the site was inserted. Such strategies are |
|
|
|
|
||||||||||||
potentially quite powerful, but they are still in their infancy. |
|
|
|
|
|
|
|
|
||||||||
|
The unfortunate conclusion, from experimental studies on the pattern of DNA fragment |
|
|
|
|
|||||||||||
lengths generated by genomic restriction nuclease digestion, |
is that most currently available |
|
|
|
|
|||||||||||
8-base specific enzymes are not useful for generating fragments suitable for macrorestriction |
|
|
|
|
|
|||||||||||
mapping. Either the average fragment length is too short, or the digestion is not complete, |
|
|
|
|||||||||||||
leading to an overly complicated set of reaction products. However, mammalian genomes are |
|
|
|
|
|
|||||||||||
very poorly modeled by binomial statistics, |
and |
thus |
some |
enzymes, |
which |
statistically |
|
|
|
|||||||
might be thought to be useless because they would generate fragments that are too small, in |
|
|
|
|
||||||||||||
fact generate fragments in useful size ranges. As a specific example, consider the first two en- |
|
|
|
|
||||||||||||
zymes known with eight-base recognition sequences: |
|
|
|
|
|
|
|
|
|
|
|
|||||
|
|
|
|
Sfi |
I |
|
|
GGCCN^NNNNGGCC |
|
|
|
|
|
|
||
|
|
|
|
Not |
I |
|
|
GC^GGCCGC |
|
|
|
|
|
|
|
|
Here the symbol N indicates that any of the four bases can occupy this site; the caret (^) |
|
|
|
|||||||||||||
indicates the cleavage site on the strand shown; there is |
a |
corresponding site |
on the sec- |
|
|
|
||||||||||
ond strand. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Human |
DNA |
is A–T |
rich; like that of other |
mammals |
it |
contains approximately |
60% |
|
|
|
|||||
A T. When this is factored into the predictions (Box 8.1), on the basis of base composi- |
|
|
||||||||||||||
tion, alone, these enzymes would be expected to cut human DNA every 100 to 300 kb. In |
|
|
||||||||||||||
fact |
Sfi |
I digestion of the human genome yields DNA fragments that predominantly range |
|
|
|
|
||||||||||
in size from 100 to 300 kb. In contrast, |
|
|
Not |
I generates |
DNA |
fragments that average |
|
|||||||||
closer to 1 Mb in size. The size range generated by |
|
|
|
|
Sfi |
I would make it potentially useful |
||||||||||
for some applications, but the cleavage specificity of |
|
|
|
|
Sfi |
I leads to some unfortunate com- |
||||||||||
plications. This enzyme cuts within an unspecified sequence. Thus the fragments it gener- |
|
|
|
|
||||||||||||
ates cannot be directly cloned in a straightforward manner |
because the three base over- |
|
|
|
|
|||||||||||
hang |
generated |
by |
Sfi I is |
a mixture |
of 64 different sequences. Another problem, |
|
||||||||||
introduced by the location of the |
|
Sfi |
I cutting site, is |
that different |
sequences |
turn out |
to |
|||||||||
be cut at very different rates. This makes it difficult to achieve total digests efficiently, and |
|
|
|
|||||||||||||
as described later, it also makes it very difficult to use |
|
|
|
Sfi |
I in |
mapping |
strategies |
that de- |
||||||||
pend on the analysis of partial digests. |
|
|
|
|
|
|
|
|
|
|
|
|
RESTRICTION NUCLEASE GENOMIC DIGESTS |
243 |
BOX 8.1
PREDICTION OF RESTRICTION ENZYME-CUTTING FREQUENCIES
An accurate prediction of cutting |
frequencies requires an accurate statistical estima- |
||||
tion of the probability of occurrence of the enzyme recognition site. To accomplish |
|||||
this, one must take into account two factors: First, the sample of interest, that is, the |
|||||
human genome, is unlikely to have a base composition that is precisely 50% G |
C, |
||||
50% A T. Second, |
the |
frequencies of particular dinucleotide sequences often vary |
|||
quite substantially from that predicted by simple binomial statistics based on the base |
|||||
composition. A rigorous treatment should also take the mosaicism into account (see |
|
||||
Chapters 2 and 15). |
|
|
|
|
|
Base composition effects alone can be considered, for double-stranded DNA with a |
|||||
single variable: |
X G |
C |
1 X A T , where X is a mole fraction. Then the expected fre- |
||
quency of occurrence of a site with |
n G’s or C’s and |
m |
A’s or T’s is just |
|
2 |
|
|
|
|
|
|
|
|
1 |
n m (X G |
C )n (1 X G C )m |
|
||||
|
|
|
||||||
To take base sequence into account, Markov statistics can be used. In Markov chain |
|
|||||||
statistics the probability of the |
|
n th event in a |
series can |
be |
influenced by the |
specific |
||
outcome of the prior events such as the |
( |
|
n |
1)th |
and |
( |
n 2)th |
events. Thus Markov |
chain statistics can take into account known frequency information about the occur- |
|
|
||||||
rences of sequences of events. In other |
words, this kind |
of statistics |
is ideally |
suited |
|
|||
for the analysis of sequences. |
|
|
|
|
|
|
|
|
Suppose that the frequencies of particular dinucleotide sequences are known for the |
|
|
||||||
sample of interest. There are 16 possible dinucleotide sequences: The frequency |
|
X A,C, |
||||||
for example, indicates the fraction |
of dinucleotides that has a AC sequence on |
one |
|
|||||
strand base paired to a complementary GT on the other. The sum of these frequencies |
|
|
||||||
is 1. Only 10 of the 16 dinucleotide sequences are distinct unless we consider the two |
|
|||||||
strands separately. On each strand, we can relate dinucleotide and mononucleotide fre- |
|
|||||||
quencies by four equations: |
|
|
|
|
|
|
|
|
|
X A,C X A,A |
X A,T X A,G X A |
|
since base A must always be followed by some other base (the X’s indicate mole fraction). The expected frequency of occurrence of a particular sequence string, based on these nearest-neighbor frequencies is just
|
n |
X i,i 1 |
|
|
X 12 |
|
|
|
X i |
||
|
i 2 |
||
where the product is taken successively over the mole fractions |
|
|
|
X i of all successive dinucleotides |
i,i 1, and mononucleotides |
terest. Predictions done in this way are usually more accurate than predictions based
solely on the base composition. Where |
methylation occurs that can block the cutting |
|
of |
certain restriction nucleases, the |
issue becomes much more complex, as discussed |
in |
the text. |
|
X i,i 1 , respectively, and i in the sequence of in-