Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Data-Structures-And-Algorithms-Alfred-V-Aho

.pdf
Скачиваний:
122
Добавлен:
09.04.2015
Размер:
6.91 Mб
Скачать

Data Structures and Algorithms: CHAPTER 11: Data Structures and Algorithms for External Storage

Removing record 10 from the B-tree of Fig. 11.11 results in the B-tree of Fig. 11.12. Here, the block containing 10 is discarded. Its parent now has only two children, and the right sibling of the parent has the minimum number, three. Thus we combine the parent with its sibling, making one node with five children.

Fig. 11.11. B-tree after insertion.

Fig. 11.12. B-tree after deletion.

Time Analysis of B-tree Operations

Suppose we have a file with n records organized into a B-tree of order m. If each leaf contains b records on the average, then the tree has about [n/b] leaves. The longest possible paths in such a tree will occur if each interior node has the fewest children possible, that is, m/2 children. In this case there will be about 2[n/b]/m parents of leaves, 4[n/b]/m2 parents of parents of leaves, and so on.

If there are j nodes along the path from the root to a leaf, then 2j-1[n/b]/mj-1 ³ 1, or else there would be fewer than one node at the root's level. Therefore, [n/b] ³ (m/2)j-1, and j £ 1 -- logm/2[n/b]. For example, if n = 106, b = 10, and m = 100, then j £ 3.9. Note that b is not the maximum number of records we can put in a block, but an average or expected number. However, by redistributing records among neighboring blocks whenever one gets less than half full, we can ensure that b is at least half the maximum value. Also note that we have assumed in the above that each interior node has the minimum possible number of children. In practice, the average interior node will have more than the minimum, and the above analysis is therefore conservative.

For an insertion or deletion, j block accesses are needed to locate the appropriate leaf. The exact number of additional block accesses that are needed to accomplish the insertion or deletion, and to ripple its effects through the tree, is difficult to compute. Most of the time only one block, the leaf storing the record of interest, needs to be rewritten. Thus, 2 + logm/2[n/b] can be taken as the approximate number

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1211.htm (26 of 34) [1.7.2001 19:28:20]

Data Structures and Algorithms: CHAPTER 11: Data Structures and Algorithms for External Storage

of block accesses for an insertion or deletion.

Comparison of Methods

We have discussed hashing, sparse indices, and B-trees as possible methods for organizing external files. It is interesting to compare, for each method, the number of block accesses involved in a file operation.

Hashing is frequently the fastest of the three methods, requiring two block accesses on average for each operation (excluding the block accesses required to search the bucket table), if the number of buckets is sufficiently large that the typical bucket uses only one block. With hashing, however, we cannot easily access the records in sorted order.

A sparse index on a file of n records allows the file operations to be done in about 2 + log(n/bb′) block accesses using binary search; here b is the number of records that fit on a block, and b′ is the number of key-pointer pairs that fit on a block for the index file. B-trees allow file operations in about 2 + logm/2[n/b] block

accesses, where m, the maximum degree of the interior nodes, is approximately b′. Both sparse indices and B-trees allow records to be accessed in sorted order.

All of these methods are remarkably good compared to the obvious sequential scan of a file. The timing differences among them, however, are small and difficult to determine analytically, especially considering that the relevant parameters such as the expected file length and the occupancy rates of blocks are hard to predict in advance.

It appears that the B-tree is becoming increasingly popular as a means of accessing files in database systems. Part of the reason lies in its ability to handle queries asking for records with keys in a certain range (which benefit from the fact that the records appear in sorted order in the main file). The sparse index also handles such queries efficiently, but is almost sure to be less efficient than the B-tree. Intuitively, the reason B-trees are superior to sparse indices is that we can view a B- tree as a sparse index on a sparse index on a sparse index, and so on. (Rarely, however, do we need more than three levels of indices.)

B-trees also perform relatively well when used as secondary indices, where "keys" do not really define a unique record. Even if the records with a given value for the designated fields of a secondary index extend over many blocks, we can read them all with a number of block accesses that is just equal to the number of blocks holding these records plus the number of their ancestors in the B-tree. In comparison,

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1211.htm (27 of 34) [1.7.2001 19:28:20]

Data Structures and Algorithms: CHAPTER 11: Data Structures and Algorithms for External Storage

if these records plus another group of similar size happen to hash to the same bucket, then retrieval of either group from a hash table would require a number of block accesses about double the number of blocks on which either group would fit. There are possibly other reasons for favoring the B-tree, such as their performance when several processes are accessing the structure simultaneously, that are beyond the scope of this book.

Exercises

Write a program concatenate that takes a sequence of file names as

11.1arguments and writes the contents of the files in turn onto the standard output, thereby concatenating the files.

Write a program include that copies its input to its output except when

11.2

it encounters a line of the form #include filename, in which case it is to replace this line with the contents of the named file. Note that included files may also contain #include statements.

11.3

How does your program for Exercise 11.2 behave when a file includes itself?

11.4

Write a program compare that will compare two files record-by-record to determine whether the two files are identical.

Rewrite the file comparison program of Exercise 11.4 using the LCS *11.5 algorithm of Section 5.6 to find the longest common subsequence of

records in both files.

Write a program find that takes two arguments consisting of a pattern string and a file name, and prints all lines of the file containing the

11.6pattern string as a substring. For example, if the pattern string is "ufa" and the file is a word list, then find prints all words containing the trigram "ufa."

11.7

Write a program that reads a file and writes on its standard output the records of the file in sorted order.

11.8

What are the primitives Pascal provides for dealing with external files? How would you improve them?

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1211.htm (28 of 34) [1.7.2001 19:28:20]

Data Structures and Algorithms: CHAPTER 11: Data Structures and Algorithms for External Storage

*11.9

*11.10

**11.11

**11.12

Suppose we use a three-file polyphase sort, where at the ith phase we create a file with ri runs of length li. At the nth phase we want one run on one of the files and none on the other two. Explain why each of the following must be true

a.li = li-i + li- 2 for i ³ 1, where l0 and l-1 are taken to be the lengths of runs on the two initially occupied files.

b.ri = ri-2 - ri- 1 (or equivalently, ri-2 = ri-1 + ri for i

³ 1), where r0 and r-1 are the number of runs on the two initial files.

c.rn = rn-1 = 1, and therefore, rn, rn-1, . . . ,r1, forms a Fibonacci sequence.

What additional condition must be added to those of Exercise 11.9 to make a polyphase sort possible

a.with initial runs of length one (i.e., lo=l- 1=1)

b.running for k phases, but with initial runs other than one allowed.

Hint. Consider a few examples, like ln = 50, ln-1 = 31, or ln = 50, ln- 1 = 32.

Generalize Exercises 11.9 and 11.10 to polyphase sorts with more than three files.

Show that:

a.Any external sorting algorithm that uses only one tape as external storage must take W(n2) time to

sort n records.

b.O(n log n) time suffices if there are two tapes to use as external storage.

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1211.htm (29 of 34) [1.7.2001 19:28:20]

Data Structures and Algorithms: CHAPTER 11: Data Structures and Algorithms for External Storage

Suppose we have an external file of directed arcs x → y that form a directed acyclic graph. Assume that there is not enough space in internal memory to hold the entire set of vertices or edges at one time.

a. Write an external topological sort program that prints out a linear ordering of the vertices such that if x y is a directed arc, then vertex x appears before vertex y in the linear

11.13

b. What is the time and space complexity of your

program as a function of the number of block

 

 

accesses?

 

c. What does your program do if the directed graph

 

is cyclic?

 

d. What is the minimum number of block accesses

 

needed to topologically sort an externally stored

 

dag?

Suppose we have a file of one million records, where each record takes

11.14

100 bytes. Blocks are 1000 bytes long, and a pointer to a block takes 4 bytes. Devise a hashed organization for this file. How many blocks are needed for the bucket table and the buckets?

11.15Devise a B-tree organization for the file of Exercise 11.14.

Write programs to implement the operations RETRIEVE, INSERT, DELETE, and MODIFY on

a. hashed files,

11.16

b.indexed files,

c.B-tree files.

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1211.htm (30 of 34) [1.7.2001 19:28:20]

Data Structures and Algorithms: CHAPTER 11: Data Structures and Algorithms for External Storage

 

Write a program to find the kth largest element in

11.17

a. a sparse-indexed file

 

 

b. a B-tree file

Assume that it takes a + bm milliseconds to read a block containing a node of an m-ary search tree. Assume that it takes c + d log2m milliseconds to process each node in internal memory. If there are n nodes in the tree, we need to read about logmn nodes to locate a given

record. Therefore, the total time taken to find a given record in the tree

is

11.18

(logmn)(a + bm + c + d log2m) = (log2n)((a + c + bm)/log2m + d)

milliseconds. Make reasonable estimates for the values of a, b, c, and d and plot this quantity as a function of m. For what value of m is the minimum attained?

A B*-tree is a B-tree in which each interior node is at least 2/3 full (rather than just 1/2 full). Devise an insertion scheme for B*-trees that

*11.19 delays splitting interior nodes until two sibling nodes are full. The two full nodes can then be divided into three, each 2/3 full. What are the advantages and disadvantages of B*-trees compared with B-trees?

When the key of a record is a string of characters, we can save space by storing only a prefix of the key as the key separator in each interior

*11.20 node of the B-tree. For example, "cat" and "dog" could be separated by the prefix "d" or "do" of "dog." Devise a B-tree insertion algorithm that uses prefix key separators that at all times are as short as possible.

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1211.htm (31 of 34) [1.7.2001 19:28:20]

Data Structures and Algorithms: CHAPTER 11: Data Structures and Algorithms for External Storage

Suppose that the operations on a certain file are insertions and deletions fraction p of the time, and the remaining 1-p of the time are retrievals where exactly one field is specified. There are k fields in records, and a retrieval specifies the ith field with probability qi.

Assume that a retrieval takes a milliseconds if there is no secondary *11.21 index for the specified field, and b milliseconds if the field has a

secondary index. Also assume that an insertion or deletion takes c + sd milliseconds, where s is the number of secondary indices. Determine, as a function of a, b, c, d, p, and the qi's, which secondary indices should be created for the file in order that the average time per operation be minimized.

Suppose that keys are of a type that can be linearly ordered, such as real numbers, and that we know the probability distribution with which keys of given values will appear in the file. We could use this knowledge to outperform binary search when looking for a key in a sparse index. One scheme, called interpolation search, uses this statistical information to predict where in the range of index blocks Bi,

. . . ,Bj to which the search has been limited, a key x is most likely to *11.22 lie. Give

a.an algorithm to take advantage of statistical knowledge in this way, and

b.a proof that O(loglogn ) block accesses suffice, on the average, to find a key.

Suppose we have an external file of records, each consisting of an edge of a graph G and a cost associated with that edge.

a.Write a program to construct a minimum-cost spanning tree for G, assuming that there is enough memory to store all the vertices of G in core but not all the edges.

b.What is the time complexity of your program as a function of the number of vertices and edges?

11.23

Hint. One approach to this problem is to maintain a forest of currently connected components in core. Each edge is read and processed as follows: If the next edge has ends in two different

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1211.htm (32 of 34) [1.7.2001 19:28:20]

Data Structures and Algorithms: CHAPTER 11: Data Structures and Algorithms for External Storage

components, add the edge and merge the components. If the edge creates a cycle in an existing component, add the edge and remove the highest cost edge from that cycle (which may be the current edge). This approach is similar to Kruskal's algorithm but does not require the edges to be sorted, an important consideration in this problem.

Suppose we have a file containing a sequence of positive and negative

numbers a1, a2, . . . ,an. Write an O(n) program to find a contiguous 11.24 subsequence ai, ai+1, . . . ,aj that has the largest sum ai + ai+1 × × × + aj

of any such subsequence.

Bibliographic Notes

For additional material on external sorting see Knuth [1973]. Further material on external data structures and their use in database systems can be found there and in Ullman [1982] and Wiederhold [1982]. Polyphase sorting in discussed by Shell [1971]. The six-buffer merging scheme in Section 11.2 is from Friend [1956] and the four-buffer scheme from Knuth [1973].

Secondary index selection, of which Exercise 11.21 is a simplification, is discussed by Lum and Ling [1970] and Schkolnick [1975]. B-trees originated with Bayer and McCreight [1972]. Comer [1979] surveys the many variations, and Gudes and Tsur [1980] discusses their performance in practice.

Information about Exercise 11.12, oneand twotape sorting, can be found in Floyd and Smith [1973]. Exercise 11.22 on interpolation search is discussed in detail by Yao and Yao [1976] and Perl, Itai, and Avni [1978].

An elegant implementation of the approach suggested in Exercise 11.23 to the external minimum-cost spanning tree problem was devised by V. A. Vyssotsky around 1960 (unpublished). Exercise 11.24 is due to M. I. Shamos.

† It is tempting to assume that if (1) and (2) take the same time, then selection could never catch up with reading; if the whole block were not yet read, we would select from the first records of the block, those that had the lower keys, anyway. However, the nature of reading from disks is that a long period elapses before the block is found and anything at all is read. Thus our only safe assumption is that nothing of the block being read in a stage is available for selection during that stage.

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1211.htm (33 of 34) [1.7.2001 19:28:20]

Data Structures and Algorithms: CHAPTER 11: Data Structures and Algorithms for External Storage

If these are not the first runs from each file, then this initialization can be done after the previous runs were read and the last 4b records from these runs are being merged.

This strategy is the simplest of a number of responses that can be made to the situation where a block has to be split. Some other choices, providing higher average occupancy of blocks at the cost of extra work with each insertion, are mentioned in the exercises.

We can use a variety of strategies to prevent leaf blocks from ever becoming completely empty. In particular, we describe below a scheme for preventing interior nodes from getting less than half full, and this technique can be applied to the leaves as well, with a value of m equal to the largest number of records that will fit in one block.

Table of Contents Go to Chapter 12

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/mf1211.htm (34 of 34) [1.7.2001 19:28:20]

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/images/f10_1.gif

http://www.ourstillwaters.org/stillwaters/csteaching/DataStructuresAndAlgorithms/images/f10_1.gif [1.7.2001 19:28:24]

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]