Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Beginning Algorithms (2006)

.pdf
Скачиваний:
256
Добавлен:
17.08.2013
Размер:
9.67 Mб
Скачать

Chapter 11

private void maintainLoad() { if (loadFactorExceeded()) {

resize();

}

}

private boolean loadFactorExceeded() {

return size() > _buckets.length * _loadFactor;

}

private void resize() { BucketingHashtable copy =

new BucketingHashtable(_buckets.length * 2, _loadFactor);

for (int i = 0; i < _buckets.length; ++i) { if (_buckets[i] != null) {

copy.addAll(_buckets[i].iterator());

}

}

_buckets = copy._buckets;

}

private void addAll(Iterator iterator) {

assert iterator != null : “iterator can’t be null”;

for (iterator.first(); !iterator.isDone(); iterator.next()) { add(iterator.current());

}

}

The method bucketIndexFor() determines which bucket a given value should be stored in. Just like you did for LinearProbingHashtable, the hashCode() method is called, and the remainder after dividing by the number of buckets is taken. This ensures you have an index that falls within the bounds of the bucket array:

private int bucketIndexFor(Object value) { assert value != null : “value can’t be null”;

return Math.abs(value.hashCode() % _buckets.length);

}

The bucketFor() method obtains the appropriate bucket for a specified value. Ordinarily you would just use a direct array lookup, but the bucketFor() method also guarantees that if no bucket exists yet at the appropriate position, then one is created:

private List bucketFor(Object value) {

int bucketIndex = bucketIndexFor(value);

List bucket = _buckets[bucketIndex]; if (bucket == null) {

bucket = new LinkedList(); _buckets[bucketIndex] = bucket;

284

Hashing

}

return bucket;

}

The add() method obtains the appropriate bucket and the value added only if it doesn’t already exist. Again, this ensures that two equal values — those for which equals would return true — can’t exist in the hash table simultaneously:

public void add(Object value) { List bucket = bucketFor(value);

if (!bucket.contains(value)) { bucket.add(value); maintainLoad();

}

}

The contains() method is also very simple. First find the appropriate bucket and then return true if a bucket exists and contains the specified value:

public boolean contains(Object value) {

List bucket = _buckets[bucketIndexFor(value)]; return bucket != null && bucket.contains(value);

}

Finally, the size() method adds the number of values in each bucket to calculate the total size:

public int size() { int size = 0;

for (int i = 0; i < _buckets.length; ++i) { if (_buckets[i] != null) {

size += _buckets[i].size();

}

}

return size;

}

Assessing Performance

Now that you have two hash table implementations, it’s time to see how well they perform, not only individually, but also against one another. To evaluate the performance of each, in the next Try It Out section you develop tests that exercise the add() and contains() methods to see how many times equals() is called on the stored values: The smaller the number, the more efficiently the implementation finds a suitable location for the value.

285

Chapter 11

Try It Out

Creating the Tests

Create a test class as follows:

package com.wrox.algorithms.hashing;

import junit.framework.TestCase;

public class HashtableCallCountingTest extends TestCase { private static final int TEST_SIZE = 1000;

private static final int INITIAL_CAPACITY = 17;

private int _counter; private Hashtable _hashtable;

public void testLinearProbingWithResizing() {

_hashtable = new LinearProbingHashtable(INITIAL_CAPACITY); runAll();

}

public void testLinearProbingNoResizing() {

_hashtable = new LinearProbingHashtable(TEST_SIZE); runAll();

}

public void testBucketsLoadFactor100Percent() {

_hashtable = new BucketingHashtable(INITIAL_CAPACITY, 1.0f); runAll();

}

public void testBucketsLoadFactor75Percent() {

_hashtable = new BucketingHashtable(INITIAL_CAPACITY, 0.75f); runAll();

}

public void testBuckets50Percent() {

_hashtable = new BucketingHashtable(INITIAL_CAPACITY, 0.50f); runAll();

}

public void testBuckets25Percent() {

_hashtable = new BucketingHashtable(INITIAL_CAPACITY, 0.25f); runAll();

}

public void testBuckets150Percent() {

_hashtable = new BucketingHashtable(INITIAL_CAPACITY, 1.50f); runAll();

}

public void testBuckets200Percent() {

_hashtable = new BucketingHashtable(INITIAL_CAPACITY, 2.0f); runAll();

}

286

Hashing

private void runAll() { runAdd(); runContains();

}

private void runAdd() { _counter = 0;

for (int i = 0; i < TEST_SIZE; ++i) { _hashtable.add(new Value(i));

}

reportCalls(“add”);

}

private void runContains() { _counter = 0;

for (int i = 0; i < TEST_SIZE; ++i) { _hashtable.contains(new Value(i));

}

reportCalls(“contains”);

}

private void reportCalls(String method) {

System.out.println(getName() + “(“ + method + “): “ + _counter + “ calls”);

}

private final class Value { private final String _value;

public Value(int value) {

_value = String.valueOf(Math.random() * TEST_SIZE);

}

public int hashCode() { return _value.hashCode();

}

public boolean equals(Object object) { ++_counter;

return object != null && _value.equals(((Value) object)._value);

}

}

}

How It Works

The HashtableCallCountingTest extends TestCase, making it easy to run. The class holds a hash table instance for the current test, and a counter for recording the number of times the equals() method is called:

package com.wrox.algorithms.hashing;

import junit.framework.TestCase;

public class HashtableCallCountingTest extends TestCase { private static final int TEST_SIZE = 1000;

287

Chapter 11

private static final int INITIAL_CAPACITY = 17;

private int _counter; private Hashtable _hashtable;

...

}

The Value inner class enables you to intercept and count calls to equals(). If you were to store simple strings, there would be no way to know when the equals() method had been called. Moreover, the String class is marked final, so there is no way to extend it and override the equals() method directly. Instead, you created your own class, which wraps a string and increments _counter anytime equals() is called. Notice how the constructor randomly assigns an underlying value to ensure that the results aren’t skewed due to the insertion of ordered data:

private final class Value { private final String _value;

public Value() {

_value = String.valueOf(Math.random() * TEST_SIZE);

}

public int hashCode() { return _value.hashCode();

}

public boolean equals(Object object) { ++_counter;

return object != null && _value.equals(((Value) object)._value);

}

}

The method reportCalls() enables you to report the number of times equals() has been called, in the form test-name(method): #### calls (where method will be either “add” or “contains”, depending on which part of the test is being reported at the time):

private void reportCalls(String method) {

System.out.println(getName() + “(“ + method + “): “ + _counter + “ calls”);

}

The methods runAdd() and runContains() reset the counter before running TEST_SIZE iterations of the add() and contains() methods, respectively, and finally reporting the results:

private void runAdd() { _counter = 0;

for (int i = 0; i < TEST_SIZE; ++i) { _hashtable.add(new Value());

}

reportCalls(“add”);

}

private void runContains() { _counter = 0;

288

Hashing

for (int i = 0; i < TEST_SIZE; ++i) { _hashtable.contains(new Value());

}

reportCalls(“contains”);

}

The method runAll() is a convenience to enable the test cases to run both parts with a single call:

private void runAll() { runAdd(); runContains();

}

Now we get into the actual test cases. The first set of test cases is for linear probing. There aren’t that many different configurations to try — only two, in fact — as the only configurable option for LinearProbingHashtable is the initial capacity: The first creates a hash table with an initial capacity that is smaller than the data set’s size, hopefully leading to a number of resize operations. The second test has exactly the right capacity to ensure that no resizing occurs at all:

public void testLinearProbingWithResizing() {

_hashtable = new LinearProbingHashtable(INITIAL_CAPACITY); runAll();

}

public void testLinearProbingNoResizing() {

_hashtable = new LinearProbingHashtable(TEST_SIZE); runAll();

}

The next set of tests exercises the bucketing version. These not only demonstrate the performance relative to linear probing, but also give you an idea of how performance varies depending on the initial configuration. Each case creates a hash table with an initial capacity small enough to guarantee that a number of resize operations will be performed. The difference between each lies in when that resize will occur. Notice the varying load factor values for each test:

public void testBucketsLoadFactor100Percent() {

_hashtable = new BucketingHashtable(INITIAL_CAPACITY, 1.0f); runAll();

}

public void testBucketsLoadFactor75Percent() {

_hashtable = new BucketingHashtable(INITIAL_CAPACITY, 0.75f); runAll();

}

public void testBuckets50Percent() {

_hashtable = new BucketingHashtable(INITIAL_CAPACITY, 0.50f); runAll();

}

public void testBuckets25Percent() {

_hashtable = new BucketingHashtable(INITIAL_CAPACITY, 0.25f); runAll();

}

289

Chapter 11

public void testBuckets150Percent() {

_hashtable = new BucketingHashtable(INITIAL_CAPACITY, 1.50f); runAll();

}

public void testBuckets200Percent() {

_hashtable = new BucketingHashtable(INITIAL_CAPACITY, 2.0f); runAll();

}

Running the performance comparison should produce output similar to the following. Keep in mind that the actual results will be slightly different due to the random nature of the tests.

testLinearProbingWithResizing(add): 14704 calls testLinearProbingWithResizing(contains): 1088000 calls testLinearProbingNoResizing(add): 18500 calls testLinearProbingNoResizing(contains): 1000000 calls testBucketsLoadFactor100Percent(add): 987 calls testBucketsLoadFactor100Percent(contains): 869 calls testBucketsLoadFactor75Percent(add): 832 calls testBucketsLoadFactor75Percent(contains): 433 calls testBuckets50Percent(add): 521 calls testBuckets50Percent(contains): 430 calls testBuckets25Percent(add): 262 calls testBuckets25Percent(contains): 224 calls testBuckets150Percent(add): 1689 calls testBuckets150Percent(contains): 903 calls testBuckets200Percent(add): 1813 calls testBuckets200Percent(contains): 1815 calls

In this form, the numbers are a bit hard to interpret, so they have been summarized in Table 11-1.

Table 11-1: Calls to equals() for 1,000 Iterations Each of add() and contains()*

Configuration

add

contains

Total

Average

Linear Probing - Resizing

14,704

1,088,000

1,102,704

551.35

Linear Probing – No resizing

18,500

1,000,000

1,018,500

509.25

Buckets – 100% Load

987

869

1,856

0.93

Buckets – 75% Load

832

433

1,265

0.63

Buckets – 50% Load

521

430

951

0.48

Buckets – 25% Load

262

224

486

0.24

Buckets – 150% Load

1,689

903

2,592

1.30

Buckets – 200% Load

1,813

1,815

3,628

1.81

 

 

 

 

 

* Actual results may vary due to the random nature of the tests

290

Hashing

The most striking thing about these results is the obvious difference between linear probing and buckets. The last column in the table — Average — shows that linear probing performs generally no better than a linked list — O(N). Using buckets, however, appears to work remarkably well. Even in the worst case, where the hash table didn’t resize until the load reached 200%, the number of calls to equals() still averaged under 2! In the best case, the average was 0.24, or one call for every four values. Of course, in this case, the hash table is only ever 25% populated, leading to a lot of wasted space. In all cases, though, the buckets clearly outperform linear probing by several orders of magnitude.

There also seems to be a direct correlation between the bucket load factor and the number of calls: 100% load leads to around one call per value; 75% load results in a call for around 60% of the values, and so on. The really interesting feature, though, is that no matter what the load factor, performance still remains amazingly close to O(1).

From this, you can conclude that a hash table implementation that uses buckets provides excellent overall performance, possibly the best so far, for storing and retrieving values. However, achieving such performance is contingent on finding a good hash function.

Summar y

In this chapter, you learned the following:

Hashing acts as a kind of randomizing function, destroying any sense of order within the data.

A perfect hash function is one that causes no collisions; however, this is hard to achieve.

The particular hashing function to use is largely determined by the nature and characteristics of the input data, which in many cases is difficult, if not impossible, to know in advance. Therefore, finding a hash function that minimizes the number of collisions, rather than eliminates them altogether, is more attainable.

Increasing the table size can reduce the number of collisions at the expense of wasted memory, as can using a prime number for the table size.

Linear probing degenerates into a linked list.

Buckets coupled with a good hash function can achieve O(1) search times.

291

Chapter 11

Exercises

1.Modify BucketingHashtable to always use a prime number of buckets. What effect (if any) does this have on performance?

2.Modify LinearProbingHashtable to maintain the number of values in the table, rather than calculate it every time.

3.Modify BucketingHashtable to maintain the number of values in the table, rather than calculate it every time.

4.Create an iterator that provides access to all of the entries in a BucketingHashtable.

292

12

Sets

Sets are collections that hold only distinct values; a set guarantees that an item will not be added more than once. They are particularly useful in scientific applications but often provide a more sensible structure than lists for holding data when duplicate values are not needed. More often than not when a list is used, a set is probably what is intended.

This chapter discusses the following topics:

The basic operations of a set

A set implementation designed for small amounts of data, the list set

Another implementation that efficiently manages large amounts of unordered data, the hash set

A third type of set that has predictable iteration order, the tree set

Understanding Sets

Think of a set as an unordered pool of data containing no duplicates. This differs from a list, which, like an array, maintains the order of insertion and allows duplicates. Figure 12-1 depicts the set of letters A through K. Notice that there is no explicit ordering of the values.

Sets typically support the operations shown in Table 12-1.