Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Beginning Algorithms (2006)

.pdf
Скачиваний:
255
Добавлен:
17.08.2013
Размер:
9.67 Mб
Скачать

Chapter 16

The brute-force algorithm works by scanning from left to right one position at a time until a match is found. Given that in the worst case you must compare every character of the pattern with almost every character of the text, the worst-case running time is O(NM) — particularly nasty! The ideal scenario for the brute-force approach is a scenario in which the first character comparison fails every time, right up until a successful match at the end of the text. The running time of this best case is therefore O(N + M).

The Boyer-Moore algorithm performs character comparisons from the right to the left of the pattern, and skips multiple character positions each time. It has a worst-case running time that is as bad as or slightly worse than (due to the overhead of the initial pattern processing) the bruteforce algorithm. In practice, however, it performs remarkably better than the brute-force algorithm and can achieve a best-case running time of O(N/M)when it can continually skip the entire pattern right up until the end.

You can implement an iterator that avoids the cumbersome state management associated with performing repeated searches. Because the iterator depends only on the StringSearcher interface, you can use it with any string searcher you use. For example, if you need to search through different types of text with characteristics that required some sophisticated and varied string searching algorithms, the iterator enables you to use it, assuming your new algorithm conforms to the StringSearcher interface while leaving all your application code as is, oblivious to the change in search technique.

You compared the two algorithms by searching for various English words in a relatively large (~3MB) text file. Obviously, real-world results will vary depending on the type of text to search, the make-up and length of the pattern, and so on. Overall, it is hoped that you can see how, with a little thinking and a little effort, you can achieve almost an order of magnitude improvement in performance between the brute-force and the Boyer-Moore search algorithms.

There are many other well-known string searching algorithms that we haven’t discussed — Rabin-Karp [Cormen, 2001] and Knuth-Morris-Pratt [Cormen, 2001] being the first to spring to mind. Neither of these perform nearly as well as Boyer-Moore in most applications, and they can often be no better than the brute-force approach for plain-text searching. Rabin-Karp, which uses a clever hashing scheme, is useful for searching multiple patterns at once. Whatever the application, the important thing is to analyze the type of text you are searching through and identify the characteristics that will enable you to avoid many of the obviously unnecessary comparisons.

414

17

String Matching

Chapter 16 concentrated on efficient techniques for finding one string within another. This chapter focuses on matching whole strings, and, in particular, attempting to find matches between nonidentical yet similar strings. This can be very useful for detecting duplicate entries in a database, spell-checking documents, and even searching for genes in DNA.

This chapter discusses the following topics:

Understanding Soundex

Understanding Levenshtein word distance

Understanding Soundex

Soundex encoding is one of a class of algorithms known as phonetic encoding algorithms. Phonetic encoding takes strings and converts similar sounding words into the same encoded value (much like a hash function).

Soundex, developed by R. C. Russell to process data collected from the 1980 census, is also known as the Russell Soundex algorithm and has been used in its original form and with many variations in numerous applications — ranging from human resource management to genealogy, and, of course, census taking — in an attempt to eliminate data duplication that occurs because of differences in the spelling of people’s surnames.

In 1970, Robert L. Taft, working as part of the New York State Identification and Intelligence project (NYSII), published a paper titled “Name Search Techniques,” in which he presented findings on two phonetic encoding schemes. One of these was Soundex, the other an algorithm developed by the NYSII based on extensive statistical analysis of real data. The NYSII project concluded that Soundex was 95.99% accurate with a selectivity of 0.213% per search, whereas the new system (not presented here) was 98.72% accurate with a selectivity of 0.164% per search.

Other phonetic encoding schemes include Metaphone, Double-Metaphone, and many variations on the original Soundex.

Chapter 17

The Soundex algorithm is quite straightforward and fairly simple to understand. It involves a number of rules for processing an input string. The input string, usually a surname or the like, is processed from left to right, with a transformation applied to each character to produce a four-character code of the form LDDD, where L represents a letter and D represents a decimal digit in the range 0 to 6.

Each input character is transformed according to one or more of the following rules (look for the relationships within each group of letters):

1.

2.

3.

4.

5.

6.

All characters are processed as if they were uppercase.

Always use the first letter.

Drop all other characters if they are A, E, I, O, U, H, W, or Y.

Translate the remaining characters as follows:

B, F, P, and V to 1

C, G, J, K, Q, S, X, and Z to 2

D and T to 3

L to 4

M and N to 5

R to 6

Drop consecutive letters having the same code.

Pad with zeros if necessary

After taking the first letter, you drop all the vowels. In English, it is often still possible to read most words after all of the vowels have been removed. Notice also that H, W, and Y are also ignored, as their pronunciation is often the same as a vowel sound.

The letters B, F, P, and V are also similar, not only in pronunciation but also in the shape of your mouth when making the sound. Try saying B followed by P. The same can be said for T and D as well as M and N, and so on.

Also notice that you ignore consecutive letters with the same code. This makes sense because double letters in English often sound the same as a single letter.

To give you an idea of how the encoding works in practice, take the surnames Smith and Smythe and see how you would encode them using the Soundex algorithm.

Start by initializing a result buffer with space for four characters — the maximum length of a Soundex code is four — as shown in Figure 17-1. You then start processing the input string one character at a time from left to right.

You know from rule 2 that you always copy the first character from the input string into the first character of the result buffer, so you copy across the S as shown in Figure 17-2.

The next character in the input string is m. Rule 4 says this should be encoded as a 5. Figure 17-3 shows the 5 being placed into the second character position of the result.

416

String Matching

Input S m i t h

Result

Figure 17-1: Start by initializing a result buffer with space for four characters.

Input S m i t h

Result S

Figure 17-2: The first input string character is always used as the first character in the result.

Input

S

m i t h

Result

S

5

Figure 17-3: An m is encoded as a 5.

The third input string character position contains an i, which according to rule 3 should be ignored (along with any other vowels), and therefore does not contribute to the result (see Figure 17-4).

Input

S

m i t h

Result

S

5

Figure 17-4: All vowels are ignored.

Following the i is the letter t, which according to the algorithm is encoded as a 3. In this example, it goes into the result at position 3, as shown in Figure 17-5.

Input

S

m

i t h

Result

S

5

3

Figure 17-5: A t is encoded as a 3.

417

Chapter 17

The last character, h, is a special character that is treated as if it was a vowel and is therefore ignored (see Figure 17-6).

Input

S

m

i t h

Result

S

5

3

Figure 17-6: H, W, and Y are all treated as vowels and hence ignored.

You’ve run out of input characters but you haven’t filled the result buffer, so following rule 6, you pad the remainder with zeros. Figure 17-7 shows that the Soundex value for the character string Smith is S530.

Input

S

m

i

t h

Result

S

5

3

0

Figure 17-7: The result is padded with zeros to achieve the required four characters.

Now take a quick look at encoding Smythe. You start off as you did previously, with a result buffer of length four, as shown in Figure 17-8.

Input

S m y t h e

Result

Figure 17-8: Again, begin by initializing a result buffer with space for four characters.

We’re not going to show you each step in the process this time; you can do this easily enough for yourself. Instead, we’ve summarized the result, shown in Figure 17-9.

Input

S

m

y

t h e

Result

S

5

3

0

Figure 17-9: The final encoding for “Smythe”.

418

String Matching

Figure 17-9, shows that Smythe encodes as S530, as did Smith. If you were creating a database index using the Soundex for surnames, then a search for Smith would also return any records with Smythe and vice-versa, exactly what you would hope for in a system designed to catch spelling mistakes and find people with similar names.

Although not a huge concern in this particular instance, the algorithm clearly runs in O(N) time, as only one pass over the string is ever made.

Now that you have a feel for how the Soundex algorithm works in theory, in the next Try It Out section you write some tests to ensure you get your actual implementation right.

Try It Out

Testing the Soundex Encoder

Create the test class as follows (there are quite a few rules and you want to cover as many as possible to ensure that you implement the algorithm correctly):

package com.wrox.algorithms.wmatch;

import junit.framework.TestCase;

public class SoundexPhoneticEncoderTest extends TestCase { private SoundexPhoneticEncoder _encoder;

protected void setUp() throws Exception { super.setUp();

_encoder = SoundexPhoneticEncoder.INSTANCE;

}

public void testFirstLetterIsAlwaysUsed() { for (char c = ‘A’; c <= ‘Z’; ++c) {

String result = _encoder.encode(c + “-”);

assertNotNull(result); assertEquals(4, result.length());

assertEquals(c, result.charAt(0));

}

}

public void testVowelsAreIgnored() {

assertAllEquals(‘0’, new char[] {‘A’, ‘E’, ‘I’, ‘O’, ‘U’, ‘H’, ‘W’, ‘Y’});

}

public void testLettersRepresentedByOne() { assertAllEquals(‘1’, new char[] {‘B’, ‘F’, ‘P’, ‘V’});

}

public void testLettersRepresentedByTwo() {

assertAllEquals(‘2’, new char[] {‘C’, ‘G’, ‘J’, ‘K’, ‘Q’, ‘S’, ‘X’, ‘Z’});

}

public void testLettersRepresentedByThree() {

419

Chapter 17

assertAllEquals(‘3’, new char[] {‘D’, ‘T’});

}

public void testLettersRepresentedByFour() { assertAllEquals(‘4’, new char[] {‘L’});

}

public void testLettersRepresentedByFive() { assertAllEquals(‘5’, new char[] {‘M’, ‘N’});

}

public void testLettersRepresentedBySix() { assertAllEquals(‘6’, new char[] {‘R’});

}

public void testDuplicateCodesAreDropped() { assertEquals(“B100”, _encoder.encode(“BFPV”)); assertEquals(“C200”, _encoder.encode(“CGJKQSXZ”)); assertEquals(“D300”, _encoder.encode(“DDT”)); assertEquals(“L400”, _encoder.encode(“LLL”)); assertEquals(“M500”, _encoder.encode(“MNMN”)); assertEquals(“R600”, _encoder.encode(“RRR”));

}

public void testSomeRealStrings() { assertEquals(“S530”, _encoder.encode(“Smith”)); assertEquals(“S530”, _encoder.encode(“Smythe”)); assertEquals(“M235”, _encoder.encode(“McDonald”)); assertEquals(“M235”, _encoder.encode(“MacDonald”)); assertEquals(“H620”, _encoder.encode(“Harris”)); assertEquals(“H620”, _encoder.encode(“Harrys”));

}

private void assertAllEquals(char expectedValue, char[] chars) { for (int i = 0; i < chars.length; ++i) {

char c = chars[i];

String result = _encoder.encode(“-” + c);

assertNotNull(result); assertEquals(4, result.length());

assertEquals(“-” + expectedValue + “00”, result);

}

}

}

How It Works

The SoundexPhoneticEncoderTest class holds an instance of a SoundexPhoneticEncoder that is initialized in setUp() and used by the test cases:

package com.wrox.algorithms.wmatch;

import junit.framework.TestCase;

public class SoundexPhoneticEncoderTest extends TestCase {

420

String Matching

private SoundexPhoneticEncoder _encoder;

protected void setUp() throws Exception { super.setUp();

_encoder = SoundexPhoneticEncoder.INSTANCE;

}

...

}

Rule 2 says that you must always use the first letter under any circumstances, so you start by testing this assumption. The testFirstLetterIsAlwaysUsed() method cycles through each character from A to Z, encoding it as the first character of a string. Once encoded, you then ensure that the return string is not null and that the length is four — all Soundex values must be four characters in length. You then verify that the first character of the result is the same as the one used in the input string:

public void testFirstLetterIsAlwaysUsed() { for (char c = ‘A’; c <= ‘Z’; ++c) {

String result = _encoder.encode(c + “-”);

assertNotNull(result); assertEquals(4, result.length());

assertEquals(c, result.charAt(0));

}

}

The tests for the remaining rules all look pretty much the same, and use a helper method to do most of the work. The method assertAllEquals() accepts an expected value and an array of characters to use. Each character is used as the second letter in a two-letter input string, which is encoded. Again the

return value is checked for null and to ensure it has the correct length. The encoded value is then compared with the expected result. In all cases, the first character should have remained unchanged, and because we only encoded a two-character string, the last two digits will always be padded with zeros. This leaves only the second character from the result to be checked, and in this case the expected value is 0, indicating that the input character was ignored:

private void assertAllEquals(char expectedValue, char[] chars) { for (int i = 0; i < chars.length; ++i) {

char c = chars[i];

String result = _encoder.encode(“-” + c);

assertNotNull(result); assertEquals(4, result.length());

assertEquals(“-” + expectedValue + “00”, result);

}

}

Rule 3 says that you must drop all vowels, including some special letters that sound like vowels. The method testVowelsAreIgnored() checks this by constructing a string containing nothing but an arbitrary first character — which is always copied as is — followed by a single vowel. After encoding, you

421

Chapter 17

expect the last three characters of the encoded value to be “000”, indicating that the vowel has been ignored and the result was therefore padded to fill the remaining character spaces:

public void testVowelsAreIgnored() {

assertAllEquals(‘0’, new char[] {‘A’, ‘E’, ‘I’, ‘O’, ‘U’, ‘H’, ‘W’, ‘Y’});

}

You also tested each of the six cases for rule 4. In each case, you called assertAllEquals(), passing in the expected value and the set of input characters:

public void testLettersRepresentedByOne() { assertAllEquals(‘1’, new char[] {‘B’, ‘F’, ‘P’, ‘V’});

}

public void testLettersRepresentedByTwo() {

assertAllEquals(‘2’, new char[] {‘C’, ‘G’, ‘J’, ‘K’, ‘Q’, ‘S’, ‘X’, ‘Z’});

}

public void testLettersRepresentedByThree() { assertAllEquals(‘3’, new char[] {‘D’, ‘T’});

}

public void testLettersRepresentedByFour() { assertAllEquals(‘4’, new char[] {‘L’});

}

public void testLettersRepresentedByFive() { assertAllEquals(‘5’, new char[] {‘M’, ‘N’});

}

public void testLettersRepresentedBySix() { assertAllEquals(‘6’, new char[] {‘R’});

}

Rule 5 specifies that we should drop consecutive letters having the same code, although how testDuplicateCodesAreDropped() checks this may not be as obvious as with the other tests.

Essentially, you take each group of letters and use them to form a string. You know, of course, that the first letter will be used directly. You also know that the second letter will be encoded — none of the letters in the test are vowels — but because the third and subsequent letters all code the same as the second, you expect them to be ignored, ensuring that the last two digits of the encoded string will be zeros:

public void testDuplicateCodesAreDropped() { assertEquals(“B100”, _encoder.encode(“BFPV”)); assertEquals(“C200”, _encoder.encode(“CGJKQSXZ”)); assertEquals(“D300”, _encoder.encode(“DDT”)); assertEquals(“L400”, _encoder.encode(“LLL”)); assertEquals(“M500”, _encoder.encode(“MNMN”)); assertEquals(“R600”, _encoder.encode(“RRR”));

}

Finally, testSomeRealStrings() takes three pairs of names that encode to the same and validates the result:

422

String Matching

public void testSomeRealStrings() { assertEquals(“S530”, _encoder.encode(“Smith”)); assertEquals(“S530”, _encoder.encode(“Smythe”)); assertEquals(“M235”, _encoder.encode(“McDonald”)); assertEquals(“M235”, _encoder.encode(“MacDonald”)); assertEquals(“H620”, _encoder.encode(“Harris”)); assertEquals(“H620”, _encoder.encode(“Harrys”));

}

Now that you’re confident you have a test suite sufficient to ensure the correctness of your implementation, in the next Try It Out section you write the actual Soundex encoder.

Try It Out

Implementing the Soundex Encoder

Starting by creating an interface definition common to any phonetic encoder:

package com.wrox.algorithms.wmatch;

public interface PhoneticEncoder {

public String encode(CharSequence string);

}

Then create the Soundex encoder class as follows:

package com.wrox.algorithms.wmatch;

public final class SoundexPhoneticEncoder implements PhoneticEncoder { public static final SoundexPhoneticEncoder INSTANCE =

new SoundexPhoneticEncoder();

private static final char[] CHARACTER_MAP = “01230120022455012623010202”.toCharArray();

private SoundexPhoneticEncoder() {

}

public String encode(CharSequence string) {

assert string != null : “string can’t be null”; assert string.length() > 0 : “string can’t be empty”;

char[] result = {‘0’, ‘0’, ‘0’, ‘0’};

result[0] = Character.toUpperCase(string.charAt(0));

int stringIndex = 1; int resultIndex = 1;

while (stringIndex < string.length() && resultIndex < result.length) { char c = map(string.charAt(stringIndex));

if (c != ‘0’ && c != result[resultIndex - 1]) { result[resultIndex] = c;

++resultIndex;

423