Добавил:
Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:

Beginning Algorithms (2006)

.pdf
Скачиваний:
255
Добавлен:
17.08.2013
Размер:
9.67 Mб
Скачать

16

String Searching

The problem of finding one string within another comes up quite often: Searching through files on disk, DNA searches, and even Google rely on strategies for efficiently searching through text. If you’ve ever used a word processor or text editor or even the editor used for writing code, you have at some stage or another performed a string search. You may know it as the Find function.

There are many string searching algorithms — and no doubt many more will be discovered over time — each with its own optimizations for handling specific types of data. Some algorithms work better for plain text, while others work better for text and/or patterns containing a lot of repetition, such as DNA fragments.

This chapter covers two algorithms for plain-text searching. We start with an obvious brute-force algorithm and move on to the more sophisticated Boyer-Moore. Each is described in detail, and then you will see how a relatively simple twist on the brute-force approach enables the BoyerMoore algorithm to perform significantly faster.

After reading this chapter you should be able to do the following:

Describe and implement a brute-force string searching algorithm

Describe and implement the Boyer-Moore string searching algorithm

Understand the performance characteristics of each algorithm

Describe and implement a generic string match iterator

Describe and implement a simple file searching application

A Generic String Searcher Interface

Because we want to be able to implement various types of string search algorithms and implement our own variations as the need arises, it will be useful to conceive an interface that remains the same no matter what type of underlying mechanism is used. Additionally, because all of the string searches will conform to a single API, we will be able to write a single suite of tests that can be applied to all of them in order to assert their correctness.

Chapter 16

Try It Out Creating the Interface

Start by creating this simple interface:

package com.wrox.algorithms.ssearch;

public interface StringSearcher {

public StringMatch search(CharSequence text, int from);

}

You also need to create the StringMatch class that is used as the return type from search():

package com.wrox.algorithms.ssearch;

public class StringMatch {

private final CharSequence _pattern; private final CharSequence _text; private final int _index;

public StringMatch(CharSequence pattern, CharSequence text, int index) {

assert text != null : “text can’t be null”; assert pattern != null : “pattern can’t be null”; assert index >= 0 : “index can’t be < 0”;

_text = text; _pattern = pattern; _index = index;

}

public CharSequence getPattern() { return _pattern;

}

public CharSequence getText() { return _text;

}

public int getIndex() { return _index;

}

}

How It Works

The StringSearcher class defines a single search() method. This method takes two arguments, the text within which to search and an initial starting position, and returns an object that represents the match (if any), which you will learn more about in just a moment. It is assumed that the pattern to search for will be fixed at construction time — for any concrete implementation — and is therefore not required to be passed as a parameter to search.

396

String Searching

Notice that you have used CharSequence instead of String for the text. If you were implementing a word processor, you would most likely use a StringBuffer to hold the text of any edited document. There may be times, however, when you also wish to search through a plain String. Ordinarily, these two classes — String and StringBuffer — have nothing in common, meaning you would need to write two different implementations of each algorithm: one for handling Strings and another version for StringBuffers. Thankfully, the standard Java library provides an interface, CharSequence, that is implemented by both the String and StringBuffer classes, and provides all the methods you need for the two search algorithms.

Each call to search() will return either an instance of StringMatch or null if no match was found. This class encapsulates the concept of a match in a class all of its own, holding not only the position of the match (0, 1, 2, . . .) but also the text and the pattern itself. This way, the result of the search is independent of any other object for its context.

A Generic Test Suite

Even though string searching is conceptually quite simple, the algorithms contain subtleties that can easily trip things up. As always, the best defense against this is to have tests. These tests will serve as our guarantee of correctness — our safety net to ensure that no matter how sophisticated our algorithms become, the outward behavior is always the same.

You will create several test cases, including tests to do the following: find a pattern at the start of some text; find a pattern at the end of some text; find a pattern in the middle of some text; and find multiple, overlapping occurrences of a pattern. Each one will test some aspect of a string searcher in order to prove its correctness.

Try It Out

Creating the Test Class

All the string searchers in this chapter share common behavior, so you can use our tried and trusted method for creating a generic test suite with hooks for subclassing:

package com.wrox.algorithms.ssearch;

import junit.framework.TestCase;

public abstract class AbstractStringSearcher extends TestCase {

protected abstract StringSearcher createSearcher(CharSequence pattern);

...

}

The first test case is really the simplest of all possible scenarios: searching within an empty string. Anytime search() is called with a pattern that doesn’t exist within the text, it should return null to indicate that no match has been found. Testing boundary conditions like this is a very important part of writing goodquality code:

public void testNotFoundInAnEmptyText() {

StringSearcher searcher = createSearcher(“NOT FOUND”);

assertNull(searcher.search(“”, 0));

}

397

Chapter 16

The next scenario searches for a pattern at the very beginning of some text:

public void testFindAtTheStart() { String text = “Find me at the start”; String pattern = “Find”;

StringSearcher searcher = createSearcher(pattern);

StringMatch match = searcher.search(text, 0);

assertNotNull(match); assertEquals(text, match.getText());

assertEquals(pattern, match.getPattern()); assertEquals(0, match.getIndex());

assertNull(searcher.search(text, match.getIndex() + 1));

}

Having searched for a pattern at the beginning of some text, you next look for one at the end:

public void testFindAtTheEnd() { String text = “Find me at the end”; String pattern = “end”;

StringSearcher searcher = createSearcher(pattern);

StringMatch match = searcher.search(text, 0);

assertNotNull(match); assertEquals(text, match.getText());

assertEquals(pattern, match.getPattern()); assertEquals(15, match.getIndex());

assertNull(searcher.search(text, match.getIndex() + 1));

}

Next, you test that a pattern in the middle of some text is correctly identified:

public void testFindInTheMiddle() {

String text = “Find me in the middle of the text”; String pattern = “middle”;

StringSearcher searcher = createSearcher(pattern);

StringMatch match = searcher.search(text, 0);

assertNotNull(match); assertEquals(text, match.getText());

assertEquals(pattern, match.getPattern()); assertEquals(15, match.getIndex());

assertNull(searcher.search(text, match.getIndex() + 1));

}

Finally, you want to verify that overlapping matches are found. Not that this occurs very often in plain text, but you do need to ensure that the algorithm is working correctly. Besides, it will also test the searcher’s ability to find multiple matches — something you haven’t done until now:

398

String Searching

public void testFindOverlapping() { String text = “abcdefffff-fedcba”; String pattern = “fff”;

StringSearcher searcher = createSearcher(pattern);

StringMatch match = searcher.search(text, 0); assertNotNull(match);

assertEquals(text, match.getText()); assertEquals(pattern, match.getPattern()); assertEquals(5, match.getIndex());

match = searcher.search(text, match.getIndex() + 1);

assertNotNull(match); assertEquals(text, match.getText());

assertEquals(pattern, match.getPattern()); assertEquals(6, match.getIndex());

match = searcher.search(text, match.getIndex() + 1); assertNotNull(match);

assertEquals(text, match.getText()); assertEquals(pattern, match.getPattern()); assertEquals(7, match.getIndex());

assertNull(searcher.search(text, match.getIndex() + 1));

}

How It Works

All of the string searches you create will encapsulate the pattern for which they are looking — think of them as being a kind of pattern with “smarts,” so createSearcher() declares the pattern as its one and only argument. Then, in each test method, you create a searcher by calling createSearcher() before performing the rest of the test.

The first test searches for an empty string, the result of which should be null to indicate that it wasn’t found.

In the next test, you expect to find a match at the start of the string and therefore ensure that search() returns a non-null value. You then ensure that the details of the match are correct, including importantly, verifying the position — in this case, of the first character. Looking at the text, you can see that there should be no more matches. This needs to be tested as well, so you initiate a further search, starting one character position to the right of the previous match, and make sure that it returns null.

The third test looks almost identical to the previous one only this time the single occurrence of the pattern exists all the way on the right-hand side of the text, instead of at the left (beginning).

The last test is somewhat more involved than the previous ones, as this time there are multiple occurrences of the pattern — three, to be precise — all slightly overlapping. The test confirms that the searcher finds all of them and in the correct order.

That’s it for the test cases. You could have written many more tests, but the ones you implement here will give you reasonably good coverage and enable you to turn your attention to the actual business of searching.

399

Chapter 16

A Brute-Force Algorithm

The simplest and most obvious solution is to perform a brute-force scan through the text. This algorithm is quite widely used and actually performs pretty well in most cases. It is also very easy to describe

and code.

The brute-force algorithm is very straightforward and can thus be defined in a few simple steps. Imagine overlaying the text with the pattern, starting from the left-hand side and continuing to slide the pattern right one character until a match is found:

1.Start at the first (leftmost) character in the text.

2.Compare, from left-to-right, each character in the pattern to those in the text. If all of the characters are the same, you have found a match.

Otherwise, if you have reached the end of the text, there can be no more matches, and you are done.

If neither of the preceding results occur, move the pattern along one character to the right and repeat from step 2.

The following example shows the brute-force search algorithm in action, looking for the pattern ring in the text String Searching. First ring is compared with the substring Stri — clearly not a match —

followed by ring with trin, and eventually a match is found on the third attempt. Note the sliding pattern; the brute force approach must compare every character:

String Search

1ring

2 ring

3ring

Now suppose you wanted to continue searching for additional occurrences of ring. You already know the pattern exists at the third character, so there is no point starting from there. Instead, start one character to the right — the fourth character — and follow the same process as before. The following example shows the remaining steps in the search, sliding the pattern across, one position at a time:

String Search

4ring

5ring

6ring

7ring

8ring

9ring

10 ring

In this example, there are no more occurrences of “ring” within “String Search”, so you eventually run out of text before finding a match. Notice that you didn’t need to move the pattern all the way to the last character; in fact, you can’t move too far or you run out of text. You can see that if you attempt to move beyond the tenth character, you would end up comparing “ring” with “rch”. You know these two strings could never match because they are different sizes (one is four characters long and the other is three); therefore, you only ever need to move the pattern until it lines up with the end of the text.

400

String Searching

It’s quite easy to determine how far you need to search before you run out of text characters to compare: For any pattern of length M and text of length M, you never need move beyond the character at position N – M + 1. In the case of our example, the length of the text is 13, and the pattern is 4, giving us 13 – 4 + 1 = 10 — just what you saw in the example.

Now that you understand how the algorithm works, you can go ahead and implement it in code. You also want to create some tests to make sure you get your algorithm right.

Try It Out

Creating the Test Class

You’ve already done the hard work of creating the actual test case earlier in the chapter. At that time, we described how you might go about re-using the test cases you created. Now is your chance to try it out:

package com.wrox.algorithms.ssearch;

public class BruteForceStringSearcherTest extends AbstractStringSearcherTestCase { protected StringSearcher createSearcher(CharSequence pattern) {

return new BruteForceStringSearcher(pattern);

}

}

How It Works

By extending AbstractStringSearcherTestCase, the test class inherits all the predefined test methods, meaning you don’t have to do much at all besides construct an instance of your specific searcher class — in this case, BruteForceStringSearcher — with the specified pattern.

Try It Out

Implementing the Algorithm

Next you create the BruteForceStringSearcher class as shown here:

package com.wrox.algorithms.ssearch;

public class BruteForceStringSearcher implements StringSearcher { private final CharSequence _pattern;

public BruteForceStringSearcher(CharSequence pattern) { assert pattern != null : “pattern can’t be null”; assert pattern.length() > 0 : “pattern can’t be empty”; _pattern = pattern;

}

public StringMatch search(CharSequence text, int from) { assert text != null : “text can’t be null”;

assert from >= 0 : “from can’t be < 0”;

int s = from;

while (s <= text.length() - _pattern.length()) { int i = 0;

while (i < _pattern.length()

&& _pattern.charAt(i) == text.charAt(s + i)) {

401

Chapter 16

++i;

}

if (i == _pattern.length()) {

return new StringMatch(_pattern, text, s);

}

++s;

}

return null;

}

}

How It Works

The BruteForceStringSearcher class implements the StringSearcher interface you defined earlier. The constructor performs a bit of sanity checking, such as ensuring that a pattern was actually passed, and if so, that it contains at least one character, and then it stores a reference to the pattern for later use.

The search() method contains two nested loops that control the algorithm: The outer while loop controls how far the algorithm proceeds through the text, and the inner while loop performs the actual left-to-right character comparison between the pattern and the text.

When the inner loop terminates, if all the characters in the pattern compared successfully, then a match is returned. Conversely, if a mismatch was encountered, the current position within the text is incremented by one and the outer loop continues. This process repeats until either a match is found or there is no more text to process, in which case null is returned to indicate there are no further matches.

As discussed earlier, this algorithm is called brute-force for a reason: There are no tricks, no shortcuts, and no optimizations that you have made to try to reduce the number of comparisons made. In the worst case, you would compare every character of the pattern with (almost) every character of the text, making the worst-case running time O(NM)! In practice, however, the performance is much better, as demonstrated toward the end of the chapter.

The Boyer-Moore Algorithm

Although the brute-force approach works fairly well, you have seen that it is far from optimal. Even in the average case, there are numerous false starts and partial matches. However, with a few simple enhancements, you can do much better.

Two men — R. S. Boyer and J. S. Moore — came up with an algorithm that has become the basis for some of the fastest string searching algorithms currently available. They observed that many of the moves made in the brute-force algorithm were redundant. In many cases, the characters in the text don’t even exist within the pattern, in which case it should be possible to skip them entirely.

The following example shows the original search, this time using the Boyer-Moore algorithm. Note how large portions of the text have been skipped, reducing the total number of string comparisons to 4. Compare this with the brute-force algorithm, which performed a total of 10!

402

String Searching

String Search

1ring

3ring

4ring

8ring

The secret is in knowing how many places to shift when you find a mismatch. You can determine this by analyzing the pattern itself. Each time you encounter a failed match, you search the pattern for the last (rightmost) occurrence of the offending character and proceed according to the bad-character heuristic:

The original Boyer-Moore algorithm actually makes use of the heuristic suffix. However, it has been shown by most papers on the subject that this can safely be ignored, as it only improves performance for very long or repetitive patterns. For the purposes of this discussion, we focus purely on the simplified version.

1.If the character exists within the pattern, you shift right enough places to align the character in the pattern with the one in the text. In the example, after an unsuccessful first comparison of a g with an i, you determine that i exists within the pattern, so you move right two places until they meet.

2.If the character doesn’t exist within the pattern, you shift right enough places to move just beyond it. Position 4 in our example compares a g with a space. The pattern itself contains no spaces at all, so you move right four places to skip past it completely.

3.Whenever the heuristic proposes a negative shift, and in this case only, you resort to the naive approach of moving right one position before returning to the Boyer-Moore algorithm proper.

This last point probably needs a little more explanation. Imagine you were searching for the pattern over in the text everything:

everything over------

Starting from right to left in the pattern, you first compare r and then e and then v until you eventually encounter a mismatch between o and e. If you were to blindly follow the heuristic, you would discover that in this case the heuristic proposes a move backwards:

--everything over--------

The pattern does contain an “e” but it is to the right of the mismatch. Clearly, this isn’t what you want. Therefore, given our example, you would shift the pattern one position to the right and continue by comparing, right-to-left, the characters in “over” with “very”, and so on:

everything -over-----

-----over-

There are actually slightly more efficient ways to handle this case, rather than simply moving one character position to the right, but we have tried to keep the algorithm as simple as possible. Unfortunately, it does mean that in the worst case, our Boyer-Moore implementation performs no better than bruteforce, but in practice it performs considerably better.

403