next up previous contents
Next: Exercises Up: String Searching Previous: The Knuth-Morris-Pratt Algorithm

The Boyer-Moore Algorithm

 

Once again, imagine that we are performing a string search, and the pattern is ``AAAB'' and the text is ``AAAXAAAB''.

If we skip ahead and look at the fourth character of the text, we can immediately note that this character appears nowhere in the pattern. Therefore, if there is a match anywhere in this text, it cannot possibly overlap with the fourth character in the text. Since the pattern is of length 4, that means that the smallest possible shift that needs to be considered is 4. Therefore, just by checking a single character, we have completely eliminated the need to look at the first three characters in the text!

This kind of skip is the goal of the Boyer-Moore algorithm- unlike the Knuth-Morris-Pratt algorithm, which strives to look at each character in the text only once, the Boyer-Moore algorithm strives to completely ignore as many characters in the text as possible. The algorithm itself is given in algorithm 8.4.

 

Once again, the algorithm is very simple, but the computation of the skips is not. As we did in the Knuth-Morris-Pratt algorithm, we will precompute an array of skips, but unlike the Knuth-Morris-Pratt algorithm, this array will be two-dimensional- it will take into account both the number of characters successfully matched (j) and the identity of the character in the text (c) that caused the mismatch.

It is much easier to understand the computation of the skips from an intuitive perspective and then see how this algorithm obeys our intuition than it is to prove that this algorithm is correct and complete, so that is how I will proceed.

In the calculation of the skips, there are two basic cases that we need to consider in algorithm 8.4:

  1. For , then the new shift must overlap all of the pattern that we have already successfully matched. Therefore, it must match all of the pattern we have already matched (because otherwise, a mismatch is inevitable). In addition, the character that caused the mismatch in the current shift must match its counterpart in the pattern in the new shift (because, once again, otherwise we know that a mismatch is inevitable if this character does not match the new shift of the pattern).
  2. If d > j, then we are shifting ``past'' the part of the pattern that we have already matched- we are shifting far enough so that the new shift moves past some or all of the characters already matched.

    We have determined that the mismatched character cannot appear anywhere among the characters we have already matched, because there is no shift that contains it and would be consistent with the other characters we have already matched- if there was, we would have already found it when we were investigating the first case. Therefore, we must skip past the characters we have already matched in the text, looking for a way to match them up with the pattern.

    The only question is how far to skip- and this is determined, in a similar manner as the Knuth-Morris-Pratt algorithm, by finding the length of the longest suffix of the matched characters that matches a prefix of the pattern.

Note that once again, we always choose the smallest skip that has not already been proven to lead to a mismatch. This prevents the mishap of skipping past a match.

Implementation of Boyer-Moore

The code that implements the heart of the Boyer-Moore algorithm is shown in figure 8.11, and the code that implements the computation of the skip array is shown in figure 8.12.

As in the implementation of the Knuth-Morris-Pratt algorithm, no special effort is taken to make sure that this implementation is efficient- in particular, the algorithm used to compute the skip array is quite inefficient. It is possible to compute the skip array in time O(M) (rather than as this code does), but the algorithm for doing so is quite complex.gif This means that for long patterns or short texts, this implementation is quite a bit slower than any of the other algorithms shown in this chapter. However, once the text length grows large, the true speed of this algorithm becomes apparent- for texts of more than a few hundred kilobytes, this algorithm is roughly twice as fast as any of the others.gif Clearly, when the text is long, and speed is important, an optimized Boyer-Moore function will give superior performance.

 
Figure 8.11:   Code that implements the Boyer-Moore algorithm.

 
Figure 8.12:   Code to compute the skipArray for the Boyer-Moore algorithm.

A Boyer-Moore Example

This section illustrates the Boyer-Moore algorithm with an example.

Let us begin with an example where the pattern is ACAC and the text is ABACABACAC. In figure 8.13, j and i are shown, along with the pattern shifted by i:

 
Figure 8.13:   An illustration of the Boyer-Moore algorithm.


next up previous contents
Next: Exercises Up: String Searching Previous: The Knuth-Morris-Pratt Algorithm

Dan Ellard
Mon Jul 21 22:30:59 EDT 1997