The Knuth-Morris-Pratt algorithm ensures that a string search will never require more than N character comparisons (once some precomputation is performed).
Imagine that we are performing a string search, and the pattern is ``AAAB'' and the text is ``AAAXAAAAA''. If we use the brute-force algorithm, then the first test will fail when the ``B'' in the pattern fails to match the fourth character in the text, which is an ``X''. At this point, the brute-force algorithm will shift the pattern by one position and start over.
However, this seems silly. In our first attempt to match by brute-force, we learned some key information about text- because we managed to successfully match three characters in the string, we know what those three characters are and can reuse this information. More importantly, we know implicitly what the first two characters are of the next substring that we will try- so there is no need to explicitly check them.
The Knuth-Morris-Pratt string searching algorithm uses this information to reduce the number of times it compares each character in the text to a character in the pattern. If properly implemented, the Knuth-Morris-Pratt algorithm only looks at each character in the text once.
The Knuth-Morris-Pratt algorithm, stated in terms of the brute-force algorithm, is given in algorithm 8.3.
The basic form of the matching algorithm is very similar to the brute-force algorithm- except that instead of always incrementing i and setting j back to zero at the end of every failed substring test, we set j back only as far as the beginning of the possible match, and increment i by the corresponding amount.
The difficulty is in figuring out how far to skip when a mismatch occurs. The trick is to precompute a table of skips for each prefix ahead of time, so that in the second step of the algorithm we know immediately how much to increment i and decrement j. Note that this precomputation is based entirely upon P, so the cost of this precomputation is not related to the length of the text.
Before writing the code, however, we must clarify exactly what this table of skips will contain. If we have successfully matched x characters in the text against x characters in the pattern, and a mismatch is detected at the next character, how far should we skip? The goal is to skip as far as possible without missing any potential matches.
This turns out to be the length of the longest prefix of fewer than x characters in the pattern that also appears as a suffix of the first x characters in the pattern. If there is no such prefix, then we know that we can skip over x characters.
It is extremely important to note that the text itself is not considered- if we have already matched x characters in the text, then we know what that text is in terms of the pattern.
If x = 1, then the pattern is immaterial- the skip must always be at least one (or else we won't ever make any progress), so we abandon the matched character, skip over it, and begin again at the mismatch.
A table of the skips for several short strings is shown in figure 8.7.
Figure 8.7: The skipArray for several patterns. The Prefix/Suffix column shows the substring that is both a prefix and suffix of the first j characters of the pattern. A Prefix/Suffix of nil denotes the empty string.
It is sometimes difficult to think concretely about this sort of problem, because humans have an enormous innate pattern-matching ability, and therefore we have the ability to use this kind of algorithm without really understanding it! Algorithms that make a great deal of intuitive sense can sometimes be difficult to implement elegantly (or even correctly). Figure 8.8 shows code that computes the skip array for a pattern.
Figure 8.8: Code to compute the prefix lengths for the Knuth-Morris-Pratt algorithm.
Once the skipArray is computed, however, the code to actually implement the algorithm is very straightforward. A C implementation of the code to implement algorithm 8.3 is shown in figure and 8.9.
Figure 8.9: The Knuth-Morris-Pratt algorithm, implemented in C.
The careful reader will note that this algorithm does not fulfill the promise that we will look at each character in the text only once. Characters in the midst of a match will only be looked at once, and characters that fail immediately to match the first character in the pattern are looked at only once, but we will look at mismatched characters twice. When a match fails somewhere in the midst of a string, the character that caused the failure is tested again in the next attempt. Although we know what this character is and can make use of this information, the only information we are using is what it isn't- it isn't a match.
However, although we may have to look at some characters more than once, the algorithm looks at each character at most twice and so is O(N) (see the exercises for more detail).
We can improve upon the performance of our implementation by extending our table to refine our information about the prefix lengths to include the character that caused the mismatch. This can result in a much larger table, since instead of a one-dimensional table indexed by the number of matching characters, we now need to have a two-dimensional array indexed by both the number of matching characters and the identity of the character that caused the mismatch. Although the table is often sparse, and can be represented in relatively little space, this optimization adds considerably to the memory requirements and/or complexity of the implementation. Whether the added performance is worth the extra memory and complexity is a design decision.
This section illustrates the Knuth-Morris-Pratt algorithm with an example.
Let us begin with an example where the pattern is ACAC and the text is ABACABACAC. In figure 8.10, j and i are shown, along with the pattern shifted by i.
Figure 8.10: An illustration of the Knuth-Morris-Pratt algorithm.