We can use the technique of hashing (explored earlier in the semester) as a way of testing whether two strings are the same- if two strings hash to the same value, then they might be the same string, but if they hash to different values then then are definitely not the same. This property can be used in a string searching algorithm as shown in algorithm 8.2.
However, at first analysis, this seems like a terrible idea: good hash functions need to look at every character in the string. If the length of the pattern is M, and the hash function looks at each character in the substring, then computing a hash of a substring of length M takes (at least) O(M) to compute. If the hashing function looks at every character in the string, then this algorithm will may need to perform an O(N) calculation at each of the M-N possible positions in the text, giving an algorithm that is O(M(M-N)).
If you recall from section 8.1.1, O(M(M-N)) is the time complexity of the worst-case of the naive algorithm! This is certainly a bad sign- it would seem that by using hashing, we're dooming ourselves to always do the amount of work that would otherwise be done by worst case of the brute-force algorithm.
As you might guess, however, there is a way around this problem. Although there is no general way to reduce the amount of work required to perform a single hash, there exist hashing functions that can be computed very efficiently for a group of strings that are related to each other. In this algorithm, each substring is closely related to the next substring: the next substring simply has one character chopped off of one end and a new character added to the other end.
Can we devise a good hashing function that takes advantage of this fact? There are several, but one that is commonly used is shown in figure 8.1. This hashing function treats the string as a large number, where each character in the array represents a digit in a number in base b. The modulo of this number is then taken by p to give the hash value. In order to get good results, this p should be chosen so that it and b are relatively prime; this is easily accomplished by having b or p be prime.
Figure 8.1: A hashing function to compute
the hash of string ,
where the size of the character set is b and p
is relatively prime to b.
The hash function given in figure 8.1, as stated above, is clumsy (or impossible) to implement in C. The problem is that integers in C have a fixed number of bits (usually 32 or 64, though 16-bit machines are not unusual). Depending on the values of b and n, it may be inevitable that the term will be too large to fit into an int (or any ordinary C type). For example, if b = 256, then n must be 4 or less in order for the result of the summation to fit into a 32-bit integer.
Luckily, we can exploit an important property of the modulo operation- taking the modulus of intermediate steps of an addition, multiplication, or subtraction calculation has no effect on the outcome of the calculation. For example, for any and :
This means that by performing the modulo operation early and often, we can compute the modulus of the results of operations on very large numbers without actually ever worrying about the size of the numbers, as long as the modulus we are using is not very large.
For example, imagine that we want to compute , where and are very large numbers, but b is a small number. The product will be an extremely large number- but we need not ever deal with it, because , and by definition, the result must always be somewhere in the range 0 and b - 1.
By aggressively taking the modulus during intermediate steps, we can avoid using very large numbers at all.
Imagine that we want to compute . It is useful to first compute several powers of 10, 7:
Given these values, it is easy to compute :
The trick that makes the Rabin-Karp string searching algorithm efficient is the fact that we have chosen a hash function which is easy to compute for adjacent substrings- once we know what the hash value of the first substring is, we can compute the hash value of the next in time O(1). This is accomplished by taking advantage, once again, of the properties of the modulo operation.
For example, imagine we are that the text and pattern string are drawn from the alphabet consisting only the characters 0 through 9. The text is ``234591'', and we are searching for a pattern of length 4. We have already computed . How can we use this to help us compute ?
Observe that 2345 can be transformed into 3459 by the following steps:
Using this method results in the following calculation:
Similarly, once we know , it is easy to use this knowledge to compute , in precisely the same manner as above:
We will now explore how to implement this algorithm in C, for strings consisting of C characters.
A hashing function based on the function given in figure 8.1 is shown in figure 8.2. Also note that modulus m is not defined in the function; it is a parameter that can be specified at runtime.
Figure 8.2: A hashing function.
The actual string matching algorithm is given in figure 8.3.
Figure 8.3: The Rabin-Karp String Searching Algorithm
The first action taken is to compute , which will be used later to figure out what amount to subtract from the hash value as characters are ``shifted off'' to the left.
To gracefully deal with the case where M is large, the divide-and-conquer algorithm for computing exponentials is used. This is the same as the algorithm you explored in the first assignment, except that the modulus is taken early and often to prevent the result from becoming too large. The code for the resulting function is shown in figure 8.4.
Figure 8.4: An exp function that
computes .
The next action is to laboriously hash the pattern and the substring of length M at shift zero of the text. This is the only time we'll hash a substring of the text using the hashing function; all subsequent hashes will be computed by the method described previously.
Before we begin working our way down the string, however, we must check whether shift zero itself is a match. If it is, then we are finished. Otherwise, we loop through the rest of the string, computing each hash based on the previous (using the nextHash function, shown in figure 8.5).
Figure 8.5: The nextHash function.
The implementation of the Rabin-Karp algorithm shown in the previous section is faithful to the original algorithm, but has a flaw that makes it somewhat suboptimal on many modern computers- it relies heavily on the modulo operation, which is usually relatively slow.
However, the modulus operation can be eliminated from the algorithm by choosing a different hashing function that has the same property that it is easy to compute from the previous string, yet does not require so many modulo operations. One way to generate such a function is to keep the same basic function, but always choose a modulus that is a power of 2, so that bit masks can be used instead of the more expensive % operation. Of course, using a modulus that is a power of 2 will have disastrous consequences for the hashing function if the base b used by the function is also a power of 2- which it often will be. However, we have some freedom here as well- all that we need to do is to choose a b that is relatively prime with respect to m- any number larger than the number of characters in the alphabet will do. In this case, using a b of 257 is a safe bet.
A function that implements this change is shown in figure 8.6. Note that this function completely ignores its b and m parameters and instead alway uses b = 257 and m = 1024. (These parameters are retained only so this function can be used as a drop-in replacement for stringSearchRK.)
Figure 8.6: A faster implementation of the Rabin-Karp
algorithm, which avoids using the mod operation.
The Rabin-Karp algorithm, as presented here, can be very slow if the text contains a lot of ``false matches'': substrings that hash to the same number as the pattern and therefore cause an expensive memcmp to be performed. An intelligent (or lucky) adversary who knows b and m can create texts that make this algorithm perform terribly.
One solution is to choose m and b randomly at runtime, and whenever the number of false matches is high, randomly choose a new m and/or b.