We can use the technique of hashing (explored earlier in the semester)
as a way of testing whether two strings are the same- if two strings
hash to the same value, then they *might* be the same string, but
if they hash to different values then then are definitely *not*
the same. This property can be used in a string searching algorithm
as shown in algorithm 8.2.

However, at first analysis, this seems like a terrible idea: good hash
functions need to look at every character in the string. If the
length of the pattern is *M*, and the hash function looks at each
character in the substring, then computing a hash of a substring of
length *M* takes (at least) *O*(*M*) to compute. If the hashing
function looks at every character in the string, then this algorithm
will may need to perform an *O*(*N*) calculation at each of the *M*-*N*
possible positions in the text, giving an algorithm that is
*O*(*M*(*M*-*N*)).

If you recall from section 8.1.1, *O*(*M*(*M*-*N*)) is the time
complexity of the *worst-case* of the naive algorithm! This
is certainly a bad sign- it would seem that by using hashing, we're
dooming ourselves to *always* do the amount of work that would
otherwise be done by worst case of the brute-force algorithm.

As you might guess, however, there is a way around this problem.
Although there is no general way to reduce the amount of work required
to perform a *single* hash, there exist hashing functions that can
be computed very efficiently for a *group* of strings that are
related to each other. In this algorithm, each substring is
closely related to the next substring: the next substring simply has
one character chopped off of one end and a new character added to the
other end.

Can we devise a good hashing function that takes advantage of this
fact? There are several, but one that is commonly used is shown
in figure 8.1.
This hashing function treats the string as a large number, where each
character in the array represents a digit in a number in base *b*.
The modulo of this number is then taken by *p* to give the hash value.
In order to get good results, this *p* should be chosen so that it and
*b* are relatively prime; this is easily accomplished by having *b* or
*p* be prime.

**Figure 8.1:** A hashing function to compute
the hash of string ,
where the size of the character set is *b* and *p*
is relatively prime to *b*.

The hash function given in figure 8.1, as stated above,
is clumsy (or impossible) to implement in C. The problem is that
integers in C have a fixed number of bits (usually 32 or 64, though
16-bit machines are not unusual). Depending on the values of *b* and
*n*, it may be inevitable that the term will
be too large to fit into an `int` (or *any* ordinary C type).
For example, if *b* = 256, then *n* must be 4 or less in order for the
result of the summation to fit into a 32-bit integer.

Luckily, we can exploit an important property of the modulo operation- taking the modulus of intermediate steps of an addition, multiplication, or subtraction calculation has no effect on the outcome of the calculation. For example, for any and :

This means that by performing the modulo operation early and often, we can compute the modulus of the results of operations on very large numbers without actually ever worrying about the size of the numbers, as long as the modulus we are using is not very large.

For example, imagine that we want to compute ,
where and are very large numbers, but *b* is a small
number. The product will be an extremely large
number- but we need not ever deal with it, because , and by definition, the
result must always be somewhere in the range 0 and *b* - 1.

By aggressively taking the modulus during intermediate steps, we can avoid using very large numbers at all.

Imagine that we want to compute . It is useful to first compute several powers of 10, 7:

Given these values, it is easy to compute :

The trick that makes the Rabin-Karp string searching algorithm
efficient is the fact that we have chosen a hash function which is
easy to compute for adjacent substrings- once we know what the hash
value of the first substring is, we can compute the hash value of the
next in time *O*(1). This is accomplished by taking advantage, once
again, of the properties of the modulo operation.

For example, imagine we are that the text and pattern string are drawn from the alphabet consisting only the characters 0 through 9. The text is ``234591'', and we are searching for a pattern of length 4. We have already computed . How can we use this to help us compute ?

Observe that 2345 can be transformed into 3459 by the following steps:

- Multiply 2345 by 10, producing 23450.
- Subtract 20000 from 23450, leaving 3450. This erases the effect of the ``2'' digit, which has ``shifted'' off to the right.
- Add 9, giving 3459. This adds in the ``9'' digit.

Using this method results in the following calculation:

Similarly, once we know , it is easy to use this knowledge to compute , in precisely the same manner as above:

We will now explore how to implement this algorithm in C, for strings consisting of C characters.

A hashing function based on the function given in figure
8.1 is shown in figure 8.2.
Also note that modulus *m*
is not defined in the function; it is a parameter that can be
specified at runtime.

**Figure 8.2:** A hashing function.

The actual string matching algorithm is given in figure 8.3.

**Figure 8.3:** The Rabin-Karp String Searching Algorithm

The first action taken is to compute , which will be used later to figure out what amount to subtract from the hash value as characters are ``shifted off'' to the left.

To gracefully deal with the case where *M* is large, the
divide-and-conquer algorithm for computing exponentials is used. This
is the same as the algorithm you explored in the first assignment,
except that the modulus is taken early and often to prevent the result
from becoming too large. The code for the resulting function
is shown in figure 8.4.

**Figure 8.4:** An `exp` function that
computes .

The next action is to laboriously hash the pattern and the substring
of length *M* at shift zero of the text. This is the only time we'll
hash a substring of the text using the hashing function; all
subsequent hashes will be computed by the method described previously.

Before we begin working our way down the string, however,
we must check whether shift zero itself is a match. If it
is, then we are finished. Otherwise, we loop through the
rest of the string, computing each hash based on the previous
(using the `nextHash` function, shown in figure 8.5).

**Figure 8.5:** The `nextHash` function.

The implementation of the Rabin-Karp algorithm shown in the previous section is faithful to the original algorithm, but has a flaw that makes it somewhat suboptimal on many modern computers- it relies heavily on the modulo operation, which is usually relatively slow.

However, the modulus operation can be eliminated from the algorithm by
choosing a different hashing function that has the same property that
it is easy to compute from the previous string, yet does not require
so many modulo operations. One way to generate such a function is to
keep the same basic function, but always choose a modulus that is a
power of 2, so that bit masks can be used instead of the more
expensive **%** operation. Of course, using a modulus that is a power
of 2 will have disastrous consequences for the hashing function if
the base *b* used by the function is also a power of 2- which it
often will be. However, we have some freedom here as well-
all that we need to do is to choose a *b* that is relatively
prime with respect to *m*- any number larger than the number
of characters in the alphabet will do. In this case,
using a *b* of 257 is a safe bet.

A function that implements this change is shown in figure
8.6. Note that this function completely ignores its *b* and
*m* parameters and instead alway uses *b* = 257 and *m* = 1024.
(These parameters are retained only so this function can be used as a
drop-in replacement for `stringSearchRK`.)

**Figure 8.6:** A faster implementation of the Rabin-Karp
algorithm, which avoids using the mod operation.

The Rabin-Karp algorithm, as presented here, can be very slow if the
text contains a lot of ``false matches'': substrings that hash to the
same number as the pattern and therefore cause an expensive `
memcmp` to be performed. An intelligent (or lucky) adversary who
knows *b* and *m* can create texts that make this algorithm perform
terribly.

One solution is to choose *m* and *b* randomly at runtime, and
whenever the number of false matches is high, randomly choose a new
*m* and/or *b*.

Mon Jul 21 22:30:59 EDT 1997