The string searching problem can be defined as the following: given a string P and a string T, determine whether or not P appears as a substring anywhere in T. If so, return the offset of the start of the occurrence of P in T. If not, indicate that no such occurrence exists by returning -1.
Note that string searching algorithms can, in general, be extended to find all of the occurrence of P in T instead of only the first, but for the sake of simplicity we will initially only consider the problem of finding the first occurrence.
For example, let P = ``ring'' and T = ``string searching''. The string ``ring'' does appear in T, starting in position 2 (using C's zero-based indexing).
In the previous example, the text is ``string searching'', the pattern is ``ring'', and the shift of the occurrence is 2.
An algorithm to solve the string searching problem immediately suggests itself- starting with a shift i of 0, simply compare the substring of T starting at i with the pattern P. Continue testing until a match is found, or the end of T is reached. This algorithm is presented in C in algorithm 8.1.
[hbtp]
1 long stringSearchBF (unsigned char *P, long M, unsigned char *T, long N) 2 { 3 long i, j; 4 5 for (i = 0; i <= N - M; i++) { 6 for (j = 0; j < M; j++) { 7 if (P[j] != T[i + j]) { 8 break; 9 } 10 } 11 if (j == M) { 12 return (i); 13 } 14 } 15 return (-1); 16 }
One important optimization that appears in this code is that the attempt to match P and each substring of T is abandoned as soon as any character in P fails to match the substring from T. Depending on the nature of P and T, this simple optimization can have a profound impact on the execution speed of the algorithm!
In the worst case, this algorithm performs M(N-M) character comparisons (comparing each of the M characters in P to each of the characters in each of the N-N possible substrings of T). If mismatches are recognized earlier, however, then substantially fewer comparisons are necessary. For example, if the first character of P does not appear anywhere in T, or only as part of a match, then the attempts to match each substring will all fail immediately, and this algorithm will only require N-M character comparisons.
In the rest of this chapter, we will explore several ways to reduce the amount of work required in the worst case. We will begin with the Rabin-Karp algorithm, and then progress on to the Knuth-Morris-Pratt and Boyer-Moore algorithms.
It must be noted that for many purposes, the brute force string searching algorithm is sufficient. It is small and simple, and can be coded in an efficient manner that takes advantage of the underlying hardware. For short patterns, or even long patterns that begin with ``rare'' prefixes, the brute-force method is as fast as any.