For this assignment we had to program two algorithms of string matching and report and experiment different the worst-time complexity of them. The algorithms are:
- Knuth-Morris-Pratt
- Boyer-Moore
Before explaining the algorithms, I will explain a simple and basic way of finding a sub-string in a string, so we can compare it with the algorithms for better understanding.
The Basic Way
So the basic way of searching for a sub-string within a string would be pretty obvious, searching and comparing every character of the string with the sub-string in a very inefficient way. The code that does this is as follows:
The Python script
Example Run:
With this search algorithm, we did 21 comparisons in a T with length
n = 7 and using a P with length
m = 3. So we compare every character in T,
m times, which is very inefficient.
Knutt-Morris-Pratt Algorithm
The KMP algorithm is a very easy to understand and somewhat acceptable algorithm for string searching. The principle is the same as the basic naive way, but instead of keep searching when we find a mismatch between a character in the string and the pattern, we shift a k number of characters ahead, k being the number of matches that we'd already found.
Let T = 'ASASDAA', so n = 7
and P = 'ASD', so m = 3
P is found in the 2 index of T.
So, in the first comparisons (column 1) we find two letters, but the third one doesn't match. According to the algorithm we can advance 2 positions (we matched 2 letters, and found a mismatch). In the next comparisons we find the pattern, so we can advance m positions. Finally, the number of remaining letters in T is 2, which is less than m so we can skip the comparisons at the end and just finish the execution.
The Python script:
Running the previous example:
Worst Case Complexity
It's easy to think of a way that would make this algorithm behave the worst way. We know that the shift is going to be 1 when we find a match. So if we find
m - 1 matches (all except the last character) and then we find a mismatch, we already compared
m - 1 characters in vain, so to create a worst case we can repeat this pattern
k times, and add the correct match at the end.
An example using
k = 100 and 'K' as the filler character would be:
"
ASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASD"
Analyzing this worst case string with the algorithm gives us:
399 Comparisons with T of length 300.
To better analyzing the worst case and, let's plot the growth of the comparisons while the length increases:
Worst Case T analysis A regular T analysis.
The growth is linear as we can see. So even in a worst case scenario, the complexity is O(m + n).
Time
The growth of the time while T's length increases. As we can see the efficiency is clearly good, because we get 0.1 seconds with a 10, 000 length T.
Boyer-Moore Algorithm
This algorithm was somewhat tricky to understand, and even so I had to guide myself using the C code in this page. The principle is that we had to do some pre-processing before we start searching for the string.
In this pre-processing we define two arrays (in my case, a list and a dictionary). This is so we can define "rules" to shift faster ahead and avoid unnecessary comparisons.
Bad character Rule
The Bad character rules is applied when a character that we are comparing in T is not in P. To fill this array (which is a dictionary in my case) we first we check every character in T. If it is not in P, we assign that character the value m (length of P). Otherwise, we will assign it a value from searching that character in P from right to left.
So the Bad Character dictionary for the 'ASASDAA' example is:
{'A': 2, 'S': 1, 'D': 0}
- A is in the position 2 (counting from 0 to m - 1) from right to left.
- S is in the position 1 from right to left.
- D is in the position 0 from right to left.
The Python script:
Running the previous example:
Worst Case Complexity
The worst case complexity would be repeating the word one time after another
k number of times. Something like:
P = ASD T = ASDASDASDASDASD....
Comparations
Worst Case T analysis A regular T analysis.
We can see that even in a worst-case scenario, the growth is linear, approximately 1/2 n. This shows that the Boyer-Moore algorithm is the most efficient one for every length of T.
Time Analysis
The algorithm is incredibly fast, completing searches in 0.06 seconds with a 10, 000 length T.
References: