Emmanuel García: 2nd Assignment: String Matching

For this assignment we had to program two algorithms of string matching and report and experiment different the worst-time complexity of them. The algorithms are:

Knuth-Morris-Pratt
Boyer-Moore

Before explaining the algorithms, I will explain a simple and basic way of finding a sub-string in a string, so we can compare it with the algorithms for better understanding.

The Basic Way

So the basic way of searching for a sub-string within a string would be pretty obvious, searching and comparing every character of the string with the sub-string in a very inefficient way. The code that does this is as follows:

The Python script

Example Run:

With this search algorithm, we did 21 comparisons in a T with length n = 7 and using a P with length m = 3. So we compare every character in T, m times, which is very inefficient.

Knutt-Morris-Pratt Algorithm

The KMP algorithm is a very easy to understand and somewhat acceptable algorithm for string searching. The principle is the same as the basic naive way, but instead of keep searching when we find a mismatch between a character in the string and the pattern, we shift a k number of characters ahead, k being the number of matches that we'd already found.

Let T = 'ASASDAA', so n = 7
and P = 'ASD', so m = 3

P is found in the 2 index of T.

So, in the first comparisons (column 1) we find two letters, but the third one doesn't match. According to the algorithm we can advance 2 positions (we matched 2 letters, and found a mismatch). In the next comparisons we find the pattern, so we can advance m positions. Finally, the number of remaining letters in T is 2, which is less than m so we can skip the comparisons at the end and just finish the execution.

The Python script:

Running the previous example:

Worst Case Complexity

It's easy to think of a way that would make this algorithm behave the worst way. We know that the shift is going to be 1 when we find a match. So if we find m - 1 matches (all except the last character) and then we find a mismatch, we already compared m - 1 characters in vain, so to create a worst case we can repeat this pattern k times, and add the correct match at the end.

An example using k = 100 and 'K' as the filler character would be:

"ASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASKASD"

Analyzing this worst case string with the algorithm gives us:

399 Comparisons with T of length 300.

To better analyzing the worst case and, let's plot the growth of the comparisons while the length increases:

Worst Case T analysis A regular T analysis.

The growth is linear as we can see. So even in a worst case scenario, the complexity is O(m + n).

Time

The growth of the time while T's length increases. As we can see the efficiency is clearly good, because we get 0.1 seconds with a 10, 000 length T.

Boyer-Moore Algorithm

This algorithm was somewhat tricky to understand, and even so I had to guide myself using the C code in this page. The principle is that we had to do some pre-processing before we start searching for the string.

In this pre-processing we define two arrays (in my case, a list and a dictionary). This is so we can define "rules" to shift faster ahead and avoid unnecessary comparisons.

Bad character Rule

The Bad character rules is applied when a character that we are comparing in T is not in P. To fill this array (which is a dictionary in my case) we first we check every character in T. If it is not in P, we assign that character the value m (length of P). Otherwise, we will assign it a value from searching that character in P from right to left.

So the Bad Character dictionary for the 'ASASDAA' example is:

{'A': 2, 'S': 1, 'D': 0}

A is in the position 2 (counting from 0 to m - 1) from right to left.
S is in the position 1 from right to left.
D is in the position 0 from right to left.

The Python script:

Running the previous example:

Worst Case Complexity

The worst case complexity would be repeating the word one time after another k number of times. Something like:

P = ASD T = ASDASDASDASDASD....

Comparations

Worst Case T analysis A regular T analysis.

We can see that even in a worst-case scenario, the growth is linear, approximately 1/2 n. This shows that the Boyer-Moore algorithm is the most efficient one for every length of T.

Time Analysis

The algorithm is incredibly fast, completing searches in 0.06 seconds with a 10, 000 length T.

References:

Emmanuel García

Etiquetas

jueves, 21 de febrero de 2013

2nd Assignment: String Matching

1 comentario: