jueves, 11 de abril de 2013

Huffman Coding

For this week, we had to program the Huffman encoding using binary trees.


The algorithm is simple. First we have some text that we want to encode: "synnick', let's call that L. 


For a pre-processing state, it is necessary to count the frequencies of each character that belongs to L. 

s : 1  y : 1  n : 2  i : 1 c : 1 k : 1

Then we order the frequencies along with it's character from small to large.

s : 1  y : 1  i : 1 c : 1 k : 1  n : 2

With this we can start building the binary tree, grabbing the 2 characters (A and B) with the lesser frequencies, and connecting them to a parent (C), which will have a frequency of the sum of those characters, and some character that we don't use (like '?' or whatever).

A = s : 1   B = y : 1  C = ? : 2

                                                            ? : 2
                                                          /        \
                                                       s : 1      y : 1

Both A and B are eliminated from the frequencies and C is appended to the frequencies, reordering them if necessary:


i : 1  c : 1  k : 1  n : 2  ? : 2


The process repeats, until only one element is in the list:

A = i : 1
B = c : 1
C = ? : 2


                                                            ? : 2                      ? : 2
                                                          /        \                   /        \
                                                       s : 1      y : 1          i : 1        c : 1


k : 1  n : 2  ? : 2 ? : 2

A = k : 1
B = n : 2
C = ? : 3

                                                    ? : 2                      ? : 2                    ? : 3
                                                   /        \                   /        \                /       \
                                               s : 1      y : 1          i : 1     c : 1        k : 1   n : 2


? : 2 ? : 2 ?: 3
A = ? : 2
B = ? : 2
C = ? : 4

                                                               ? : 4    
                                                          /               \
                                                        /                   \
                                                    ? : 2                ? : 2                    ? : 3
                                                   /        \            /        \                /       \
                                               s : 1      y : 1    i : 1     c : 1        k : 1   n : 2



 ?: 3 ?: 4

A = ? : 3
B = ? : 4
C = ? : 7 
                                                                         ____ ? : 7 ___
                                                                       /                         \


                                                            _ ? : 4 _                          \
                                                          /               \                         \
                                                        /                   \                         \
                                                    ? : 2                ? : 2                    ? : 3
                                                   /        \            /        \                /       \
                                               s : 1      y : 1    i : 1     c : 1        k : 1   n : 2

 ? : 7

The Huffman tree would be the last one, with the root being  ? : 7. Now to generate a code-word for a determined character, the tree is traversed, and with each path to the left taken, a 0 is added to a string, and for every path to the right, a 1 is added. 

So for every character the Huffman encoding would be:

s : 1    000
y : 1   001
i : 1    010
c : 1   011
k : 1  10
n : 2  11

The encoded word 'synnick' would then be:

000001111101001110

To recover the word from the encoding, the string is read from left to right, searching from the beginning of the tree (root) to the left if a 0 is found, or to the right if a 1 is found. This will be done until a leaf node is reached, which will be a character. When the character is obtained, the search starts again from the root of the tree, continuing at the same index of the binary string.

The code that does the previous algorithm is:


Code:



Example:

Typical case analysis:

For the typical case, a word of N length was generated, containing random characters from the alphabet. Measuring times at the compression of different lengths of the same word, we get the following graphic:



Worst case:

For the worst case I found that generating code-words for values whose's frequencies follow a Fibonacci sequence makes the Huffman encoding really slow and inefficient. To create texts with different quantity of frequencies  I wrote a small function to create a string of text with several characters repeating themselves (following the Fibonacci sequence). Then I measured the running time to get a graphic of the increasing time, as the Fibonacci sequence increases. The result is:


As it can be seen, the time increases pretty much with every instance doubling the time of the previous one, taking one and a half minutes to compress a text with Fibonacci frequencies.  

References:

1 comentario:

  1. Distribución uniforme en realidad no es típico – textos de verdad tienen una frecuencia asimétrica. Bastante bien todo; 7+7 pts.

    ResponderEliminar