I just came back from today’s exercise in information theoretic modelling (so far I would call it redundancy and data compression, but that doesn’t sound anywhere as cool) with a question that had formed itself during the exercise.

We had the task to implement a Shannon-Fano encoding (a very simple data compression that uses variable length prefix codes) as homework. Basically the algorithm creates a binary tree so that the more frequent characters are higher up than the less frequent ones. This tree is then used to create binary prefix codes. The tree is constructed from a sorted table with the frequencies by subdividing it into 2 halfs that are as equal as possible.

This is the part where I didn’t pay too much attention and thought the equality was the most important goal when dividing. In fact the dividing step is quite simple as it is supposed to be a simple split at the right point. Instead I tried a brute force approach to find the best 1:1 sized division. Later I came up with an approximation that is quite close to the brute force results and is much faster for bigger frequency tables. O(2^n) -> O(n)

In the exercise i found out how it was supposed to work and I started comparing the two approaches (brute force/approximation vs real shannon)

My understanding of the problem was that a very equal division means that the length of the codewords will be close to the length that is most suitable for the entropy – and thus would minimize average code length.

After trying it out I was proved wrong. I used a few pieces from the English wikipedia to test. And all of them turned out to be slightly longer using my encoding (difference between brute force and the approximation was negligible). On average Shannon was 3% better than my approach.

After thinking about it for a while I think I found the problem. It seems that my approach favours equal code length a little too much.

The following dataset illustrates the difference

char: | C | B | A | D | E | F |
---|---|---|---|---|---|---|

freq: | 25% | 22% | 20% | 15% | 13% | 5% |

The two algorithms then give the following output

As you can see the tree is more balanced for the equal-split but if you compare the differences in code length it is obvious that the second is worse for encoding.

char | shannon | equal-split | len*probability (shannon) | len*probability (equal-split) | contribution to entropy |
---|---|---|---|---|---|

a | 10 | 011 | 0.4 | 0.6 | 0.4644 |

b | 01 | 10 | 0.44 | 0.44 | 0.4806 |

c | 00 | 00 | 0.5 | 0.5 | 0.5000 |

d | 110 | 111 | 0.45 | 0.45 | 0.4105 |

e | 1110 | 110 | 0.52 | 0.39 | 0.3826 |

f | 1111 | 010 | 0.2 | 0.15 | 0.2161 |

∑ | 2.51 | 2.53 | 2.4542 |

The leftmost column represents the ideal ratio of this letter for the total encoding and the 2 columns right of this one show, how well the two algorithms work for that specific char. Most interesting is the last line where you can see that both algorithms are still a bit away from the lower bound imposed by the entropy. So the actual difference between the two algorithms is very slim. The effect of course varies with the dataset but I didn’t find any instances where the equal-split was closer to the entropy than shannon.

So far I wasn’t able to come up with a mathematical proof if shannon is always better, but maybe I’ll have the right idea one of these days…

You can also download the source from my experiments, but don’t expect too much as I have only just started with python

soucecode (.py, 7.5kb)