Skip to content

The story of the student, who created the most famous compression algorithm, because he tried to avoid an exam

This is the third in a series of articles where I go over the basics of common type of compressions. You can find the first article here

When you take pictures, he’s there.
When you listen to music, he’s there.
When you surf the web, he’s there.
This article is dedicated to him.


Drawing lesson

Question: How would a basic drawing software which allows you to draw in just eight colors, encode the information into bits?
A common way to do this is to predefine a specific bit sequence, of a fixed length, per color. For eight colors we will need at least 3 bits.

A dictionary

This simple implementation then allows converting from bits to colors: The software will take three bits from memory, find their translation to a certain color in a dictionary, display that color on the screen, then move on to the next three bits.

Looking at the table one might think that it is possible to reduce the size of the information by translating some of the colors into two bits or even a single bit.

To demonstrate why this is not as simple as it seems, let’s change the black color translation to “00“. Now we will attempt to restore the following color sequence: 001010010100

Oops, we encountered a problem, the sequence can be restored in two different ways:



A new dictionary

To map the black color to “00” and still allow the information to be restored, we will need to assign new codes to some of the other colors. These codes will need to be longer (see the second figure).

Which of the two dictionaries is more efficient? If the user utilizes all colors equally, then the original dictionary will produce fewer bits. On the other hand, if the user often draws in black and barely draws in brown and purple, then the new dictionary is more suitable. There is no need to choose the most appropriate dictionary in advance. First we will see how much each color is used and then, once the user has saved the file, we will choose the dictionary that will serve us best. Selecting the dictionary with the optimal “variable length codes” is an act of data compression.

In 1948, mathematician Claude Shannon published the paper “A Mathematical Theory of Communication”. The article formed the basis of information theory, an important idea that deals with various aspects of information such as quantification and measurement of information. In this paper Shannon also defined the concept of entropy (of information) as the amount of information, on average, that is contained / produced (measured in units of bit).

If the upper paragraph does not sound like gibberish to you, congratulations, you are a hard-core electrical engineer.

Using the entropy from the theory he developed, Shannon found a way to calculate an upper limit to the efficiency of a dictionary of codes. A formula that, after plugging in the relative share of each color in the data (or another symbol), makes it possible to calculate how far, theoretically, the information can be compressed. It is of course possible to create a dictionary that compresses less than the result obtained, but it is not possible to create a dictionary that compresses better. Shannon’s formula does not provide the dictionary itself, only a measure of the quality of the dictionary. Measuring the Everest without climbing it.

Climbing the Everest

Shannon did publish an algorithm for producing code dictionaries and in 1949 the scientist Robert Fano published an improved algorithm. Both Shannon’s algorithm and Fano’s algorithm were very effective, but Shannon’s formula showed that there was still room for improvement. Shannon and Fano climbed high but did not reach the summit.

In 1951 Fano gave his student David Huffman two options: study for a final exam or find the optimal algorithm. Fano “forgot” to mention that both he and Shannon tried to do this without any success. Huffman worked on the problem for months, but as the test date approached he decided to give up and start studying for the exam.

Then he had a Eureka moment!

  • Huffman’s method:
    • Record the number of occurrences (or the probability) of each symbol.
    • Write down the sum of the lowest numbers above and connect a line to them.
    • Repeat the previous step over and over again (even for sums of numbers) until you reach the final sum of numbers.

To know the new code of each value we descend from the final sum along the lines to the number of its occurrences. At each line junction, if we turn left we write 0 and if we turn right we write 1 (the opposite also works).

A 250 pixel drawing in our software consisting of 19 red, 22 orange, 24 yellow, 40 blue and 70 purple pixels on a green background (75 pixels) will look like this:

The “tree” is built opposite to the direction of the arrows, the code reading is along the direction of the arrows

Huffman encoding is one of the most common compression techniques, MP3 uses Huffman, so does JPEG. Huffman is often used as a finishing step, a final compression layer after using other compression techniques.

The codes that the Huffman algorithm produces are not only efficient, they also have another special feature: they can be decoded instantly. The following code set: 11 ; 110 does not have this feature. When we read the output 11 we can not tell if it is the code 11 or the beginning of 110 until another character is read. The reason for this lagging in decoding is that 11 is a prefix of 110. Against all common sense, codes that do not constitute a prefix of other codes in the set, such as Huffman, are called prefix codes.

The opening crawl of Star Wars Episode 4 “IT IS A PERIOD OF CIVIL WAR…”, 46% compression not including the dictionary

Compressing Compression Dictionary

Although Huffman provides amazing results, the dictionary is needed for restoring the information. Often information is divided into blocks where each block contains a separate and distinct dictionary. The size of the dictionary diminishes from the savings obtained and hence we need to create it in the most efficient way possible.

Let’s consider a simple and naive implementation of a dictionary based on the codes obtained from the drawing. In order to log that the purple color is mapped to the code 10, we will first write the code of the purple color as defined before compression (111, see the first figure), followed by the code as it appears after compression.

After doing this for all the colors, we get: 111-10101-11010-000011-001100-010110-011

Oops, we’re missing something, a software which reads the dictionary will not be able to identify where the new codes end up. Is the new code for purple 10? Or 101? Or maybe it’s 1010? We will solve this by encoding into the dictionary the length of the new code in binary. Assuming that in the most extreme case the length of the new code will be seven characters, we need to devote three more bits per symbol. Our dictionary will now look like this: the first 3 bits represent the color before compression, 3 more bits represent the length of the new code and a few more bits of varying length represent the new code.


The 52 characters above comprise a complete dictionary. We will use this dictionary as a basis for a dictionary of reduced size.

The first technique we can use is to set the dictionary in a fixed order. Let’s say that in our software the order is always as shown in the first figure: black> brown> red> orange> yellow> green> blue> purple, which now be called “the color order”.

If the order is fixed, it is not necessary to write down the color codes before compression, but unlike before, we will need to write down the code lengths of colors that are not in use (zero bit) if other colors are used after them.

We will then get a 40-bit dictionary as shown below.


We will try to reduce further with the following technique:
    1. We will arrange the codes by length from short to long. (Black> Brown> Purple> Green> Red> Orange> Yellow> Blue)
    2. In each length group we sort by the color order. (Black> Brown> Green> Purple> Red> Orange> Yellow> Blue)
    3. We will take the first code (11) and replace it with a code consisting of zeros of the same length.
    4. The next binary number will replace the next code in the order.
    5. If the new code is shorter than the old one, we will add zeros until they are the same length.
    6. We repeat steps 4 and 5 until the last of the colors.

This technique creates codes called Canonical Huffman codes. These codes are unique prefix codes and have the same length as the previous codes. So why did we replace the previous codes? Because as long as the order is in the colors order, these codes can be described by their length. In other words, it is not necessary to write the code itself in the dictionary only its length.

The new dictionary contains only 24 bits:


Looking at the dictionary, we can see that the first shortest code (which is not zero bit) is 2 bits long and its position is six. We can conclude from this that it is the color green which is placed sixth in the color order and the code itself, according to the 3rd clause of the method, is “00“. The rest of the codes can be restored in a similar way.

Because we use six of the eight colors, the “penalty” for describing unused colors is small. But if we were to paint in six colors out of a thousand supported colors, it would be more sensible to describe only the colors used. For this purpose, we will write in binary how many codes have a single bit, in our case there are none, now we will write how many codes have two bits, there are two: green and purple. We will write down their original codes, in the colors order, then move on to three-bit codes… When we finish, we obtain the following recoverable canonical dictionary:


Whichever method we choose, we will need to differentiate between the dictionary and the information itself so the last piece in the puzzle is a number of fixed bits at the beginning of the dictionary that define its length.


Since its development, the Huffman encoding has been the best way to generate variable length codes. On a good day, any other technique could, at best equal Huffman’s compression but never surpass it. Since that Eureka moment, Huffman has become the optimal way to compress random symbols. The correct way to compress random symbols. The one and only, the undisputed king!

And then came arithmetic coding…

For more information on Shannon’s entropy of information and formula.

Published inData Compression 101

Be First to Comment

Leave a Reply

Your email address will not be published. Required fields are marked *