Where did the missing 44GB went? – Data compression 101

This is the second in a series of articles where I go over the basics of common type of compressions. You can find the first article here

Wikipedia in its English version has 3.5 billion words. All human knowledge compressed to 3.5 billion words. How do I know this? I read it on Wikipedia.

***

Some of you may be surprised to discover that the recent text version of Wikipedia can be downloaded in its entirety. As I’m writing this lines, the English version of Wiki is 58GB, but if someone will try to download it he will be glad to know that he downloading a compressed file that is “only” 14GB, which is a quarter of the original file size!

Each paragraph, sentence, word and even a letter appearing in the uncompressed file, also exists in the compressed file. And yet the compressed file is 44GB smaller. How is it done? in the article we will look at the methods and algorithms that enable this.

So, replace the batteries in the flashlights and release the dogs. We going to search for the missing 44.

Run-length encoding

Sometimes when we compress information we come across repetitive patterns. DNA For example consists of a sequence of four nucleotides called A, C, G and T. Distribution of the nucleotides in the sequence is not uniform and there are areas where there are high concentrations of CG’s. Recording of an audio file may contain regions of silence that are shown as repeated sampling of 0 volume.

In Run-length encoding (RLE) compression, instead of encoding each repetition, we encode the number of repetitions. To get an idea of the potential savings, we will take a look at the smiley image. This is an image of a magnified icon, which was originally created in a bmp format. The software that created the icon encoded the pixels colors in order. First the bottom left pixel, then the adjacent pixel on the same row and so on. When the row ended, it continued to encode starting from the left pixel in the row above. So, the content of the smiley file will look like this: blue; blue; blue… yellow; yellow; yellow… blue; blue …

In some circumstances, the bmp format supports RLE compression. If those circumstances will be present the smiley file will look like this: blue X69; yellow X6; blue X38 and so on. Guess in which method the file is smaller? 🙂

This technique cannot always be utilized. As suggested earlier, RLE require repetitive templates in the information to be useful, in other words it will work brilliantly on the smiley but not on an photo of me making duck face.

Options reduction

A photo (or a drawing) in a 24-bits type bmp file, can display 16 million colors. To represent one of those millions of colors, it requires, not surprisingly, 24 bits.

Despite the huge potential, in the smiley world only 3 colors exist. In a drawing software like Paint, it is not practical to use more than a few dozen colors. Photos exploit the variety of colors better and yet a 4 mega pixel photo, will use at most a quarter of the colors. Re-mapping used values to a new sequence of bits, while neglecting unused values, reduces the number of bits required to represent a color, 2 bits in the case of the smiley

The 8-bit ASCII standard contain 256 characters: 0-9, a-z, A-Z, punctuation characters, control characters, and a number of other characters. Some English text, say a children’s story, might not contain numbers. Furthermore, usually English text is written in lowercase, so some of the uppercase letters may not be found. True, it is common in English to start a sentence with a capital letter, but some of the letters (VKQZX) Are less likely to appear at the beginning of a sentence than others (TIAHS). This means that it is very possible that the same text could be represented by using 5-bits (32 symbols) rather than 8, savings 37.5% of the block volume.

Let’s summarize by saying that RLE exploits repetition in the information to reduce the file size while options reduction does this by not representing unused symbols.

So far we have dealt with compression in a very general way, now let’s examine more closely how the above mentioned techniques could be applied in practice.

Applying compression

Each compression software has a decompression software that completes it, yin and yang. The first is meaningless without the second, the second cannot exist without the first. For these two to coexist in symbiosis, they must follow strict rules. These rules are what makes up the format.

In the compression format bzip2, one of these rules deals with the execution of RLE. This is done in the following way: When the compression software detects a symbol that is repeated more than 3 times. It will encode it 4 times. And then it will encode the number of additional repetitions.

Example 1 : “It’s over 9000!!!!!!!!!!!!” Will encode to – It’s over 9000!!!!8.

Example 2: “The password is 1111” Will encode to – The password is 11110.

In the first example 8 of the exclamation marks in the uncompressed file were converted to the symbol 8 in the compressed file. Meaning that we saved the encoding of 7 symbols. The second example shows an unusual situation in which we followed the laws of the format and the unfortunate result is an increase in data size by an additional symbol. From the second example, we can learn why a minimum repeating symbol bar was set (4 repetitions) for encoding RLE.

When we use RLE Why do we encode the first 4 repeating symbols? And why do we bother to encode 0 when there are only 4 reps?

A software that would decompress the file will use this information to restore the file to its pre-compressed state. The number of repetitions is encoded into the file in the mix with the rest of the information and therefore there is difficulty separating it from the other symbols. A software that decompress the file and encounters the digit 8 needs to make a decision, whether it is the symbol 8 as in a text containing a telephone number or whether it is an instruction “the character should be repeated 8 times”. The way the software makes the decision is as follows: It will track the repetition of the symbols and when it encounters four, it will examine the next symbol to find the number of additional repetitions. If a number appears before counting four repetitions, this is part of the text and not a RLE instruction.

When we apply RLE according to the laws of bzip2 on the smiley image, we get:

blue; blue; blue; blue; 65; yellow; yellow; yellow; yellow; 2; blue; blue; blue; blue; 34 etc.

Let’s go back to options reduction, and look at another acceptable way to implement compression: division to blocks. In this technique, we take the information and divide it into parts, say fixed blocks of 256 symbols. Before each block we will place meta-data that will provide required information for compression. In the case of options reduction on an image, this information will be re-mapping of colors to the bits sequences that represents them in the file.

Earlier I argued that if we perform options reduction to the smiley image, we could represent its 3 colors with the help of two bits. But if we divide the image into blocks of 256 pixels, in some of the blocks there will be only two colors. Meaning we could represent the smiley colors in those blocks using a single bit.

**The first block of the smiley image. Note that the preceding meta-data is generally consider part of the block.**

Burrows – Wheeler transform

You’ve probably noticed that the examples of RLE on text were quite forced. In written languages it is rare to find repeating characters in sequence. Hence RLE alone is useless on text, to make it useful requires… magic.

So lets do magic, we will take the word “abracadabra”, add to it the character “$” at the end. Now we will imagine that this word appears on a wheel. We will write down all the letters as they appear on the wheel in order, starting from the letter A. That gives us the same word. We’ll do it again, but this time we’ll start with the letter B. We will continue to do this again and again, each time starting from a different point.

after we finish, we will get the following:

abracadabra$
bracadabra$a
racadabra$ab
acadabra$abr
cadabra$abra
adabra$abrac
dabra$abraca
abra$abracad
bra$abracada
ra$abracadab
a$abracadabr
$abracadabra

We will sort the results in lexicographic order ($ precedes a) and we will get:

$abracadabra
a$abracadabr
abra$abracad
abracadabra$
acadabra$abr
adabra$abrac
bra$abracada
bracadabra$a
cadabra$abra
dabra$abraca
ra$abracadab
racadabra$ab

Taking, the last character of each word, in order, would give us “ard$rcaaaabb”.

What we have just done is implementing Burrows – Wheeler transform (BWT). To understand for what good is it, look at the result, specifically on the letters “aaaabb” at the end.

We took the word “abracadabra” which does not contain consecutive repeated letters and transform it to a word with four repeats. This is not a coincidence, but a unique feature of this transformer. You can think of BWT as a machine that receives coherent texts and emits gibberish in which identical letters tend to be in sequence.

I won’t demonstrate it, but amazingly, the single character that we added “$” (a character that we pre-define that it can’t appear in the text) is all we need to restore the original text.

To make the best use of BWT, it should be performed on long texts and not on single words. Using an online tool I did a BWT on the beginning of the song “Billie Jean” (lowercase).

Here is the result:

eddeeeemdii, afs, ellltdtnheeenye, drrreedutsardeeeeeeyeeeieeassnnnmgmnnn

edeet, ooododnndne ess neeedddww hhce nnnssnneiileaenn nnnn

khhhhmirccchhhhhhhihmhnhrjmbsmnubhecchvy ntwtstttttttttsss t o www

aalvlwwwb mew Illlol Iiiifffoaa aa eeoooiiioaa aaauuuieeooorih

hhd tr d lllooomyrrrm oooeoduf eaaae $ aa’ui u u

oqoootabaeo rte

We can see that we created a very convenient text for RLE, with many repeating symbols in a row. A longer text would likely have an even more convenient outcome.

How does BWT do it? How does it take random text and turn it into compressible text? How do the simple actions we did cause similar letters to group together? The scientific answer is black magic.

We can now run RLE on the result, but instead let’s take it one step further…

Move-to-front transform

Move to front transformer (MTF) will often follow BWT and similar to it, it does not compress the information but rather transform it to an output that is more convenient for compression. We shall build a basic version of it by first creating a table, 30 cells wide, and fill it as follows: “$” in cell 0, ” ” (space) in cell 1, “’” In cell 2, “,”(comma) in cell 3 and a-z In cells 4-29.

Here are the rules:

1) Take the first character in the text and record the position of the character in the table.

2) If the character position in the table is not 0, update the table so that the character in the table moves to position 0. If necessary, the rest of the characters will free up space by moving a cell.

3) Now that the table is updated, we will repeat the process with the next character in the text. And so on until we go over all the characters in the text.

The set of rules above is all that is needed to create a simplified version of MTF transformer.

We will execute the MTF on the output we got earlier when we did BWT on “Billie Jean”:

**MTF On “eddeeeem”, the first 8 characters of Billie Jean post BWT**

We will get the following result:

8 8 0 1 0 0 0 16 2 13 0 7 8 11 22 3 7 17 0 0 23 8 1 19 17 5 0 0 2 28 2 7 6 24 0 0 3 0 2 25 8 10 12 6 5 6 0 0 0 0 0 8 1 0 0 13 1 0 5 6 0 10 0 0 14 20 1 2 0 0 5 8 1 0 10 12 23 0 0 4 1 1 5 0 1 1 5 18 1 9 0 2 3 3 0 0 4 0 0 27 0 4 16 0 22 5 3 6 0 0 7 0 1 0 3 14 0 18 2 15 1 4 0 6 1 0 0 0 24 9 0 0 0 15 8 18 11 0 0 4 0 0 0 0 0 0 3 1 4 1 6 1 5 24 4 24 13 2 6 21 4 7 12 10 0 2 27 22 14 19 1 9 21 19 1 13 1 0 0 0 0 0 0 0 0 1 0 0 4 2 1 3 0 0 18 0 19 9 1 3 0 0 13 5 15 14 4 3 18 6 0 0 12 1 3 3 0 0 0 23 0 0 4 10 0 4 1 1 1 1 7 0 3 0 0 5 0 0 1 4 0 4 1 0 0 17 0 0 4 5 0 5 0 0 19 3 17 0 0 21 8 15 5 2 3 1 11 0 0 7 0 0 13 18 6 0 0 2 5 4 0 0 10 1 7 11 13 5 5 13 0 0 1 2 4 1 24 4 0 25 4 15 5 2 9 27 1 0 0 15 7 18 1 10 4 7 14 6 4

Why did we do that and for what is it good for?

The result of the second rule of MTF Is that each repeating sequence of characters will produce some number followed by a sequence of zeros regardless of the type of characters. Indeed, the output consists of 34% zeros. The more the character is common in the text the more the times it will be promoted at the table, and indeed the most common value after zero is one, which appears in 11% of the output. Other numbers that are common in output: two, three, four and five. Meaning, the output consists mainly of low values.

The special feature of MTF, Under the right circumstances, is that it reduces the number of symbols in the section and converts them to low values. Compression algorithms know how to take advantage of the fact that there are fewer symbols in the block. We saw an example of this before. Furthermore, some types of algorithms also know how to exploit a tendency for low values. And so the next natural step would be to run this kind of compression algorithm, such as a special version of RLE or Arithmetic coding. These are for another article.

***

When we compress, we are not limited to one technique. In a good compression format, several algorithms work together to achieve the best result. Wikipedia, in it’s downloadable version is compressed in the bzip2 format mentioned earlier. bzip2 is a free open source compression format, that uses all the methods that been demonstrated in this article.

How do I know that? I read it on Wikipedia.