Replacing a sentence with a single character – Data compression 101

This is the first in a series of articles where I will go over the basics of common type of compressions.

There is a legend about a famous 19th-century writer, Victor Hugo, who was on vacation. That writer wanted to know how the sale of his new book, “Les Misérables”, was progressing.

So, what did he do? He wrote to the publisher.

When the publisher opened the letter, he was surprised to find out that it contain a single character: “?”

***

In data compression we are trying to represent relevant data in a minimal number of symbols. Take a file and shave of him every possible bit. Less is more, that is, a website that takes less memory downloads faster, images that take less space allow to store more of them and video that occupies less space allows it to be streamed at a higher quality.

One character: “?” that replaces one sentence: “How are my book sales going?” – an excellent example of data compression.

Robert Downey Jr.

Do you ever fill an Excel table and one of the values is unusually long and goes beyond the cell borders? If you want to keep it that way, then Excel can hide the exceeding part for you. But then it bugs you and you are forced to widen the whole column. Of course, now the table takes up more space.

Ouch.

There is another solution. You could just represent the value in the rogue cell in a way that won’t expand your table. For example, by splitting the value into two rows.

What that to do with data compression?

As you probably know, computers need to represent everything in bits. Now let’s pretend that we have a series of numbers with values ranging from 0 to 7:

5 ; 5 ; 4 ; 6 ; 2 ; 3 ; 1 ; 7 ; 6 ; 3 ; 0

We can represent the values in the series using three bits per number. Meaning that representing the entire series would require 33 bits.

Now let’s say that instead of the previous series we got a new identical series with the exception of the number three in the middle of the series, which was replaced with the number 12.

This minor change will force us to represent the values using four bits and the whole series using 44 bits. Significantly more than before.

And here something from life. You’re browsing eBay and you buy a chicken shaped yolk extractor. You don’t need it, you don’t even cook, but it cost just 99 cents. Another case of impulsive buying just like your Spock silicon ears ($3), vintage Edison bulb ($5), Hello Kitty iPhone case ($4) and T-800 poster ($7).

So far, all small sum payments. Then you go to the living room and press the TV remote. nothing happens. The tube won’t turn on. You get smokes instead. No need to panic there is a Wi-Fi, a few clicks on Amazon and you are the proud owner of a brand new 8k ($2300). The credit card company, that tracks your purchases and stores them in her database, will encounter the same issue we saw before. The series of small purchases could be represented by a small number of bits if not for the occasionally relatively large porches.

How can we reduce the number of bits needed for representation?

Simply, just like we did with the Excel, we will split the rogue value so we could go back to represent the series using three bits per value.

One way to do so, is to define that from now on the values above ‘6’ are represented by the sum of two consecutive numbers when the first of them is ‘7.’

Indeed we saved five bits, but this is not a perfect solution. First, note that 7 must be split as well, which means that our bit saving gets reduced the more times that 7 appears. Secondly, this method does not address values above 14.

Now that we have an idea how the credit company may compress the data when it receives a series that composed mainly of small numbers and a few large ones. Let’s see how she might deal with a client that makes just large purchases.

Adele who found out that her husband was cheating, took his credit cards and went for a shopping spree. First stop is the Mercedes Benz showroom, second stop is Tiffany, thirds stop is Versace and so on…

From the perspective of the credit cards company this infidelity looks like that:

$341,445 ; $31,900 ; $28,500 ; $29,222 ; $28,119 ; $28,700 ; $30,500 ; $29,000

How shall we compress the above? One possibility is to reduce the numbers in a way that allow their reconstruction. For example, let us choose the lowest value (28,119). We will call it the reconstruction member and we will copy it before the series (outside the block), then we subtract the values by the reconstruction value.

The members of the new series could be represented using a 25 bits, instead of 26 bits (as shown in the left column in the below table). A saving of one bit per number, but of course the new series now contain an additional member so in total we saved nothing. In contrary, we added 14 bits to the series. If the series was longer, the saving per value would outweigh the cost of adding reconstruction value.

Why did we got such a poor results? The main culprit is the first member that is relatively large in compared to other member of the series. This compression method works great when the values of the members are close to each other. If the value of the first member was not 341,445 but 30,000 for example, we would save 3 bits per member. There are different way to handle large members, we have seen one before (which won’t provide good results in this case), another way is using a compression method that capable to represent different size values using variable length bits.

Delta force

Important principle in compression is to know to match the type of compression to characteristics of the expected data. As we have seen before a mismatch may create an opposite effect where the “compression” increase the volume of the data. On the other hand, compression that fits the nature of the data could be very effective.

And here is an example to a successful compression: a running watch of some kind that containing 8MB memory sampling GPS coordinates every second. Even if the manufacturer allows to keep one-week record, it seems that 8 MB is more than enough memory for GPS coordinates. Until you do the math and found out that in seven days its sums up to 600 thousand! Samples.

Example of GPS sample (longitude; latitude; altitude):

483221123 ; 234234324 ; 122111.

For simplicity’s sake, let us neglect the latitude and longitude and examine just the above sea height. What values we are expected to get in one second intervals? Most likely the values that we will get will be identical or at least close to each other. That is unless the user falling from a cliff. That insight opens the door to Delta encoding, in this compression instead of recording the height in each sample, we are recording the height just on the first sample (and once a week). And on the other samples we record the difference. At need, the processor will calculate the require value.

It can be seen that in this case it is a significant saving even when accounting for the extra bit that necessary to distinguish negative numbers from positive (red).

And here is a more challenging example, a medical portable accessory of some kind tracks and records throughout the day the ECG of a patient.

An average proper ECG contains 5 repetitive jump\fall points (P,Q,R,S,T), the device records the time stamp and height of each such point.

The time stamp we can compress easily using Delta encoding, but what about the points height?

We can Delta compress them as well, but there is a nuance. If we calculate the difference between the height value for each previous point we will get unsuccessful results. This is a especially prominent when calculating the difference between the S value and R value.

How we overcome the issue? We calculate the difference between the height of same type points. That is, between S and the preceded S, R and the preceded R, etc.

Various Video formats also use (among others) Delta encoding. Imagine for a moment a scene, from a stationary camera, of a red car crossing empty street. This scene, which last a few seconds contains hundreds of frames, where the background in those frame is the same and the only difference between frames is just the location of the vehicle. An efficient way to represent the scene is to encode just the differences between frames. Meaning, roughly, to inform the video player only about the changes in the pixels color that show the car. That way we can save all the data that responsible for the background pixels. This could even be taken a step further by utilizing the correlation of adjacent pixel colors.

The art of compression

Data compression is everywhere around us, most user aware of it mainly because ZIP and RAR files, but, thanks to the Internet, we are expose to it much more.

Behind the scenes of cloud services like Gmail and Dropbox, working hard compression algorithms, which optimized for specific file types. As mentioned before, successful data compression consists of deep understanding of the information characteristics. And from addition “secret” ingredient. No, not love, but creativity.

In this article we went over several examples, some of which are rarely practical, but all provide insight to alternative ways to display the same data.

See you at the next post.

***

By the way, according to the legend, the publisher replied the writer: “!”

Make sure to check the next article in the series: “Where did the missing 44GB went?“

Replacing a sentence with a single character – Data compression 101

This is the first in a series of articles where I will go over the basics of common type of compressions.

Robert Downey Jr.

Delta force

The art of compression

Be First to Comment

Leave a Reply Cancel reply