Information entropy explained

How much information is in the result of a coin toss? That’s an odd question to ask. It doesn’t sound right. You cannot think of information as if it were water in a bucket, right? Well Claude Shannon begs to differ.

In 1948, Shannon published his paper “A Mathematical Theory of Communication,” and by doing so laid the foundation of information theory. This theory became a really big deal, contributing to many fields of science, especially to the field of data compression.

Information theory states that we can quantify the information that is contained in a random event such as a coin toss. When we quantify the amount of water in a bucket, we do so in liters or gallons. When we measure information content, we do it in bits.

Back to the question of how much information there is in a coin toss. According to Shannon, the answer is provided by a formula which looks like this:

h(x_i{})= log_2 (\frac{1}{p(x_i{})})

where p is the probability of the outcome

So the information gain from a coin toss that results in heads would be:

log_2 (\frac{1}{\frac{1}{2}})=1

which is not surprisingly the same as the information in a coin toss where the result is tails.

Now lets roll a 4-sided “fair” dice. What is the information content in getting a 1?

Using the formula, we will get that the answer is 2 bits. Wait, what? A dice roll contains more information then a coin flip? Well, yes! Furthermore, rolling an 8-sided “fair” dice will generate 3 bits of info.

Note that there is a correlation between uncertainty and information content, so we can assume that the result of rolling a normal 6-sided dice will contain more info then the 4-sided dice, and less info than the 8-sided dice. Indeed, after plugging in the probability to the function we get that the information content is 2.585 bits in a 6-sided dice roll .

So far, we’ve dealt with equally probable scenarios; each side of the dice had the same chance to be the outcome of a roll. But let’s say we have a sack with 2 green balls, 1 yellow ball, and 1 red ball. If we randomly pick one ball out of the sack, the information content of a red ball pull is 2 bits, while that of a green ball is 1 bit. This is due to the fact there are twice as many green balls as red balls.

Keeping with the sack example, we pick a random ball and return it back to the sack, then we pick another ball and again we return it to the sack. We keep doing that for infinity times, each time getting some amount of information (either one or two bits of information). The average information content is the Entropy!

And of course there is a formula for that as well:

H= \sum_{i=1}^{n} p_i {}log_2 (\frac{1}{p_i{}})

Entropy is also measured in bits. Looking at the entropy equation we can deduce that in cases where different outcomes have the same likelihood, the entropy will be equal to the information content. If you keep flipping a coin, the information gain is the same whether the result is heads or tails, so the average information content is the same as the information content.

Now that we can define entropy, lets try to understand what it means. Say we keep tossing that coin again and again, and say that between tosses you record the results in binary form, i.e. using zeros and ones. How many bits would you need to record each toss result? You will need one, same as the entropy. You already know that the only possible results are either ‘heads’ or ‘not heads’, thus you can mark 1 for ‘heads’ and 0 for ‘not heads’.

One way to think about it is that entropy is the amount of (average) minimum questions we need to ask to know the outcome. Since our unit of measure is bits, there are only two possible answers to each question: 1 or 0 – Yes or No.

For the coin toss, we need to ask just one question: ‘is it heads?’ Thus the information content is 1 bit. In the case of the 4-sided dice, there are four possibilities. Our best strategy in this case is divide and conquer. So if an out of sight friend rolls the dice, our first yes and no question will be: ‘is the result one or two?’ If the answer is yes, we will then ask, ‘is it one?’ If the answer is no, we ask, ‘is it 3?’

When dealing with an 8-sided dice, the entropy, like the information content, is 3 bits. And indeed we need to ask a minimum of 3 questions in order to determine the roll result.

Now how the hell do you ask the 2.585 questions which is normal dice entropy??? More on that in another post, but for now, let’s review the entropy of the sack example. Doing the math shows us the entropy picking balls from the sack is:

\frac{2}{4} log_2 (\frac{1}{\frac{2}{4}})+\frac{1}{4}log_2 (\frac{1}{\frac{1}{4}})+\frac{1}{4}log_2 (\frac{1}{\frac{1}{4}})=0.5+0.5+0.5=1.5 bits.

You can see from our optimal questions scheme below that half the time we will require a single question to know the outcome (when the ball is green), and half the time we will require 2 questions. So over infinity outcomes, we will require an average of 1.5 minimum questions to know the result.

The relationship of entropy to digital data compression

Symbols in data such as letters and pixels color are comprised of binary codes. A common way of compression is to replace the codes of popular symbols with shorter codes, often at the expense of replacing uncommon symbol codes with longer codes. Still, in total this reduces the number of data bits and thus reduces the file size. Another option is to replace the codes of multiple symbols with a single new code.

It turns out that, in both cases, the new codes average number of bits per symbols cannot be lower then the entropy! For the situation in which we replace a single old code for a new code, the Huffman algorithm will bring us the closest to entropy. If we are not limited to compress a single symbol at a time, arithmetic coding can bring us even closer.

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

Be First to Comment

Leave a Reply Cancel reply