## Intuition

Suppose someone tells you that the sun will come up tomorrow. Unless you’re in the depths of depression, this probably isn’t surprising. You already knew the sun will come up tomorrow, so they didn’t really give you much information. Now suppose someone tells you that aliens have just landed on earth and have picked you to be the spokesperson for the human race. This is probably pretty surprising, and they’ve given you a lot of information. You’d be pretty pissed off if nobody told you. The rarer something is, the more you’ve learned if you discover that it happened.

## An Eight-sided Die

Suppose you have an eight-sided die with the sides labelled A B C D E F G H. Suppose you want to record a series of rolls on a computer. How many bits are required to encode each roll? Well, there are 8 possibilities that are all equally likely, and with bits you can encode possibilities, so this requires bits per roll.

Now, suppose someone tells you that they rolled a vowel. How many bits does it take to encode that roll, now that you already know that the roll was a vowel. There are only two possibilities, A and E, so you can store that roll in bit. They’ve just saved you 2 bits by giving you some information.

This is the central thing to understand about information theory. There’s some randomness whose outcome is encoded in bits. Someone tells you something about an outcome, and as a result you can store it in fewer bits. The difference in the number of bits is the amount of information you learned. In this case, knowing that they rolled a vowel saves you two bits, so that’s how much information they gave you.

Now what happens if someone tells you that they didn’t roll a vowel? Now there are 6 possible outcomes, which isn’t a power of 2, so we can’t trivially map it to some number of bits and we have to start doing math.

## Optimal Codes and Entropy

Suppose we’re encoding rolls of a die with many sides, and each possible roll is no longer equally likely. How many bits should we use to store each roll? Intuitively we’d like to make the common cases shorter and the rarer cases longer, so that on average, the messages are shorter. More formally, let be the length of the string of bits we use to encode outcome . We would like to minimize the expected value of , or .

What are the constraints on how short our encodings can be? We certainly can’t have more than two outcomes encoded in only 1 bit, but how can we generalize this? Think of this in terms of the fraction of the “namespace” that each outcome uses. If an outcome is encoded as 1, then you’ve already used half of the namespace. No other outcome can start its encoding with 1. Suppose you’re encoding an outcome as 11. You’re now using half-of-the half-of-the namespace. No other outcome can start with 11, but they are free to start with 0 or 10. Formally, this means that if we encode an outcome using bits, it uses of the namespace. We now have the nice constraint that .

Minimizing under this constraint gives[1] you the optimal encoding length for each as .

This behaves in the way we would expect. The rarer an outcome, the longer its encoding, and if all of the outcomes are equally likely, we give each one an encoding of length just like we’re used to. Now that we know the optimal length of each encoded outcome, what’s the expected encoding length for an event?

We call this quantity the “entropy” of a distribution . Mathematically it is identical to the quantity of the same name from thermodynamics, and you can think of it as a measure of the “spread” or “disorder” of a probability distribution. It has its peak value when all outcomes are equally likely, and its minimum value when there is only one possible outcome. Careful readers will notice that is undefined when . Thankfully though, the limit at 0 is 0. This leads to the interesting result that if there is one outcome that is absolutely certain to happen, you can encode it in 0 bits. There are some things so obvious (certain) that they aren’t worth saying at all!

Back now to our eight-sided die. The entropy of the original die roll is 3, and the entropy of the die roll if we know that the letter rolled is a vowel is 1. What’s the entropy of the die roll if we know the letter rolled is not a vowel?

Now we have something whose unit is “bits” but whose value includes fractions of a bit. What can we do with this? After all, if we’re only storing one roll, we still need 3 bits to store 6 possibilities. The trick is that we can use fewer bits if we are storing more rolls at once. There are 2 wasted possibilities in those 3 bits we used for the first roll, and if you’re clever, you can use those to encode some information about the next roll. If we’re clever enough, and storing enough, 2.58… is the lower bound on the number of bits required per roll that you’ll converge to with an optimal compression scheme.

So if someone tells us that the roll isn’t a vowel, they’ve given us 3 – 2.58 = 0.415 bits of information. Consider that if they ruled out half of the possibilities, they’d have given us 1 bit. Since they ruled out less than half of the possibilities, it makes sense that they gave us less than one bit.

## Information Gain and Conditional Entropy

Suppose now that someone has agreed to tell us whether or not the roll is a vowel, but we don’t know in advance which it will be? What’s the expected value of the information they will give us? The roll will be a vowel 2/8 of the time, and not-a-vowel 6/8 of the time. So, take the expectation of how much information they give us in each case. On average, they will give us 0.8 bits of information about each roll. This quantity we’ve just computed is called the “information gain” of knowing whether the roll is a vowel.

So more formally, if is the random variable indicating our roll, we’ve computed . It takes 3 bits to store each roll if we know nothing in advance. We’ve also computed and ; the entropy of the roll given that we know the roll was a vowel, and the entropy of the roll given that we know the roll wasn’t a vowel, respectively. What we’ve computed above is:

This formula is the form of information gain for any two random variables, just sum it over all values of the variable you are conditioning on (in this case, V is either true or false.) Now, lets take the information gain, distribute the probabilities, and rearrange a little bit.

We have the original entropy of R minus the expected value of the entropy of R given that we know the value of V. This second term we call the “conditional entropy” and it is denoted . You can think of conditional entropy as the expected number of bits that are “left” in R once you know V.

Information gain is an extremely useful quantity in machine learning. It tells you how much value your classifier could possibly extract out of using a given feature by itself, and is commonly used for feature selection. Anytime you need to sort anything, sorting by information gain/KL-divergence or G-test score will almost certainly give you great results.

## Since I’m sure you want to know more . . .

- Information Theory, Inference, and Learning Algorithms is probably the best textbook on the subject and is available as a free pdf from the author’s website. It’s written well enough to be readable for pleasure.
- A Light Discussion and Derivation of Entropy takes a cute approach to deriving entropy from some very basic and fundamental first principles.

Proof — I don’t have a proof for that formulation, but I think an intuitive idea can be had via the idea of measuring surprise. If f(A) is a function that measures surprise at A you can imagine it must depend on p(A). Define it to be positive and motivate additivity and you’ll end up with log(1/p(A)). Take the expected value of f(a) forall a in A and you have entropy. It’s not rigorous, but it gives some intuition to the log term.

Mr. Ryan Moulton — Welcome back on Knol after a long time.Congratulations for creating a very popular trending Knol. 14923 page views per week. Congratulations

proof of optimal definition of entropy — I don’t know if this qualifies as short, but it should be understandable by anyone who can do algebra.Given the constraint sum 2^-l_i <= 1, a natural choice of l_i is log(1/p_i), because sum p_i = 1 and you can just set 2^-l_i = p_i and solve for l_i = log(1/p_i). Then you just need to show it's optimal. From now on, l_i is some unknown set of numbers.Given that sum 2^-l_i <= 1, is there some set of numbers l_i such that (sum p_i l_i) < (sum p_i log(1/p_i))?sum p_i l_i < sum p_i log(1/p_i)0 < sum p_i (log(1/p_i) – l_i) = XLet q_i = 2^-l_i. Notice that sum q_i <= 1 from our constraint and that l_i = log(1/q_i)X = sum p_i (log(1/p_i) – log(1/q_i)) = sum p_i log(q_i/p_i) <= sum p_i (q_i/p_i – 1) because log x <= x – 1 (not hard to prove, easy to see on a graph) = sum q_i – p_i = (sum q_i) – (sum p_i) = (sum q_i) – 1 <= 1 – 1 (consequence of the constraint) = 0So we have 0 < X <= 0, thus X = 0, thus l_i = log(1/p_i). Sorry for the notation.

Thank you for the clear and concise proof!

Thank you for the clear and concise proof!

As it stands this proof is not valid: half a dozen steps from the end you claim that log x <= x – 1. This is true when the base of the log is e (i.e ln x 0.5 = x – 1.The inequality needed for the proof is: log_2(q_i/p_i) = ln(q_i/p_i) / ln 2 <= (q_i/p_i – 1) / ln 2– i.e. you need to carry the factor of 1/ln 2 through the remainder of the proof, until it drops out at the final step, “… = 0”.Another questionable aspect of the proof as presented is that we end up with 0 < X <= 0 which does not actually imply (as stated) that X=0 — it is simply false, i.e. satisfied by absolutely no value of X whatsoever!For those who care, these infelicities can be sidestepped if the proof is reworked, Dijkstra-style, to manipulate predicates in which the inequalities are explicit, starting with the predicate: E[log_2 1/p_i] <= E[log_2 l_i]and showing that this predicate follows from the given constraint: sum (2 ^ – l_i) <= 1 .

I found your entire article well written, but one part (which I’ll admit is probably obvious) I wasn’t 100% convinced and I just wanted to make sure.This is more of a kind of silly clarification question: So this section:”Think of this in terms of the fraction of the “namespace” that each outcome uses. If an outcome is encoded as 1, then you’ve already used half of the namespace. No other outcome can start its encoding with 1. Suppose you’re encoding an outcome as 11. You’re now using half-of-the half-of-the namespace. No other outcome can start with 11, but they are free to start with 0 or 10. “My question/clarification is: When we are reading this encoded string, it’s variable length and we are reading from left to right.So if we encoded something as 1, we know if we have to code 11010101. As soon as we hit a 1 we know that it represents a state. The only time we hit a 1 and it could potentially be something else is if it’s after a 0. So for example (using 11010101) We hit the first 2 1’s, those have to be one state but if we encoded 01 as a potential state then the 1 after the 0 would be another state. So the code is something like: 1,1,01,01,01.

Yep. You have it exactly right. We want the coding to be “prefix-free” so that when you’ve reached a leaf of the encoding tree, you know you’re done and can start from the root of the tree for the next symbol.

Thank you!

Computer Science and Engineering – Knol Books – Catalogue — http://knol.google.com/k/narayana-rao-k-v-s-s/computer-science-and-engineering-knol/2utb2lsm2k7a/3712#Like to have your suggestions and contributions.Hope more Google engineers write on computer science and engineering topics and make available comprehensive knowledge on computer science through Knol.

Pingback: Artificial Intelligence Blog · “Basic Concepts in Information Theory”