Open Mind


February 10, 2009 · 22 Comments

As King of the mythical land of Gilder, you’re invited to all the best royal weddings. You don’t much care for that sort of thing, but you’ve just received an invitation to celebrate the nuptuals of Prince Humperdink and Princess Buttercup in the neighboring (and bitter rival) country of Florin. Not to go, or even to arrive late, would be taken as an affront, and wars have started on less provocation, so in the interest of peace you decide you must attend.

Problem is: because Florin is your bitter rival, neither you nor any of your staff have ever been, and you don’t know where their castle is. Not to worry! The royal cartographers provide you with a map of Florin, divided into grid squares, and inform you that the castle is in grid square 111001. If you can get to the right grid square, you’ll be close enough to see the castle and find it easily.


But what’s that “grid square” stuff? Each digit (or “binary digit,” or “bit”) eliminates half the possibilities. The first bit, a “1,” places you in the upper-right half (an “x” marks an eliminated grid square, open grid squares are still possibilities):


The second bit, again a “1,” puts you in the upper half of what remains:


The next bit, yet again a 1, puts you in the upper-right half of that:


The next bit, a 0, puts you in the lower half of what remains:


Another 0 puts you in the lower-left half of the remainder:


Finally, the final 1 puts the castle in the upper half of the remainder, and we’ve identified the necessary grid square:


Voila! Castle located, respects paid, blessings bestowed, war averted.

The description of the location of the castle, “111001,” is like a 6-letter “word” in a 2-letter “alphabet” where the letters of the alphabet are 0 and 1. Because each letter can take 2 values, and the word is 6 letters long, the total number of possible words is

N = 2^6 = 64.

That’s convenient, because there are 64 grid squares so we have a unique “word” for each one. A 6-letter word is made of 6 binary digits, or “bits,” so the information content of the word is 6 bits. In fact it happens that the information content required to encode any number N of unique possibilities is always

I = \log_2(N).

That’s the number of “bits” required to specify uniquely a number between 1 and N. It so happens that \log_2(64) = 6, so to locate one of 64 grid squares we need 6 bits of information (a 6-letter word where each letter is a bit). Since we get 6 bits from 6 letters, the information per letter, or I.P.L., is 1 bit per letter for this 2-letter “alphabet.”

That’s all well and good, but you’re a little worried about the 111001 stuff because your coachman isn’t too bright. He’s not too stupid, in fact he’s OK with maps and pretty good with compass directions — but you don’t want to take any chances; the peace and prosperity of Gilder are at stake! What if he confuses upper-right and upper-left? You’d hate to take a wrong turn and end up in the fire swamp, so you’d like a simpler system. You consult your most intelligent scholar, who recommends that instead of giving a 6-letter “word” in a 2-letter alphabet of 1’s and 0’s, that you give a 3-letter word in a 4-letter alphabet where the letters are N,E,S,W for “North, East, South, West.” The castle is at location “NEW.” The first letter, “N,” puts it somewhere in the north quadrant:


The next letter, “E,” places it in the east quadrant of what remains:


The final “W” puts it in the west quadrant of the remainder and the castle is again located:


This is much simpler, with only 3 letters corresponding to already-familiar compass directions, so the coachman should have no trouble. Blessings be!

The directions can still be thought of as a “word,” but now it’s a 4-letter alphabet forming a 3-letter word. That means there are 4^3 = 64 possible words, just as many as before, so they convey the same information.

In fact if we have an n-letter alphabet, the number of possible unique words L letters long is

N = n^L.

The information content is still I = \log_2(N), so it’s

I = \log_2(N) = L \log_2(n),

and with that much information in L letters, the information per letter is

I.P.L. = \log_2(n).

Therefore, for an alphabet with n letters we get information amounting to \log_2(n) bits for each letter in a word. Since \log_2(4) = 2, our 4-letter alphabet provides 2 bits per letter. With 3-letter words we get 6 bits, enough information to locate the one grid we need out of 64 possible.

Alas! Alack!!! You were supposed to get the directions “N-E-W” but the scribe who wrote it out is your idiot drunken brother-in-law who doesn’t letter very well. In consequence, you can’t tell the difference between his “E” or his “W” or his “S.” You can tell the “N,” but beyond that it’s just a guess. So the directions you’re actually able to read are “N-?-?” where “?” could be “E” or “S” or “W.” What the heck you gonna do now?

The first letter is still good; you know it’s “N” so the castle is somewhere in this quadrant:


But for the sub-quadrant within that quadrant, all you can tell is that it’s not “N.” You can at least eliminate that, leaving these possibilities:


Finally, you know it’s not in the north sub-sub-quadrant of any of the remaining sub-quadrants, so you can eliminate these:


But that’s not good enough! There are still nine possibilities, so you decide you have no choice but to search them all and hope you get lucky. Which you do, because thank God, while searching you encounter a man in black, a giant, a Spaniard, and the princess herself, who inform you that the wedding is off so your tardiness will not ignite armed conflict between nations. Whew!

When the letters “E,S,W” got blurred into “?” our 4-letter alphabet was morphed into a 2-letter alphabet. But not all the letters give the same amount of information! The letter “N” still gives two bits, because it restricts the search area to one fourth. But what about the letter “?”? And how long does the word have to be to specify a location uniquely?

That depends on the location. If the castle were at “N-N-N” we’d still find it, but with “N-?-?” we can only narrow the search to 9 possibilities and if it were “?-?-?” all we’d know is that it’s not in the N grid or the N sub-grid of any grid or the N sub-sub-grid of any sub-grid, so there’d be 27 possibilities. Unfortunately, “?” is the most likely letter — and also the one which gives the least information.

We can figure out the average information content for a given word which is L letters long. The probability that a given letter will be “N” is 1/4. The probability it’s “?” is 3/4. To restore the lost information, we have to specify a “correction word” or “fix word” which tells us which of the 3 possibilities each “?” should be replaced by — we can do that with a 3-letter “alphabet” consisting of “E,S,W”. So on average we’ll have to specify \frac{3}{4}L letters of a 3-letter alphabet. The number of possible such words is

N_{fix} = 3^{(\frac{3}{4}L)}.

The information content in such a word is

I_{fix} = \log_2(N_{fix}) = \frac{3}{4} L \log_2(3).

This is the information we lost by blurring “E,S,W.” If we take the total information in the correct word, and subtract the information we have to restore with the fix word, we get the information content of the blurred word:

I_{blur} = I - I_{fix} = L \log_2(4) - \frac{3}{4} L \log_2(3).

We can make this even more general. Suppose we start with an n-letter alphabet and make a word L letters long. Its information content is

I = L \log_2(n).

If we take the n letters of our alphabet and split them into the first j and the remaining k letters, then of course j + k = n. If the first j get blurred together, and the remaining k letters are merged into a different “blur” (so we can tell the difference between blur 1 and blur 2), to restore the lost information we have to give two “fix words.” One of the fix words comes from a j-letter alphabet and specifies which of the j letters should replace each occurrence of blur 1, the other tells which of the k letters should replace each occurrence of blur 2. The probability a blur will be blur 1 is j/n, the probability for blur 2 is k/n, so for a word L letters long we’ll need, on average, for fixword1 to be Lj/n letters long and fixword2 to be Lk/n letters. The information content of fixword1 is therefore

I_1 = (Lj/n) \log_2(j),

while the information content of fixword2 is

I_2 = (Lk/n) \log_2(k).

The information content of the blurred word is the total information content, minus the lost information which is restored by the two fixwords, so

I_{blur} = I - I_1 - I_2.

This is

I_{blur} = L \log_2(n) - (Lj/n) \log_2(j) - (Lk/n) \log_2(k).

Now n = j + k so

L = (Lj/n) + (Lk/n).

We now have

I_{blur} = (Lj/n) [\log_2(n) - \log_2(j)] + (Lk/n) [\log_2(n) - \log_2(k)].

But of course

\log_2(n) - \log_2(j) = \log_2(n/j) = - \log_2(j/n),

and likewise for k. We finally get

I_{blur} = - (Lj/n) \log_2(j/n) - (Lk/n) \log_2(k/n).

Then the information content per letter is just I_{blur}/L, or

I.P.L. = - (j/n) \log_2(j/n) - (k/n) \log_2(k/n).

Now: the blurred word comes from a 2-letter alphabet consisting of “blur 1″ and “blur 2,” in which the letters do not occur with equal probability. The probability of blur 1 is j/n = p_1, the probability of blur 2 is k/n = p_2, so the I.P.L. of this alphabet is

I.P.L. = -p_1 \log_2(p_1) - p_2 \log_2(p_2).

There’s a another name for the “I.P.L.” — it’s called the entropy of the alphabet.

If the alphabet has many letters which occur with probabilities p_1,p_2,..., then the total entropy is

Entropy = S = \sum [-p_j \log_2(p_j)].

That is the definition of the information-theoretic entropy, or the Shannon entropy.

There’s another kind of entropy which is used in thermodynamics. They’re strongly related, but the thermodynamic entropy uses natural logarithms (to the base e rather than base-2 logarithms, so

S = \sum_{states} [ -p \ln(p)].

In case you’re wondering why I’ve bothered with all this, I just thought it was a nice illustration of the derivation (and to some degree, the meaning) of entropy … a particularly tricky concept … so I thought I’d share it.

But now back to climate science.

Categories: mathematics

22 responses so far ↓

  • if0x // February 10, 2009 at 2:15 pm

    Clearly, you have a dizzying intellect…

    (great film, illuminating post!)

  • Mistletoe // February 10, 2009 at 2:37 pm

    “Let me ’splain. No.. .there is too much. Let me sum up.” :P

  • Soil Creep // February 10, 2009 at 3:11 pm

    I’m left wondering how the I.P.L. differs in true alphabets and the different treatments of consonants and vowels (e.g. greek vs. hindi)? So do more vowels increase the entropy of a given language?

    Xs thxs x mxxnxngfxl qxxstxxn? ((Ix xxix a xeaxixxxux xuexxiox? (Is this a meaningful question?) यह एक सार्थक सवाल है) Είναι αυτό ένα σημαντικό ζήτημα)

    [Response: From what I can tell, it's an enlightening approach to real alphabets ... but I don't know beans about the results.]

  • sod // February 10, 2009 at 4:17 pm

    very nice. it is a pleasure to read you blog, Tamino. thanks.

  • Soil Creep // February 10, 2009 at 5:54 pm

    I did a quick google search and apparently the Claude Shannon behind the entropy calculation was interested in its application to languages and English:

    I found this neat little JAVA application from the Mathematics Department at UCSD that simulates “Shannon’s Experiment to Calculate the Entropy of English” (fyi - it took a minute or so to load on my browser)

  • BBP // February 10, 2009 at 7:35 pm

    But is physical entropy really a measure of information? If it is, then as entropy increase, so must information. In a fully deterministic system you cannot increase the amount of information - so that would imply the universe is not deterministic. It could still be mechanistic, but not deterministic.

    [Response: I doubt the analogy goes so far as to make thermodynamic entropy equivalent to information. But I guess that depends on one's definition of information.]

  • Dana // February 10, 2009 at 7:38 pm

    That explanation was inconceivable!

    [Response: You keep using that word. I do not think it means what you think it means.]

  • Soil Creep // February 10, 2009 at 8:37 pm

    BBP: I think wikipedia is helpful in this instance - explaining of entropy in thermodynamics to entropy in information theory:

  • TCOis banned...why? // February 11, 2009 at 1:07 am

    Entropy is disorder. That’s the opposite of info. No?

    [Response: In thermodynamics it's disorder; in information theory it's information per unit.]

  • Ray Ladbury // February 11, 2009 at 2:09 am

    Entropy is a precisely defined term in physics and in information theory–and sometimes the two definitions overlap. There’s a story about Shannon explaining his ideas to von Neumann, after which von Neumann said: “I think you should call your quantity entropy, because it is equivalent to that quantity in physics. But more important, if you call it entorpy no one will understand what you are talking about and they will ask fewer questions.

    Actually, Tamino, this is an interesting coincidence as I just started a book by B. Roy Frieden that is looking at Fisher Information as a way of deriving Lagrangians for any physical system. Pretty interesting. It’s nice to have some time to do some reading.

    [Response: What a great story! I'm gonna use that one.]

  • luminous beauty // February 11, 2009 at 4:18 am


    veny furry!

  • dhogaza // February 11, 2009 at 4:42 am

    Here’s the actual quote:

    My greatest concern was what to call it. I thought of calling it “information”, but the word was overly used, so I decided to call it “uncertainty”. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, “You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage.”

    As related by Shannon himself.

  • Gator // February 11, 2009 at 5:23 am

    Physical entropy is a measure of the number of states available to the system. In this way it is similar to the potential information content of an alphabet or other encoding system where you are looking at possible message states that could be encoded.

  • Ray Ladbury // February 11, 2009 at 10:22 am

    OK, if you’re going to use it, the actual quote is:

    “ My greatest concern was what to call it. I thought of calling it ‘information’, but the word was overly used, so I decided to call it ‘uncertainty’. When I discussed it with John von Neumann, he had a better idea. Von Neumann told me, ‘You should call it entropy, for two reasons. In the first place your uncertainty function has been used in statistical mechanics under that name, so it already has a name. In the second place, and more important, nobody knows what entropy really is, so in a debate you will always have the advantage. ”

    –Conversation between Claude Shannon and John von Neumann regarding what name to give to the “measure of uncertainty” or attenuation in phone-line signals (1949)


    Note typos corrected. It was past my bedtime.

  • Gavin's Pussycat // February 11, 2009 at 10:27 am

    Ray, yes that one attracted my attention too. There is a book review at

  • Ray Ladbury // February 11, 2009 at 4:21 pm

    GP, So far (I’m still at the beginning) it looks like an interesting approach, and it’s pretty readable. The whole question of “Mommy, where do Lagrangians come from?” has always seemed a little vexing to me, and this seems to provide a degree of context (albeit, not a cookbook). My inner nerd has been looking forward to the opportunity to read this one.

  • pough // February 11, 2009 at 7:34 pm

    TCO, another way to look at the entropy/disorder/information thing is to think of file compression. What looks to you and I like “information” will compress down to a smaller size than random bits and bytes. In one definition of information, disorder is king.

    Tamino, this post is pushing you perilously close to the IDC* debate, which hinges on (what seems to me to be) fluctuating definitions of information. With a blog title like “Open Mind” you’d drive the cdesign proponensists into a fury if you start in on that topic.

    * Intelligent Design Creationism

  • TCOis banned...why? // February 11, 2009 at 9:03 pm

    I donno. Think you guys are confusing concepts from stat mech, but I’m not enough of a jock to weed through the combinations of ideas and words being used for different things. I don’t even know my stat mech that well. Boltzmann was the man though…

  • Dano // February 11, 2009 at 10:30 pm

    Quite the service. Keep it up sir.



  • steven mosher // February 14, 2009 at 4:23 pm

    Thanks for doing this post Tamino. For those interested JR Pierce wrote some fascinating books on Information Theory, Entropy (information theory Entropy) and the arts. Here’s a little bio on this artist/scientist

    Start with this book if you can find it, and no I won’t loan you my copy
    An Introduction to Information Theory: Symbols, Signals, and Noise;

  • r // February 19, 2009 at 9:45 am

    Well, the thing i could’t figure out is: why the first digit “1″ excludes that (part) –> corner and not any other ?
    like it coul mean the symetric one , right ?
    I could’t follow your idea, :(
    If you coul explain more. Because i sse like each square is (x,y)


    111001 (111,001) , it depends on the reference you assume, or it’s me : )

  • Fernando // March 2, 2009 at 5:12 pm

    Correct for a closed system.
    Without meaning to an open system .(?????)

Leave a Comment