Open Mind

Reverend Bayes

July 10, 2008 · 73 Comments

Suppose a new commemorative coin is struck, specifically to be used in the coin flip at the Super Bowl. Winning the coin flip is only a slight advantage, but it’s a real one, so the choice should be truly random, i.e., it should be equally likely to turn up heads as tails. You’re statistical assistant to the NFL commissioner, and he suspects that the design of the coin makes it not fair. Since you’re not just a rabid fan of one of the teams, but working for the commissioner, you have access to the coin and are able to perform any tests you can think of (as long as they don’t actually alter the coin itself). How would you test the fairness of the coin?

For many, the answer is straightforward. We form a null hypothesis: the coin is fair. Then we flip it many times — a lot of times. The NFL has plenty of money, so you can easily afford to play flunkies to do this; eventually they’ve flipped the coin 10,000 times. For each flip, the outcome (either heads or tails) is recorded. Then you study the data to see whether or not you can contradict the null hypothesis.

What would you expect to happen if the coin is fair? We already know how to compute this; if the coin is fair then the probability of getting “heads” is \alpha=0.5, just as the probability of getting tails is 1-\alpha=0.5. So the expected value of the number of heads is the number of flips (which we’ll call N) mulitplied by the probability of heads, or

\langle H \rangle = N\alpha = 10000 \times 0.5 = 5000.

In this equation, the angle brackets surround H indicate the “expected value” of H. But we know that the result should be random, so we shouldn’t expect to see exactly 5000 heads out of 10000 flips. We might, but it’s not very likely. The observed number H should be near to 5000, but we don’t expect it to be exactly so. The question is, how near?

The probability of observing H heads out of N flips if the probability for a single flip is \alpha, can be computed from the binomial distribution:

P(H) = {N! \over H! (N-H)!} \alpha^H (1-\alpha)^{N-H}.

But we don’t have to get quite so complicated. With a lot of flips, the binomial distribution gets indistinguishably close to the normal distribution. We compute the mean, or expected value,

\mu = N\alpha.

We also compute the standard deviation

\sigma = \sqrt{N\alpha(1-\alpha)}.

Then the probability will be approximately

P = e^{-(H-\mu)^2/2\sigma^2} / \sqrt{2\pi \sigma^2}.

For N=10000 and \alpha=0.5 (our null hypothesis), we have \mu = 5000 and \sigma = 50. The probability distribution of H looks like this:

We can see that the probability of getting exactly 5000 heads, even if the coin is fair, is only 0.008. That’s not very big! So even if the coin is fair, we shouldn’t be surprised if we don’t get exactly 5000 heads and 5000 tails.

Suppose we actually observe 4940 heads. Is this out of the ordinary? The probability of getting 4940 heads is only 0.00388, but we already knew it’d be small for a specific number. If we look again at the probability distribution, we can see that there’s a roughly 95% chance the result will be within two “standard deviations” of the expected value (as indicated by the shaded regions):

There’s also a 2.5% chance of being two standard deviations or more above the mean, and a further 2.5% chance of being that far (or more) below the mean. Hence the chance that the actual result will be more than 2 standard deviations away from the expected value (in either direction), if the null hypothesis is true, is only 0.05, or 5%.

The observed value (4940) is 1.2 standard deviations away from the mean. The chance of being that far, or farther, from the expected value if the null hypothesis is true is 23%. That’s certainly not an unbelievably rare result, so we conclude that the available data provide no significant evidence that the coin is anything but fair. Let the game begin!

The foregoing is a classic example of the application of frequentist statistics. We posit a hypothesis from which we’re able to calculate all the probabilities, then we see how “unlikely” the observed outcome is given that assumption, to determine whether or not it’s unlikely enough to consider that we have evidence to contradict the null hypothesis.

Now let’s take a look at a different approach.

Bayesian statistics

Thomas Bayes was a British mathematician and Presbyterian minister, known for having formulated a specific case of the theorem that bears his name: Bayes’ theorem. To illustrate how it works, let’s suppose…

Suppose you’re a physician administering AIDS tests. Suppose further that the test is 99% accurate, whether you have AIDS or not. In other words, if you have AIDS there’s a 99% chance the test will be positive, if you don’t have AIDS there’s a 99% chance the test will be negative. A given patient’s test comes back positive. Do this patient have AIDS? What’s the probability?

You might be tempted to think it’s 99%. But it isn’t! Suppose that 1% of the population actually does have AIDS (note: all the numbers associated with AIDS testing and its frequency in the population are made up, this is just a hypothetical example). Suppose you tested 10,000 people, and 100 of them (1%) actually do have AIDS. Of the 9,900 people who don’t have AIDS, we “expect” (probabilistically) that 9,801 of them (99%) will have a negative result on the test while 99 of them (1%) will have a positive result. The 99 positive results for people who don’t have the disease are false positives. Of the 100 people who do have AIDS, we expect 99 of them (99%) to get a positive test result while 1 of them (1%) will get a negative result. That 1 negative result is a false negative. So we expect to get 99 false and 99 real positives, for 198 positive test results. But only half of those patients actually have the disease! In fact, the actual probability (in this case) that a given patient with a positive AIDS test result actually does have the disease, is only 0.5 (50%).

This analysis depends on knowing ahead of time the probability of AIDS for a patient who has not been tested. This is called the prior probability. If the prior probability is only 0.01, then a positive result on a 99%-accurate test means the chance the patient has the disease is only one-half.

Suppose the prior probability was 0.1 (10%). Suppose further that of the 10,000 people tested, 9,000 have AIDS and 1,000 don’t. Then for the 9,000 non-infected patients, we expect 8,910 negative test results and 90 (false) positives. Of the 1,000 with the disease, we expect 990 positives and 10 (false) negatives. So, we expecte 90 + 990 = 1080 positive test results. Only 990 of those actually do have the disease, so the probability that a patient with a positive AIDS test has the disease is 990/1080 = 0.917 (91.7%). That’s much higher! But it’s still not as high as the accuracy of the test, 99%. We see that the prior probability can have a profound impact on the actual probability that a positive test result is true.

For the AIDS test, let n_+ be the prior probability, i.e., the chance a given patient has AIDS if we have no other information. Then the chance a patient doesn’t have it is n_- = 1 - n_+. Suppose the chance of a positive test result for a truly infected patient is p_+ (so the chance of a negative result is 1-p_+), while the chance of a positive test result for a patient who is not infected (a false positive) is p_-. In the previous example we assumed the test was equally accurate for infected and non-infected patients so that p_- = 1 - p_+, but this is actually rarely the case. What would we expect from a large number, say N, of tests?

We expect that the actual number of infected patients will be

N_+ = N n_+,

while those who are not infected will number

N_- = N n_-.

For the Nn_+ infected patients, we expect positive test results in N n_+ p_+ cases — these are the true positives. For the Nn_- non-infected patients, we expect positive test results in N n_- p_- cases — these are the false positives. So we expect a total number of positive test results

N n_+ p_+ + N n_- p_- = N(n_+p_+ + n_-p_-).

But only N n_+ p_+ of these patients actually have the disease. So the probability of infection given a positive test result, with prior probability n_+, is

P = {N n_+ p_+ \over N(n_+ p_+ + n_- p_-)} = {n_+ p_+ \over n_+ p_+ + n_- p_-}.

This is one way to state Bayes’ Theorem. It’s more usually written

P(A|B) = P(A) P(B|A) / P(B).

In this equation, P(A|B) is the probability that “A” is true given that “B” is true. P(A) is the prior probability that “A” is true. P(B|A) is the probability that “B” is true, given that “A” is true. Finally, P(B) is the probability that “B” is true for the entire population. P(A) corresponds to n_+, P(B|A) corresponds to p_+, and P(B) corresponds to n_+p_+ + n_-p_-. It’s just different notation — they’re the same equation.

What we really get out of this analysis is P(A|B), i.e., the chance that “A” is true given that we just found out “B” is true. It requires knowledge of the chance that “A” is true without that information — the prior probability. Bayes’ theorem enables us to update that using the new information that “B” has happened. This gives us a posterior probability, so Bayes’ theorem has enabled us to improve our “belief” (probability estimate) with the new information. In general, Bayesian analysis enables one to estimate the impact of new information on the probability that a given hypothesis is true or false. In fact, it’s one of the most powerful features of this analysis.

In this case, the results are discrete: each patient either has AIDS or doesn’t, and each test is either positive or negative. But Bayes’ theorem can be adapted to the case of continuous results as well. In that case, we won’t be computing probabilities, we’ll work with probability distributions. We’ll start with a prior distribution representing what we believe is true before we get the new information, and end up with a posterior distribution representing our estimate of the likelihood of various circumstances when the new information is taken into account.

Measuring probability, a case for both Bayesian and frequentist statistics

Suppose you want to know the probability \alpha that if you receive an email which is not spam, it will nonetheless contain the word “viagra.” It’s unlikely! But it’s possible. So you go into your email (you’ve saved them all since the birth of email) and look at all the non-spam messages. Out of N non-spam emails you’ve received, k of them have the word. What’s the probability?

Suppose further that your two best friends are both statisticians (now you know this is a made-up example, right?). One of them is a rabid frequentist, the other a maniacal Bayesian. You start with your frequentist friend, tell him that out of N non-spam emails you’ve found k with the given word (but you don’t tell him what the actual numbers are), and ask how to estimate the probability. “That’s simple!” he replies. The unbiased estimate is

\alpha = k/N.

This is pretty straightforward, in fact it’s in all the textbooks.

But your frequentist friend is indeed rabid; wanting a second opinion, you seek out your Bayesian friend. He tells you “That’s bullshit! Dr. Frequentist is selling you down the river because he’s stuck in the old way of doing things, before we knew better! The estimate you actually want to use is”

\alpha = (k+1) / (N+2).

Now you’re really confused! As a result, you make a big mistake: you forego your usual custom of keeping them widely separated, and invite them both to lunch so you can all discuss it! You were hoping that they’d explain their reasons for preferring one formulation over another, so maybe you’d have a clue which to prefer in the given circumstances. In actual fact, the discussion turns to insults, name-calling, and ends in a fistfight resulting in Dr. Frequentist and Dr. Bayesian being arrested. The restaurant owner tells you never to return.

Which is correct? Both of them are! Both their formulae are indeed estimates of the probability \alpha which you seek. Both estimates have advantages and disadvantages, and depending on circumstances, you might want to use one or the other.

Isn’t \alpha = k/N as simple as it gets? It’s an unbiased estimate, which means that what you expect to get from the ratio k/N is the true probability \alpha. On the other hand, Dr. Bayesian’s formula gives a biased estimate, because its expected value is not the true value. Or so says your friend Dr. Frequentist in the visiting room at the county jail (he also says that Dr. Bayesian is a lying sack of **** who wouldn’t know an unbiased estimate if it flew up and bit him in his big fat dumb ass). You tell Dr. Frequentist that you can’t represent him at the trial (you’re a lawyer) because your friendship with Dr. Bayesian is a conflict of interest. Thinking about it, you realize that Dr. Frequentist is right, his estimate is unbiased, while Dr. Bayesian’s estimate can be expected to give the wrong answer (although only slightly so). He must be right.

You visit Dr. Bayesian as well (which is easy because he’s in the same jail, although the sherrif has vowed never again to put two mathematicians in the same cell) to tell him that you can’t represent him, either. He replies “Screw the court case! What are the numbers you observed?” The wild look in his eyes suggests that you should no longer withhold this information, so you let slip that out of 10,000 non-spam emails, none of them had that word. “Aha!” he exclaims. “I thought so! Dr. I’m-nothing-but-an-idiot-stuck-in-bygone-centuries wants you to believe that the probability is \alpha= zero? Do you actually believe that? Is there no chance at all of this happening? Don’t you think there’s at least a one-in-a-billion-trillion chance of it happening?” Thinking about it, you realize that he must be right. It’s just not plausible that there’s no chance at all — it’s unlikely but not impossible.

In general, for this problem I’d say the frequentist approach is preferable; if you observe k out of N, then the best-estimate probability is \alpha=k/N. But if the observed number k is zero then we have some additional information, namely, that if k/N isn’t the exact correct answer, then we must be wrong on the low side. That’s because we already know, ahead of time, that the probability can’t be negative — so the zero cases we observed cannot possible be too many. In fact we know that the probability (like all probabilities) must be between 0 and 1.

Dr. Frequentist used this information in computing his estimate. He started with no assumptions other than what is known for certain, that \alpha is between 0 and 1. He assumed that all such values are equally likely, so the prior probability is P(\alpha) = 1 for \alpha between 0 and 1, but P(\alpha) is zero for other values. This is one example of a uniform prior. We also know the chance of observing k out of N, if the probability is \alpha, is given by the binomial theorem (one of the few things Drs. Bayesian and Frequentist agree upon). In other words,

P(k|\alpha) = {N! \over k! (N-k)!} \alpha^k (1-\alpha)^{N-k}.

We can compute the probability of getting k out of N for all such cases by summing (actually, integrating) the probability for all possible \alpha values

P(k) = \int_0^1 P(k|\alpha) ~d\alpha = \int_0^1 {N! \over k! (N-k)!} \alpha^k (1-\alpha)^{N-k} ~d\alpha.

We can now apply Bayes’ theorem to get

P(\alpha|k) = [{N! \over k! (N-k)!} \alpha^k (1-\alpha)^{N-k}] / \int_0^1 {N! \over k! (N-k)!} \alpha^k (1-\alpha)^{N-k} ~d\alpha.

This gives the posterior probability distribution for \alpha. Then we can compute the expected value of \alpha from its posterior probability distribution (given of course that we’ve observed k), to get

\langle \alpha \rangle = \int_0^1 \alpha P(\alpha|k) ~d\alpha = (k+1) / (N+2).

The final result (what’s after the last “=” sign) takes a good bit of manipulation to get and is by no means “obvious” from what precedes it, but it’s exact.

We can see two advantages to the Bayesian approach for this problem. One is that it enables us to include already-known information in our prior (namely, that \alpha has to be between 0 and 1). Another is that it doesn’t just give us an estimate of \alpha, it actually gives us a probability distribution for \alpha (the posterior probability). We also see an advantage to the frequentist approach: the bias of the Bayesian estimate is a very real problem. If you observe 0 out of 10,000, and adopt the Bayesian estimate \alpha = (k+1)/(N+2) = 0.00009998, then use that estimate to predict how many such events will occur in a million cases, you’d estimate 99.98 (about 100) cases. But the estimate given by the frequentist approach (zero cases) is likely (all other things being equal) to be closer to the truth. Another disadvantage is that the prior probability (which is often based on expertise and insight, but is still an assumption) can strongly affect the outcome, as for example with the AIDS tests, where changing the prior probability from 1% to 10% changed the posterior probability from 50% to 91.7%. In fact, there’s a great deal of debate in many cases over what’s the appropriate choice of prior distribution, with a particular result sometimes blamed on the choice of prior rather than the evidence at hand.

It’s well to keep in mind that both estimates are only estimates! Neither is gospel, and which estimate we choose to adopt can depend on the circumstances, and on what we intend to do with the estimate.

This example is very relevant for a subject I work on. It requires estimating the probability \alpha for binomial events, but we want conservative estimates which avoid the extremes 0 and 1. One also needs to compute the logarithm of that probability \ln(\alpha), in which case \alpha=0 gives negative infinity — an unacceptable result! By using the Bayesian estimate, we get a conservative (in the sense of not being extreme either high or low) estimate whose logarithm is necessarily finite.

Some regard the frequentist approach as the right approach, while Bayesian methods only apply in specific cases (like the AIDS testing example). Some regard the Bayesian approach as the right approach, while frequentist methods apply only in specific cases. There has actually been some animosity between opposing viewpoints (but I’ve never witnessed any arrests for assault). My opinion? I regard attachment to either choice as ideological. Both approaches are tools to understanding, and the wise analyst selects the tool which is best for a given job. Discarding either approach would diminish our ability to understand the meaning of data.

But this much is certain: the Reverend Bayes has given us a lot to think about, and a new way to think about it.

Categories: mathematics

73 responses so far ↓

  • David B. Benson // July 10, 2008 at 9:34 pm

    As a good Bayesian, I prefer

    for deetermining which of two hypotheses has the weight of the evidence. Often this is the hypthesis with more ‘free’ parameters to adjust to best fit the evidence. While not meeting the

    as well as the simplier hypothesis, I then need to argue the physics, which for a particular problem is not known in sufficient detail.

    Anyway, nicely done again!

  • Ray Ladbury // July 11, 2008 at 1:15 am

    As a statistical agnostic, I too would like to thank you. I am currently confronting the pull of Bayesian methods in my day job. Long story short, for a particular threat to a satellite, most of the analyses rely heavily on expert opinion, but either take it as gospel or allow expert opinion to determine the model to be used and then gin up a quasi-frequentist approach. I am starting to think that at least a Bayesian methodology would get all the assumptions out into the open.

    One nit: Wouldn’t Dr. Frequentist simply bound the rate for a given confidence level using binomial statistics? If so, does the Bayesian approach really yield anything not obtained from the frequentist approach?

    [Response: Yes, Dr. Frequentist would have more to say on the subject of confidence intervals (unless he was too overcome with emotion to be thinking clearly). But this is pedagogy rather than rigor.

    The Bayesian approach does give an estimate which avoids the extreme values 0 and 1, and whose logarithm is necessarily finite. Not much of an advantage except in special circumstances.]

  • Ray Ladbury // July 12, 2008 at 2:10 am

    One of the things I am confronting is similar to the situation confronting us with climate sensitivity–the risk is dominated by the tails of the probability distribution, and that is the part of the distribution we least understand. Moreover, since we invariably rely on small sample sizes, we don’t really seek to determine the distribution of failures so much as bound it. So you can see how “expert opinion” could come to play a large and even outsized role.

  • S2 // July 12, 2008 at 7:59 am

    Thanks for the lucid explanation.

    One small typo, I think:

    Suppose further that of the 10,000 people tested, 9,000 have AIDS and 1,000 don’t.

  • NU // July 12, 2008 at 12:22 pm


    As you’ve noted, when the outcome strongly depends on the prior, it usually means that the data are not sufficient for inference, and all you’re left with is prior information.

    One approach under deep prior uncertainty which matters is to work backwards, and map out what prior assumptions are required to justify a particular decision. Then at least you put the argument squarely on what quantitative level of belief you need to have in a proposition if you want to support a given policy.

  • Ray Ladbury // July 12, 2008 at 6:03 pm

    NU, In effect isn’t this just relying on the expert –but telling him how strongly he or she has to believe in order to make tinkerbell come alive again?
    I’ve been sort of looking along those directions actually. Look at available data and then find additional data that favor the proposition I’m trying to demonstrate (in this case regarding reliability of microelectronics). Can’t say I feel comfortable with it just yet.

  • David B. Benson // July 12, 2008 at 7:11 pm

    There are variations of the Delphi technique in which several ‘experts’ are independently consulted and their opinions merged.

  • Ray Ladbury // July 12, 2008 at 8:44 pm

    Hi David,
    Yes, I did something like this to get a warm fuzzy on a detector where we had rather imperfect qualification data and a rather different application. The results came out pretty well, although ultimately I wound up not using the technique when the experts got cold feet about whether there could be other threats that previous testing did not look at. A pity, as the results turned out pretty well.

  • James Annan // July 12, 2008 at 10:32 pm


    One thing you seem to have missed out is that the very word “probability” has a different MEANING to a frequentist and a Bayesian.

    [Response: Indeed! When writing for the lay reader one has to make choices...]

  • Ray Ladbury // July 13, 2008 at 1:56 am

    Of course Dr. Bayesian could just ask Dr. Frequentist to define “random” and watch him sputter…

  • steven mosher // July 13, 2008 at 3:11 am

    Thanks Tamino your exposition was both lucid and lively.

  • James Annan // July 13, 2008 at 6:00 am

    But since the term “probability” has different meanings to the different people, the idea that they can provide different answers to the same problem is nonsensical. They are not talking about the same problem!

  • Gavin's Pussycat // July 13, 2008 at 4:01 pm

    In addition to the typo noted by S2, the following seems wrong too:

    > Dr. Frequentist used this information in computing his estimate.

    Surely you mean Dr. Bayesian here.

    Beautifully written BTW.

    One thing that remains nagging me (and relates to what Dr. Annan has studied) is: the ignorance of the prior. While it is clear that a uniform prior on [0,1] is ignorant of the corpus of email messages that constitutes the observations in this case, it also seems to me to be ignorant to some unrelated considerations of common sense. Like, that the occurrence of a very specific word like ‘viagra’ in a general natural language text would appear to be highly unlikely, unless the text is very large — which emails are not (is this admissible prior knowledge?).

    This begs the question, is the use of ignorant but absurd priors wrong, or just dumb? Note that this prior implies a 50% probability of half or more of all legitimate emails containing the word ‘viagra’…

    [Response: You're right about the typo. Both Drs. F and B consider this a serious insult and are now angry with me ... Can't we all just get along?

    The choice of prior must avoid using information from the experiment, so we can't use the fact that we've observed 0 out of 10000 to conclude that the prior should be heavier on the low side than on the high side. And there are words ("the" "and" "of" "to" etc.) that are frequent enough (perhaps I should avoid using that word ... let's say "common enough") that a probability near 1 of occurence of a word, if all we know about it a priori is that it is a word, is not an unrealistic model. However, the very common words tend to be articles, prepositions, and pronouns, so we could probably (there's that word, oh my!) use that information to design a *much* better prior. For example, we could use a uniform prior but set its upper limit at the maximum probability observed for nouns/verbs/adj/adv in a "reference corpus." Or we could use the distribution of probabilities for such words, as estimated from a reference corpus, as our prior. In fact we'd be well advised to try many such possibilities and compare the posteriors, to get a handle on how sensitive the result is to a range of realistic (?) choices.

    I suspect that Dr. B had computed the result for a uniform prior in order to model general binomial events with no assumptions about the nature of the process, and was quoting a result off the top of his head. In fact, his Bayesian colleagues are now seriously peeved with him...]

  • Hank Roberts // July 13, 2008 at 7:10 pm

    Tamino writes:

    > How would you test the fairness of the coin?

    James Annan writes:

    > the idea that they can provide different
    > answers to the same problem is nonsensical.
    > They are not talking about the same problem!


    [Response: First, keep in mind that even maniacal Bayesians will admit the validity of frequentist approaches in *some* cases. I suspect the coin-flip test is one such case.

    Some insight may be gained by considering estimates of climate sensitivity S. For a frequentist, a "probability distribution function" (pdf) refers to the physical world; the true pdf for climate sensitivity f(S) is about how likely the physical system is to exhibit sensitivity S. After all, it may not just be a fixed number; if you repeat the same experiment (say, doubling CO2) with multiple earths, you may see a different response each time -- one earth comes in at 2.8, another at 2.77, yet another at 3.14. But the differences are likely to be small, so the "frequentist pdf" for physical sensitivity likely has a very narrow range.

    For a Bayesian, the pdf is about what our *belief* is. So we might end up with a pdf for sensitivity which goes as low as 1.5 and as high as over 8! This does *not* mean that the Bayesian expects one earth to exhibit a temperature increase of only 1.5 while another warms by more than 8 degrees. It means that the Bayesian recognizes that our incomplete knowledge makes it possible that S is as low as 1.5 or that it's as high as 8; even if phyical sensitivity is very precisely constrained, our knowledge of it is not. The Bayesian pdf is about estimating the probability for our belief about things.]

  • David B. Benson // July 13, 2008 at 7:45 pm


    and for the old stuff

  • Hank Roberts // July 14, 2008 at 2:10 am

    Okeh. James, on the coin-flip question, is it the same question for both experts consulted? They have a month before the Olympics, if they start investigating the specific coin now, can they say if one specific coin is fair?

    (I assume the answer’s different if you’re asking about all the supposedly identical coins struck on that pattern — any one could have been loaded after it was minted?)

  • James Annan // July 14, 2008 at 6:58 am


    From my perspective, the interesting question might be “what is the probability that the next coin toss is H” or “what is the probability that the next mail message contains the word ‘v–gra’”. Unfortunately, in this context these questions are meaningless to a frequentist. This is why I object to Tamino’s presentation - he is not actually clarifying what the frequentist is trying to do.

    What the frequentist is actually doing for the k/N calculation is presenting a *method* which, if repeated over infinitely many trials (each individual trial consisting of N samples of a coin toss with a fixed but unknown probability p of heads), generates the best estimator of the unknown p.

    That does not mean that the actual observed value k/N it is a useful estimator for p, for a specific experiment once N and k are known. k/N is not, in general, reasonable odds for a bet on the next coin toss. It was not designed to be. If you ask a frequentist “what is p” he will simply say “I don’t know”. If you toss the coin 1000 times and get 550H, and ask him if it is fair, he will say “I don’t know, but such an outcome is very unlikely in repeated experiments of 1000 tosses of a fair coin”, and he can quantify “very unlikely”. (In practice, he will usually be Bayesian enough to admit that he thinks the coin is not fair, but strictly speaking, that is a question that he does not have the tools to deal with). In contrast, a Bayesian can provide an estimate of what he thinks are fair odds for the next coin toss, and he can go further and provide a full pdf for the long-run proportion of heads…but this *is* in principle a subjective opinion, even though in practice it is highly constrained by the data (ie all reasonable Bayesians will agree quite closely, even if they do not start from the same prior).

    Someone else mentioned “ignorant” priors. Be warned that there is no such thing! Every prior implies a specific and precise level of belief about the events it describes. (eg U[0,1] for the coin implies that you specifically think there is a 11.245% probability that p is greater than 0.11245. And so on.) Sometimes, this is a reasonable prior. Sometimes, it is not - and note that by changing variables, you describe any prior as being uniform in *something*, so if uniform = ignorant then all priors are simultaneously ignorant and non-ignorant…

  • James Annan // July 14, 2008 at 6:59 am

    well I don’t know if that lengthy comment got through…

    [Response: I regularly check the spam queue, so that legitimate but misclassified messages don't get lost. But I can see that I've chosen the word for my hypothetical case poorly -- you're not the only one whose comment was delayed due to its presence.]

  • Gavin's Pussycat // July 14, 2008 at 9:18 am

    Tamino, thanks! Yes this makes sense. Do I understand you correctly then, that ‘responsible prior building’ involves some sort of understanding of the processes involved in the real world, i.e., you’re not supposed to just make them up mathematically?

  • Gavin's Pussycat // July 14, 2008 at 2:12 pm

    Someone else mentioned “ignorant” priors. Be warned [...]

    Thanks James, you answered my question!

    So, what do you do when you have really, honestly no way of telling if a prior is reasonable? Give up on Bayes? Try a variety of cases to see how the land lies? …?

  • Timothy Chase // July 14, 2008 at 3:30 pm

    James Annan wrote in his last paragraph:

    Someone else mentioned “ignorant” priors. Be warned that there is no such thing! … note that by changing variables, you describe any prior as being uniform in *something*, so if uniform = ignorant then all priors are simultaneously ignorant and non-ignorant…

    I think this deserves to be underscored.

    Incidentally, it reminds me of how in quantum mechanics one can always diagonalize the hamiltonian and thereby arrive at a coordinate system (set of variables) in which all interactions between particles cease to exist.

  • David B. Benson // July 14, 2008 at 5:06 pm

    The closest to ignorant priors are obtained via the Maximum Entropy Principle (MEP). So for real-valued ‘random’ variables, the normal distribution is the only one satisfying MEP. This means that Dr. Bayesian has to give his belief of the mean and variance.

    For the coin, MEP gives 50% H; no other belief satisfies MEP.

  • Gavin's Pussycat // July 14, 2008 at 7:37 pm

    David: and for the email/viagra case?

  • David B. Benson // July 14, 2008 at 8:40 pm

    Gavin’s Pussycat // July 14, 2008 at 7:37 pm — That will actually be hard to work just what MEP requires. Much easier is having a fixed but finite number N of possiblities; recall that only one of the N is ‘true’. Then MEP requires that the prior probablity of each is 1/N.

  • David B. Benson // July 14, 2008 at 9:50 pm

    Here is a link to the abstrat of a very recent paper on equilibrium climate sensitivity (ecs) using a novel method based on historical data:

    I was able to find the ‘interim report’, i.e., preprint, and here are some quotations from it:

    “Atmosphere-Ocean General Circulation Models (AOGCMs) show different climate sensitivity ranging from 1.9°C to 4.6°C (IPCC, 2007, pp.798-799)…”

    “{The research] indicates that the climate sensitivity is unlikely to be smaller than 2°C…”

    “Therefore, our analysis suggests that the well-defined peak of the PDF of climate sensitivity in former studies is a consequence of insufficient treatment of the historical development of radiative forcing uncertainty. Including these uncertainties implies that climate sensitivity is much less constrained at the high end than previously thought.”

    “The question still remains as to how to appropriately represent the forcing uncertainty, although it may ultimately depend on the specific research question. Our results support the idea of using the carbon cycle for climate sensitivity estimation. The interplay among the uncertainty estimates in the carbon cycle and climate systems encourages a holistic uncertainty analysis using an Earth system model with more complexity.”

    Now the question for Dr. Bayesian: My prior pdf for ecs is a heavy-tailed distribution, say a Weibull distribution with shape parameter between about 1.5 and 2 and with a mean around 2.8 K. After reading the above paper, I need to update to account for the large ecs’s implied by this paper.

    I don’t want to be overly formal about it, but how much weight do I give to this paper as opposed to earlier ones?

    [Response: You've already got your prior P(S). Now you need an estimate of P(this paper|S). That might be the hard part or the easy part, depending on what's in the paper!

    Can you send me a link to the preprint (I'll keep it confidential if you wish) so I can take a closer look?]

  • David B. Benson // July 14, 2008 at 10:36 pm

    Tamino — I went here

    down the page to

    Tanaka, K., T. Raddatz, B.C. O’Neill and C.H. Reick (2008). “Is the Climate Sensitivity Even More Uncertain?” IIASA Interim Report, IR-08-012, Jul 2008.

    and clicked on the link. (There’s nothing confidential about this preprint.)

  • James Annan // July 15, 2008 at 6:10 am


    MEM still leaves open the question of what one is to be ignorant about (ie what is the parameter that is assigned a Normal distribution).

    Gavin’s pussycat, by all means we should look at the sensitivity of the results to the prior (and to other assumptions - the prior is not the only source of subjectivity!) There can be no general laws about which priors are considered acceptable, but one can (and should IMO) indicate what sort of changes to the prior would overturn the researcher’s headline results.

    In practical examples, we are rarely truly “ignorant” anyway. For example, stefan-boltzmann tells us something about the effective radiating temperature of the planet, clausius-clapeyron gives us a rough estimate about water vapour…pretending we should be “ignorant” about all this and then using a prior that is uniform in sensitivity (and thus assigns absurdly high prior probability to high S) is a travesty, as I have explained in more detail ad nauseam elsewhere. I don’t mean to hijack this thread in that direction though.

  • Gavin's Pussycat // July 15, 2008 at 6:30 am

    Thanks David.

    Much easier is having a fixed but finite number N of possiblities; recall that only one of the N is ‘true’

    More precisely: N equivalent possibilities, right?

    It’s the insight that p(heads) = p(tails) a priori that gives us this solution. A corresponding insight in the viagra case is much, much harder to come by: p(contains) != p(does not contain) for starters.

    Would you agree?

  • David B. Benson // July 15, 2008 at 9:51 pm

    James Annan // July 15, 2008 at 6:10 am — Its MEP for Maximum Entropy Principle. I’m not sure how to interpret your comment, so I’ll try to explain MEP in two cases.

    Suppose I have a parameter about which prior information tells Dr. Bayesian that it is some real number. Then MEP requires Dr. Bayesian to give his current belief in the mean and variance of the Gausian distribution, using whatever knowledge he currently possesses.

    Consider instead a value which is, like climate sensitivity, constrained to be non-negative. Now MEP requires Dr. Bayesian to give his belief about four LaGrange multipliers! In practice this can be too hard and anyway, the resulting pdf can hardly be distinguished from other heavy-tailed distributions. Easier, and only slightly less precise, is to use a Weibull or log-normal distribution; both only require expressing one’s belief about two parameters. In a prior comment I did this for climate sensitivity; it is close to the distributions you & Hargreaves have worked out, which is why I chose it.

  • David B. Benson // July 15, 2008 at 10:12 pm

    Gavin’s Pussycat // July 15, 2008 at 6:30 am — I’m not sure what you mean by ‘equivalent’, but no, if its what I think you mean.

    Suppose I tell you I have a six-sided die, but I don’t let you see it. Then MEP gives the probability of a face coming up of 1/6. Now the experiment; I open my hand and you see that some of the faces are much smaller than others; its not a regular polyhedron. This will cause you to update your beliefs to some more complex collection of posteriors.

  • James Annan // July 16, 2008 at 1:35 am

    I find the comparison with the discrete case to generally cause more problems that it solves, because in that case there really is a “maximally ignorant” distribution, in a mathematically well-defined sense. Even here it is dubious if it really corresponds to the concept of “ignorance” - if you know nothing about a coin, would you happily bet on heads at odds of marginally better than evens?

    David, I still don’t follow exactly what you mean in the case of climate sensitivity. Let’s forget the nonnegativity so as to revert to a simple Gaussian. Why would you not give a Gaussian in feedback = 1/S? You are surely just as ignorant about that!

  • Gavin's Pussycat // July 16, 2008 at 7:23 am

    David, what you say is precisely what I mean. When you describe a die as having six ‘faces’ — abstractly, describing a system as having six distinct states — then in the absence of specific information, one must assume these states to be “equivalent”, i.e., have the same likeliness of occurring. What else can one assume in that situation?

    It is even reasonable when you know the faces to be different, but you don’t know which are large and which small. Your state of knowledge for all the faces is still “equivalent”.

    In the discrete case it really is this simple :-)

    James Annan:

    If you know nothing about a coin, would you happily bet
    on heads at odds of marginally better than evens?

    If I were confident that the other guy knows nothing about the coin either, and the money at stake were an amount I could afford to lose, and if I were a betting man, which I am not — yes, certainly :-)

  • David B. Benson // July 16, 2008 at 6:43 pm

    James Annan // July 16, 2008 at 1:35 am — Ok, simple Gaussian it is. I think you are now interested in looking at

    F = 1/S

    and clearly if S is Gaussin, F is not! Also vice-versa.

    I don’t know enough about MEP to give a satisfying answer. I would say whichever is more ‘physical’, i.e., going to zero at the infinities.

    Anyway, according to

    the Gaussian is the MEP solution for all distributions which have a given mean and given variance. But as both S and F cannot have a (finite) variance one or the other has to be chosen on some grounds .

    I (tentatively) conclude that Dr. Bayesian cannot be in total ignorance about a real-valued distribution. After all, in the discrete case he at least knows the number of possible outcomes.

    I end by stating that I completely botched the case of a distribution which is zero for the negative reals. The Wikipedia page does the matter properly.
    With regard to the climate sensitivity question, Dr. Bayesian, at an absolute minimum, needs to express his belief in the mean value.

  • David B. Benson // July 16, 2008 at 6:48 pm

    Gavin’s Pussycat // July 16, 2008 at 7:23 am — If one knows that the die has six faces, one could assume any six probabilities. But Dr. Bayesian would follow MEP and assign 1/6 to each.

    But, for example, upon seeing that the die is a truncated square pyramid (tsp) then he’ll probably want to revise this to probabilites which proclaim that the base of the tsp will most likely be on the table when the tsp is thrown.

    I think we are in agreement.

  • David B. Benson // July 16, 2008 at 9:44 pm

    The Wikipedia page on MEP is quite good:

  • David B. Benson // July 17, 2008 at 1:09 am

    This paper:

    Axiomatic Derivation of the Principle of Maximum Entropy …

    was illuminating.

  • David B. Benson // July 17, 2008 at 1:49 am

    Dr. Bayesian needs a least informative (MEP) prior for climate sensitivity. He knows that this is non-negative. He now needs one more piece of information, say obtained by redoing the Arrenhius calculation in a more modern fashion to obtain S = 3 K.

    This value is then the mean (or median if you prefer, I can’t rule that out) for the exponential distribution on the non-negative reals.

    The paper by Shore & Johnson finally helped me see this. The point is that a uniform distribution does not appear to ever satisfy MEP, so should be discarded as the ‘uninformed prior’.

  • James Annan // July 17, 2008 at 1:54 am


    You can see on that wikipedia page that in the case of continuous variables, the can is simply kicked down the road by the demand for an “invariant measure” which serves the purpose of concealing the need to make an arbitrary decision about the basis of the prior. ie, it is just a restatement of the problem, not a solution.

    Gavin’s pussycat;

    “If I were confident that the other guy knows nothing about the coin either, and the money at stake were an amount I could afford to lose, and if I were a betting man, which I am not — yes, certainly :-)”

    It doesn’t matter what the other person knows about the coin - the probability is supposed to be YOUR opinion. But I’ll let that pass.

    Let’s say the coin comes up H first time. Would you then use the Bayesian formula (which gives P(H)=2/3, P(T)=1/3) to give odds for the second toss? I certainly wouldn’t!

  • Gavin's Pussycat // July 17, 2008 at 12:18 pm


    put that way, I agree. I answered to the question “if someone offers to pay $100 on heads, if I pay $99 on tails, for a coin neither of us knows (to exclude foul play), — do I take the offer?” Perhaps not what you intended to ask :-)

  • Gavin's Pussycat // July 17, 2008 at 12:39 pm


    having thought about your coin example, doesn’t it mean a prior (for coins from the big world that the other guy hasn’t had his fingers on) of a narrow peaked distribution centred on 0.5?

    Such a prior would explain your behavior; but it is a culturally defned one based on your experience with, and our expectations about, coins, not general two-state systems.

    (If the other guy provides the coin, then my prior would depend on what bet he proposes ;-) )

  • David B. Benson // July 17, 2008 at 7:13 pm

    James Annan // July 17, 2008 at 1:54 am — Well, I’m still learning about MEP and especially Bayesian == MEP. I’m going to try to avoid ‘kicking the can down the road’ by pointing out that a uniform prior, U[a,b], is actually highly informed; one needs at least all odd moments of the moment generating function. From Boltzmann’s theorem, just selecting the mean will suffice. Dr. Bayesian could use (1/2)(a+b) if nothing more informative occurs to him.

    The trouble, of course, is that just using the variance will do as well, resulting in a rather different prior. So avoiding ‘kicking the can’ appears to be a question of what moments one has some prior knowledge about; I earlier suggested the mean.

  • David B. Benson // July 17, 2008 at 9:06 pm

    Over lunch with Carl Hauser (the local Dr. Bayesian) we discussed the ‘ignorant prior’ problem. We worked our way to agreeing with

    “The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b] (which means that the probability density is 0 outside of the interval).”

    found here just now:

    which agrees with the Boltzmann theorem in the case of n = 0.

    For a more informed prior over [a,b], use a belief in the mean, m, lying between a and b and use the Boltzmann theorem to obtain what amounts to a truncated exponential distribution.

  • James Annan // July 18, 2008 at 1:45 pm


    That is only true if you arbitrarily define entropy so as to make it true. Uniform on [a,b] and uniform on [1/b,1/a] are different but both have the same support, and both maximise entropy if you define it in the right way! Why should you be “ignorant” about x and not 1/x?


    If it precisely my point that in practice, all real world problems involve cultural and experiential decisions. There is no such thing as a “general two-state system”, at least outside of textbooks and artificial puzzles.

    Claiming it is “correct” to feign ignorance, is generally easily defeasible by the observations that (a) ignorance cannot be adequately represented in Bayesian statistics and (b) we are not really ignorant anyway. Better IMO to try to deal with the problem of what we should claim to reasonably believe in an open and honest manner rather than brushing the problem under the carpet (which is what a whole bunch of climate scientists have basically done over the last few years).

  • David B. Benson // July 18, 2008 at 7:19 pm

    Tamino — I forgot about html, so the previous try is messed up. Kindly delete it. Here is a correct version:

    James Annan // July 18, 2008 at 1:45 pm — MEP distributions are invariant under scaling transformations (Shore & Johnson) so Dr. Bayesian can chose a coordinate system in which a > 1 and b > a. U[a,b] involves expressing belief that

    the probability of x less than a is zero.
    the probability of x greater than b is zero.

    So Dr. Bayesian believes that the probability of x lying between 1/b and 1/a, inclusive, is precisely zero.

    This is not, IMHO, an ‘uniformed’ prior. All the MEP variations I have found require expressing a belief in at least two predicates.

    I entirely agree with “… we are not really ignorant anyway.”

  • David B. Benson // July 18, 2008 at 11:06 pm

    Entropy is precisely as defined as in statistical mechanics, hardly arbitrary; while E.T. Jaynes called his principle MaxEnt, it equally well could have been called MinInfo for Minimum Information.

    All MEP distributions are invariant under affine transformations, in the case being considered here those of the form

    rx + t

    for constants r and t. There is nothing which states that these are invariant under othr transformations such as 1/x or exp(x), neither of which are even bijections on the reals. So for Dr. Bayesian to express belief in U[a,b] is completely different, and unrelated to expressing belief in U[1/b,1.a].

    If Dr. Bayesian is going to express belief in exactly two predicates, as I currently understand the matter, he can choose one of

    A mean and a variance;
    Negative values have probability zero, together with the mean, the exponential distribution.

    And of course any affine transformation of these; this only makes some sort of difference for the exponential distribution.

    At this point I will advocate using the exponential distribution for an equilibrium climate sensitivity prior with the median being, say, ModelE’s approximately 2.6 K for 2xCO2; this fixes the mean.

  • David B. Benson // July 19, 2008 at 12:17 am

    Oops. Not ’statistical mechanics’.

    Statistical thermodynamics.

  • James Annan // July 19, 2008 at 7:39 am

    “So for Dr. Bayesian to express belief in U[a,b] is completely different, and unrelated to expressing belief in U[1/b,1.a].”

    For Dr Bayesian to express any specific belief in X is logically equivalent to expressing a related specific belief in Y=1/X. If he believes that a and b are bounds on X, he also believes that 1/b and 1/a are bounds on Y. The question is why he should use a uniform distribution U[a,b] for X rather than expressing his belief via U[1/b,1/a] for Y. The ranges of the intervals are equivalent but the distributions are clearly different. Both can be presented as maximum entropy distributions by choosing an appropriate invariant measure.

  • Gavin's Pussycat // July 19, 2008 at 2:22 pm

    OK James, I see your point (which you express here with the greatest clarity that I have seen from you so far :) .

    Having played around a bit with this stuff (preparing to lecture on it, wish me luck) I can only agree.

  • David B. Benson // July 19, 2008 at 10:41 pm

    James Annan // July 19, 2008 at 7:39 am — Yes, Dr. Bayesian can choose either U[a,b] or U[1/b.1/a] as his maximum entropy distribution. The transformation Y = 1/X between these will not preserve the maximum entropy property, not being an affine transformation:

    Or anyway, that’s how I understand it, still trying to learn these concepts.

    With regard to the (simplified) climate sensitivity equation

    S = R/(1-f)

    for constant R and feeedback f in the half-open interval [0,1), suppose S-R is exponentially distributed. Then the pdf for

    f = 1 - R/S

    clearly fits into the Boltzmann theorem framework (sending n to infinity). So the moments are known and for those, the pdf of feedback f is maximum entropy.

  • David B. Benson // July 20, 2008 at 6:52 pm

    Of course any closed interval [a,b] can be carried into any other closed interval [c,d] by an ‘invariant measure’, to wit

    f(x) = ((d-c)/(b-a))(x-a) + c

    as the coordinate transformation (affine transformation), which does carry U[a,b] to U[c,d].

    However, g(x) = 1/x does not carry U[a,b] to U[1/b,1/a], but rather some non-uniform pdf. For example, aftrer accumulating half the probability in going from a to (a+b)/2, under the mapping g(x) half the probablity lies between 2/(a+b) and 1/a. But in general 2/(a+b) is not the midpoint of the interval [1/b,1/a].

  • David B. Benson // July 20, 2008 at 9:44 pm

    The general (simple) feedback gain equation is

    S = R/(1-f)

    for constant R and feedback f in the half-open interval [0,1). To simplify, scale so that R = 1 and introduce the new variable v = S -1. With these, the feedback gain equation can be written as

    v = f/(1-f); f = v/(1+v).

    Uniform pdf: Assume v is uniformly distributed as U[0,1] for purposes of illustration. The cdf for v is then just v itself. Then the cdf for f is f/(1-f); this distribution certainly appears to be far from a maximum entropy cdf. Of course, by the near-symmetry one could begin by assuming that f is uniformly distributed on U[0,1) to observe the same lack of maximum entropy in the corresponding cdf for v.

    Exponetial: Assume the pdf for v is exponentially distributed; p(v) = r*exp(-rv) for some rate parameter r. Then the pdf for f is p(f) = r*exp(-rf/(1-f)) which by the Boltzman theorem is maximum entropy as well. Again the near-symmetry enables one to do this in the opposite direction. Assume the pdf for f is p(f) = k*exp(-sf) for constants k and s, recalling the support is only in [0,1). Then the pdf for v is p(v) = k*exp(-sv/(1+v)), again maximum entropy by the Boltzman theorem, where one has to check that this applies in this case over the interval [0,+infinity).

    Either way, both distributions of interest, that for v and that for f, are maximum entropy distributions, but have rather different shapes depending upon whether p(v) or else p(f) is taken as exponential. It seems that Dr. Bayesian cannot simultaneously believe both. Note that

    Tomassini, L., P. Reichert, R. Knutti, T. F. Stocker and M. Borsuk, Robust Bayesian uncertainty analysis of climate system properties using Markov chain Monte Carlo methods, Journal of Climate, 20, 1239-1254

    determines upper and lower bounding pdfs for climate sensitivity, etc. I suppose that Dr. Bayesian might do the same with the two alternatives

    p(v) = r*exp(-rv)
    p(f) = k*exp(-cf).

  • Gavin's Pussycat // July 21, 2008 at 6:33 am

    Either way, both distributions of interest, that for v and that for f, are maximum entropy distributions, but have rather different shapes depending upon whether p(v) or else p(f) is taken as exponential. It seems that Dr. Bayesian cannot simultaneously believe both.

    Hmmm. But do these two alternatives maximise the same entropy (definition)? If not, Dr. Bayesian will have to make up his mind on which one makes most physical sense in this situation. Right?

  • Gavin's Pussycat // July 21, 2008 at 8:06 am

    About the uniform prior, isn’t part of the problem that, to be a valid ME distribution it has to be defined on a finite interval, both ends of which have to be given?

    For the exponential distribution, this is clearer: you specify that only positive values are allowed, and a plausible expected value, e.g., Arrhenius.

    You don’t escape the problem of choosing the variable you’re working in, f or S; but in this case it seems to be a small problem. You get non-absurd results either way.

    In the uniform case, what you often see is that the lower bound is set to 0, and the upper bound to some high value “out there”, without much discussing why. The idea seems to be that the finiteness of the support is an ugliness to be abstracted from, and what we really want is the uniform distribution on the infinite positive axis — which doesn’t exist of course.

    This is wrong! Both bounds are integral parts of the plausibility statement that defining a prior is. E.g., if you choose the high bound 10C, you imply that a sensitivity as high as 9C is perfectly plausible — which the folks that have studied the mechanisms involved would likely dispute.

    As always, my understanding.

  • Gavin's Pussycat // July 21, 2008 at 8:29 am

    … I suppose that Dr. Bayesian might do the same with the two alternatives

    p(v) = r*exp(-rv)
    p(f) = k*exp(-cf).

    …the two edges of the road on which James’s can is still firmly positioned ;-)

    My understanding would be that, if you use physics (modelling) to provide your prior, you should work in f space; if you use observed responses to known forcings, you should work in S space (and avoid double-counting).

  • David B. Benson // July 21, 2008 at 6:36 pm

    Gavin’s Pussycat — The uniform prior is only defined on a finite interval; both end points have to be given.

    I, too, find assuming all values of S between, say, 0.7 K and 10 K as equally likely to be absurd. But in this series of comments I’ve been attempting to work out just what MEP provides. As I’ll must later, actually very little; a weak, but precise, form of Ockham’s Razor.

    I don’t see any work using f space. For example, Schmidt et al. (2007) [available from Gavin Schmidt's publication page] discusses all the physics in GISS’s Model E; the equilibrium climate sensitivites (ecs) to 2xCO2 of the various verions of ModelE are reported, being from a low of 2.4 K to a high of 2.8 K. Anyway, working with the gain (ecs) rather than the feedback seems more natural to me.

    Avoiding double counting is very important to correctly applying Bayes’s Rule. I don’t see (yet) how to properly apply Bayesian reasoning to start from, say Annan & Hargreaves, and update by the paper I references earlier. Not to mention the rather disturbing conclusions in

    Interim Report IR-08-012
    Is the Climate Sensitivity Even More Uncertain?
    Katsumasa Tanaka (
    Thomas Raddatz (
    Brian C. O’Neill (
    Christian H. Reick (
    International Institute for
    Applied Systems Analysis
    Schlossplatz 1
    A-2361 Laxenburg, Austria

  • David B. Benson // July 21, 2008 at 10:39 pm

    A continium of maximum entropy distributions —


    p(v; r, n) = k*exp(-rv^n)

    supported on [0,+infinity) with r greater than 0 and n, not necessarily an integer, greater than or equal to 1. Then k = k(r,n) is just scaling so that p(v;r,n) is a pdf. Note that p(v;r,1) is the exponential distribution.

    Each p(v;r,n) is maximum entropy when the expectation of v^n is known; so for each only one belief needs be expressed, being that expectation. With that then the rate r is determined.

    Now consider the limit as n goes to infinity; this limit is precisely U[0,r] (or else U[0,r)). By previous considerations, this means the pdf for the associated feedback ought to be maximum entropy as well, but I’ve not attempted to work this out.

    In any case, I currently can see no reason why Dr. Bayesian should believe in any one of the p(v;r,n) any more than any of the others, the choice being only the value of n; maximum entropy does not help make this choice.

    With only knowledge of the support, [0,+infinity), and a single value, the expectation, maximum entropy rules out ‘wiggley’ distibutions such as Weibull and log-normal which require believing in two values. In this sense it is a formal expression of Ockham’s Razor.

  • James Annan // July 22, 2008 at 9:35 pm


    You still haven’t mentioned the invariant measure you need to define entropy over a continuous distribution. Depending what this is, you can get anything!

  • David B. Benson // July 22, 2008 at 10:41 pm

    James Annan // July 22, 2008 at 9:35 pm — I did, but I’ll do it again:

    Since we are working within the real line and wish to do arithmetic, there is no choice, AFAIK, but to use a Borel sigma-algebra with

    If one is willing to give up arithmetic, then other invariant measures could, of course, be used. But without arithmetic there can be no formalization of feedback.

  • David B. Benson // July 23, 2008 at 12:32 am

    This link may be clearer:

    which is an invariant Haar measure. (There are many more.)

    We are interested in functions f which preserve measure in the sense that for all pairs of open intervals (a,b), (c,d), |b-a|/|d-c| = |f(b)-f(a)|/|f(d)-f(c)|. The only such f are the functions of the form

    f(x) = rx + t

    for constants r and t, the coordinate transformations. These are too precious to give up and the paper by Shore & Johnson develop maximum entropy with invariance under coordinate transformations as one of the axioms.

  • James Annan // July 23, 2008 at 11:56 am


    None of that actually addresses the issue. Why should you “preserve measure” for the distribution of X viewed as a variable in [a,b] and not for the distribution of Y=1/X which is a variable in [1/b,1/a]? You are just begging the question. For that matter, I am not even convinced that the “invariant measure” of Jaynes is the same thing as the “invariant measure” on that other wikipedia page (there is no clear link between the two).

  • Ray Ladbury // July 23, 2008 at 12:49 pm

    David and James,
    It seems to me that there are essentially 3 types of inference in which we might be interested. The first is whether a particular effect (be it greenhouse heating or asymmetry of a coin) is potentially important–call this the exploratory phase. In the second, we are trying to obtain a best estimate of the significance of the effect–call this the modeling phase. In the third, we may wish to remediate the effects (decrease warming or engineer a new, more symmetric coin)–call this the engineering phase. It seems to me that you might want very different priors for these different phases, and that the importance of the prior would be greater for the first and third than for the second (where hopefully you have enough data to diminish the Prior’s role). Any thoughts?

  • Gavin's Pussycat // July 23, 2008 at 2:54 pm


    James says it, but let me try to say it clearer.

    The “invariance” you are referring to, applies only to a limited class of linear transformations. What it states, essentially, is that you will get the same result if you study p(S) with S being expressed in degees C per CO2 doubling, degs F per CO2 tripling or degrees Monckton per CO2 quintupling :-) The relevant entropy to be maximized, and the ME solution found will be the same.

    Now if, instead, you study p(L) with L=1/S, that is no longer the case. The transformation S to L is no longer in that limited linear/affine class. If you now apply the same ME method in L space, the entropy definition applied, and the optimum found, will be different.

    (Now also within L’s equivalence class there are alternative argumetnts, like f=(1-L) etc which have the same entropy definition.)

    James’s point is, that without looking at the physics of the situation, purely mathematically, there is no way of deciding whether it is more proper to do RE in S space or in L (= f) space.

    Ultimately it reduces to symmetry considerations: for a fair die, all six sides are equivalent, i.e., permutation symmetry of an ideal cube. For a real line segment, it is translational symmetry, meaning that the piece (x1,x1+dx) is equivalent to (x2,x2+dx) for any x1, x2. Map this on the circumferenc of a circle and you’re talking roulette wheels and circular symmetry…

    You can also talk about entropy and information content. Consider a line segment of x: (a,b) divided into 1024 equal pieces dx. Then choosing one such piece expresses 10 bits of information. Now, look at the corresponding segment (1/b,1/a) of y=1/x and divide that into 1024 equal pieces dy. Each will again represent 10 bits, but the two sets of pieces don’t map one-to-one to each other.. The information content of a piece dx mapped to y space will thus not be precisely 10 bits — sometimes less, sometimes more. Different entropy definition.

    Hope this helps (and that I got it right).

  • David B. Benson // July 23, 2008 at 7:29 pm

    James Annan // July 23, 2008 at 11:56 am & Gavin’s Pussycat // July 23, 2008 at 2:54 pm — I’ll try again later, but just for now:

    (1) E.T. Jaynes obtained this idea from Harold Jeffries. While I don’t know the history in any detail, I’m sure they were both using ‘invariant Haar measure’ (iHm). It is the case that every probability distribution supported on the real line defines its own iHm.

    (2) GP — I agree that maximum entropy is not conserved by any mapping other than the coordinate transformations, in general. So MEP cannot be used to decide between expressing a maximum entropy prior on gain or a maximum entropy prior on feedback, under the assumption that the same class of pdf’s is to be used in both situations. I went through the details of this in an earlier comment on this thread.

    Its not a different entropy definition, its a different probability distribution.

  • David B. Benson // July 23, 2008 at 10:22 pm

    In this book preview,M1

    find page 46 to read at least the first paragraph of Chapter 6 to discover that we need to use a topological symmetry group (Lie group) to express the notion of the ‘least informed prior’.
    Somewhat later in chapter 6 I learned that the ‘invariant measure’ used by Bayesians is that measure known to mathematicians as ‘left invariant Haar measure’ (liHm).

    The group of coordinate transformations obtained by varying

    f(x) = rx + t

    over all real ‘constants’ r and t, is such a Lie group and possesses Lebesque measure as its liHm. It is this group and its liHm which are used by Schorr & Johnson (linked above) as part of their axiomatic definition of maximum entropy. I use their definition (although I am sure it gives the same results as Jaynes original definition).

    (1) On the face of it the transformation

    g(x) = 1/x

    is not in the chosen Lie group and so does not preserve maximum entropy.
    (2) In the set of all probability distributions supported on the real line, the subset of normal distributions all have maximum entropy. The coordinate transformations each take every normal ditribution into another normal distribution; maximum entropy is preserved. We take the normal distributions as the least informed of distributions with that support.
    (3) In the set of all probability distributions supported by the positive reals the subset of exponential distributions are of maximum entropy. Remarks similar to those in (2) apply.

    While there are many books introducing Bayesian reasoning or maximum entropy methods, E.T. Jaynes’s “Probability Thoery: The Logic of Science” comes well-reviewed; the reviewer states that Jaynes’s book is the most entertaining.

  • Gavin's Pussycat // July 24, 2008 at 8:10 am


    mostly agree. Altthough you don’t seem to see the full implications of (1). It isn’t just “on the face of it”, it is really so.

    As to maximum entropy on the real line, the measure used is called “line length”, and invariant for translation. This measure is both Lebesque and Haar. Note that the translational invariance disappears if you transform to y=1/x: x-line length is not preserved over y-translation.

    Perhaps we agree and are talking past each other.

    Jaynes is a very good writer and explainer (better than any of us here :-) judging from his famous article. Haven’t read the book.

  • David B. Benson // July 24, 2008 at 6:52 pm

    Gavin’s Pussycat // July 24, 2008 at 8:10 am — We agree; this communication method leads to difficulties. By the way, ‘on the face of it’ is a polite way to say ‘obviously’. Could you kindly reference, or better link to, the paper of his you are referring to?

    In a subsequent comment I’ll work out some consequences of D(x) = 1/x. For now, I need to make some small corrections to previous comments.

    The coordinate transformation group consists of dilations

    d(x) = rx

    for r in the open interval (0,+infinity), the translations

    T(x) = x + c

    for any c and then sums of the above. For probability distributions supported on the non-negative reals, such as the exponential distribution, only the group of dilations is possible. Similarly for those distributions such as the lognormal distribution

    for which p(0) = 0 and the support is only on the positive reals.

  • David B. Benson // July 24, 2008 at 8:51 pm

    I’ll be hornswoggled!

    The mapping D(x) = 1/x is its own inverse; D(D(x)) = x. So D(x) generates a two element group of transformations. This suggests looking for a probability distribution which is invariant under the actions of this group.

    Definition (the ab probability distribution). Each probability distribution

    p(x) = k*exp(-rA(x))

    for normalizing constant k, rate parameter r, and kernel

    A(x) = 1/x for x in the open interval (0,1) and
    A(x) = x for x in the half-open interval [1,+infinity)

    is said to be an ab probability distribution.

    (1) Each ab probability distribution is invariant under D(x) in that p(D(x)) = p(x).
    (2) From the Boltzman theorem, the ab probability distributions are of maximum entropy amoung those probability distributions with known expected values of the kernel A(x).

    Unusual, to say the least…

  • Gavin's Pussycat // July 25, 2008 at 3:03 pm


    author = {E. T. Jaynes},
    title = {{Information Theory and Statistical Mechanics}},
    journal = {Physics Reviews},
    issue = 106,
    year = 1957,
    note = {URL: \url{}, accessed July 24, 2008}

  • David B. Benson // July 25, 2008 at 9:50 pm

    Ray Ladbury // July 23, 2008 at 12:49 pm — Ideally, the posterior for each phase becomes the prior to the next.

    As I understand the problem, there is not enough observational data of high enough quality to adequately ‘mask off’ the effect of the second-phase prior in assigning probabities to high sensitivity. By being conservative, the resulting probabilites for high sensitivity are such that very significant mediation must be started immediately, no matter that this will cost about a trillion a year.

  • David B. Benson // July 26, 2008 at 2:03 am

    Gavin’s Pussycat // July 25, 2008 at 3:03 pm — Thank you!

    Very good paper; thought-provoking.

    In the most recent AIP Conference Proceedings of the Workshop on Maximum Entropy and Bayesian Methods in Sceince and Engineering there was a paper about deriving Newton’s laws of motion from just MEP + Bayesian. But Jaynes’s paper makes it much clearer why these sorts of ideas work.

  • Gavin's Pussycat // July 26, 2008 at 4:30 pm


    I could imagine what such a derivation looks like… but did it mention the ‘elephant in the room’, i.e., the translational symmetry of physical spacetime?

  • David B. Benson // July 26, 2008 at 7:27 pm

    Gavin’s Pussycat // July 26, 2008 at 4:30 pm — I don’t recall the details, but I’m sure that is part of the derivation.

  • David B. Benson // July 27, 2008 at 11:23 pm

    Which ‘least informed’ prior for the climate sensitivity problem?

    While I am sure we all agree that the climate sensitivity, S, cannot be zero or less, it seems to take some thought to find a ‘least informed’ prior which then places some form of constraint on high S. We require a ’subjective’ pdf on the open interval (0,+infinity); this means that the probability has to go to zero as S goes to infinity. Here I will attempt to defend the lognormal distribution as the rational choice for our lack of knowledge of the value of S. (In a previous comment I linked one of the many sites for the lognormal distribution.)

    Begin by excluding a uniform distribution with high upper bound: look at the antarctic long ice core temperature proxy records to before termination 5 (MIS 11) to see that the orbital forcing temperature variations were rather small in comparison to the last five terminations. At that deep time, S must have been moderately small and not much bigger for the five modern terminations, from MIS 11 to termination 1 (LGM to Holocene). So our prior belief is now informed enough to decide upon some pdf, p. with p(0) = 0, growing to a maximum and then declining to 0 for large S.

    A digression on the normal distribution: The set of normal distribution is closed under all non-degenerate affine transformations

    f(x) = rx + t

    where r is not zero. Because of this flexibility, normal distributions are commonly used for problems with spatial or temporal variation, unless the data or the physics demands some other form of pdf.

    Applying this flexibility, i.e., a form of lack of knowledge of anything more restrictive, to the logarithm of S, X = ln S, and taking the antilogarithm, the transformations are of the form

    f(S) = TS^R

    where T = ln t and R = ln r. While scaling S by multiplying by T is just a change of coordinate, c.f., Rankine to Kelvin, raising S to some power R seems rather drastic. But various laws depend upon this for temperature; Stefan’s Law for example. While climate sensitivity is a temperature difference, still this provides additional flexibility, i.e., an additional lack of knowledge.

    On these grounds, choose lognormal as ‘less informed’ than the gramma distributions, closed only under the dilation group. It remains to estimate the two parameters, the location parameter being chosen as zero for this problem.

    Fortunately, we have the climate sensitivities of many GCMs. Using the logarithm of these provides estimates for the two parameters in precise analogy to methods for the more ordinary normal distribution.

    This program now leads to a bit of statistics that I don’t know: the climate models with sensitivity of 1.9 K and the climate model with sensitivity 6.1 K are considered to have failed the Maunder Minimum test with some fairly high confidence; on these grounds I would like to essentially exclude everything outside the support (1.9,6.1). I know of a formal means of restricting the support to just (1.9,6.1), but I doubt that such a blunt technique is going to satisfy either Dr. Bayesian or Dr. Frenquentist, not to mention Tamino. :-)

Leave a Comment