# Open Mind

## Born-Again Bayesian

#### October 29, 2009 · 43 Comments

I’m finally getting to the point where I can do my own typing (this is my hand, not Mrs. Tamino’s), but I’ve fallen far enough behind that I have catching up to do in lots of areas. Still, I thought I should resume blogging with a thank-you to readers.

Not too long ago someone posted about E.T. Jaynes’s Probability: the Logic of Science. Hobbled with injury, I at least had the opportunity to do some reading, so I’ve devoured it. I just might devour it again.

Now, T was brought up as a “frequentist” (or as Jaynes would say, a member of the orthodox school of statistics). I’ve never been anti-Bayesian like so many others (that just doesn’t make sense to me) but for actual analysis I’ve stayed within orthodox practice. Reading Jaynes’s book has been an epiphany. I’m beginning to see, not just the logic of the Bayesian approach to statistics, but its power as well. The number of cases in which a Bayesian analysis gives you not just the estimate of things, but its probability distribution as well (when deriving the sampling distribution of the orthodox estimate is a pain in the arse) is impressive. A Bayesian approach might even handle cases that give the orthodox approach fits (Cauchy distribution, anyone?). And so many examples are in accord with problems that have nagged me for many years.

For instance, it often happens that a traditional significance test shows that a given result is extremely unlikely for a given null hypothesis — in fact that’s the essence of significance testing. But more than once (more than often, in fact) it has occured to me that a particular result is extremely unlikely, no matter what hypothesis one adopts. This fact has bugged me. Well, a Bayesian approach doesn’t reject a null hypothesis based solely on the improbability of a given result relative to it, but takes into account the improbability of the result relative to other hypotheses as well. And it can take into account our prior information about the null (and other) hypotheses. It can even take into account our prior opinion — which doesn’t protect us against some bias which may be inherent in our personal opinion, but does at least enable us to quantify the impact of our opinion.

For genuine anti-Bayesians, the big hoopla seems to be about priors — how there’s too much danger they’ll reflect the investigator’s preference rather than objective reality. Certainly that can happen — but frequentists can find a way to support their agenda just as easily as Bayesians. And it seems clear to me that there are ways to construct non-informative and non-influential priors, in fact there are lots of ways to be truly objective about it. Furthermore, in many (if not most) cases the prior has little impact on the result, and we can always determine how sensitive the outcome is to the choice of prior — if it’s too dependent on the prior, then the result should be viewed with caution.

Bayesian analysis can do that. It can also show you what the implications are of a very informative prior! It doesn’t justify the prior, it just shows you the result, and that can be extremely valuable to know. It doesn’t mean you have to buy in to that result, but in my opinion it’s foolish to close your eyes and act as though it doesn’t exist. More information is more information (and that’s a popular Bayesian tune).

So, in contradiction to the adage about old dogs and new tricks, it’s just possible I’m transforming into a born-again Bayesian. In retrospect, it surprises me how so many can go through the educational system with so little exposure to Bayesian ideas. There’s always the token mention of Bayes’ theorem, with the classic example of disease testing (a case for which the prior is known with precision and certainty and its effect cannot be ignored), but there’s little or no (most likely “no”) mention of the philosophy of the Bayesian approach, which is the focal point of Jaynes’s book. In fact only one thing about his book bugged me: he spent a lot of time preaching the gospel of Bayesian analysis, more than necessary for me personally. I suspect it is necessary for a less receptive reader, including all the cautions about slipping into the old way of thinking.

Categories: mathematics

### 43 responses so far ↓

• DrC

The Bayesian v frequentist camps are as fraught with emotion as AGW in many ways. But you have taken the right tack here IMHO. Are there cases where a Bayesian approach makes sense? Sure. Are there cases where a traditional significance test works just fine? Yep. Use the tools that work and as long as everything passes a smell test you can turn to inference. Bravo. Good to have you back!

• You should also read Gelman et al., “Bayesian Data Analysis”. While Jaynes is a philosopher, Gelman offers convincing practical reasons to be a Bayesian.

The practical reasons that come to my mind right now are: (1) Models for multi-level data, data with kids within schools within states for example, become much more intuitive; (2) More innovative model families – for the model family is just a prior, and a non-informative prior is much like a fine-tuning of the model family; (3) There are models for which ML just does not work, because the posterior is all over the place, but the model is still useful at least in the predictive sense; (4) Nonparametric models: infinite mixtures, spatial processes, etc.; (5) Machine learning, where the prior is informative, but it is not YOUR prior.

Tamino,
re: Jayne’s Preaching. Remember that these chapters were written over a period of over 20 years–and at the beginning of this period, opposition to Bayesian ideas was much more entrenched than it is at present.

There are still some folks who HATE Bayesian ideas. I run into them as reviewers with some frequency in my field.

A question: Wouldn’t model averaging as under AIC/BIC diminish the importance of the Prior and further downplay such objections to a Bayesian approach?

BTW, Burnham and Anderson have argued that AIC is just as useful as BIC in Bayesian analyses if you allow use of “savvy Priors” rather than “ignorant Priors”.

• David B. Benson

That someone was first me, then Ray Ladbury.

Happy that you are better now, Tamino. Both physically and in knowledge. :-)

If you want more and yet more, every year AIP publishes in their Conference Proceedings series the workout of the Workshop on Maximum Entropy and Bayesian Methods in Physics and Enigineering. Well, that’s how I remember the title, I may not have it quite right. Anyway, there are now over 20 volumnes and the last several have some most outstanding papers.

• Tony O'Brien

Are we allowed to put Bayesian techniques in the can be usefull box.

The little statistics I learn’t is over 30 yrs old so of course I know nothing about Bayes. But modifying a study as results come in actually sounds like a good idea (sometimes)

• Always keep in mind that models (hypotheses, theories, etc) built on one (or no) prior observations (like SETI) are lousy candidates for a Bayesian approach; there’s simply no reasonable model for the distribution of priors – not even a Gaussian distribution!

• Always wondered about why Bayesians have to be born ignorant. It seems to me that if the data/theory on which the prior is built is does not include the additional data that it is tested against one could learn more.

Welcome back:)

• The recent manuscripts by Martin Tingley and Peter Huybers may be of interest. There already is an article in the most recent Scientific American about their later work. Of course, the Mystery Van is hot on their trail.

Glad to see you posting again, Tamino.

• David B. Benson

Eli Rabett // October 31, 2009 at 1:19 am — James Annan and I, at least, agree that there are no completely ignorant priors. Indeed, it suffices for independence that cloistered experts who set priors and develop models be ignorant of the data used to updata to the posteriors.

• Timothy Chase

Eli Rabett // October 31, 2009 at 1:19 am — James Annan and I, at least, agree that there are no completely ignorant priors. Indeed, it suffices for independence that cloistered experts who set priors and develop models be ignorant of the data used to updata to the posteriors.

If I remember correctly, one Annan’s arguments is essentially that if the probability is evenly distributed as would be required by a state of complete ignorance, this would be relative to a given variable (lets say x), but the choice of x would be to some extent arbitrary, and relative to y=1/x or z=x*x, the probability would no longer be evenly distributed and would not represent a state of complete ignorance.

Likewise, it makes little sense saying that the probability is evenly distributed over an infinite range. He gives the example of climate sensitivity — which those who argue for beginning with a state of complete ignorance will remedy by using a cutoff — that, however, cannot be justified by reference to the premises that they begin with.

I believe he concludes that given these problems it would seem that there must be some sort of reasoning that precedes the attempt to apply a Bayesian approach. Seems reasonable to me.

• Timothy Chase

PS

Sorry that wasn’t attributed. I was quoting and responding to David B. Benson…

• What it comes down to, at least from my little experience with priors, is whether the Bayesean analysis is used to create the probability distribution or improve upon other estimates gotten by other means. IEHO, the lateris the best use, because of the knots you get tied up in with the former.

• Don’t know what he thinks of “fuzzy” stats, butthis was written with Tamino in mind.

Good to have him typing again. The silence has been deafening.

• David B. Benson

Eli Rabett // November 1, 2009 at 2:11 pm — Annan & Hargreaves did this most elegantly to give a mostly-subjective prior for climate sensitivity. First, from fundamental physics the climate sensitivity must be positive. So they choose the Cauchy distribution as, in some sense, the worst possible (as it has no mean). To set the parameter, it is considered fair game to use Arrhenius’s estimate of a 6 K climate sensistivity; this becomes the most likely value.

I don’t see how those two steps can be considered Bayesian as no Bayesian reasoning has been applied; that comes with using the evidence to set the posterior pdf.

• David, where I learned about this sort of stuff, the difference between the prior and the data was called the surprisal. It is a common use in molecular dynamics. It may not be strict Bayesean but it is a non bat crazy way of finding reasonable priors (in reaction dynamics one uses the maximum entropy distribution to define the prior)

• PI

“In retrospect, it surprises me how so many can go through the educational system with so little exposure to Bayesian ideas.”

I didn’t really become interested in statistics until I learned about Bayesian ideas after college. Part of it was the very dull way intro stats is often taught, as a hodgepodge of seemingly-arbitrary recipes for estimators and tests.

I later realized that what had really rubbed me the wrong way was that hypothesis testing and estimators never seemed to get at what I was really interested in, which was inference about hypotheses and their uncertainty. Confidence intervals always seemed convoluted to me. I discovered that a Bayesian posterior probability distribution was the approach I had been unknowingly seeking all along.

Furthermore, Bayesian methods give a simple, uniform way of doing inference on complex problems which avoids endless debates on how to construct a test or estimator.

Or rather, they move the debate from somewhat esoteric arguments about asymptotics (which may not be relevant) or unbiasedness (which may not exist), to arguments about the prior, which I find much more transparent. (Yes, I think it’s an advantage to put all your “subjectivity” in one place, the prior, rather than in your choice of algorithm for constructing an estimator or confidence interval.)

Write down the likelihood, write down the prior, extract whatever you want from the posterior: the algorithm is the same. Indeed, one can think of Bayesian inference as a way of automatically constructing estimators with good statistical properties, with the prior to regularize the inference (e.g., in cases where the MLE does bizarre things).

Finally, I simply think it makes more sense to condition inference on the known data, rather than speak about hypothetical data you never observed conditioned on a specified hypothesis. That distinction wasn’t even on my radar when I was taught statistics in college.

• Mark

“but the choice of x would be to some extent arbitrary, and relative to y=1/x or z=x*x, the probability would no longer be evenly distributed and would not represent a state of complete ignorance.”

I had always seen this as the model defines what your euclidean linear population space is and THEN you randomly dot that space.

E.g. you wouldn’t plot a correlation between linear CO2 and temperatures because the problem defines them as a linear/log relationship.

Similarly in a Bayesian test you would use log(x)/y as your axis. For which your probability WOULD be evenly dispersed and your test checks whether there is any real correlation or whether the apparent correlation is a result of happenstance.

• David B. Benson

Eli Rabett // November 3, 2009 at 1:54 am — In recent volumes of the workshop series I mentioned earlier, one learns that

maximum entropy == Bayesian.

So maybe you were doing Bayesian all along…

• Timothy Chase

Mark wrote:

I had always seen this as the model defines what your euclidean linear population space is and THEN you randomly dot that space.

Perhaps. Alternatively, different theories might define different spaces that are not simply linear mappings of one-another. And in that case if Bayesian reasoning were meant to decide between them one would still be left with the same arbitrariness in the choice of coordinates as before.

If one is simply using Bayesian reasoning to decide the value of a constant once linearity has been defined that is one thing. If one is using Bayesian reasoning to decide between theories that define the dimensions in which uniform probability density is to be achieved in mutually exclusive ways that is something quite different.
*
Mark wrote:

Similarly in a Bayesian test you would use log(x)/y as your axis. For which your probability WOULD be evenly dispersed and your test checks whether there is any real correlation or whether the apparent correlation is a result of happenstance.

But as I pointed out, proponents of the view that one should start with a state of complete ignorance in which probability is evenly distributed nevertheless find it necessary to have cut offs — and cut offs imply that their ignorance is necessarily limited rather than the state of complete ignorance which they profess.

In any case, James Annan probably does a better job of explaining his views regarding uniform priors and cut offs than I do. His most recent post that touches on this is here:

Uniform prior: dead at last!
Tuesday, September 01, 2009

… and links to the following:

On the generation and interpretation of
probabilistic estimates of climate sensitivity
J.D. Annan and J.C. Hargreaves
http://www.jamstec.go.jp/frcgc/research/d5/jdannan/probrevised.pdf

… but in all honesty I was remembering it from about two years ago, and as such I would go with the primary source.

• PI

David,

I don’t know what “maximum entropy == Bayesian” means, but as far as I know they’re not equivalent. See, for example, Cheeseman and Stutz.

• David B. Benson

PI // November 4, 2009 at 3:41 pm — In the settings of the 2–4 papers in the 2007 and 2008 workshops, the claim of equivalence was made. Unfortunately, I don’t now recall the details.

While I have considerable respect for Peter Cheesman, I doubt that a 2004 paper is the final word.

• Maximum entropy is, to a large degree, about conjugate priors in the exponential family. Physicists are often fond of them, maybe because there is the word “entropy”, and maybe because physical systems tend to have symmetries that can be taken into account by the maxent formalism.

But I don’t see much about maxent in the broader literature, at least not among classic statisticians or among machine learning people.

About ignorant priors in general: I agree that they mostly do not exist. But this is of little importance, because usually one specifies a model family anyway, and that tends to be a strong prior. Say, one uses a logistic linear model – that is a prior, p(x|\theta, M) instead of p(x|\theta) or p(x|\theta, M’) over some very flexible model family M’. There is no way around this, the least for a frequentist. Then using p(\theta)=gaussian or p(\theta)=cauchy is of relatively little importance. One can be empiric about the different priors for \theta and conclude that under typical ignorance, for example Cauchy tends to be better than a Gaussian. And so on. The prior is not a problem in practice. Lots of, e.g., medical stuff is published with Bayesian analysis in prestigious journals and nobody complains.

• In science, the use of statistics is to provide, ‘for convenience’, agreed upon ‘measures of confidence’, to reduce ambiguity in discussion of the relevance of data, to the match of models (theories, hypotheses, etc.) to perceived ‘reality’.

So it’s not so much that using a Cauchy prior, vs a flat distribution as a prior, is ‘better’ or ‘worse’, but merely that when confidence intervals are discussed, all participants should be on board about just which probability density function (PDF) is being used! Of course, it helps very much if the participants understand the nuances of the use of each PDF. And in that sense, the contribution of Annan and Hargreaves is to be applauded.

• Eric L

I had no idea Bayesian ideas were controversial within the statistics community. I learned about Bayesian statistics in an AI class in college, and I’ve come across them over and over in machine learning. Before that class I figured AI/learning was all about the glamorous code-imitating-nature stuff like genetic algorithms and neural networks, but it turns out in the real world so many problems are best solved by boring old statistics, usually Bayesian inference in some form. I suppose it may provide ways to lead yourself astray when testing scientific hypotheses, but it can certainly give very good answers, and at least in my world if Bayesian stats can be applied to your problem they’ll usually give you better answers than any other approach.

• jyyh

Was it so that Bayesian methods were more commonly used in complex systems, but the chance to go wrong in these is so large, it isn’t advisable to publish without some additional checks (that would be some frequentist methods) ? Welcome back, hope the healing has progressed well.

OffTopic:
Here we are in the beginning of the swineflu epidemic, so there might be 2 weeks off for me in the near future. Children and elderly are getting their vaccination shots now, after the medical personnel had theirs. The curious thing was that some nurses declined to have that – maybe they thought to have some weeks off during the epidemic?

Greetings from rainy Sri Lanka. scellus, I think the reason the exponential family (and so, Max Ent) is popular is that it is an astoundingly broad and general family of distributions.
As to the “controversy” over Bayesian vs. Frequentist, it is an old question. Fisher, himself threw up his hand over the problem of the Prior. I think Jaynes provides a compelling answer to these objections. I also think that model averaging provides a partial answer that is attractive.

Just curious: Has any one else looked at Information Geometry?

• Timothy Chase

Greetings from rainy Sri Lanka.

Sri Lanka? I’m betting that is a little more hospitable than Diego Garcia. I believe we had only one bar on the entire island. But at least it was where I learned about long island ice teas.
*

Just curious: Has any one else looked at Information Geometry?

I am sure they have now. Metric tensors, dual affine connections, … It is beginning to look like my old home town.

• Timothy Chase

Just curious: Has any one else looked at Information Geometry?

Ok. After looking it up, seeing how its methodology is built upon the same mathematical formalism as general relativity (namely Riemannian geometry — which I am actually somewhat familiar with), and how it is related to Bayesian analysis, population coding, artificial intelligence, pattern recognition and the study of the brain, I got curious enough that I have gone ahead and ordered a copy of “Methods of Information Geometry.”

Haven’t the foggiest how this could turn out well, and I will be holding you personally responsible.

• David B. Benson

Timothy Chase // November 13, 2009 at 5:20 am — Do keep us informed about your discoveries in “Methods”. Thanks.

• Timothy Chase

David B. Benson wrote:

Do keep us informed about your discoveries in “Methods”. Thanks.

Well, unfortunately I haven’t yet had the chance to get into “Probability Theory: The Logic of Science” by E. T. Jaynes as of yet even though I downloaded a copy of it to my laptop, and “Methods of Information Geometry” by Shun-Ichi Amari and Hiroshi Nagaoka has yet to arrive. Even then, not really sure which I should try reading first — although I presume they aren’t the sort of works one reads from cover-to-cover or just once, so I am not sure that it matters.

At the same time I doubt that I will really be able to get to either before Christmas break. School is keeping me rather busy — and I fell behind over the last week on account of flu.

But if one is interested in information geometry, presumably the latter is what comes closest to being the defining work nowadays — and would serve as a good introduction for just about anyone. I would of course love to get Tamino’s take on any of this.
*
Then again I have been awaiting “AIC part 2″ for a while. We were treated to part I on the Kullback-Leibler a few weeks back, but with the injury to his hand part II had to be put off. Incidentally, the Wikipedia entry for “information geometry” states, “The Kullback-Leibler divergence is one of a family of divergences related to dual affine connections.”

In general relativity at least, affine connections are used to uniquely displace vectors along a any given curve within a given space — independently of the coordinate system — and in general relativity form part of the mathematical basis for determining whether or not spacetime is curved — independently of the coordinate system. This is related to an observation I made over at Real Climate that in general relativity there is only a weak equivilence between acceleration and gravitational fields as one can uniquely determine whether or not spacetime is curved — independently of the coordinate system — by whether the Riemannian curvature tensor vanishes: one derives the Riemannian curvature tensor from the affine connection, or equivilently, from the Christoffel symbols. (See “Chapter XI: The Riemann-Christoffel Curvature Tensor” of Peter Gabriel Bergmann’s “Introduction to the Theory of Relativity.”)

In any case, it might prove helpful — as a sort of additional handle for grabbing hold of information geometry — were Tamino to bring out part II of AIC — given the relationship between the Kullback-Leibler divergence and and affine connections in information geometry. Then again, I would think that seeing part II would be of interest to people independently of its bearing on information geometry. Assuming Tamino is up to it.

Home now from our vacation in Sri Lanka–after 40 hours on the road. It is truly a wonderful island–friendly people, delicious food, an ancient and interesting culture, amazing geology. Lots of interesting climate-related issues, as well, which I’ll touch on when I regain a semblance of consciousness. Suffice to say that with the end of a 30 year civil war, a more than 92% literacy rate and a young population, Sri Lanka is now poised for economic take off. The question is whether they will meet their energy needs with renewables or coal (or nuclear in the short term).

The information geometry I’ve looked at looks quite interesting–I can see some applications to reliability calculations pretty clearly.

As more of a pure math person both undergrad and grad, I tended to think of Bayesian more in terms of inferences as it applied to programming. For instance, when I made a Mastermind program when I first started university, beforehand I did a ton of pre-programming data generation and formulas based not on purely legal guesses, but informative guesses, and not just on random codes but on the strategic fallout from patterns of codes.

When I saw the Bayesian controversies I read up on it a little bit, but it seemed to me at the time to be more or less a fuss over nothing – like when people thought Fuzzy Logic (fuzzy subsets of real sets and the concomittant membership logic) was not a real field because it was a way of rewriting other results, etc. Math is just full of that sort of thing.

So this interests me, and I’ll follow suit.

• I’ve always been baffled by the “debate”. I think communications engineers are taught Bayesian statistics under the name “statistics”.

This is because every symbol (in the simplest case one waveform for “1″ and its additive inverse for “0″; in other cases typically 4,8 or 16 possible waveforms) is a member of an alphabet. No matter how noisy the signal, it corresponds to a symbol. You just have to decide which one.

There is no “null hypothesis”.

The associated mathematics is stunningly beautiful and elegant, by the way. I miss it in the messiness of non-artifact systems.

Anyway, the whole frequentist business has been a pox on climate science. Sensitivity is a number, not a binary decision. “Global warming, yes or no” is fundamentally an ignorant approach, and “attribution studies” just pander to it.

http://bayes.wustl.edu/etj/prob/book.pdf

That’s the whole book, or no?

• David B. Benson

Marion Delgado // November 18, 2009 at 5:29 am — I don’t think so.

Marion and David, it is pretty much the whole book. The missing chapters are ones Jaynes never got around to writing–they’re missing from the book as well.

Ray Ladbury, do you really run into them with some frequency, or is that a probabilistic inference?

Marion,
Of course a true Bayesian worth his salt would deny the existence of any real frequency and say that the frequency must be inferred with some degree of belief.

More seriously, my field is rather applied, and there are some who detest the idea that there could be anything subjective about the field. There are others who resent the idea that “engineering judgment” should require supporting statistical analysis. Between the two, there are a lot of anti-Bayesians, even if they wouldn’t know a measure if it bit them on the pecker.

• Timothy Chase

Marion Delgado wrote on November 20, 2009:

Ray Ladbury, do you really run into them with some frequency, or is that a probabilistic inference?

I didn’t know what you were referring to, but after searching for the phrase “run into,” I presume it is the following…?

Ray Ladbury wrote on October 29, 2009:

There are still some folks who HATE Bayesian ideas. I run into them as reviewers with some frequency in my field.

Don’t mean to criticize or anything, but a little bit of context after nearly a month might help. (Part of the reason why I tend to include hyperlinks to what I am responding to.)

• Timothy Chase

Re Marion Delgado, November 20, 2009 at 4:22am

It took me five flights of stairs and a block on the way to the coffee shop before I got it and started laughing — and in my neighborhood people look at you funny if you start laughing for no apparent reason.