Rev. Bayes is no St. George. |
Francis Diebold brought up Bayesian versus frequentist interpretations of probability the other day and and I think Noah Smith retweeted it. It's actually a pretty funny framing of the argument. Diebold quotes a Bayesian econometrician who visited him in his office:
There must be something about Bayesian analysis that stifles creativity. It seems that frequentists invent all the great stuff, and Bayesians just trail behind, telling them how to do it right.
I'm going to take issue with the last clause about doing it right. Now I've touched on this "debate" before, and I don't think I could have put it more perfectly than I put it then:
I've never been one to get into the Bayesian-frequentist argument, which much like the interpretations of quantum mechanics, seem to be a waste of my time (and are basically the same underlying issue). That it's a waste of my time does not imply it is a waste of your time if you happen to be a masochist who's into philosophical arguments that never go anywhere.
The thing is that there really isn't such a person as a "frequentist" who excludes Bayesian methods. At least I've never heard of such a person. I am fine with "frequentist" interpretations (which are just effective theories) as well "Bayesian" methods (which are just math). The issue as I see it is that there is just a subculture consisting of weird strict Bayesians versus everyone else.
When I read Diebold's post and Chris House's paper that he links called "Understanding Non-Bayesians", I thought I'd indulge my inner masochist and try to figure out exactly what Bayesians are on about.
The first thing to note is that Bayesian math is perfectly valid. The Bayesian approach is incredibly useful if you are updating pre-existing probabilities derived from older data using new data, or where you have a theoretical model of the prior probability. At least, that is how us "non-Bayesians" view it.
I think this probably falls under Chris House's description:
[To non-Bayesians] [t]he distinguishing feature of Bayesian inference then appears to be that Bayesians give explicit attention to non-flat priors that are based on subjective prior beliefs.
As I mentioned, I don't think the priors are always subjective. They can come from data that's been collected before or theoretical arguments. The weirdness (for me) comes in when the priors are subjective. What does it mean for a theory parameter to have a prior probability distribution before any data has been collected?
An excellent illustration of strict Bayesian weirdness is in a post I found in researching this topic called "Are you a Bayesian or a Frequentist? (Or Bayesian Statistics 101)". In the example worked out, a possibly unfair coin is flipped 14 times, coming up heads 10 times. The not "weird strict Bayesian" approach is to say 10/14 is an estimate of the unfair coin's probability p of coming up heads. The weird strict Bayesian approach is to stick in a prior probability distribution of that probability distribution parameter: P(p).
Now P(p) totally makes sense if you've encountered this coin before, or have some theoretical model of the center of mass of the coin (e.g. based on manufacturing processes). Even a so-called "frequentist" would approach the problem in exactly this way because Bayes' theorem is exactly what it says it is: a theorem.
However, imposing a non-uniform P(p) out of mathematical convenience or other subjective criteria effectively just increases the number of parameters in your model to include the parameters of the prior distribution. In the example, a beta distribution is used so now our model has two parameters (α and β giving us the distribution of p) instead of just one (p). If α and β summarize earlier data or our theoretical model gives us e.g. functions α(x) or β(x) of some parameter x (e.g. location of the center of mass), that's a different story. However, if you just posit a beta distribution then you're just adding parameters — which should be judged on information criteria like the AIC. Adding a non-flat Bayesian prior distribution actually makes the AIC worse in the coin example above unless you collect a few dozen more data points. There is also the problem of what P(p) even means if it is not derived from previous data or a theoretical model.
Chris House responds to this argument with tu quoque:
Though frequentist data analysis makes no explicit use of prior information [ed. not true per above], good applied work does use prior beliefs informally even if it is not explicitly Bayesian. Models are experimented with, and versions that allow reasonable interpretations of the estimated parameter values are favored. Lag lengths in dynamic models are experimented with, and shorter lag lengths are favored if longer ones add little explanatory power. These are reasonable ways to behave, but they are not “objective”.
In physics we have a heuristic we call "naturalness" with respect to theory parameters. However it is important to note that naturalness would never be used as a prior probability in estimating a parameter's likelihood, but rather as a rhetorical device hinting at a possible problem with the model. For example, the strong force is many orders of magnitude stronger than gravity. Physicists take that to mean there is probably something going on. We did not, however, incorporate a prior probability that the ratio should be natural in estimating the strengths of those forces.
Long lag lengths in House's hypothetical economic model that performs well against the empirical data (already requiring suspension of disbelief) might be "unnatural" based on the time scales in the theory (another thing economists haven't come to terms with), but that does not mean we should use that information to construct a prior that emphasizes shorter lags [1].
I would like to point out a nice concise point House makes:
A Bayesian perspective makes the entire shape of the likelihood in any sample directly interpretable, whereas a frequentist perspective has to focus on the large-sample behavior of the likelihood near its peak.
This is basically true, but also illustrates the issues with calling this "objective". The shape of that likelihood away from the peak is strongly dependent on the tails of the prior probability distribution, and the tails of probability distributions are notoriously hard to estimate. That is to say the Bayesian perspective makes the entire shape interpretable, but also strongly model-dependent and highly uncertain [2]. The so-called "frequentist" perspective is the practical one you should have: don't trust likelihoods except when you have large samples and then only near the peak. The rest of a probability distribution should come with a sign: Here be dragons.
...
Update 26 January 2017
This post on a specific prior producing a "counterintuitive" result is relevant.
...
Footnotes:
[1] This is not an argument against e.g. using the length of the available data as a prior. There you have a genuine information source (and Nyquist). I use the length of the available data to discount a lot of theories from Steve Keen's nonlinear models to Kondratiev waves and Minksy-like theories. Again, this is a rhetorical argument and I wouldn't compute a prior probability.
[2] This makes me think of the strange things that arise when economists talk about expectations at infinity (here, here) ‒ much like the tails of probability distributions, there be dragons in the infinite future.
1) "The weird strict Bayesian approach is to stick in a prior probability distribution of that probability distribution parameter: P(p)."
ReplyDeleteI don't know in what sense this is the "strict" approach, but this is clearly a fallacious approach, and any decent book on Bayesian statistics should warn against that. I'm not sure that Panos Ipeirotis' blog is a credible source for proper Bayesian analysis.
If you could just stick the posterior as the prior and redo the analysis then you could repeat that to the point where the coin's bias (or whatever parameter is being investigated) would be known with almost complete certainty. This is clearly nonsense.
The "Let's use a Beta distribution because that's mathematically convenient" argument is embarrassing. Unfortunately I see this line of thought in a great deal of economic and statistical literature.
2) "That is to say the Bayesian perspective makes the entire shape interpretable, but also strongly model-dependent and highly uncertain."
Yes, but you can decide for yourself if you agree with the model in question, and if you have another model in mind, you can make your own analysis. If you don't have any model in mind then there are methods aspiring to choose the "maximally uninformative prior", e.g. the maximum entropy approach that you use in the IT model.
Statistical inference is inherently subjective. The advantage of the Bayesian appraoch is that it makes the statistician's subjective beliefs explicit in the model, which makes it harder (though not impossible; we're all human) to fool the reader into thinking that his/her results are unbiased and infailable. It's easier to notice that someone has chosen the Beta distribution arbitrarily than it is notice that someone had cherry-picked an obscure statistical test to get a p-value < 0.05.
"It's easier to notice that someone has chosen the Beta distribution arbitrarily than it is notice that someone had cherry-picked an obscure statistical test to get a p-value < 0.05."
DeleteI think those are comparably subjective. As a physicist, I've never really given much attention to p-values. Usually we just see if the function goes through the data. Saying a p-value is less than some arbitrary value like 0.05 is not really different from just saying "the function kind of fits the data".
This is not to say I take issue with p-values (they're a metric), but I take issue with any specific p-value as meaningful besides tiny ones (log p << 0) and huge ones (log p ~ 0).
"... this is clearly a fallacious approach, and any decent book on Bayesian statistics should warn against that"
I'm not sure I understand your objection here. Bayesian analysis gives prior probability distributions for parameters. I'll quote Chris House:
Frequentist inference insists on a sharp distinction between unobserved, but nonrandom “parameters” and observable, random, data. It works entirely with probability distributions of data, conditional on unknown parameters. It considers the random behavior of functions of the data —estimators and test statistics, for example — and makes assertions about the distributions of those functions of the data, conditional on parameters.
Bayesian inference treats everything as random before it is observed ...
If you are saying that you don't think this should happen or doesn't make sense, congratulations! You're part of the non-weird strict Bayesian club that gets labeled "frequentist".
The *interpretation* of statistical inference is subjective. Statistical inference itself is a collection of algorithms.
DeleteWhen I write up a paper that says "I saw such and such effect with a p-value of 0.05", I haven't done anything subjective. But when I title that paper "Such and such effect found!", and put sentences to that effect in the abstract and body, I'm being subjective. I could easily have written a paper saying that p < 0.05 indicates such and such effect isn't useful.