## Sunday, February 3, 2013

### Frequentist properties of Bayesian decision rules? A few more words...

In recent days I received e-mails from different readers of DBDA, inquiring about frequentist properties of Bayesian decision rules. One question asked about the relative false alarm rates of the traditional t test, on the one hand, and Bayesian estimation (as in BEST) on the other hand. Another question asked how to set the limits of a ROPE (region of practical equivalence) so that the false alarm rate in sequential updating would be kept to a certain limit, such as 5%. These sorts of questions may feel natural, even crucial, for people trained in frequentist thinking. In this post I try to explain that the frequentist properties of Bayesian decision rules suffer the same issues of ambiguity as the frequentist properties of traditional sample statistics.

For example, suppose you want to know "the" probability of false alarm when using a particular choice of HDI and ROPE in BEST. To compute the answer, you have to specify a particular hypothetical (e.g., null hypothesis) population distribution from which simulated data are generated, and you must specify a particular sampling process for generating simulated data, and you must specify a set of decisions. Then you can simulate a zillion random samples, do the Bayesian estimation and decisions for each random sample, and tally the decisions, with the proportion of decisions-to-reject being an estimate of the false alarm rate. (Or maybe math-savvy folks can figure out some exact analytical answer or some clever analytical approximation that is more accurate than Monte Carlo sampling.) Traditionally, you would specify normally distributed populations sampled until threshold sample size N. You could instead specify t-distributed populations (for outliers) sampled until a threshold duration, with sample size determined by, say, a Poisson process (which is more appropriate for many real-world data situations). Or you could compute the overall false alarm rate while including another decision about the same data, such as testing the difference of standard deviations and the difference of means (which is even more appropriate for the real world). All of those variations and their permutations and many other reasonable possibilities yield different false alarm rates for the same Bayesian decision rule. Which one is "the" probability of false alarm? They all are.

This ambiguity in "the" false alarm rate does not prevent you from comparing the false alarm rates of, say, traditional t tests and BEST. The ambiguity means merely that the quantitative (and maybe qualitative) details of the comparison will depend on your choice of hypothetical population, sampling process, and set of decisions. (For example, in the BEST article and video, there is a comparison of false alarm rates in sequential updating for t tests and BEST, under one conventional set of assumptions. There is no claim that the quantitative details are the same for other assumptions about the shape of the population and sampling process and set of decisions.)

But the real problem comes if you try to set, say, the limits of the ROPE according to "the" false alarm rate it yields. There is no single false alarm rate yielded by a ROPE. The false alarm rate depends on the hypothetical data-generating population and the sampling process and the set of decisions. You might be able to find the limits of a ROPE that yields 5% false alarms under specific conventional assumptions, but that same ROPE will yield quite different false alarm rates under different assumptions about the data-generating population.

On the other hand, simulating possible outcomes from hypothetical populations is fine for computing power, because that's what power is all about. Power is specifically asking about the probability of particular (Bayesian) decisions if the data come from a hypothetical population and are sampled a certain way. Importantly, power is not being used to make decisions about data that are actually obtained, nor is power being used to set decision criteria for data that are actually obtained.

And, you might ask, where does that leave "Bayesian p values" from a posterior predictive check? I don't think those are necessarily very useful either. Here* are my ideas about that.

* Your click on this link constitutes your request to me for a personal copy of the article, and my delivery of a personal copy. Any other use is prohibited.