Tuesday, August 12, 2014

Stopping and testing intentions in p values

An emailer asks:

I am really interested in Bayesian analysis, but I don't get the issue of sampling intention being so important in frequentist t-tests; if you have 60 values you have 60 values surely - why does your intention matter? The computer does not know what I intended to do; or have I missed the point entirely!?

That is exactly the point -- intuitively (for natural Bayesians) the stopping and testing intentions should not matter (for some things), but for p values the stopping and testing intention is of the essence, sine qua non. In lieu of reading the BEST article, see this video, starting at 7:15 minutes:

One other quick question if I may; how does the BEST package differ from the BayesFactor package?

An appendix of the BEST article explains, but you can instead see this video, starting at 2:35 minutes:

Those video excerpts might make the most sense if you just start with the first and watch them through, all the way...


  1. It's rather disturbing, particular in this day and age, that the Bayesian is prepared to countenance stopping til the data support a hypothesized real effect, ignoring the increased probability of erroneously doing so. In standard normal testing, for instance, one is guaranteed to erroneously reject (or equivalently erroneously exclude the true value from a HPD region) with maximal probability. Cherry-picking, multiple testing, post-data choices of subgroups---all these gambits are touted as the "simplicity and freedom" (Savage) offered by the Bayesian Way. Their reference priors cannot save the day. True one may find, at times, a skeptical prior to restore what the error statistician achieves directly (and with a clear rationale). But if these gambits are not to be frowned upon for destroying error control, on what basis can one demand a sufficiently skeptical prior?

  2. Dear Dr. Mayo: Thank you for your comment. The videos linked in the blog post are all about NOT ignoring the increased probability of erroneously rejecting or accepting the null with sequential testing. The videos show that Bayesian decision rules --including highest density interval with ROPE-- result in increased error rates that asymptote below 100%, unlike conventional NHST which rises to 100% false alarm rates with sequential testing. The videos show that any stopping decision based on accept/reject produces bias in the estimate (conditional on the decision). The videos argue that a better stopping rule than accept/reject is achieving precision. (This is what is done, for example, in political polling.) Importantly, precision is better measured by a Bayesian posterior distribution than by a frequentist confidence interval. --John

  3. Very nice work. I'm glad to see people re-examining the fundamentals, and the t-test always seemed suspect to me.

    Question: could we do equivalent estimation with a bootstrapped approach, to generate a sampling distribution of difference of means? It seems to me this distribution has the same information as the distribution of the difference of mean parameters that you derive from your model. But bootstrapping and makes fewer assumptions and is both conceptually and practically simpler.

  4. Bootstrapping (a.k.a. resampling) is all about generating a sampling distribution, just like traditional t tests. The only difference is that we don't sample from a hypothetical normal population, we resample from the data (which stand as the best available representation of the population). But the bootstrapped sampling distribution still depends on the stopping and testing intentions like any other p value or confidence interval.

  5. But why do p-values and stopping intentions come into this at all? You can bootstrap to find the sampling distribution of either mean, or the difference in means. Are these not the same as the distributions estimated by your model?

  6. Whenever you generate a sampling distribution, you must specify what constitutes a sample. Therefore you must specify if the sample has fixed N, fixed duration (with random N), or some other stopping rule. And you must specify which sample statistics you are sampling, which means which tests you are conducting. The sampling distribution depends on the intended stopping criterion and testing intentions. In the Bayesian approach there are NO sampling distributions. The distributions are of credible parameters values given the single fixed set of observed data.