## Wednesday, May 13, 2015

### The Bayesian New Statistics: Two Historical Trends Converge

If not null hypothesis significance testing, then what? If not p values, then confidence intervals? If not NHST, then Bayes factors? Both? Neither? These issues are addressed in a new manuscript titled The Bayesian New Statistics: Two Historical Trends Converge.

Abstract: There have been two historical shifts in the practice of data analysis. One shift is from hypothesis testing to estimation with uncertainty and meta-analysis, which among frequentists in psychology has recently been dubbed “the New Statistics” (Cumming, 2014). A second shift is from frequentist methods to Bayesian methods. We explain and applaud both of these shifts. Our main goal in this article is to explain how Bayesian methods achieve the goals of the New Statistics better than frequentist methods. The two historical trends converge in Bayesian methods for estimation with uncertainty and meta-analysis.

Excerpt: Our main goal in this article is to explain how Bayesian methods achieve the goals of the New Statistics better than frequentist methods. We will recapitulate the goals of the New Statistics and the frequentist methods for addressing them, and we will describe Bayesian methods for achieving those goals. We will cover hypothesis testing, estimation of magnitude (e.g., of effect size), assessment of uncertainty (with confidence intervals or posterior distributions), meta-analysis, and power analysis. We hope to convince you that Bayesian approaches to all these goals are more direct, more intuitive, and more informative than frequentist approaches. We believe that the goals of the New Statistics, including meta-analytic thinking engendered by an estimation approach, are better realized by Bayesian methods.

The manuscript is available at this link (via SSRN).

1. John, this is brilliant. This is a much needed paper! The whole bayesian momentum in psychology is being wasted by the focus on Bayes factors. Hopefully, your paper will show to people that there is more to to Bayes stats than bayesian hypothesis testing. In fact historically estimation has been dominant in bayes stats. Just look at books from Andrew Gelman, Edwin Jaynes or George Box. Bayes Factors are barely mentioned. My impression is that this whole bayes factor thing is just an aimless solo ride by the Roudermakers crew. If you look up the research by Wagenmakers, Lee and others they rarely use bayes factors in their own work. When they do then only as a supplementary analysis. Their dominant approach is estimation. All this BF propagation just hands down an easy argument to frequentists and the defenders of status quo to discard bayesian stats.

2. Hi Matus. Maybe you need to read my papers more carefully ; -) And study the work by Jeffreys, Raftery, O'Hagan, Forster, and Berger. Testing comes logically prior to estimation (Jeffreys, 1961; Simonsohn, in press). You may not like it but tis the way nature works.
Cheers, EJ

3. Hi Matus. Maybe you need to read my papers more carefully ; -) And study the work by Jeffreys, Raftery, O'Hagan, Forster, and Berger. Testing comes logically prior to estimation (Jeffreys, 1961; Simonsohn, in press). You may not like it but tis the way nature works.
Cheers, EJ

4. Sorry for the earlier double post. I guess I wanted to emphasize the point :-) Matus, you could check out jasp-stats.org. JASP is meant to be statistically inclusive: it does hypothesis testing *and* estimation, both in the classical *and* the Bayesian way. So you can pick and choose!
Cheers,
E.J.

5. HPD intervals can be wrong with high or even maximal probability by ignoring things like optional stopping. Insofar as New Reforms fails to pick up on error probabilities, they will never replace error statistical methods.

6. Optional stopping is explored at length in Chapter 13 of DBDA2E, where it is shown that highest density intervals with a ROPE for making decisions have excellent error behavior (i.e., false alarm and correct detection rates).

7. Mayo, please back up your claims with a minimum working example (code and data, simulated will do).

8. John, interesting paper. I look forward to reading it carefully. I just wanted to say that one thing that should be stressed is that when one has a lot of data (this is the case in most of the work I do and review), the frequentist CI and the Bayesian credible interval will essentially overlap. It's only when the sample size is small that Bayesian methods are clearly going to give different results. People are consistently glossing over this detail.

Another thing is that the major problem (at least in linguistics and psycholinguistics) is not whether we use frequentist or Bayesian methods. The major problem is that people are clicking buttons blindly to get a decision from the data. This leads to all kinds of BS, of the type you discuss, but also much simpler things. E.g., I know several veteran experimenters in linguistics (>15 years experience running studies) who can't do a paired t-test correctly. If they shift to Bayes, if anything, the problems will get worse. An amazing fact is that these people don't even care that they don't understand the t-test. What's missing is basic education in statistical theory, and a basic lack of regard for the method. Linguists tend to be contemptuous of statistical theory, considering it something ancilliary to the science. This reminds me of professional writers who used to be contemptuous of people who typed their own stuff up on a typewriter instead of sending handwritten drafts to a typist. For many people, the typist is the software, which returns a black-box answer. The real problem is lack of education.

9. Shravan:

A few quick thoughts regarding what happens when sample size is large:
There is still a need for estimation instead of hypothesis testing, because trivial effects become "significant" when N is large (an example was given in Figure 2 of the manuscript). Estimation reveals the trivial effect size.
There is still a difference between frequentist confidence intervals and Bayesian credible intervals when there are multiple intended tests, because the frequentist confidence intervals expand with every intended test. And changes in stopping intention can also change the confidence interval even when N is large.

Regardless of theoretical statements about asymptotically large N, in practice a lot of researchers do not have the luxury of large N. Nevertheless, emphasis on precision, inherent in the estimation approach, often reveals to researchers that N should usually be larger than they might have hoped they could get away with.

Perhaps the main point is this: When the emphasis is on estimation of parameter values, the analysis method should tell you about the credibility of parameter values. Bayesian methods do that directly, frequentist methods don't, regardless of sample size.

Does "blind clicking" get worse with Bayesian than with frequentist? I don't think it necessarily must. Blind clicking is a function of (i) ease of clicking and (ii) difficulty of learning about what's beneath the clicks. Ease of clicking is about software, and right now Bayesian software is the most difficult part of Bayesian analysis. Easier software is coming (e.g., JASP as mentioned in EJ's comment) and that's good as long as it makes the right things easy and the wrong things hard. Difficulty of learning is about access and familiarity. Frequentist methods might feel easier, but that's only because everyone has had hundreds of hours of training in it and people mistake familiarity for ease. I strongly suspect that equal hours of exposure to Bayesian ideas will be much more fruitful. Bayesian ideas are easier to understand than frequentist p values. Education is the key to giving Bayesian methods a chance.

10. Dear EJ,

Now if, as you say in chap 7.7, researchers should take utmost care when applying bayesian hypothesis testing, what is the the first thing we should not do? The first we SHOULD NOT do is to hand down to people a software where they can accomplish bayesian comparison seamlessly in two clicks. It's completely beyond me how JASP fits with your recommendations. All it accomplishes is to make mindless abuse of bayesian statistics easy and accessible.

The work of ohagan, forster and berger targets mathematical statisticians not applied researchers. The issue here is not whether one can derive ultra-consistent and mega-calibrated bayesian default methods, but rather whether such methods are a useful tool for data analysis. Your discussion in chapter 7.7 as well as the discussion of hypothesis testing in john's paper does not indicate so.

11. Hi John, I agree with everything you write in response to my comment. But my point does not relate to these issues. I was trying to say that the actual CI we compute with large n is going to look very similar to the credible interval. This should be clearly discussed in Bayesian discussions, but it is not.

I agree in particular that providing education is the key to everything. I will try to attend one of your courses in Europe to see how you do it.

12. Shravan:

Along with my general point about Bayesian giving us what we want regardless of N, I was also trying to say that even with large N the confidence interval is not necessarily going to look very similar to the credible interval. The confidence interval can still change a lot depending on the intended tests (and even on the stopping intention if the intention could involve smaller N).

As far as education goes, the audience of my workshops is a self-selected highly interested group of researchers (from precocious undergrads through grad students to established career professionals from many fields) and I teach with that audience in mind. But we also need good courses for students at earlier introductory levels, who are sitting in a class only because it's required. Implementation of those courses also has to battle the institutionalized entrenchment of NHST-based courses.

By the way, in my courses, I can't help but inject feeble attempts at humor throughout -- it's just who I am. If that sort of thing makes your eyes roll, you might want to think twice ;-)

13. Is JASP going to incorporate DBDA2E code? Because that would be awesome. Especially for intro level teaching. We need a good way for researchers to express and encode their priors for both estimation and (critically) model comparison. Prior predictive distributions, maybe?

14. Hi Mike. I want JASP to be inclusive, so I'd like it to feature as many methods as possible, although some perhaps as add-on modules. It's really a question of priorities, time, and money. But I'm open to it in principle.
E.J.

15. Hi Mike. I want JASP to be inclusive, so I'd like it to feature as many methods as possible, although some perhaps as add-on modules. It's really a question of priorities, time, and money. But I'm open to it in principle.
E.J.

16. John,

You wrote:
" The confidence interval can still change a lot depending on the intended tests (and even on the stopping intention if the intention could involve smaller N). "

Maybe, but this doesn't really affect the life of the researcher who fixes his/her sample size before starting the experiment, does it? I know people run till they hit significance, but they shouldn't do that anyway.

If I manage to get to one of your courses, I will try to tune out your feeble attempts ;)

17. The stopping intention and the testing intention have distinct influences on the confidence interval. Even with fixed-N stopping intention, the confidence interval expands greatly when the intended tests increase. See Section 11.3 of DBDA2E.

18. Hi John,

I just read section 11.3. This is not a relevant issue regarding what I am discussing. I am saying the following: suppose you have a *given* data-set. Using a Bayesian LMM, I can compute the 95% credible intervals for each "fixed effects" parameter. I can also do it using the lme4 function. For large data-sets, these will be identical.

I understand that the properties of CIs are not comparable to HPDs or credible intervals, and I understand that credible intervals are far superior because they answer the question, whereas CIs answer (to the extent that they answer anything) the wrong or irrelevant question.

What I'm saying is that for a *given* data-set, with n large enough, in the standard factorial designs we use in psych and ling, the credible interval will be very similar to the CI. Why is this so hard to accept and to discuss in Bayesian attacks on frequentist methods? In a conjugate normal-normal setting, with n large enough, the posterior will be identical to the MLE. It's great that you and others are pointing out all the flaws of the classical method. But it's not so great that people are not upfront about this. One consequence of not being upfront about this point is that beginners will learn to use Bayes, and then get surprised that the intervals are the same (in the standard factorial repeated measures designs I am talking about).

If I'm wrong about this, I would appreciate seeing an example where we have large n, and a 2x2 repeated measures design, the model is an LMM, and our fixed effects parameter intervals are radically different. Just one counterexample will help me understand what it is that I am not getting.

19. twillick said "In a conjugate normal-normal setting, with n large enough, the posterior will be identical to the MLE. It's ... not so great that [proponents of Bayesian methods] are not upfront about this. One consequence of not being upfront about this point is that beginners will learn to use Bayes, and then get surprised that the intervals are the same..."

I thought I was being up front. Let me try again.
1. With asymptotically large N, the shape of the posterior distribution will converge to the shape of the likelihood function. The posterior distribution expresses p(parameter|data) while the likelihood function expresses p(data|parameter), so they are not identical in meaning.
2. With asymptotically large N, the shape of the sampling distribution of the MLE of the parameter, using hypothetical parameter values set at the MLE of the actual data and stopping intention of fixed N and testing intention of a single test, will converge to the shape of the likelihood function. (At least, that's my impression from frequentist mathematical statistics.)
3. Therefore, with asymptotically large N, the shape of the posterior distribution will converge to the shape of the sampling distribution of the MLE of the parameter, using hypothetical parameter values set at the MLE of the actual data and stopping intention of fixed N and testing intention of a single test. The distributions are not the same meaning, however, as the posterior distribution expresses p( parameter | data ) while the sampling distribution expresses p( MLE_simulated | parameter=MLE_data, intention=fixed N and 1 test ).

People might be surprised by this correspondence, if people are under the impression that Bayesian intervals must be radically different than frequentist intervals. But that's a problem with the premise that Bayesian intervals must always be radically different than frequentist, which is false. Would beginners confabulate this false premise? Maybe, as an inappropriate generalization from some of the examples presented by Bayesians to illustrate the conceptual differences between frequentist and Bayesian. Certainly that's what the examples I present are intended to do: Illustrate the concepts and the differences between Bayesian and frequentist ideas.

Although some people might be surprised by the correspondence, I don't think people should feel indignant or feel that they've been conned by false advertising or false promises. Bayesian methods provide the information that's desired, namely p(parameter|data), and for any size N. Frequentist methods can approximate that information in some circumstances, and only through the chain of correspondences in the 1-2-3 argument explained above.

CONTINUED in next comment...

20. Continuation of previous comment:

Finally, an example of how the frequentist confidence interval can change noticeably for large N when there are multiple tests. This isn't about N=infinity, it's about moderately large N. Consider a single-factor between-subjects ANOVA. There are 11 groups, with true sample means at 1.0, 1.1, 1.2, ..., 2.0, and true sample SD's in each group of 1.0. The sample size is N in each group. Consider a t-test of group 1 vs group 3 and a Tukey HSD that corrects the p value and confidence interval for multiple tests (all pairwise comparisons). Here are the results for different sample sizes:

N: 10
t test: -0.7903841 1.1903841
Tukey HSD: -1.353395 1.753395

N: 100
t test: -0.08029036 0.48029036
Tukey HSD: -0.258461 0.658461

N: 1000
t test: 0.1122507 0.2877493
Tukey HSD: 0.05595481 0.34404519

N: 10000
t test: 0.1722789 0.2277211
Tukey HSD: 0.1544791 0.2455209

Thus, even when there are 10000 per group (!), there is still a noticeable difference between the t-test confidence interval and the Tukey-corrected confidence interval.

Here is the R code that generated the above results:

#------------------------------------------------------------------------------

# Specify generating values:
NperGroup = 10000
cat("N: ")
show( NperGroup )
groupMean = c( 1.0 , 1.1 , 1.2 , 1.3 , 1.4 , 1.5 , 1.6 , 1.7 , 1.8 , 1.9 , 2.0 )
groupSD = 1.0

# Utility function, standard deviation using N not N-1:
sdN = function(y) { sqrt(mean((y-mean(y))^2)) }

# Generate data:
y = NULL
x = NULL
set.seed(47405)
for ( groupIdx in 1:length(groupMean) ) {
groupData = rnorm( NperGroup )
groupData = ( (groupData-mean(groupData))/sdN(groupData) * groupSD
+ groupMean[groupIdx] )
y = c( y , groupData )
x = c( x , rep( groupIdx , NperGroup ) )
}
x = factor( x )

plot( y ~ x )
title( main=bquote("N per Group"==.(NperGroup)) )

tTestInfo = t.test( y[x==3] , y[x==1] )
cat( "t test: ")
show( tTestInfo$conf.int ) aovInfo = aov( y ~ x ) TukeyHSDinfo = TukeyHSD( aovInfo ) cat( "Tukey HSD: ") show( TukeyHSDinfo$x["3-1",c("lwr","upr")] )

plot( TukeyHSDinfo )

#------------------------------------------------------------------------------

21. Hi John,

Shravan here (my son messed around with my account, and changed my name to twillick).

My point really is that for such designs as I'm talking about it's not going to matter much whether one fits a Bayesian model or frequentist ones---the decision will be the same.

22. Hi John and everybody else,
First of all: Thank you for your blog and your book (which I am currently reading and working through).
I have made a Youtube video criticizing the P-Value and NHSTs and would be very happy if you'd check it out and even more happy if you (or someone) left a comment or criticism. Here is the link: https://youtu.be/HRLxtJeopDs
Sorry for the advertisement, but I would honestly be interested in anyone’s opinions and I'm not making anything of it.
Thank you!

23. Amoral: Nice video. I left a comment at YouTube.

24. Hello,

Thanks, John, for this interesting article. I was wondering if you or any others had thoughts on using Bayes factors together with Bayesian parameter estimation. For example, I am interested in pitting several multiple regression models against a base model to determine which variables best predict some outcome (i.e., using Bayes factors). Then, I would like to estimate the credible values for those parameters. The first step can be accomplished using Morey's 'BayesFactor' R-package and the second step using the scripts that you have provided.

I have not been able to find any published research using this approach, so I am curious to hear what others think. Is this a valid way to proceed or is there a different approach that can accomplish the same thing?

Thank you!

25. Kurtis:
Variable selection for multiple regression is extensively addressed in Ch. 18 of DBDA2E. Section 18.3 covers hierarchical models with shrinkage, and section 18.4 discusses variable selection (in combination with hierarchical shrinkage). Variable selection with inclusion parameters is tantamount to using Bayes factors for inclusion. See section 18.4 for all the usual cautions about using Bayes factors / discrete inclusion parameters.