Wednesday, February 25, 2015
"Is the call to abandon p-values the red herring of the replicability crisis?"
In an opinion article [here] titled "Is the call to abandon p-values the red herring of the replicability crisis?", Victoria Savalei and Elizabeth Dunn concluded, "at present we lack empirical evidence that encouraging researchers to abandon p-values will fundamentally change the credibility and replicability of psychological research in practice. In the face of crisis, researchers should return to their core, shared value by demanding rigorous empirical evidence before instituting major changes."
I posted a comment which said in part, "people have been promoting a transition away from null hypothesis significance testing to Bayesian methods for decades, long before the recent replicability crisis made headlines. The main reasons to switch to Bayesian have little directly to do with the replicability crisis." Moreover, "It is important for readers not to think that Bayesian analysis merely amounts to using Bayes factors for hypothesis testing instead of using p values for hypothesis testing. In fact, the larger part of Bayesian analysis is a rich framework for estimating the magnitudes of parameters (such as effect size) and their uncertainties. Bayesian methods are also rich tools for meta-analysis and cumulative analysis. Therefore, Bayesian methods achieve all the goals of the New Statistics (Cumming, 2014) but without using p values and confidence intervals."
See the full article and comment at the link above.

Just for redundancy, here's the full text of my comment on the opinion article:
ReplyDeleteBayes factors for hypothesis testing are only a small part of Bayesian data analysis. It is important for readers not to think that Bayesian analysis merely amounts to using Bayes factors for hypothesis testing instead of using p values for hypothesis testing. In fact, the larger part of Bayesian analysis is a rich framework for estimating the magnitudes of parameters (such as effect size) and their uncertainties. Bayesian methods are also rich tools for meta-analysis and cumulative analysis. Therefore, Bayesian methods achieve all the goals of the New Statistics (Cumming, 2014) but without using p values and confidence intervals.
Readers should not overgeneralize the problem of prior distributions in Bayes factors to all applications of Bayesian analysis. While Bayes factors can be very sensitive to the choice of prior, that sensitivity does not invalidate Bayes factors; instead it emphasizes the fact that the prior must express a meaningful theory or meaningful previous data. Most importantly from my perspective, unlike Bayes factors, Bayesian parameter estimates are usually NOT very sensitive to the choice of prior (see Kruschke 2011 cited in the article). Moreover, when there is strong prior information, then not using it to inform a prior distribution can be a serious blunder, as has been emphasized in the literature repeatedly when incorporating base rates of diseases (i.e., the priors) into disease diagnosis. To dismiss priors as merely another source of researcher degrees of freedom and another questionable research practice (QRP) is wrong.
Bayesian analysis is primarily a means for analyzing data already in hand, while the replication crisis is largely a consequence of biased selectivity in which data get analyzed and which data get published. Bayesian analysis cannot prevent people from biasing the data that get analyzed. But people have been promoting a transition away from null hypothesis significance testing to Bayesian methods for decades, long before the recent replicability crisis made headlines. The main reasons to switch to Bayesian have little directly to do with the replicability crisis. Nevertheless, there are aspects of Bayesian methods that can help scrupulous analysts attenuate some problems. As just one example, complex hierarchical models can be seamlessly analyzed in Bayesian software, and hierachical models can produce rational shrinkage in parameter estimates, which in turn reduces false alarms.
For readers interested in learning more about Bayesian methods, may I suggest the easy introductory chapter available at https://sites.google.com/site/doingbayesiandataanalysis/sample-chapter
Next, try the article about Bayesian estimation of effect sizes for comparing two groups, available at http://www.indiana.edu/~kruschke/BEST/
"at present we lack empirical evidence that encouraging researchers to abandon p-values will fundamentally change the credibility and replicability of psychological research in practice."
ReplyDeleteI have no issue whatsoever with examining the affects of abandoning p-values empirically. What rankles though is that p-values were adopted with no theoretical or empirical justification/evidence whatsoever.
This contrast with Bayesian methods, which at the time they were rejected by Fisher et al had been used by Laplace to obtain masses of highly reproducible research, and were derived from the sum and product rules of probability theory (making them as theoretically sound as anything can be in statistics). Fisher didn't point out any examples where Bayesians were getting non-reproducible research using their methods. He merely blathered on about their 'philosophical' unsoundness in such a way that it's clear Fisher didn't understand what Bayesians like Jeffreys were really doing.
The only positive evidence for p-values was that they were vaguely intuitive and were tried on a few very simple problems in which they give results operationally identical to Bayes. Which is to say there was no evidence at all. Even after the mass of theoretical and practical problems with p-values were well known, people still cling to p-values!
It would be nice if Frequentists had the intellectual honesty to admit that p-values had and have none of the empirical evidence they insist is now required and admit their rank hypocrisy in this matter.
‘rank hypocrisy’... this is perhaps an extreme opinion. As a statistician with mostly Frequentist training who is looking into the Bayesian paradigm, my opinion is that we continue using p-values and confidence intervals, not out of stubbornness, but because they have mathematical backing, in the form of theoretical sampling distributions, or alternatively, empirical distribution functions computed with bootstrap methods. So there are both theoretical and empirical justifications behind the use of these Frequentist tools.
ReplyDeleteA p-value allows us to make a statement on whether the value of an observed statistic (or more extreme values), computed from data at hand, and assuming a certain setting (i.e., the ‘null’ hypothesis), is likely to result from just sampling variability. This single p-value applies to a single ‘null’ hypothesis, and it doesn’t say anything else, especially, it doesn’t necessarily say anything about the magnitude or relevance of the effect of interest. Ideally, in a Frequentist analysis, a null hypothesis test would be the first step, with the simple purpose of determining if sampling variability can be ruled out. Referring to a null hypothesis test as a ‘significance’ test is an unfortunate choice of terms, and I admit that it is probably misleading.
In my view, the problem is that oftentimes, the interpretation of a p-value is overstretched to claim that it is the probability of a hypothesized ‘null’ value, or that it implies practical relevance. This constitutes misuse and misunderstanding of the methodology, and not a flaw of it. Further, misinterpreting the results from ‘significance’ tests is often combined with multiple testing, while ignoring the fact that the more hypotheses you test (via Frequentist or Bayesian procedures), the higher the chances of encountering false associations that are the product of sampling variability, which, in my view, greatly contributes to the issues of replicability in several disciplines. This used to be a big problem in the ‘omics’ fields, over a decade ago, but more rigor with the strength of evidence, still using Frequentist methods, has decreased the number of false positives reported in recent years. I am not saying that Bayesian methods are not used in ‘omics’, but that rigor is also possible within the Frequentist paradigm.
Comment continued from above due to character limitations per post.
ReplyDeleteFor another common example of the multiple testing issue, we don’t have to go too far. Consider one of the most common methodologies, linear modeling, still the analytical workhorse in many disciplines. Inference from linear models is typically conducted within the Frequentist paradigm. More often than not, a linear model is fitted, and all sorts of conclusions are made based on the p-values for the model coefficients (and hopefully on the magnitudes of these coefficients, and the model fit). Yet, what everyone seems to ignore, is that these p-values result from marginal tests of ‘added last’ coefficients, resulting in the possibility of multiple testing issues within a single model. On top of that, cross-validation procedures often show shrinkage on the goodness-of-fit from these models, indicating that conclusions from them, such as goodness of fit, or statements on the ‘significance’ of some model coefficients, don’t replicate with the same strength even in the same sample. Again, the problem with linearized models is not necessarily the models themselves, but ignoring their limitations, and misinterpreting the ‘significance’ tests of the model coefficients.
Within the Frequentist paradigm, there are a number of ways to be more rigorous when dealing with multiple tests, and/or fitting and conducting inference from models (e.g. cross-validation, a whole slew of model selection procedures, adjustments for multiple testing, or simply acknowledging the possible threats to the validity of an analysis). However, and I admit it, we don’t necessarily do so as much as we should. In applied disciplines were practitioners have skills sufficient to conduct some analyses of their own data, I would be surprised if there was sufficient awareness of the aforementioned issues.