Wednesday, October 19, 2011

False conclusions in "False positive psychology..."

A new article in Psychological Science correctly points out flaws of p values and of procedures that produce bias in data selection and analysis. The article makes several reasonable conclusions. Unfortunately it also makes two wrong conclusions about issues of fundamental importance with major ramifications.

One conclusion is that p values are okay but need to be corrected for the researcher's stopping intention. I refute that claim by reductio ad absurdum. A second conclusion is that Bayesian analysis is a "nonsolution" to the problem of researchers having too much wiggle room. I dispute that claim by clarifying what problems any analysis method can or cannot address, by denying flexibility attributed to Bayesian analysis that isn't really available, and by claiming the actual flexibility in Bayesian analysis as an important advantage for scientific research.

The first issue stems from the fact, correctly pointed out in the article, that p values depend on the stopping intention of data collector. Here's the idea. Suppose a researcher collects some data and computes a summary statistic such as t or F or χ2. The p value is the probability of the observed statistic, or a value more extreme, in the space of possible data that might have been obtained if the null hypothesis were true and the intended experiment were repeated ad infinitum. The space of possible data that might have been obtained depends on the stopping intention. So, if the data collector intended to stop when the sample size N reached a certain number, such as 23, then the space of possible data includes all data sets for which N=23. But if the data collector intended to stop at the end of the week (and just happened to get N=23), then the space of possible data includes all data sets that could have been collected by the end of the week, some of which have N=23, and some of which have smaller N or larger N. Because the two spaces of possible data sets are not the same, the p values are not the same. The p value can depend quite dramatically on the stopping intention. For example, if the researcher intended to stop when N=100 but was unexpectedly interrupted when N=23, the p value is much smaller than if the intention was to stop when N=23. Or, if the researcher intended to stop when N=23 but got an unexpected windfall of data so that N=100, perhaps because a new volunteer assistant showed up, then the p value is much larger than if the researcher intended to stop at N=100. Therefore, to correctly determine the p value for a set of data, we must know the reason that data collection ceased.

Here is an example of the correct use of p values. A lab director has some research in mind and decides that N=30 is adequate. The director tells the lab administrator to collect data from 30 subjects. The administrator knows that the lab typically recruits about 30 subjects in a week, and therefore tells the data-collecting assistant to run subjects for a week. The assistant dutifully collects data, and at the end of the week the lab director happens to be present as the last datum is collected, which happens to be N=30. As far as the lab director can tell, data collection ceased intentionally when N=30. When the lab director analyzes the data, under the intention of stopping when N=30, a particular p value is computed. But when the assistant analyzes the data, under the intention of stopping at the end of the week, a different p value is computed. In fact, for the lab director p<.05, but for the assistant p>.05. Which p value is correct? Are the results "significant"?

Here is another example of the correct use of p values. Two competing labs are pursuing the same type of research in a well established experimental paradigm. The two labs independently have the same idea for an identical experiment design, and the researchers go about collecting data. In one lab, they intend to collect data for a week, and they happen to get N=30. In the other lab, they intend to stop data collection when N=30. Moreover, by chance, the data in the two labs happen to be identical. (That's not so far fetched; e.g., perhaps the data are binary choices for each subject, and so the data can be summarized as the number of "heads" out of N subjects.) The two labs compute p values for their identical data sets. The two p values are different because the data were collected with different stopping intentions; in fact p<.05 for one lab but p>.05 for the other lab. Which lab got "significant" data?

The problem is that the data bear no signature of the stopping intention of the researcher. Indeed, researchers go out of their way to make sure that the data are not influenced by their stopping intention. Each datum collected is supposed to be completely insulated from any data collected before or after. The last datum collected has no trace that it was the last or the first or any position in between.

Not only is the intention opaque to the data, it's often opaque to the researcher. Collaborators on a project might have differing sampling intentions (as in the example of the director and assistant, above). Or the sampling intention might change midway through data collection. ("Let's collect N=100." Then on Friday afternoon, "Well, we've got N=94, that's good enough.") Or, as often happens, some subjects have to be deleted from the data set because of procedural errors or failure to respond, despite the data-collector's intention about sample size or stopping time. These are all very realistic sampling procedures, and we tolerate them because we know that the data are completely unaffected by the intention of the researcher.

Therefore it's strange that the interpretation of the data in terms of p values should depend crucially on something that has no impact on the data, namely, the stopping intention. But, the correct calculation of p values does depend on the stopping intention.

So, what should be done about this problem? Given the examples above, it seems clear that we should treat p values as inherently ill-defined because they depend on intentions that are irrelevant to the data.

But here is the #1 recommendation from the new article in Psychological Science:
Requirements for authors: ...
1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calculations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.
Presumably this requirement is declared so that researchers can use the stopping rule to calculate the true and correct p value, determined appropriately for the idiosyncratic intentions of the researchers. Here's a plausible example. "We decided to collect 100 observations, but by Friday we had 94 and figured that was close enough, so we stopped, and then had to delete 5 subjects because of later-discovered transcription errors. A post-experiment inquisition of the lab team revealed that one of our assistants was covertly intending to quit the job on Monday, which would have limited our data collection if we had not decided to stop on Friday. Therefore, running a large Monte Carlo simulation that incorporated the estimated recruitment rate of subjects during the week, and the estimated probability of deciding to stop on Friday for different values of N achieved on Friday, and the estimated probability of transcription errors, and the probability of an assistant quitting on Monday, we determined that p=..."

To ease the reporting of true and correct p values, it would be extremely helpful to have critical values for commonly-used statistics (t, F, etc.) under various typical stopping intentions that researchers adopt. All the critical values in contemporary books and computer programs assume the unrealistic convention that N was fixed in advance. Instead, we should have tables of critical values for stopping after a certain duration, with varieties of sampling rates during that interval. (In fact, I've already seeded this process with an example in this article.) Therefore researchers could obtain the true p values for their data if they intended to stop after a particular duration.

It would also be helpful to have correction factors for unexpected interruptions of data collection, or unexpected windfalls of data collection. The researcher would merely have to enter the intended sample size and the probability of being interrupted or enhanced (and the probability of each magnitude of enhancement) at any particular time during data collection, and the correction factor would produce the true and correct p value for the data.

Federal funding agencies have been especially keen lately to support collaborative efforts of teams of researchers. It is likely that different team members may harbor different intentions about the data collection (perhaps covertly or subconsciously, but harbored nonetheless). Therefore it would extremely useful to construct tables of true and correct critical values for cases of parallel mixed intentions, when one collaborator intends to stop at a fixed sample size, and the other collaborator intends to stop at the end of the week. Clearly, the construction of these tables should be a major funding priority for the granting agencies.

I look forward to a new industry of publications that reveal appropriate corrections for different sampling intentions. Fortunately, we already have a model of this industry in the extensive literature about correcting for false alarms in multiple comparisons. Depending on the types of intended comparisons, and whether the intentions are planned or post-hoc, we have a cornucopia of corrections for every possible set of intended comparisons. Unfortunately, all of those corrections for multiple comparisons have been based on the sampling assumption that data collection stopped at fixed sample size! Therefore, every one of the corrections for multiple comparisons will have to be reworked for different stopping intentions. It will be a great day for science when we have a complete set of corrections for all the various intentions regarding multiple comparisons and intentions regarding stopping collection of data, because then we will know true and correct p values for our data, which were completely insulated from those intentions.

Oops! Sorry, I slipped into sarcasm. But hey, it's my blog. I should reiterate that I agree with many of the points made by the authors of the Psychological Science article. Just not the point about p values and stopping intentions. And one more point, regarding a different analysis method that the authors dismissed as a "nonsolution." Here is the relevant excerpt:
Nonsolutions: ...
Using Bayesian statistics. We have a similar reaction to calls for using Bayesian rather than frequentist approaches to analyzing experimental data (see, e.g., Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011). Although the Bayesian approach has many virtues, it actually increases researcher degrees of freedom. First, it offers a new set of analyses (in addition to all frequentist ones) that authors could flexibly try out on their data. Second, Bayesian statistics require making additional judgments (e.g., the prior distribution) on a case-by case basis, providing yet more researcher degrees of freedom.
It's important to be clear that statistical analysis of any kind can only deal with the data its given. If the design and procedure garner biased data, no analysis can fully undo that bias. Garbage in, garbage out. If the problem of "too many researcher degrees of freedom" stems from design-and-procedure problems, then it needs design-and-procedure solutions. To say that Bayesian analysis is a nonsolution to a design-and-procedure problem is like saying that a meat grinder is a nonsolution to rancid meat (and that therefore the meat grinder is useless) (and leaving the reader to make the leap that therefore the meat grinder is useless).1

The authors argue that Bayesian analysis "increases researcher degrees of freedom" (and is therefore bad) in two ways. First, "it offers a new set of analyses (in addition to all the frequentist ones)". The tacit assumption of this statement seems to be that researchers would try frequentist and Bayesian approaches and just report the one that gave the most flattering conclusion. No, this wouldn't fly. Bayesian analyses provide the most complete inferential information given the data (in the normative mathematical sense), and analysts can't just slip into frequentist mode because it's flattering. In fact, reporting a p value is embarrassing, not flattering, because p values are ill defined.

Second, say the authors, "Bayesian statistics require making additional judgments (e.g., the prior distribution)...". Ah, the bogey-man of priors is trotted out to scare the children, as if priors can be capriciously set to anything the analyst wants (insert sounds of mad, wicked laughter here) and thereby predetermine the conclusion. In actuality, priors are overt and explicitly agreeable to a skeptical scientific audience. Typically they are set to be noncommittal so that they have minimal influence on the posterior distribution. When there is considerable previous research to inform a prior, then a strong prior can give great inferential leverage to small samples. And not using strong prior information when it is available can be a serious blunder; consider random drug testing and disease diagnosis, which must take into account the base rates, i.e., the priors.

Bayesian analysis does, in fact, give analysts more flexibility than traditional frequentist analysis. It gives the analyst the flexibility to use a model that actually describes the trends and distributions represented in the data, instead of being shoe-horned into linear models and normal distributions that may have little resemblance to the data. (Of course, if the analyst wants linear models with normal distributions, Bayesian analysis provides the richest possible inference about their parameters without ever computing a p value.) With Bayesian analysis, researchers can actually get useful parametric descriptions of complex data, involving multi-level non-linear models with non-normal distributions at various levels throughout the model. This is the flexibility that scientific theorizing needs.

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, first published online Oct. 17, 2011. DOI: 10.1177/0956797611417632

1 Revision to parenthetical remark made on Oct. 23, 2011, in response to personal communication from Uri Simonsohn. Struck-out version was in the original post, which could have inadvertently been interpreted as saying that the authors themselves explicitly made the meat-grinder-is-useless argument. They did not.


  1. Thanks as always for your insight and wit! Always a pleasure reading your posts.

  2. The p values are allways computed under the "as if" kind of assumption. When doing bayesian analysis posterior is computed under the "as if" kind of assumption too. In the first (frequentist p values) case we act as if the N obtained was intended, the model was true, etc. In the second case we act as if the model was true, the prior reflected our apriori belief, etc. These "as if" assumptions are never true in any realistic setting, because the model is never true (it is only an approximation, and in fact no one has come up with a coherent idea of how to express the error of this approximation formally as far as I know) and we can never be sure that the prior distribution reflects our prior belief without any distortions. But this means that the intention to collect N data points or a weak worth of data points does not play such an important role as you suggest. What really matters is stopping after seeing the portion of the data, but this does not seem to be your point. One could argue that the indifference of bayesian analysis to this kind of stopping rules is an obvious flaw, since one can just go on and on until the p value is close to what one wishes it to be.

  3. I'm sympathetic to the basic points you're making, but I'm not convinced that the stopping intention is as important as this post makes it out to be, at least in cases where the researcher stops after a fixed number of points.

    If I understand your argument correctly, it's that the distribution of the t (or whatever) statistic depends on the distribution of n, which in turn depends on the researcher's data-collection policies/intentions. If one's goal from an NHST perspective is to keep false positive results to a certain low rate, then isn't that achieved by saying: "I'll pick n from an arbitrary distribution, draw n samples, and conduct a test that has a false-positive rate of alpha conditional on n"? After all, if the conditional-on-n false positive rate is always low enough, wouldn't the marginal rate be as well? The case that Simmons et al. point out is rather different (and much more problematic), with multiple tests being conducted.

  4. Dear Anonymous October 21, 2011, 6:56 PM:

    The problem of ill-defined p values, illustrated in the blog post, is just an elaboration of points made previously by statisticians, for example in the often-cited article: Berger, J. O. & Berry, D. A. (1988). Statistical analysis and the illusion of objectivity. American Scientist, 76(2), 159-165.

    I am keen to know of publications that take up the condition-on-N approach that you suggest. Do you know of any? Thanks in advance!

  5. In what place exactly does the notion of researchers intention appear in the frequentist statistical theory defining the p value?

  6. Dear Autor (Borys):

    The p value is the probability of obtaining the observed sample statistic (e.g., the observed t value), or a value more extreme, if the null hypothesis were true and the intended sampling process were repeated ad infinitum. Many "book" definitions leave out the tacit assumption that the intended sampling process is to stop at a fixed N. For a very accessible example, seeBerger, J. O. & Berry, D. A. (1988). Statistical analysis and the illusion of objectivity. American Scientist, 76(2), 159-165. Here's the PDF:

  7. I know what p value is, or so I think. This definition you provide would be quite fine if not for the psychological term, i.e., "intended". This is the only term that stands out between "p value", "probability", "statistic" and "null hypothesis", doesn't it? In principle the psychological subtleties of researcher's decision process can be taken into account, but this would make both the frequentist and the bayesian approach loose their appeal quickly.

    Perhaps the main reason people like p value is that 1) p value is simple to compute and, more importantly 2) p value is a 0-1 decision criterion (to reject, or not to reject).

    Imagine you want to create a bayesian equivalent of this device. Inevitably the issue of generally agreed upon threshold appears and there seems to be no other way than to use posterior quantiles, credible/HPD intervals, etc. But if you follow this path it turns out that most of the time bayesian and frequentist intervals (confidence interval and p value give equivalent information) are very similar when uninformative priors are used.

    I don't see how can one escape the difficulties associated with p values ***when trying to achieve the aim described in 1) and 2)*** within the bayesian framework.

  8. Dear Autor Oct. 27, 2011, 3:09 AM:

    There are lots of different issues raised in your comment, but I must be brief in my reply.

    The definition of p is based on a sampling distribution, and a sampling distribution requires repeated (imaginary/simulated) sampling, which means repeating the sampling process. Do repeat the sampling process, we must specify what it means to generate a sample. Hence the involvement of the stopping intention.

    Regarding your item (1), p values are not, in general, simple to compute. For linear combinations of normally distributed data, then yes, there are analytical solutions for sampling distributions and p values. But for hierarchical, non-linear models involving non-normal distributions, p values are very difficult to compute. But Bayesian methods apply seamlessly even in such complex models.

    Regarding your item (2), the decision rule converts a continuous value into a discrete decision. In NHST, we convert a continuous p value to reject or fail to reject (undecided). In Bayesian, we convert a continuous posterior to reject, accept, or undecided, using HDI and ROPE (or thresholds on Bayes' factors in some cases). Often the discrete decisions of NHST and Bayes will agree, but not always. And in every case, the information provided by the Bayesian analysis is richer and complete, while the information in the NHST is impoverished and fickle. I don't have time and space to explain all that here -- please see the NHST chapter of the book, and summary articles I've written, and the references cited in the book and articles.

    The difficulties of p values are escaped in Bayesian analysis because sampling distributions are never used for making decisions about the actually obtained data.

  9. This comment has been removed by the author.

  10. Dear John

    True, we must specify what it means to generate a sample, but what forces us to specify this completely in advance? Not specifying it in advance certainly does not make the 'If this more or less accidentally chosen sample size was intended, the p value would be...' kind of argument invalid. But the same kind of arguments (let's call them 'arguments from hypothetical researcher' or 'hypothetical decision maker') has to be used by bayesian as well.

    True, p values are not easy to compute in every sense of the term (they are just easily provided by the software in some common cases) and in fact it is far from obvious how to compute them even for gaussian linear mixed models (last time I checked no one knew how many degrees of freedom do these models have). But in reality when more complex models are considered posterior quantities are 1) nonreplicable becase they are based on simulation 2) often prohibitively difficult to compute accurately (e.g., very slow convergence or poor mixing of the chains) or worse 3) it is in fact impossible to estimate the precision of posterior estimates based on simulations. It is hard to say how often does the third possibility arise because no test exists that can tell us if the chain has converged! All we have are the test that can tell us if the chain isn't doing something extremely ugly. Anyway, it is not true that bayesian methods apply seamlessly in complex models. They do in special cases, sometimes in interesting special cases, if one is lucky enough.

    So you admit that p values and their bayesian analogues agree in some common situations. Sure, bayesian analysis gives much more information, but it does not even come close to the simplicity of the p value solution to the problem of general purpose discrete decision criterion - when using the bayesian solution one has to provide much more information, for example convergence diagnostics, simulation parameters, etc. All this makes the obtained 'bayesian p value' more prone to scepticism. When general purpose decision criterion is sought this kind of simplicity is among the most important selling points. The price is impoverishment of information, but the research community is apparently willing to pay this price.

    Finally, if one is willing to accept 'bayesian p values' some difficulties are inevitable: the uncertainty associated with estimates obtained by choosing N ***after*** peeking at the data is different than when N is fixed in advance for obvious reasons. The fact that sampling distribution is not used in bayesian analysis does not in itself constitute a solution to this kind of problems.

    On a side note - in theory modern statistical methods based on algorithmic complexity (MML, MDL, etc.) can deal with complex models as seemlesly as bayesian analysis, these methods provide the frequentist justification for various uninformative 'priors' and provide model selection criteria that take model complexity into account and at the same time are not associated with the problem of severe prior sensitivity arising when bayes factors are used. The problem of prior sensitivity disappears exactly because the priors are chosen and justified on frequentist grounds in these modern methods. Also, common to both the frequentist and the bayesian approaches the utterly unrealistic assumption of the model being true is absent here. I'd say these are among the reasons why Vapnik made that joke:

    I usually prefer bayesian approach, by the way ;-)

  11. Hi Borys.

    So we agree that p values are hard to compute (even assuming a fixed-N intention) even for modestly complex models.

    My previous comment was not meant to imply that Bayesian analysis is seamless for all, arbitrarily complex models. I meant for a lot of typically complex models. Let me attempt to find a statement we could agree on: Bayesian MCMC methods work well for a lot of usefully complex models, for which p values cannot be definitely determined.

    You say that Bayesian is not as "simple" as p values. But that's a matter of software, not inherent conceptual difficulty. Presumably there will be a day not too long from now when MCMC software will have internal checks and fixes for lack of convergence; e.g., it will automatically thin, automatically try standard parameter expansions, etc., for a library of standard models. From the user's perspective, it will be just as "simple" as a p value. But I agree that right now, with current software, p values are more simple. But if people transition to Bayesian methods for the not-so-simple models, it will be easier and easier to apply them to the simple models too.

    It would be great if someone wrote an accessible(!) book on MDL analogues of the generalized linear model (GLM).

  12. Hi John

    So we agree on some things practical. We agree that p values are often easier to obtain than 'bayesian p values' but bayesian approach will probably win this battle in the long run. I still don't see how can the bayesian approach escape the difficulties associated with p value as a discrete decision criterion, e.g., the problem of stopping after seeing the data and the problem of multiple comparisons. They can be addressed from the bayesian perspective, but certainly not escaped from.


    is a striking example of how seamless bayesian analysis can be for really interesting models, but you probably know this already.

    Perhaps the reason why no one wrote an accessible book on MDL for GLM is that most o the time when doing GLM analysis you end up with nested models and MDL (or bayes factor with flat priors) doesn't give any substantial improvements over likelihood ratio test.

  13. Here:

    is Cox and Mayo's explanation of why frequentist inference is not threatened by the random sample size experiment scenario you describe. I think I came up with an even simpler explanation which has additional interesting properties. Here is how it goes:

    When sample size is randomly chosen before the experiment, the whole situation can be represented by the causal diagram:

    E -> T and R -> T

    where E is the true effect size, T is the test statistc, R is the random mechanism that generates the sample size. Clearly R does not depend causaly on E, R does not affect E, E and R do not have common causes - R being random. Now using elementary causal analysis it is trivial to show that conditioning on R gives the correct frequentist inference, simply because conditioning on R does not create a confound with respect to E -> T effect.

    It shows something very interesting, by the way. Apparently causal analysis can provide a notion of "relevance" that seems like a good explication of what conditionality principle, likelihood principle, etc are really about.

    I don't claim to be the first one to discover this, actually it looks like something someone smarter than me already thought about. I just thought it might be of interest to you.

    Borysław Paulewicz