A
new article in Psychological Science† correctly points out flaws of
p values and of procedures that produce bias in data selection and analysis. The article makes several reasonable conclusions. Unfortunately it also makes two wrong conclusions about issues of fundamental importance with major ramifications.
One conclusion is that
p values are okay but need to be corrected for the researcher's stopping intention. I refute that claim by reductio ad absurdum. A second conclusion is that Bayesian analysis is a "nonsolution" to the problem of researchers having too much wiggle room.
I dispute that claim by clarifying what problems any analysis method can or cannot address, by denying flexibility attributed to Bayesian analysis that isn't really available, and by claiming the actual flexibility in Bayesian analysis as an important advantage for scientific research.
The first issue stems from the fact, correctly pointed out in the article, that
p values depend on the stopping intention of data collector. Here's the idea. Suppose a researcher collects some data and computes a summary statistic such as
t or
F or χ
2. The
p value is the probability of the observed statistic, or a value more extreme, in the space of possible data that might have been obtained if the null hypothesis were true and the intended experiment were repeated ad infinitum. The space of possible data that might have been obtained depends on the stopping intention. So, if the data collector intended to stop when the sample size N reached a certain number, such as 23, then the space of possible data includes all data sets for which N=23. But if the data collector intended to stop at the end of the week (and just happened to get N=23), then the space of possible data includes all data sets that could have been collected by the end of the week, some of which have N=23, and some of which have smaller N or larger N. Because the two spaces of possible data sets are not the same, the
p values are not the same. The
p value can depend quite dramatically on the stopping intention. For example, if the researcher intended to stop when N=100 but was unexpectedly interrupted when N=23, the
p value is much smaller than if the intention was to stop when N=23. Or, if the researcher intended to stop when N=23 but got an unexpected windfall of data so that N=100, perhaps because a new volunteer assistant showed up, then the
p value is much larger than if the researcher intended to stop at N=100. Therefore, to correctly determine the
p value for a set of data, we must know the reason that data collection ceased.
Here is an example of the correct use of
p values. A lab director has some research in mind and decides that N=30 is adequate. The director tells the lab administrator to collect data from 30 subjects. The administrator knows that the lab typically recruits about 30 subjects in a week, and therefore tells the data-collecting assistant to run subjects for a week. The assistant dutifully collects data, and at the end of the week the lab director happens to be present as the last datum is collected, which happens to be N=30. As far as the lab director can tell, data collection ceased intentionally when N=30. When the lab director analyzes the data, under the intention of stopping when N=30, a particular
p value is computed. But when the assistant analyzes the data, under the intention of stopping at the end of the week, a
different p value is computed. In fact, for the lab director
p<.05, but for the assistant
p>.05. Which
p value is correct? Are the results "significant"?
Here is another example of the correct use of
p values. Two competing labs are pursuing the same type of research in a well established experimental paradigm. The two labs independently have the same idea for an identical experiment design, and the researchers go about collecting data. In one lab, they intend to collect data for a week, and they happen to get N=30. In the other lab, they intend to stop data collection when N=30. Moreover, by chance, the data in the two labs happen to be identical. (That's not so far fetched; e.g., perhaps the data are binary choices for each subject, and so the data can be summarized as the number of "heads" out of N subjects.) The two labs compute
p values for their identical data sets. The two
p values are different because the data were collected with different stopping intentions; in fact
p<.05 for one lab but
p>.05 for the other lab. Which lab got "significant" data?
The problem is that the data bear no signature of the stopping intention of the researcher. Indeed, researchers go out of their way to make sure that the data are not influenced by their stopping intention. Each datum collected is supposed to be completely insulated from any data collected before or after. The last datum collected has no trace that it was the last or the first or any position in between.
Not only is the intention opaque to the data, it's often opaque to the researcher. Collaborators on a project might have differing sampling intentions (as in the example of the director and assistant, above). Or the sampling intention might change midway through data collection. ("Let's collect N=100." Then on Friday afternoon, "Well, we've got N=94, that's good enough.") Or, as often happens, some subjects have to be deleted from the data set because of procedural errors or failure to respond, despite the data-collector's intention about sample size or stopping time. These are all very realistic sampling procedures, and we tolerate them because we know that the data are completely unaffected by the intention of the researcher.
Therefore it's strange that the interpretation of the data in terms of
p values should depend crucially on something that has no impact on the data, namely, the stopping intention. But, the correct calculation of
p values does depend on the stopping intention.
So, what should be done about this problem? Given the examples above, it seems clear that we should treat
p values as inherently ill-defined because they depend on intentions that are irrelevant to the data.
But here is the #1 recommendation from the new article in Psychological Science:
Requirements for authors: ...
1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calculations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.
Presumably this requirement is declared so that researchers can use the stopping rule to calculate the true and correct
p value, determined appropriately for the idiosyncratic intentions of the researchers. Here's a plausible example. "We decided to collect 100 observations, but by Friday we had 94 and figured that was close enough, so we stopped, and then had to delete 5 subjects because of later-discovered transcription errors. A post-experiment inquisition of the lab team revealed that one of our assistants was covertly intending to quit the job on Monday, which would have limited our data collection if we had not decided to stop on Friday. Therefore, running a large Monte Carlo simulation that incorporated the estimated recruitment rate of subjects during the week, and the estimated probability of deciding to stop on Friday for different values of N achieved on Friday, and the estimated probability of transcription errors, and the probability of an assistant quitting on Monday, we determined that
p=..."
To ease the reporting of true and correct
p values, it would be extremely helpful to have critical values for commonly-used statistics (
t,
F, etc.) under various typical stopping intentions that researchers adopt. All the critical values in contemporary books and computer programs assume the unrealistic convention that N was fixed in advance. Instead, we should have tables of critical values for stopping after a certain duration, with varieties of sampling rates during that interval. (In fact, I've already seeded this process with an example in
this article.) Therefore researchers could obtain the true
p values for their data if they intended to stop after a particular duration.
It would also be helpful to have correction factors for unexpected interruptions of data collection, or unexpected windfalls of data collection. The researcher would merely have to enter the intended sample size and the probability of being interrupted or enhanced (and the probability of each magnitude of enhancement) at any particular time during data collection, and the correction factor would produce the true and correct
p value for the data.
Federal funding agencies have been especially keen lately to support collaborative efforts of teams of researchers. It is likely that different team members may harbor different intentions about the data collection (perhaps covertly or subconsciously, but harbored nonetheless). Therefore it would extremely useful to construct tables of true and correct critical values for cases of parallel mixed intentions, when one collaborator intends to stop at a fixed sample size, and the other collaborator intends to stop at the end of the week. Clearly, the construction of these tables should be a major funding priority for the granting agencies.
I look forward to a new industry of publications that reveal appropriate corrections for different sampling intentions. Fortunately, we already have a model of this industry in the extensive literature about correcting for false alarms in multiple comparisons. Depending on the types of intended comparisons, and whether the intentions are planned or post-hoc, we have a cornucopia of corrections for every possible set of intended comparisons. Unfortunately, all of those corrections for multiple comparisons have been based on the sampling assumption that data collection stopped at fixed sample size! Therefore, every one of the corrections for multiple comparisons will have to be reworked for different stopping intentions. It will be a great day for science when we have a complete set of corrections for all the various intentions regarding multiple comparisons and intentions regarding stopping collection of data, because then we will know true and correct
p values for our data, which were completely insulated from those intentions.
Oops! Sorry, I slipped into sarcasm. But hey, it's my
blog. I should reiterate that I agree with many of the points made by the authors of the Psychological Science article. Just not the point about
p values and stopping intentions.
And one more point, regarding a different analysis method that the authors dismissed as a "nonsolution." Here is the relevant excerpt:
Nonsolutions: ...
Using Bayesian statistics. We have a similar reaction to calls for using Bayesian rather than frequentist approaches to analyzing experimental data (see, e.g., Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011). Although the Bayesian approach has many virtues, it actually increases researcher degrees of freedom. First, it offers a new set of analyses (in addition to all frequentist ones) that authors could flexibly try out on their data. Second, Bayesian statistics require making additional judgments (e.g., the prior distribution) on a case-by case basis, providing yet more researcher degrees of freedom.
It's important to be clear that statistical analysis of any kind can only deal with the data its given. If the design and procedure garner biased data, no analysis can fully undo that bias. Garbage in, garbage out. If the problem of "too many researcher degrees of freedom" stems from design-and-procedure problems, then it needs design-and-procedure solutions. To say that Bayesian analysis is a nonsolution to a design-and-procedure problem is like saying that a meat grinder is a nonsolution to rancid meat
(and that therefore the meat grinder is useless) (and leaving the reader to make the leap that therefore the meat grinder is useless).
1
The authors argue that Bayesian analysis "increases researcher degrees of freedom" (and is therefore bad) in two ways. First, "it offers a new set of analyses (in addition to all the frequentist ones)". The tacit assumption of this statement seems to be that researchers would try frequentist and Bayesian approaches and just report the one that gave the most flattering conclusion. No, this wouldn't fly. Bayesian analyses provide the most complete inferential information given the data (in the normative mathematical sense), and analysts can't just slip into frequentist mode because it's flattering. In fact, reporting a
p value is embarrassing, not flattering, because
p values are ill defined.
Second, say the authors, "Bayesian statistics require making additional judgments (e.g., the prior distribution)...". Ah, the bogey-man of priors is trotted out to scare the children, as if priors can be capriciously set to anything the analyst wants (insert sounds of mad, wicked laughter here) and thereby predetermine the conclusion. In actuality, priors are overt and explicitly agreeable to a skeptical scientific audience. Typically they are set to be noncommittal so that they have minimal influence on the posterior distribution. When there is considerable previous research to inform a prior, then a strong prior can give great inferential leverage to small samples. And
not using strong prior information when it is available can be a serious blunder; consider random drug testing and disease diagnosis, which must take into account the base rates, i.e., the priors.
Bayesian analysis does, in fact, give analysts more flexibility than traditional frequentist analysis. It gives the analyst the flexibility to use a model that actually describes the trends and distributions represented in the data, instead of being shoe-horned into linear models and normal distributions that may have little resemblance to the data. (Of course, if the analyst wants linear models with normal distributions, Bayesian analysis provides the richest possible inference about their parameters without ever computing a
p value.) With Bayesian analysis, researchers can actually get useful parametric descriptions of complex data, involving multi-level non-linear models with non-normal distributions at various levels throughout the model. This is the flexibility that scientific theorizing needs.
† Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant.
Psychological Science, first published online Oct. 17, 2011. DOI: 10.1177/0956797611417632
1 Revision to parenthetical remark made on Oct. 23, 2011, in response to personal communication from Uri Simonsohn. Struck-out version was in the original post, which could have inadvertently been interpreted as saying that the authors themselves explicitly made the meat-grinder-is-useless argument. They did not.