This post argues that data collection should stop when a desired degree of precision is achieved (as measured by a Bayesian credible interval), not when a critical p value is achieved, not when a critical Bayes factor is achieved, and not even when a Bayesian highest density interval (HDI) excludes a region of practical equivalence (ROPE) to the null value.
Update: For expanded follow-up talk, from March 14, 2014 at U.C. Irvine, see:
It is a well-known fact of null-hypothesis significance testing (NHST) that when there is "optional stopping" of data collection with testing at every new datum (a procedure also called "sequential testing" or "data peeking"), then the null hypothesis will eventually be rejected even when it is true. With enough random sampling from the null hypothesis, eventually there will be some accidental coincidence of outlying values so that p < .05 (conditionalizing on the current sample size). Anscombe (1954) called this phenomenon, "sampling to reach a foregone conclusion."
Bayesian methods do not suffer from this problem, at least not to the same extent. Using either Bayesian HDI with ROPE, or a Bayes factor, the false alarm rate asymptotes at a level far less than 100% (e.g., 20-25%). In other words, using Bayesian methods, the null hypothesis is accepted when it is true, even with sequential testing of every datum, perhaps 75-80% of the time.
But not all is well with the Bayesian methods.Within the two Bayesian methods, the Bayes-factor method is far too eager to accept the null when it is not true. And both Bayesian methods, like the p-value method, give biased estimates of the parameter value when the null is not true, because they stop when extreme values are (randomly) obtained.
The proposed solution to the problem of biased estimates is to stop when a desired degree of precision is achieved, regardless of what it implies about the null hypothesis. This is standard procedure in political polling, in which sampling is designed to achieve a desired confidence bound on the estimate (e.g., "plus or minus 3 percentage points"), not to argue for one extreme or the other. I previously mentioned this proposal in a video (at 6:45 minutes) and alluded to it in an article (pp. 579-580) and in the book (Doing Bayesian Data Analysis; goals for power analysis, e.g., pp. 320-321), but surely I am not the first to suggest this; please let me know of precursors.
What follows is a series of examples of sequential testing of coin flips, using four different stopping criteria. The underlying bias of the coin is denoted θ (theta). Here are the four stopping rules:
- NHST: For every new flip of the coin, stop and reject the null hypothesis, that θ=0.50, if p < .05 (two-tailed, conditionalizing on the current N), otherwise flip again.
- Bayes factor (BF): For every flip of the coin, conduct a Bayesian model comparison of the null hypothesis that θ=0.50 against the alternative hypothesis that there is a uniform prior on θ. If BF > 3, accept null and stop. If BF < 1/3 reject null and stop. Otherwise flip again.
- Bayesian HDI with ROPE: For every flip of the coin, compute the 95% HDI on θ. If the HDI is completely contained in a ROPE from 0.45 to 0.55, stop and accept the null. If the HDI falls completely outside the ROPE stop and reject the null. Otherwise flip again.
- Precision: For every flip of the coin, compute the 95% HDI on θ. If its width is less than 0.08 (.8*width of ROPE) then stop, otherwise flip again. Once stopped, check whether null can be accepted or rejected according to HDI with ROPE criteria.
The lower panels in each figure show a histogram of the 1,000 sample proportions of heads when the sequence was stopped. The solid triangle marks the true underlying theta value, and the outline triangle marks the mean of the 1,000 sample proportions when the sequence was stopped. If the sample proportion at stopping is unbiased, then the outline triangle would be superimposed on the solid triangle.
Figure 1 |
The second column of Figure 1 shows the results from using the Bayes-factor (BF) stopping rule. When the null hypothesis is true, it shows similar behavior to the HDI-with-ROPE stopping rule, but with smaller sample sizes. In fact, it can make decisions with very small sample sizes that yield very uncertain estimates of theta.
The third column of Figure 1 shows the results from using the HDI-with-ROPE stopping rule. You can see that there is some false rejection of the null for small sample sizes, but the false alarms soon asymptote. When the sample size gets big enough so the HDI is narrow enough to fit in the ROPE, the remaining sequences all eventually accept the null.
The fourth column of Figure 1 shows the results of using the precision stopping rule. Because the desired precision is a width of 0.08, a fairly large sample is needed. Once the desired HDI width is attained, it is compared with the ROPE to make a decision. The curves extend (misleadingly) farther right over larger N even though data collection has stopped.
Figure 2 |
Figure 3 |
The same remarks apply when θ=0.65 as in Figure 3. Notice that the BF still accepts the null almost 40% of the time! Only the stop-at-critical-precision method does not overestimate θ.
Figure 4 |
Discussion:
"Gosh, I see that the stop-at-critical-precision method does not bias the estimate of the parameter, which is nice, but it can take a ton of data to achieve the desired precision. Yuck!" Oh well, that is an inconvenient truth about noisy data -- it can take a lot of data to cancel out the noise. If you want to collect less data, then reduce the noise in your measurement process.
"Can't the method be gamed? Suppose the data collector actually collects until the HDI excludes the ROPE, notes the precision at that point, and then claims that the data were collected until the precision reached that level. (Or, the data collector rounds down a little to a plausibly pre-planned HDI width and collects a few more data values until reaching that width, repeating until the HDI still excludes the ROPE.)" Yup, that's true. But the slipperiness of the sampling intention is the fundamental complaint against all frequentist p values, which change when the sampling or testing intentions change. The proportions being plotted by the curves in these figures are p values -- just p values for different stopping intentions.
"Can't I measure precision with a frequentist confidence interval instead of a Bayesian posterior HDI?" No, because a frequentist confidence interval depends on the stopping and testing intention. Change, say, the testing intention --e.g., there's a second coin you're testing-- and the confidence interval changes. But not the Bayesian HDI.
The key point: If the sampling procedure, such as the stopping rule, biases the data in the sample, then the estimation can be biased, whether it's Bayesian estimation or not. A stopping rule based on getting extreme values will automatically bias the sample toward extreme estimates, because once some extreme values show up by chance, sampling stops. A stopping rule based on precision will not bias the sample unless the measure of precision depends on the value of the parameter (which actually is the case here, just not very noticeably for parameter values that aren't very extreme).
Nice graphs and idea!
ReplyDeleteIsn't a problem that generally you actually would want to stop as early as possible when the evidence favours one of your hypotheses, not as soon as you have a certain precision? I'm thinking of a drug trial type of situation where if the drug is not effective it would be unethical to continue the trials. By the way, isn't the precision stopping rule very similar to a rule where you stop after a certain sample size?
What I've been thinking about is: Wouldn't you want to incorporate the stopping process in the model? After each sample, instead of calculating the odds for H0/H1 you would calculate the odds that accepting H0/H1 would be the correct decision conditional on the stopping rule you are using. Now, I must admit that I'm not sure if this idea is actually total nonsense...
Straw man. If you're only going to stop (NHST) when you reject the null, of course you're only going to stop when you reject the null.
ReplyDeleteA fairer comparison would be a proper sequential test, in the style of Wald's book from a long time ago. (This involves taking a sample, deciding whether to (a) stop and reject, (b) stop and not reject, (c) continue, set up in such a way that the probability of type I and type II errors is approximately some chosen value.
Thank you for your comments! Here are some replies
ReplyDelete> Isn't a problem that generally you actually would want to stop as early as possible when the evidence favours one of your hypotheses, not as soon as you have a certain precision? I'm thinking of a drug trial type of situation where if the drug is not effective it would be unethical to continue the trials.
Right, I'm on board with adaptive design, but the question is whether it leads to biased estimates.
Consider the case of political polling. Should we sample until achieving precision, or should we sample until showing one candidate is ahead of the other? I think people would get upset if pollsters did the latter, because it intuitively (and actually) leads to bias.
> By the way, isn't the precision stopping rule very similar to a rule where you stop after a certain sample size?
Similar but not the same. Stopping at a particular size does not necessarily achieve any particular degree of precision, because it depends on the noise in the data. Setting a precision criterion achieves that degree of precision regardless of the noise.
> Straw man. If you're only going to stop (NHST) when you reject the null, of course you're only going to stop when you reject the null.
Yes, it might seem obvious, but a lot of researchers actually proceed this way. They collect some data, test, collect some more data, test, collect some more data, test, and stop when they can say p<.05 (conditionalizing on N, not on the entire sequence of tests or the possibility of collecting more data).
But the main point was not to point out the problems of sequential testing in NHST (as that's been done long ago). The main point was about bias in estimation.
> A fairer comparison would be a proper sequential test, in the style of Wald's book from a long time ago. (This involves taking a sample, deciding whether to (a) stop and reject, (b) stop and not reject, (c) continue, set up in such a way that the probability of type I and type II errors is approximately some chosen value.
The "proper" frequentist sequential tests provide various schedules and critical levels for achieving an overall false alarm rate of .05 (and with type II rates as you say). But do they produce unbiased estimates? I doubt it, but would be happy to be informed otherwise.
Very thought-provoking post. The idea of using precision to inform stopping rules is appealing, but it seems to me that bias is but one factor influencing how we collect data and when we stop doing so.
ReplyDeleteDepending on your field, the cost of collecting another datum can be high. In some cases, at least, it can be higher than the value of the incrementally increased precision, I would think.
And as much as we might like to decrease the noise in our measurement process, that may or may not be possible, or the cost of doing so may outweigh the benefits.
I'd be interested to hear your thoughts on the role of bias and precision in data collection in designs that call for multilevel models, i.e., models in which shrinkage induces bias while decreasing the variance of parameter estimates. While setting a precision-based criterion in such cases may well be justified, the justification will have to rely on something other than minimal bias, I would think.
Since you asked about pre-cursors, I should mention Ken Kelley who argued that when we conduct power analysis we should not be concerned about power to reject null, but about precision of confidence interval. Much of this is implemented in his MBESS R package.
ReplyDeleteI know... power analysis is not the same as stopping rule, but one could also apply Kelley's proposal to say that we can stop collecting data when our CI has reached a certain pre-specified precision (narrowness).
It appears that this would be a frequentist equivalent to your proposed stopping rule based on the Bayesian credible interval. Would love to see how they compare to each other. I would think (?) with a flat prior, these two methods should be quite similar..in a sense that they would recommend stopping data collection at similar n's.
Oh - and also related, Felix Schoenbrodt had a small paper on when the correlation coefficient reaches a certain precision - fun, quick, read:
http://www.nicebread.de/at-what-sample-size-do-correlations-stabilize/
John, very interesting. It would be great to hear what some of the stats community thinks of the pros and cons.
ReplyDeleteThanks for the additional comments. Some replies:
ReplyDelete> Very thought-provoking post. The idea of using precision to inform stopping rules is appealing, but it seems to me that bias is but one factor influencing how we collect data and when we stop doing so. Depending on your field, the cost of collecting another datum can be high. In some cases, at least, it can be higher than the value of the incrementally increased precision, I would think. And as much as we might like to decrease the noise in our measurement process, that may or may not be possible, or the cost of doing so may outweigh the benefits.
Right, a full treatment could involve all the machinery of Bayesian decision theory. We could assign utilities (i.e., costs and benefits) to correct and wrong decisions, to magnitudes of mis-estimation, and to sampling of more data. We could then adjust the policies within each procedure to maximize the utility for whatever world we hypothesize, and then choose the procedure and policy that maximizes hypothetical utility. Full-fledged Bayesian decision theory is especially relvant to high-stakes situations (e.g., medical).
> I'd be interested to hear your thoughts on the role of bias and precision in data collection in designs that call for multilevel models, i.e., models in which shrinkage induces bias while decreasing the variance of parameter estimates. While setting a precision-based criterion in such cases may well be justified, the justification will have to rely on something other than minimal bias, I would think.
It depends on which parameters you are focusing on for the precision criterion, and whether or not you really consider shrinkage to be "bias" in the sense of being wrong. Is bias from shrinkage wrong? It's no more wrong than any bias induced by an informed prior. Shrinkage of a low-level parameter is caused by its higher-level being informed by other data at its level, and so the higher level acts as an informed prior. (Imagine that the other data were literally collected and included in the analysis temporally previously, then it would literally be an informed prior.) Is that "wrong"? It's merely the rational and mathematically correct consequence of the assumed model for describing the data. The higher-level parameter is estimated more precisely when there is shrinkage of lower-level parameters, which is good if that's the focus of the research (as it often is).
[continued in next comment...]
ReplyDelete> Since you asked about pre-cursors, I should mention Ken Kelley who argued that when we conduct power analysis we should not be concerned about power to reject null, but about precision of confidence interval. Much of this is implemented in his MBESS R package.
Thanks for the pointer to Ken Kelley's work. He and colleagues (including Scott Maxwell and others) promote the goal of accuracy in parameter estimation (AIPE) when planning sample size for statistical power. I completely agree with the flavor of the approach. My only concern is that it should be done with Bayesian highest-density intervals (HDIs), not with frequentist confidence intervals (CIs). The problem with CIs is that they change when the stopping and testing intention changes. For example, if I intend to do multiple comparisons, then the CIs change and the measure of accuracy changes. The intended stopping and testing does not change the Bayesian HDI.
> I know... power analysis is not the same as stopping rule, but one could also apply Kelley's proposal to say that we can stop collecting data when our CI has reached a certain pre-specified precision (narrowness).
Actually I agree that power and stopping rule are closely related. In fact I made the blog post specifically because I had been working on it while revising the chapter on power for the book. Sequential testing and power are related by both being about potential outcomes from hypothetical worlds. They are both frequentist issues, but the hypothetical data are analyzed by Bayesian methods.
> It appears that this would be a frequentist equivalent to your proposed stopping rule based on the Bayesian credible interval. Would love to see how they compare to each other. I would think (?) with a flat prior, these two methods should be quite similar..in a sense that they would recommend stopping data collection at similar n's.
Right, except that frequentist CIs change when the stopping and testing intentions change, unlike Bayesian HDIs (see reply above). And Bayesian HDIs are easy to compute in complex hierarchical models, unlike frequentist CIs.
P.S. With respect to Kelley et al.'s work on accuracy in parameter estimation (AIPE), I think they only discuss sample size and power for fixed-N designs. I don't think they discuss sequential testing and the resulting bias in estimation...
ReplyDeleteI'm trying to grapple this recommendation from a practical point of view... I am a little challenged by the idea of a method that (given infinitely many data points at its disposal) does not have the grey "no decision yet line" eventually approach zero. Practically speaking, when you've reached decision time and you've neither accepted nor rejected, what do you do next, curl up in a ball of existential doubt? Surely in practice if you care about the true answer enough, you will want to... collect more data! But the method has declared that "data collection should stop". Do you start a new run with the same stopping criteria and throw out the old data?
ReplyDeleteFabio, thanks for the comment.
ReplyDeleteWell, we are ALWAYS in a state of existential doubt. (Curling into a ball is optional.) Any decision procedure sets a threshold, and is not certain. The threshold can be justified in various ways, but it's still just a threshold, and the decision is always uncertain, whether the decision is "reject" or "undecided".
Traditional decisions using p values give us only two decision categories: Reject the null or be undecided. Traditional decisions using p values cannot make us decide to accept the null. (So-called "equivalence testing" has logical problems because confidence intervals are ill defined.) Bayesian methods, on the other hand, have decision rules with three (or more!) categories: Reject the null, accept the null, or be undecided.
With Bayesian methods you can always collect more data and the posterior distribution on the parameters doesn't care what your stopping intention was. BUT, if you are trying to compute the probabilities of various types of decisions (e.g., false alarms, correct acceptances, etc.) that would be generated from a particular hypothetical population ---as in the graphs of this blog post--- then you need to take into account the sequence of stopping intentions. If you had one stopping rule, and then kept collecting with another stopping rule, you'd have to take into account the full sequence of stopping rules to properly compute the probabilities of the decisions.
Thanks for the quick reply! I suppose then some diligence would be required on the part of the reader (me) to finish developing our own rules for what to do after applying the first recommended stopping rule. Would you be willing to share the code used to generate these simulations/graphs to assist with that?
ReplyDeleteI am quite late in the discussion here... but anyway, here goes:
ReplyDeleteOf relevance is a recent in press paper by Rouder " Optional Stopping: No Problem For Bayesians.", in which he argues that the BF does not need to concern itself with optional stopping rules and is still valid. In fact, Rouder writes that Wagenmakers recommends optional stopping. Rouder's argument that the BF still represents a proper ratio of posterior odds of null vs alternative is sound. However, the simulations presented here by you John convince me that even the Bayes Factor will over-estimate an effect under optional stopping - and I find this an important point to make. It's not just the proper interpretation of posterior odds that is important, but also whether the size of the effect that you estimate is biased.
Overall, a convincing argument that you presented here John.
Very interesting and illuminating simulations! E.J. Wagenmakers made this point more analytically in a supplementary: https://docs.google.com/viewer?url=http%3A%2F%2Fwww.ejwagenmakers.com%2F2007%2FStoppingRuleAppendix.pdf
ReplyDeleteTogether with this post, the point seem to be quite forceful.
I personally find the precision criterion very appealing. I am really just interested in where parameter values are, even if they overlap zero, say at the 30% quantile of the HPD interval. It is still equally informative about our world as a any other HPD's.
This comment has been removed by the author.
ReplyDeleteJust noticed this nice blog post from Felix Schonbrodt titled A comment on “We cannot afford to study effect size in the lab” from the DataColada blog
ReplyDeleteVery interesting. If I have understood the approaches correctly the following facts leave me slightly confused:
ReplyDelete- Bayesian HDI with ROPE stopping rule: When the posterior mean is near the ROPE boundary a very small HDI seems to be required in order to either accept or reject the null hypothesis. This means that as the true value of the alternative hypothesis approaches the ROPE boundary massive samples sizes will be required. The risk is to have an algorithm with very unpredictable stopping times. It seems that if the value of the alternative hypothesis is exactly on the ROPE boundary then infinite time and sample size is needed in order to shrink the HDI to zero?
- Precision based stopping rule: it seems to suffer from the same problem outlined above.
the precision required to rejected/accept hypotheses near the ROPE boundary seem to approach infinite in the limit.
Is it just a misunderstanding on my side? If the above problems really exist, any suggestion on how to circumvent them?
Thanks,
Anton
PS: "the precision required to rejected/accept hypotheses near the ROPE boundary seem to approach infinite time and sample size in the limit"
ReplyDeleteAntonio
Anton / Antonio:
ReplyDeleteThe decision rule based on achieving desired precision is not concerned with accepting/rejecting a null value. The sample size (or sampling duration) is only as big as needed to achieve the desired precision with the desired power.
The issue of what happens when a true effect is near a ROPE boundary and therefore possibly needing a large sample is not a consequence of using a ROPE nor is it a consequence of using Bayesian instead of frequentist decision criteria. It's a consequence of wanting to detect a miniscule effect. If you have a tiny effect size and you want high power of detecting it, you'll need a large sample size, period. All the ROPE does in this case is recalibrate what counts as small.
Thanks for clarifying:
ReplyDelete-Bayesian HDI with ROPE:
We may not be interested in detecting tiny effects but small effect sizes can still happen. We don't know the true magnitude of the effect size before starting the test and in sequential testing the sample size is determined by the stopping rule. So the problem seems to exist for this stopping rule?
-Precision based stopping rule:
on:
"Once stopped, check whether null can be accepted or rejected according to HDI with ROPE criteria."
I suspect that if the HDI and ROPE overlap after the desired level of precision has been reached then we can declare the test as inconclusive?
Antonio
ReplyDeleteFurther, I presume that a way of getting around the "problem" would be do conduct a power analysis and determine a maximum required sample size based on power requirements as indicated in your book. Then data could be acquired sequentially until either the chosen stopping rule accept/reject the null or the maximum sample size is reached.
This way the stopping rules act like early stopping rules with a bounded maximum sample size.
Antonio
Antonio:
ReplyDeleteThis is such a big topic (and an old thread) that I'm going to defer discussion with a pointer to the book: It's all in Chapter 13 of DBDA2E! :-)
I have a question about a detail of your simulations. You have the simulated experimenter applying the stopping rule after every single observation. Some of the tests sometimes give a decision after only a few observations, fewer than would make a publishable paper.
ReplyDeleteAre any of the curves substantially altered if optional stopping only begins after some initial number of observations have been made, e.g. 20?
Richard: The quantitative details change, but the qualitative effect remains.
ReplyDeleteThanks for the post, John. I recently implemented the third and fourth stopping rules in the context of deciding when to stop an A/B test (see the A/B testing tool and discussion of the implementation). If you have the time, it'd be great to get your feedback on it.
ReplyDeleteDear John,
ReplyDeleteso the efficiency problem is the trade-off between precision and running as little (in my case) participants as possible. In classical power analysis we anticipate a certain effect size, and plan our sample based on that. In adaptive designs (or at least the precision version) we pre-define a precision and run until we get there, regardless of the effect size, which seems a bit of a waste because I would say that if you have a massive effect, a little extra precision won't make that much of a difference in answering a question (for example on whether or not a manipulation worked).
The problem with the first approach is that we have to guess the magnitude of our effect in advance, the downside of the latter is that we don't take the magnitude of the effect under consideration at all.
So wouldn't it be possible to do both in a combined (recursive) stopping time rule based on the relationship between the change in precision from the initial sample to the extended sample relative to the effect size of the extended sample? This way, you could stop when the extra precision gained from running additional subjects is negligible in the light of the magnitude of the effect.
Obviously, the estimation would not be optimal, but it would be efficient when dealing with subjects that are difficult or costly to obtain.
Would a Bayesian adaptive design with such a stopping time rule make any sense?
Best
Wouter
A full blown adaptive stopping rule would involve specifying utilities of correct and wrong dichotomous decisions and the cost of collecting data and the utility of accuracy (i.e., not bias) and precision (i.e., not uncertainty) in measurement. And the publication and career utilities that may result for the researcher. There are trade offs in any specific procedure. This blog post --insofar as I remember way back then-- was intended to point out possible estimation bias when the stopping criterion is rejecting or accepting the null hypothesis, and the risk of stopping with a small N that doesn't achieve sufficient precision of estimation. It's important to note that the goal of adequate precision does not imply extremely high precision -- adequate precision is judged relative to current knowledge, cost of data collection, and possibility of meta-analysis with previous data.
DeleteDear John,
ReplyDeletethank you very much for the demonstrated approach, but I struggle to understand one thing. At Figure 1, the top figure from the left, the rejection rate increases over iterations and reaches the level 50%. It basically means that obtaining more data we increase Type I error, since the null hypothesis is actually true. From my understanding the trend should be other way around: the rejection rate is higher than 5% at the beginning and it tends to decrease to the theoretical level of 5% during iterations. Can you explain this discrepancy?
Best regards,
Alexander
The top-left panel of Figure 1 is supposed to show a linear increase on the log(N) scale of the false-alarms. It merely happens to stop at about 50%, if log(N) scale continued then the line would continue to rise. Remember this is optional stopping, so if we haven't yet rejected the null, we keep collecting more data. Even when the null is true, eventually we'll happen to sample enough rogue data that we'll reject the null (when using a p value that inappropriately assumes fixed N, not optional stopping).
DeleteOne thing I'm unclear on is how the dependence of p-values on testing intentions plays out in the use of statistical software. What testing intention does, say, the two-sample t-test function in R assume?
ReplyDeleteIf software assumes testing intentions that may differ from a researcher's actual testing intentions, does that not render any p-value reported in any scientific paper pretty much completely unintelligible?
The conventional t test assumes a fixed-N stopping intention and a single-test testing intention. That is, the assumed procedure for generating simulated samples of test statistics from imaginary data uses
Delete(i) a constant N that matches the data N because it is assumed that data
sampling stopped when N was reached, and
(ii) a single test statistics, that is, a single t value per simulated sample, because it is assumed that you are doing a single test.
If the observed data were not actually collected and tested with these assumptions, then the p value delivered by the software is not appropriate.
I would love to use this stopping procedure in my experiment! Is there a way you would prefer people cite the precision stopping procedure? Kruschke 2013?
ReplyDeleteThis is all in, and updated in, Chapter 13 of DBDA2E.
DeleteFantastic, thank you!
Delete