Comments on Doing Bayesian Data Analysis: Optional stopping in data collection: p values, Bayes factors, credible intervals, precision

Fantastic, thank you!

2018-02-07T15:13:43.284-05:00

Fantastic, thank you!

This is all in, and updated in, Chapter 13 of DBDA...

2018-02-07T15:10:31.904-05:00

This is all in, and updated in, Chapter 13 of DBDA2E.

I would love to use this stopping procedure in my ...

2018-02-07T14:51:12.107-05:00

I would love to use this stopping procedure in my experiment! Is there a way you would prefer people cite the precision stopping procedure? Kruschke 2013?

The conventional t test assumes a fixed-N stopping...

2017-11-23T09:01:38.156-05:00

The conventional t test assumes a fixed-N stopping intention and a single-test testing intention. That is, the assumed procedure for generating simulated samples of test statistics from imaginary data uses

(i) a constant N that matches the data N because it is assumed that data
sampling stopped when N was reached, and

(ii) a single test statistics, that is, a single t value per simulated sample, because it is assumed that you are doing a single test.

If the observed data were not actually collected and tested with these assumptions, then the p value delivered by the software is not appropriate.

One thing I'm unclear on is how the dependence...

2017-11-23T02:17:45.632-05:00

One thing I'm unclear on is how the dependence of p-values on testing intentions plays out in the use of statistical software. What testing intention does, say, the two-sample t-test function in R assume?

If software assumes testing intentions that may differ from a researcher's actual testing intentions, does that not render any p-value reported in any scientific paper pretty much completely unintelligible?

The top-left panel of Figure 1 is supposed to show...

2016-12-07T15:49:50.997-05:00

The top-left panel of Figure 1 is supposed to show a linear increase on the log(N) scale of the false-alarms. It merely happens to stop at about 50%, if log(N) scale continued then the line would continue to rise. Remember this is optional stopping, so if we haven't yet rejected the null, we keep collecting more data. Even when the null is true, eventually we'll happen to sample enough rogue data that we'll reject the null (when using a p value that inappropriately assumes fixed N, not optional stopping).

A full blown adaptive stopping rule would involve ...

2016-12-07T15:45:23.812-05:00

A full blown adaptive stopping rule would involve specifying utilities of correct and wrong dichotomous decisions and the cost of collecting data and the utility of accuracy (i.e., not bias) and precision (i.e., not uncertainty) in measurement. And the publication and career utilities that may result for the researcher. There are trade offs in any specific procedure. This blog post --insofar as I remember way back then-- was intended to point out possible estimation bias when the stopping criterion is rejecting or accepting the null hypothesis, and the risk of stopping with a small N that doesn't achieve sufficient precision of estimation. It's important to note that the goal of adequate precision does not imply extremely high precision -- adequate precision is judged relative to current knowledge, cost of data collection, and possibility of meta-analysis with previous data.

Dear John, thank you very much for the demonstra...

2016-12-07T07:32:39.558-05:00

Dear John,

thank you very much for the demonstrated approach, but I struggle to understand one thing. At Figure 1, the top figure from the left, the rejection rate increases over iterations and reaches the level 50%. It basically means that obtaining more data we increase Type I error, since the null hypothesis is actually true. From my understanding the trend should be other way around: the rejection rate is higher than 5% at the beginning and it tends to decrease to the theoretical level of 5% during iterations. Can you explain this discrepancy?

Best regards,
Alexander

Dear John, so the efficiency problem is the trade...

2016-11-30T22:22:53.475-05:00

Dear John,

so the efficiency problem is the trade-off between precision and running as little (in my case) participants as possible. In classical power analysis we anticipate a certain effect size, and plan our sample based on that. In adaptive designs (or at least the precision version) we pre-define a precision and run until we get there, regardless of the effect size, which seems a bit of a waste because I would say that if you have a massive effect, a little extra precision won't make that much of a difference in answering a question (for example on whether or not a manipulation worked).

The problem with the first approach is that we have to guess the magnitude of our effect in advance, the downside of the latter is that we don't take the magnitude of the effect under consideration at all.

So wouldn't it be possible to do both in a combined (recursive) stopping time rule based on the relationship between the change in precision from the initial sample to the extended sample relative to the effect size of the extended sample? This way, you could stop when the extra precision gained from running additional subjects is negligible in the light of the magnitude of the effect.

Obviously, the estimation would not be optimal, but it would be efficient when dealing with subjects that are difficult or costly to obtain.

Would a Bayesian adaptive design with such a stopping time rule make any sense?

Best
Wouter

Thanks for the post, John. I recently implemented ...

2016-06-23T17:24:05.712-04:00

Thanks for the post, John. I recently implemented the third and fourth stopping rules in the context of deciding when to stop an A/B test (see the A/B testing tool and discussion of the implementation). If you have the time, it'd be great to get your feedback on it.

Richard: The quantitative details change, but the ...

2015-08-05T15:51:35.135-04:00

Richard: The quantitative details change, but the qualitative effect remains.

I have a question about a detail of your simulatio...

2015-07-29T17:10:12.197-04:00

I have a question about a detail of your simulations. You have the simulated experimenter applying the stopping rule after every single observation. Some of the tests sometimes give a decision after only a few observations, fewer than would make a publishable paper.

Are any of the curves substantially altered if optional stopping only begins after some initial number of observations have been made, e.g. 20?

Antonio: This is such a big topic (and an old thre...

2015-07-29T11:44:31.476-04:00

Antonio:
This is such a big topic (and an old thread) that I'm going to defer discussion with a pointer to the book: It's all in Chapter 13 of DBDA2E! :-)

Further, I presume that a way of getting around th...

2015-07-29T08:35:20.615-04:00

Further, I presume that a way of getting around the "problem" would be do conduct a power analysis and determine a maximum required sample size based on power requirements as indicated in your book. Then data could be acquired sequentially until either the chosen stopping rule accept/reject the null or the maximum sample size is reached.
This way the stopping rules act like early stopping rules with a bounded maximum sample size.

Antonio

Thanks for clarifying: -Bayesian HDI with ROPE: W...

2015-07-29T07:10:51.250-04:00

Thanks for clarifying:

-Bayesian HDI with ROPE:
We may not be interested in detecting tiny effects but small effect sizes can still happen. We don't know the true magnitude of the effect size before starting the test and in sequential testing the sample size is determined by the stopping rule. So the problem seems to exist for this stopping rule?

-Precision based stopping rule:

on:
"Once stopped, check whether null can be accepted or rejected according to HDI with ROPE criteria."

I suspect that if the HDI and ROPE overlap after the desired level of precision has been reached then we can declare the test as inconclusive?

Antonio

Anton / Antonio: The decision rule based on achie...

2015-07-29T01:03:36.836-04:00

Anton / Antonio:

The decision rule based on achieving desired precision is not concerned with accepting/rejecting a null value. The sample size (or sampling duration) is only as big as needed to achieve the desired precision with the desired power.

The issue of what happens when a true effect is near a ROPE boundary and therefore possibly needing a large sample is not a consequence of using a ROPE nor is it a consequence of using Bayesian instead of frequentist decision criteria. It's a consequence of wanting to detect a miniscule effect. If you have a tiny effect size and you want high power of detecting it, you'll need a large sample size, period. All the ROPE does in this case is recalibrate what counts as small.

PS: "the precision required to rejected/accep...

2015-07-28T12:20:15.623-04:00

PS: "the precision required to rejected/accept hypotheses near the ROPE boundary seem to approach infinite time and sample size in the limit"

Antonio

Very interesting. If I have understood the approac...

2015-07-28T12:18:16.038-04:00

Very interesting. If I have understood the approaches correctly the following facts leave me slightly confused:

- Bayesian HDI with ROPE stopping rule: When the posterior mean is near the ROPE boundary a very small HDI seems to be required in order to either accept or reject the null hypothesis. This means that as the true value of the alternative hypothesis approaches the ROPE boundary massive samples sizes will be required. The risk is to have an algorithm with very unpredictable stopping times. It seems that if the value of the alternative hypothesis is exactly on the ROPE boundary then infinite time and sample size is needed in order to shrink the HDI to zero?

- Precision based stopping rule: it seems to suffer from the same problem outlined above.
the precision required to rejected/accept hypotheses near the ROPE boundary seem to approach infinite in the limit.

Is it just a misunderstanding on my side? If the above problems really exist, any suggestion on how to circumvent them?

Thanks,
Anton

Just noticed this nice blog post from Felix Schonb...

2014-05-09T22:24:34.684-04:00

Just noticed this nice blog post from Felix Schonbrodt titled A comment on “We cannot afford to study effect size in the lab” from the DataColada blog

2014-05-09T22:23:34.190-04:00

This comment has been removed by the author.

Very interesting and illuminating simulations! E.J...

2014-03-28T08:21:47.214-04:00

Very interesting and illuminating simulations! E.J. Wagenmakers made this point more analytically in a supplementary: https://docs.google.com/viewer?url=http%3A%2F%2Fwww.ejwagenmakers.com%2F2007%2FStoppingRuleAppendix.pdf

Together with this post, the point seem to be quite forceful.

I personally find the precision criterion very appealing. I am really just interested in where parameter values are, even if they overlap zero, say at the 30% quantile of the HPD interval. It is still equally informative about our world as a any other HPD's.

I am quite late in the discussion here... but anyw...

2014-03-01T22:21:34.053-05:00

I am quite late in the discussion here... but anyway, here goes:

Of relevance is a recent in press paper by Rouder " Optional Stopping: No Problem For Bayesians.", in which he argues that the BF does not need to concern itself with optional stopping rules and is still valid. In fact, Rouder writes that Wagenmakers recommends optional stopping. Rouder's argument that the BF still represents a proper ratio of posterior odds of null vs alternative is sound. However, the simulations presented here by you John convince me that even the Bayes Factor will over-estimate an effect under optional stopping - and I find this an important point to make. It's not just the proper interpretation of posterior odds that is important, but also whether the size of the effect that you estimate is biased.
Overall, a convincing argument that you presented here John.

Thanks for the quick reply! I suppose then some di...

2014-01-23T11:14:23.311-05:00

Thanks for the quick reply! I suppose then some diligence would be required on the part of the reader (me) to finish developing our own rules for what to do after applying the first recommended stopping rule. Would you be willing to share the code used to generate these simulations/graphs to assist with that?

Fabio, thanks for the comment. Well, we are ALWAY...

2014-01-22T12:04:26.006-05:00

Fabio, thanks for the comment.

Well, we are ALWAYS in a state of existential doubt. (Curling into a ball is optional.) Any decision procedure sets a threshold, and is not certain. The threshold can be justified in various ways, but it's still just a threshold, and the decision is always uncertain, whether the decision is "reject" or "undecided".

Traditional decisions using p values give us only two decision categories: Reject the null or be undecided. Traditional decisions using p values cannot make us decide to accept the null. (So-called "equivalence testing" has logical problems because confidence intervals are ill defined.) Bayesian methods, on the other hand, have decision rules with three (or more!) categories: Reject the null, accept the null, or be undecided.

With Bayesian methods you can always collect more data and the posterior distribution on the parameters doesn't care what your stopping intention was. BUT, if you are trying to compute the probabilities of various types of decisions (e.g., false alarms, correct acceptances, etc.) that would be generated from a particular hypothetical population ---as in the graphs of this blog post--- then you need to take into account the sequence of stopping intentions. If you had one stopping rule, and then kept collecting with another stopping rule, you'd have to take into account the full sequence of stopping rules to properly compute the probabilities of the decisions.

I'm trying to grapple this recommendation from...

2014-01-22T11:15:17.273-05:00

I'm trying to grapple this recommendation from a practical point of view... I am a little challenged by the idea of a method that (given infinitely many data points at its disposal) does not have the grey "no decision yet line" eventually approach zero. Practically speaking, when you've reached decision time and you've neither accepted nor rejected, what do you do next, curl up in a ball of existential doubt? Surely in practice if you care about the true answer enough, you will want to... collect more data! But the method has declared that "data collection should stop". Do you start a new run with the same stopping criteria and throw out the old data?