## Tuesday, October 25, 2016

### Should researchers be correcting for multiple tests, even when they themselves did not run the tests, but all of the tests were run on the same data?

A graduate student, named Caitlin Ducate, in my frequentist statistics class asks:
In Criminal Justice, it's common to use large data sets like the Uniform Crime Report (UCR) or versions of the National Longitudinal Survey (NLS) because the nature of certain questions doesn't lend itself well to experimentation or independent data gathering. As such, many researchers have conducted many analyses using the UCR and NLS. My question is whether or not p-values would need to be fixed across the entire data set? In other words, should researchers be correcting for multiple tests even when they themselves did not run the tests because all of the tests were run on the same data?
This question gets at the core conundrum of correcting for multiple tests. Certainly if two researchers were collaborating on an analysis that they considered to be one study, and each researcher had a different (perhaps overlapping) batch of tests to run, then the overall combined set of tests should have their p values corrected for all the tests. On the other hand, if the two researchers declared that they were not collaborating and they considered their batches of tests to be "separate" then the p values of the two batches of tests would not be corrected properly for the overall combined set of tests. Such separation of batches of analyses invites an inflated false alarm (Type I error) rate for tests of the data set. Thus, the appropriate practice should be that every new analysis of the data should correct for all the previous tests of the data by all previous researchers, and all previously published analyses should have updated p values to take into account subsequent tests. Right?

The puzzler above is based on the premise that corrections for multiple tests should control the error rate for any one set of data. Which begs the question of how to define "one set of data." A few years ago I was reviewing a manuscript that was submitted to a major scientific journal. The researchers had conducted an experiment with several conditions; the theoretical motivation and procedure made it obvious that the conditions were part of one conceptualization. Moreover, participants volunteered and were assigned at random across all the various conditions; in other words the conditions were obviously intended to be part of the same study. But the manuscript reported one subset of conditions as being in "Experiment 1" and the complementary subset of conditions as being in "Experiment 2." Why would the authors do such a strange and confusing thing when reporting the research? Because that way the corrections for multiple comparisons would only have to take into account the set of tests for the data of "Experiment 1" separately from the set of tests for the data of "Experiment 2." If the data from all the conditions were considered to be one set of data, then the correction for multiple comparisons would have to take into account all the tests, and various tests would no longer have p<.05. Ugh.

There's an analogous old puzzler based on the premise that corrections for multiple tests should control the error rate for any one researcher (not just for one set of data). Especially if studies conducted by the researcher are follow-ups of previous studies, are the data from the follow-ups really separate sets of data? Aren't they all really just one extended set of data from that researcher? Therefore, each researcher is allowed a lifetime false-alarm rate of, say, 5%, and the critical p value for any single test by that researcher should take into account that fact that she will be conducting hundreds, probably thousands, of tests during her research lifetime. Moreover, if you are collaborating with other researchers, be sure that they only rarely run significance tests because they won't inflate your collaborative error rate as much as frequent-testers.

The general issue, of deciding what constitutes the appropriate "family" of tests to be corrected for, is a sticky problem. To define an error rate, there must be a presumed family of tests for which the error rate is being defined. There are various arguments for defining the family this way or that in any particular application. For example, when running experiments with multi-factor designs, typically each main effect and interaction is considered to be a separate family and corrections for multiple tests only need to be made within each family, not across. The usual argument is something like this: in an experimental design, for which the independent variables are manipulated and randomly assigned, each factor could have been left out. But that argument breaks down if the factors can be redefined to be multiple levels of one factor, etc.

Those are just some rambling thoughts. What do you think is the answer to Caitlin's question, "[For shared data sets], should researchers be correcting for multiple tests even when they themselves did not run the tests because all of the tests were run on the same data?"

1. As a lay person I always felt they should, even though it may be unpractical. For example, it's frustrating to see the same economic data parsed and re-parsed with different demographic or treatment factors viewed as salient.

If prosperity in an area increases over a 30 year period, to what extent is it due to change in preschool education, changes in environmental lead, demographic changes, flynn effects, taxation changes in the area or in competing areas, average height, technological progress, benefits from trade, etc., etc., etc.

This vast literature keeps building and building without any rationalization effort to make it self consistent. Taking a step back, it becomes meaningless and uninterprettable.

2. How many times a 'star dataset' has been analysed by independent groups? Hundred? A million? Even thinking about correcting for other's potential (future) analysis is conceptually wrong, yet I do think it is a problem.
Even if someone is asking literally the same question as me previously from the same numbers, let's say for the sake of reproducibility, they need to be more conservative. The 100th reproducibility attempt would find no results as the need for multiple correction across studies inflate the p value.
I find this a very nice thought-experiment!

3. In personal email, I asked Daniel Lakens about this. He pointed me to a recent blog post of his in which he suggests that what defines a family of related tests is not the set of data, and not the researcher, but the theoretical hypothesis being examined. Thus, all tests that examine the same theoretical hypothesis should have their p values mutually corrected, but tests that examine a different theoretical hypothesis are in a different family and don't count against the first family.

I think this is an interesting perspective, and it certainly adds another dimension by which to define a "family" of tests. But I also think it doesn't resolve our issue in this case. If a bunch of different researchers have what amounts to the same theoretical hypothesis, and they run different batches of tests on the same public data set, should individual researchers correct their p values for all the tests run by the other researchers who are testing the same theoretical hypothesis? Hmmm...

4. This is fascinating and assumes importance in this time of open science/open data-sharing/huge public databases. Couple more points related to this question:

-- Is it even possible in the first place to know how many times the (public) data-set was analyzed at all? I am hard pressed to find any such database publishing such numbers. I know you need to register to access most of those data but I havent seen such data being published.

-- In the same vein, is it possible to know how many times the said data-set was analyzed for different hypotheses?

I appreciate any thoughts on this. Thanks

5. Interesting discussion!

I seem to agree with Daniel Lakens: A family of hypothesis tests is defined by the conceptual / theoretical similarity of the tests. This makes correcting familywise error comparable to setting hyperpriors for batches of effects in a bayesian framework.

For example, in multiple regression we might predict well-being from 20 predictors that have something to do with economics, and another 20 predictors that have something to do with demographics. We might conjecture that economic factors have similar effects between each other, and demographic factors are also similar between each other. (We might also conjecture something about their magnitudes, but that is a separate issue.) Assuming a multilevel model with two underlying prior distributions, one for economics, one for demographics, would seem to prevent rogue noise in one of the predictors (per family!) from reaching levels that one might interpret as signal.

In fact, multilevel modeling does this, in general, across units (participants) of a study.

Thanks

6. @ Szabolcs David: Thank you for commenting! Just to clarify: If someone else runs an identical test (on the identical data), there is no need to "correct" the p values because the two tests are perfectly correlated; they have no independent possibility of false alarm. It's only tests that have some degree of independence that require correction because they contribute additional possibility of false alarms.

@ Anonymous October 28, 2016 at 11:54 AM: Interesting questions you've posed. Presumably we could only know about published analyses, not about analyses relegated to the file drawer. And presumably it would take a thorough literature search to collate all the analyses of a public data set, and then a thorough reading to find out which of those analyses involved tests that are relevant to the new ones.

@ Matti Vuorre: It is interesting to consider how families of tests relate to hierarchical structure in a hierarchical model. I agree there is some conceptual overlap, but they are two distinct things. In particular, a frequentist analyst can use a hierarchical model that has separate higher-level distributions over financial and demographic variables, but the frequentist analyst will still have to apply corrections for multiple comparisons. That is, hierarchical model structure and corrections for multiple comparisons are separate issues. The hierarchical model structure will affect the magnitude of correction (because it affects the correlations of the test statistics in the joint sampling distribution) but the correction will still have to be applied. Moreover, even if the hierarchical structure expresses one conceptual division of "families", the correction for multiple comparisons could use a different division of families.

7. @John K. Kruschke
Thanks for the reply. That makes perfect ( and common) sense. As I understand the 100% pure, literal reproduction does not need any correction and also the scientific value is pretty low, I prefer the conceptual reproduction lot more. If there is a small modification in the model then the correction is indeed necessary, isn't it? As some degree of independence have been introduced.

I'm working in brain imaging: millions of voxels, lots of t-tests and sometimes the effect sizes are pretty small, even for very high profile projects like the HCP (brain MRI scans from 1000+ subjects, \$40 million budget, insanely good image quality) and that data is open to the public to play with. If different groups are asking slightly different questions the p-values should inflate into 0 eventually due to this correction? Just because someone is asking a very good question (aka. his model based on theoretical/biological/etc. assumptions makes a lot of sense), but after 100s of published tests his results can not be significant?

I do like this idea very much, but it would be pretty hard to 'sell' to the greater community.

8. @ Szabolcs David: Yeah, brain-scan data in the public domain is another great example of lots of researchers examining the same data and doing lots of related tests. Ultimately, I am not trying to sell the idea of everybody correcting the p values in light of everybody else's test. I'm trying to point out that (i) that's logically what should be done if your concern is error control over a relevant family and (ii) it's probably impossible in practice. Therefore...? Well, one option is not focusing on error control and instead doing Bayesian data analysis.

9. This is a topic I have been interested in for a long time now, trying to understand it when such corrections are actually needed. And I must say, it is still puzzling at times to me. Mainly because p-values are very difficult to interpret correctly.

I think the issue is whether you are focusing on the family-wise error rate or the test-wise error rate. Following a pragmatic approach, I believe a researcher should always explicitly mention about which error rate he/she is talking. And it is up to the reader to evaluate the actual evidence in favor or against a hypothesis.

Imagine a researcher who does a batch of tests (e.g. n x (n-1) pairwise comparisons in a simple one-way anova with a n-level factor), and does not correct for multiple testing. If he doesn't mention that he is only interested in the test-wise error, he is lazy or ignorant. But if he does and he comes up with an argument why he is not interested in the family-wise error, I believe this also fine as long as he doesn't make claims which would require a family-wise error rate.

This has always made me question the following. If a research performs a traditional OLS regression with 10 predictor variables, and uses the p-values to see which regressor terms are significant, shouldn't he/she then also apply some kind of multiple testing correction?