Bayesian methods can be used in general data-analytic models, in psychometric models, and in models of mind. What is the difference? In all three applications, there is Bayesian estimation of parameter values in a model. What differs between models is the source of the data and the meaning (semantic referent) of the parameters, as described in the diagram below:
As an example of a generic data-analytic model, consider data about ice cream sales and sleeve lengths, measured at different times of year. A linear regression model might show a negative slope for the line that describes a trend in the scatter of points. But the slope does not necessarily describe anything in the processes that generated the ice cream sales and sleeve lengths.
As an example of a psychometric model, consider multidimensional scaling (MDS). The data are similarity ratings (or confusion matrices) from a human observer, and the parameters are coordinates of items in a geometric representation of mental constructs that produced the ratings. Note that if the MDS model is applied to non-behavioral data, such as inter-city road distances, then it is not a psychometric model.
As an example of a Bayesian model of mind, consider models of visual perception of concavity. The data are the light beams reflecting off the curved object in the world and falling on the retina. The model in the mind has parameters that represent the shape of the object in the world being viewed and the angle of the light falling on it. The prior has strong bias for light falling from above (e.g., from the sun and sky). The posterior estimate of shape and lighting then tends to produce interpretations of shapes consistent with overhead lighting, unless there is strong evidence to the contrary. Notice that the parameter values are inside the head, so there must be additional assumptions regarding how to measure those parameter values.
Wednesday, October 26, 2011
Thursday, October 20, 2011
Automatic conversion to JAGS from BRugs
[The programs are now all available in JAGS, so you don't need to use this translator any more. See the more recent blog post.]
Fantastic contribution from Thomas G. Smith:
Many thanks to Thomas!
Fantastic contribution from Thomas G. Smith:
FakeBRugs - Pretend that rjags is BRugsSee complete info at https://github.com/tgs/FakeBRugs.
This tiny library of functions is enough to get you through many of the examples in Dr. John Kruschke's book, Doing Bayesian Data Analysis. The functions translate from the BRugs calls used in the book, to rjags calls that work on Unix-y environments. This is not a complete translation layer!!! It's just enough to do most of the things the book asks you to do!
Many thanks to Thomas!
Wednesday, October 19, 2011
False conclusions in "False positive psychology..."
A new article in Psychological Science† correctly points out flaws of p values and of procedures that produce bias in data selection and analysis. The article makes several reasonable conclusions. Unfortunately it also makes two wrong conclusions about issues of fundamental importance with major ramifications.
One conclusion is that p values are okay but need to be corrected for the researcher's stopping intention. I refute that claim by reductio ad absurdum. A second conclusion is that Bayesian analysis is a "nonsolution" to the problem of researchers having too much wiggle room. I dispute that claim by clarifying what problems any analysis method can or cannot address, by denying flexibility attributed to Bayesian analysis that isn't really available, and by claiming the actual flexibility in Bayesian analysis as an important advantage for scientific research.
The first issue stems from the fact, correctly pointed out in the article, that p values depend on the stopping intention of data collector. Here's the idea. Suppose a researcher collects some data and computes a summary statistic such as t or F or χ2. The p value is the probability of the observed statistic, or a value more extreme, in the space of possible data that might have been obtained if the null hypothesis were true and the intended experiment were repeated ad infinitum. The space of possible data that might have been obtained depends on the stopping intention. So, if the data collector intended to stop when the sample size N reached a certain number, such as 23, then the space of possible data includes all data sets for which N=23. But if the data collector intended to stop at the end of the week (and just happened to get N=23), then the space of possible data includes all data sets that could have been collected by the end of the week, some of which have N=23, and some of which have smaller N or larger N. Because the two spaces of possible data sets are not the same, the p values are not the same. The p value can depend quite dramatically on the stopping intention. For example, if the researcher intended to stop when N=100 but was unexpectedly interrupted when N=23, the p value is much smaller than if the intention was to stop when N=23. Or, if the researcher intended to stop when N=23 but got an unexpected windfall of data so that N=100, perhaps because a new volunteer assistant showed up, then the p value is much larger than if the researcher intended to stop at N=100. Therefore, to correctly determine the p value for a set of data, we must know the reason that data collection ceased.
Here is an example of the correct use of p values. A lab director has some research in mind and decides that N=30 is adequate. The director tells the lab administrator to collect data from 30 subjects. The administrator knows that the lab typically recruits about 30 subjects in a week, and therefore tells the data-collecting assistant to run subjects for a week. The assistant dutifully collects data, and at the end of the week the lab director happens to be present as the last datum is collected, which happens to be N=30. As far as the lab director can tell, data collection ceased intentionally when N=30. When the lab director analyzes the data, under the intention of stopping when N=30, a particular p value is computed. But when the assistant analyzes the data, under the intention of stopping at the end of the week, a different p value is computed. In fact, for the lab director p<.05, but for the assistant p>.05. Which p value is correct? Are the results "significant"?
Here is another example of the correct use of p values. Two competing labs are pursuing the same type of research in a well established experimental paradigm. The two labs independently have the same idea for an identical experiment design, and the researchers go about collecting data. In one lab, they intend to collect data for a week, and they happen to get N=30. In the other lab, they intend to stop data collection when N=30. Moreover, by chance, the data in the two labs happen to be identical. (That's not so far fetched; e.g., perhaps the data are binary choices for each subject, and so the data can be summarized as the number of "heads" out of N subjects.) The two labs compute p values for their identical data sets. The two p values are different because the data were collected with different stopping intentions; in fact p<.05 for one lab but p>.05 for the other lab. Which lab got "significant" data?
The problem is that the data bear no signature of the stopping intention of the researcher. Indeed, researchers go out of their way to make sure that the data are not influenced by their stopping intention. Each datum collected is supposed to be completely insulated from any data collected before or after. The last datum collected has no trace that it was the last or the first or any position in between.
Not only is the intention opaque to the data, it's often opaque to the researcher. Collaborators on a project might have differing sampling intentions (as in the example of the director and assistant, above). Or the sampling intention might change midway through data collection. ("Let's collect N=100." Then on Friday afternoon, "Well, we've got N=94, that's good enough.") Or, as often happens, some subjects have to be deleted from the data set because of procedural errors or failure to respond, despite the data-collector's intention about sample size or stopping time. These are all very realistic sampling procedures, and we tolerate them because we know that the data are completely unaffected by the intention of the researcher.
Therefore it's strange that the interpretation of the data in terms of p values should depend crucially on something that has no impact on the data, namely, the stopping intention. But, the correct calculation of p values does depend on the stopping intention.
So, what should be done about this problem? Given the examples above, it seems clear that we should treat p values as inherently ill-defined because they depend on intentions that are irrelevant to the data.
But here is the #1 recommendation from the new article in Psychological Science:
To ease the reporting of true and correct p values, it would be extremely helpful to have critical values for commonly-used statistics (t, F, etc.) under various typical stopping intentions that researchers adopt. All the critical values in contemporary books and computer programs assume the unrealistic convention that N was fixed in advance. Instead, we should have tables of critical values for stopping after a certain duration, with varieties of sampling rates during that interval. (In fact, I've already seeded this process with an example in this article.) Therefore researchers could obtain the true p values for their data if they intended to stop after a particular duration.
It would also be helpful to have correction factors for unexpected interruptions of data collection, or unexpected windfalls of data collection. The researcher would merely have to enter the intended sample size and the probability of being interrupted or enhanced (and the probability of each magnitude of enhancement) at any particular time during data collection, and the correction factor would produce the true and correct p value for the data.
Federal funding agencies have been especially keen lately to support collaborative efforts of teams of researchers. It is likely that different team members may harbor different intentions about the data collection (perhaps covertly or subconsciously, but harbored nonetheless). Therefore it would extremely useful to construct tables of true and correct critical values for cases of parallel mixed intentions, when one collaborator intends to stop at a fixed sample size, and the other collaborator intends to stop at the end of the week. Clearly, the construction of these tables should be a major funding priority for the granting agencies.
I look forward to a new industry of publications that reveal appropriate corrections for different sampling intentions. Fortunately, we already have a model of this industry in the extensive literature about correcting for false alarms in multiple comparisons. Depending on the types of intended comparisons, and whether the intentions are planned or post-hoc, we have a cornucopia of corrections for every possible set of intended comparisons. Unfortunately, all of those corrections for multiple comparisons have been based on the sampling assumption that data collection stopped at fixed sample size! Therefore, every one of the corrections for multiple comparisons will have to be reworked for different stopping intentions. It will be a great day for science when we have a complete set of corrections for all the various intentions regarding multiple comparisons and intentions regarding stopping collection of data, because then we will know true and correct p values for our data, which were completely insulated from those intentions.
Oops! Sorry, I slipped into sarcasm. But hey, it's my blog. I should reiterate that I agree with many of the points made by the authors of the Psychological Science article. Just not the point about p values and stopping intentions. And one more point, regarding a different analysis method that the authors dismissed as a "nonsolution." Here is the relevant excerpt:
(and that therefore the meat grinder is useless) (and leaving the reader to make the leap that therefore the meat grinder is useless).1
The authors argue that Bayesian analysis "increases researcher degrees of freedom" (and is therefore bad) in two ways. First, "it offers a new set of analyses (in addition to all the frequentist ones)". The tacit assumption of this statement seems to be that researchers would try frequentist and Bayesian approaches and just report the one that gave the most flattering conclusion. No, this wouldn't fly. Bayesian analyses provide the most complete inferential information given the data (in the normative mathematical sense), and analysts can't just slip into frequentist mode because it's flattering. In fact, reporting a p value is embarrassing, not flattering, because p values are ill defined.
Second, say the authors, "Bayesian statistics require making additional judgments (e.g., the prior distribution)...". Ah, the bogey-man of priors is trotted out to scare the children, as if priors can be capriciously set to anything the analyst wants (insert sounds of mad, wicked laughter here) and thereby predetermine the conclusion. In actuality, priors are overt and explicitly agreeable to a skeptical scientific audience. Typically they are set to be noncommittal so that they have minimal influence on the posterior distribution. When there is considerable previous research to inform a prior, then a strong prior can give great inferential leverage to small samples. And not using strong prior information when it is available can be a serious blunder; consider random drug testing and disease diagnosis, which must take into account the base rates, i.e., the priors.
Bayesian analysis does, in fact, give analysts more flexibility than traditional frequentist analysis. It gives the analyst the flexibility to use a model that actually describes the trends and distributions represented in the data, instead of being shoe-horned into linear models and normal distributions that may have little resemblance to the data. (Of course, if the analyst wants linear models with normal distributions, Bayesian analysis provides the richest possible inference about their parameters without ever computing a p value.) With Bayesian analysis, researchers can actually get useful parametric descriptions of complex data, involving multi-level non-linear models with non-normal distributions at various levels throughout the model. This is the flexibility that scientific theorizing needs.
† Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, first published online Oct. 17, 2011. DOI: 10.1177/0956797611417632
1 Revision to parenthetical remark made on Oct. 23, 2011, in response to personal communication from Uri Simonsohn. Struck-out version was in the original post, which could have inadvertently been interpreted as saying that the authors themselves explicitly made the meat-grinder-is-useless argument. They did not.
One conclusion is that p values are okay but need to be corrected for the researcher's stopping intention. I refute that claim by reductio ad absurdum. A second conclusion is that Bayesian analysis is a "nonsolution" to the problem of researchers having too much wiggle room. I dispute that claim by clarifying what problems any analysis method can or cannot address, by denying flexibility attributed to Bayesian analysis that isn't really available, and by claiming the actual flexibility in Bayesian analysis as an important advantage for scientific research.
The first issue stems from the fact, correctly pointed out in the article, that p values depend on the stopping intention of data collector. Here's the idea. Suppose a researcher collects some data and computes a summary statistic such as t or F or χ2. The p value is the probability of the observed statistic, or a value more extreme, in the space of possible data that might have been obtained if the null hypothesis were true and the intended experiment were repeated ad infinitum. The space of possible data that might have been obtained depends on the stopping intention. So, if the data collector intended to stop when the sample size N reached a certain number, such as 23, then the space of possible data includes all data sets for which N=23. But if the data collector intended to stop at the end of the week (and just happened to get N=23), then the space of possible data includes all data sets that could have been collected by the end of the week, some of which have N=23, and some of which have smaller N or larger N. Because the two spaces of possible data sets are not the same, the p values are not the same. The p value can depend quite dramatically on the stopping intention. For example, if the researcher intended to stop when N=100 but was unexpectedly interrupted when N=23, the p value is much smaller than if the intention was to stop when N=23. Or, if the researcher intended to stop when N=23 but got an unexpected windfall of data so that N=100, perhaps because a new volunteer assistant showed up, then the p value is much larger than if the researcher intended to stop at N=100. Therefore, to correctly determine the p value for a set of data, we must know the reason that data collection ceased.
Here is an example of the correct use of p values. A lab director has some research in mind and decides that N=30 is adequate. The director tells the lab administrator to collect data from 30 subjects. The administrator knows that the lab typically recruits about 30 subjects in a week, and therefore tells the data-collecting assistant to run subjects for a week. The assistant dutifully collects data, and at the end of the week the lab director happens to be present as the last datum is collected, which happens to be N=30. As far as the lab director can tell, data collection ceased intentionally when N=30. When the lab director analyzes the data, under the intention of stopping when N=30, a particular p value is computed. But when the assistant analyzes the data, under the intention of stopping at the end of the week, a different p value is computed. In fact, for the lab director p<.05, but for the assistant p>.05. Which p value is correct? Are the results "significant"?
Here is another example of the correct use of p values. Two competing labs are pursuing the same type of research in a well established experimental paradigm. The two labs independently have the same idea for an identical experiment design, and the researchers go about collecting data. In one lab, they intend to collect data for a week, and they happen to get N=30. In the other lab, they intend to stop data collection when N=30. Moreover, by chance, the data in the two labs happen to be identical. (That's not so far fetched; e.g., perhaps the data are binary choices for each subject, and so the data can be summarized as the number of "heads" out of N subjects.) The two labs compute p values for their identical data sets. The two p values are different because the data were collected with different stopping intentions; in fact p<.05 for one lab but p>.05 for the other lab. Which lab got "significant" data?
The problem is that the data bear no signature of the stopping intention of the researcher. Indeed, researchers go out of their way to make sure that the data are not influenced by their stopping intention. Each datum collected is supposed to be completely insulated from any data collected before or after. The last datum collected has no trace that it was the last or the first or any position in between.
Not only is the intention opaque to the data, it's often opaque to the researcher. Collaborators on a project might have differing sampling intentions (as in the example of the director and assistant, above). Or the sampling intention might change midway through data collection. ("Let's collect N=100." Then on Friday afternoon, "Well, we've got N=94, that's good enough.") Or, as often happens, some subjects have to be deleted from the data set because of procedural errors or failure to respond, despite the data-collector's intention about sample size or stopping time. These are all very realistic sampling procedures, and we tolerate them because we know that the data are completely unaffected by the intention of the researcher.
Therefore it's strange that the interpretation of the data in terms of p values should depend crucially on something that has no impact on the data, namely, the stopping intention. But, the correct calculation of p values does depend on the stopping intention.
So, what should be done about this problem? Given the examples above, it seems clear that we should treat p values as inherently ill-defined because they depend on intentions that are irrelevant to the data.
But here is the #1 recommendation from the new article in Psychological Science:
Requirements for authors: ...Presumably this requirement is declared so that researchers can use the stopping rule to calculate the true and correct p value, determined appropriately for the idiosyncratic intentions of the researchers. Here's a plausible example. "We decided to collect 100 observations, but by Friday we had 94 and figured that was close enough, so we stopped, and then had to delete 5 subjects because of later-discovered transcription errors. A post-experiment inquisition of the lab team revealed that one of our assistants was covertly intending to quit the job on Monday, which would have limited our data collection if we had not decided to stop on Friday. Therefore, running a large Monte Carlo simulation that incorporated the estimated recruitment rate of subjects during the week, and the estimated probability of deciding to stop on Friday for different values of N achieved on Friday, and the estimated probability of transcription errors, and the probability of an assistant quitting on Monday, we determined that p=..."
1. Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article. Following this requirement may mean reporting the outcome of power calculations or disclosing arbitrary rules, such as “we decided to collect 100 observations” or “we decided to collect as many observations as we could before the end of the semester.” The rule itself is secondary, but it must be determined ex ante and be reported.
To ease the reporting of true and correct p values, it would be extremely helpful to have critical values for commonly-used statistics (t, F, etc.) under various typical stopping intentions that researchers adopt. All the critical values in contemporary books and computer programs assume the unrealistic convention that N was fixed in advance. Instead, we should have tables of critical values for stopping after a certain duration, with varieties of sampling rates during that interval. (In fact, I've already seeded this process with an example in this article.) Therefore researchers could obtain the true p values for their data if they intended to stop after a particular duration.
It would also be helpful to have correction factors for unexpected interruptions of data collection, or unexpected windfalls of data collection. The researcher would merely have to enter the intended sample size and the probability of being interrupted or enhanced (and the probability of each magnitude of enhancement) at any particular time during data collection, and the correction factor would produce the true and correct p value for the data.
Federal funding agencies have been especially keen lately to support collaborative efforts of teams of researchers. It is likely that different team members may harbor different intentions about the data collection (perhaps covertly or subconsciously, but harbored nonetheless). Therefore it would extremely useful to construct tables of true and correct critical values for cases of parallel mixed intentions, when one collaborator intends to stop at a fixed sample size, and the other collaborator intends to stop at the end of the week. Clearly, the construction of these tables should be a major funding priority for the granting agencies.
I look forward to a new industry of publications that reveal appropriate corrections for different sampling intentions. Fortunately, we already have a model of this industry in the extensive literature about correcting for false alarms in multiple comparisons. Depending on the types of intended comparisons, and whether the intentions are planned or post-hoc, we have a cornucopia of corrections for every possible set of intended comparisons. Unfortunately, all of those corrections for multiple comparisons have been based on the sampling assumption that data collection stopped at fixed sample size! Therefore, every one of the corrections for multiple comparisons will have to be reworked for different stopping intentions. It will be a great day for science when we have a complete set of corrections for all the various intentions regarding multiple comparisons and intentions regarding stopping collection of data, because then we will know true and correct p values for our data, which were completely insulated from those intentions.
Oops! Sorry, I slipped into sarcasm. But hey, it's my blog. I should reiterate that I agree with many of the points made by the authors of the Psychological Science article. Just not the point about p values and stopping intentions. And one more point, regarding a different analysis method that the authors dismissed as a "nonsolution." Here is the relevant excerpt:
Nonsolutions: ...It's important to be clear that statistical analysis of any kind can only deal with the data its given. If the design and procedure garner biased data, no analysis can fully undo that bias. Garbage in, garbage out. If the problem of "too many researcher degrees of freedom" stems from design-and-procedure problems, then it needs design-and-procedure solutions. To say that Bayesian analysis is a nonsolution to a design-and-procedure problem is like saying that a meat grinder is a nonsolution to rancid meat
Using Bayesian statistics. We have a similar reaction to calls for using Bayesian rather than frequentist approaches to analyzing experimental data (see, e.g., Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011). Although the Bayesian approach has many virtues, it actually increases researcher degrees of freedom. First, it offers a new set of analyses (in addition to all frequentist ones) that authors could flexibly try out on their data. Second, Bayesian statistics require making additional judgments (e.g., the prior distribution) on a case-by case basis, providing yet more researcher degrees of freedom.
The authors argue that Bayesian analysis "increases researcher degrees of freedom" (and is therefore bad) in two ways. First, "it offers a new set of analyses (in addition to all the frequentist ones)". The tacit assumption of this statement seems to be that researchers would try frequentist and Bayesian approaches and just report the one that gave the most flattering conclusion. No, this wouldn't fly. Bayesian analyses provide the most complete inferential information given the data (in the normative mathematical sense), and analysts can't just slip into frequentist mode because it's flattering. In fact, reporting a p value is embarrassing, not flattering, because p values are ill defined.
Second, say the authors, "Bayesian statistics require making additional judgments (e.g., the prior distribution)...". Ah, the bogey-man of priors is trotted out to scare the children, as if priors can be capriciously set to anything the analyst wants (insert sounds of mad, wicked laughter here) and thereby predetermine the conclusion. In actuality, priors are overt and explicitly agreeable to a skeptical scientific audience. Typically they are set to be noncommittal so that they have minimal influence on the posterior distribution. When there is considerable previous research to inform a prior, then a strong prior can give great inferential leverage to small samples. And not using strong prior information when it is available can be a serious blunder; consider random drug testing and disease diagnosis, which must take into account the base rates, i.e., the priors.
Bayesian analysis does, in fact, give analysts more flexibility than traditional frequentist analysis. It gives the analyst the flexibility to use a model that actually describes the trends and distributions represented in the data, instead of being shoe-horned into linear models and normal distributions that may have little resemblance to the data. (Of course, if the analyst wants linear models with normal distributions, Bayesian analysis provides the richest possible inference about their parameters without ever computing a p value.) With Bayesian analysis, researchers can actually get useful parametric descriptions of complex data, involving multi-level non-linear models with non-normal distributions at various levels throughout the model. This is the flexibility that scientific theorizing needs.
† Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, first published online Oct. 17, 2011. DOI: 10.1177/0956797611417632
1 Revision to parenthetical remark made on Oct. 23, 2011, in response to personal communication from Uri Simonsohn. Struck-out version was in the original post, which could have inadvertently been interpreted as saying that the authors themselves explicitly made the meat-grinder-is-useless argument. They did not.
Tuesday, October 11, 2011
Bayesian Economist wins Nobel
image from Sims' web page |
(And an I.U. professor, Elinor Ostrom, won in 2009.)
Wednesday, October 5, 2011
If chains are converged, is autocorrelation okay?
See the follow-up post!
From time to time I've been asked whether autocorrelation in MCMC chains is okay if the chains are converged, as indicated by the BGR statistic being close to 1.0. The answer is: No. Autocorrelation in the chains implies that the MCMC sample is clumpy. A clumpy sample is not representative of a smooth distribution.
Here is an example of a case in which the BGR statistic is nicely behaved near 1.0, but there is still notable autocorrelation. It arises from doing multiple linear regression (see Fig. 17.4, p. 458 of the book) on two predictors that are highly correlated. The regression coefficients are denoted b[1] and b[2]. Here are chains that have no thinning:
Notice that the BGR is at 1.0 throughout, but the ACF has notable autocorrelation.
There is a separate question of how much autocorrelation can be tolerated. This depends on the particular parameterization and the summary measure of the MCMC sample that is being considered. But all of that worrying can be avoided if it is easy to thin the chain and get rid of autocorrelation, as it is in the example above.
If mere thinning doesn't do the trick (because the MCMC sampling takes a long time), then sometimes transforming the data can help (see the book, e.g., p. 459). Otherwise, reparameterizing is the usual way to go. You can actually change the model, or sometimes you can transform the parameters after they've been sampled and the transformed versions aren't autocorrelated (e.g., this blog post regarding Bayesian ANOVA).
See the follow-up post!
Monday, October 3, 2011
Another reader's rave review
All of a sudden it just makes sense! Everyone knows that "lightbulb moment", when previously accumulated knowledge or facts become condensed into a lucid concept, where something previously opaque becomes crystal clear. This book is laden with such moments. This is the most accessible statistics text for a generation and I predict (based on prior knowledge) that it will be a major factor in moving scientists of every shape and size towards the Bayesian paradigm. Even if you're sceptical, you're likely to learn more about frequentist statistics by reading this book, than by reading any of the tomes offered by so called popularisers. If you are a social scientist, laboratory scientist, clinical researcher or triallist, this book represents the single best investment of your time. Bayesian statistics offer a single, unified and coherent approach to data analysis. If you're intimidated by the use of a scripting language like "R" or "BUGS", then don't be. The book repays your close attention and has very clear instructions on code, which elucidate the concepts and the actual mechanics of the analysis like nothing I've seen before. All in all, a great investment. The only serious question that can be raised about the design and implementation of a book such as this is: why wasn't it done before?
Click here to see the review at Amazon.com. My great appreciation goes to R. Dunne for taking the effort to post the review.
Court outlaws Bayes' rule!
Image from the article in The Guardian |
Subscribe to:
Posts (Atom)