Wednesday, September 19, 2018

Which movie (treatment, group) is better? Opposite conclusions from different models.

Which movie is better? One way to answer is by considering the star ratings given to those movies. Just treat those 1-to-5 star ratings as numbers, throw them into a t test, and out pops your answer. Right? Not necessarily...

The analogous structure arises in many situations. Suppose, for example, we ask which group is happier, a group of poor people or a group of rich people? One way to answer is by considering subjective happiness ratings from an ordinal scale: 1 = very unhappy, 2 = mildly unhappy, 3 = neither unhappy nor happy, 4 = mildly happy, 5 = very happy. Just treat those 1-to-5 ratings as numbers, throw them into a t test, and out pops your answer. Right? Not necessarily...

Or, consider ratings of symptom intensity in different treatment groups. How bad is your headache? How depressed do you feel? Just treat the ratings as numbers and throw them into a t test, and out pops your answer. Right? Not necessarily...

Treating ordinal values as if they were numeric can lead to misinterpretations. Ordinal values do not indicate equal distances between their levels, nor equal coverage of each level. The conventional t test (and ANOVA and least-squares regression, etc.) assumes the data are metric values normally distributed around the model's predicted values. But obviously ordinal data are not normally distributed metric values.

A much better model of ordinal data is the ordered-probit model, which assumes a continuous latent dimension that is mapped to ordinal levels by slicing the latent dimension at thresholds. (The ordered-probit model is not the only good model of ordinal data, of course, but it's nicely analogous to the t test etc. because it assumes normally distributed noise on the latent dimension.)

The t test and the ordered probit model can produce opposite conclusions about the means of the groups. Here's an example involving star ratings from two movies:


The figure above shows data from two movies, labelled as Cases 5 and 6 in the first two columns. The pink histograms show the frequency distributions of the star ratings; they are the same in the upper and lower rows. The upper row shows the results from the ordered-probit model. The lower row shows the results from the metric model, that is, the t test. In particular, the right column shows the posterior difference of mu's for the two movies The differences are strongly in opposite directions for the two analyses. Each posterior distribution is marked with a dotted line at a difference of zero, and the line is annotated with the percentage of the distribution below zero and above zero. Notice the ordered-probit model fits the data much better than the metric model, as shown by the posterior predictions superimposed on the data: blue dots for the ordered probit model, and blue normal distributions for the metric model. (This is Figure 8 of the article linked below.)

Read all about it here:

Published article:
https://www.sciencedirect.com/science/article/pii/S0022103117307746

Preprint manuscript: https://osf.io/9h3et/

R code: https://osf.io/53ce9/files/

Friday, September 7, 2018

Analyzing ordinal data with metric models: What could possibly go wrong? (just published)

Just published:

Analyzing ordinal data with metric models: What could possibly go wrong?
We surveyed all articles in the Journal of Personality and Social Psychology(JPSP), Psychological Science (PS), and the Journal of Experimental Psychology: General (JEP:G) that mentioned the term “Likert,” and found that 100% of the articles that analyzed ordinal data did so using a metric model. We present novel evidence that analyzing ordinal data as if they were metric can systematically lead to errors. We demonstrate false alarms (i.e., detecting an effect where none exists, Type I errors) and failures to detect effects (i.e., loss of power, Type II errors). We demonstrate systematic inversions of effects, for which treating ordinal data as metric indicates the opposite ordering of means than the true ordering of means. We show the same problems — false alarms, misses, and inversions — for interactions in factorial designs and for trend analyses in regression. We demonstrate that averaging across multiple ordinal measurements does not solve or even ameliorate these problems. A central contribution is a graphical explanation of how and when the misrepresentations occur. Moreover, we point out that there is no sure-fire way to detect these problems by treating the ordinal values as metric, and instead we advocate use of ordered-probit models (or similar) because they will better describe the data. Finally, although frequentist approaches to some ordered-probit models are available, we use Bayesian methods because of their flexibility in specifying models and their richness and accuracy in providing parameter estimates. An R script is provided for running an analysis that compares ordered-probit and metric models.

Fig. 4Ordinal mean as a function of latent mean (mu) and SD (sigma). Groups marked Ⓐ and Ⓑ illustrate a false alarm (Type I error) for which the underlying means are exactly equal but the ordinal means are very different. Groups marked Ⓑ and Ⓓ illustrate a miss (Type II error) for which the underlying means are quite different but the ordinal means are exactly equal. Groups marked Ⓒ and Ⓓ illustrate an inversion for which the underlying means have μD > μC but the ordinal means incorrectly have μC > μD.

Published article:
https://www.sciencedirect.com/science/article/pii/S0022103117307746

Preprint manuscript: https://osf.io/9h3et/

R code: https://osf.io/53ce9/files/

Tuesday, May 8, 2018

Just published: "Rejecting or Accepting Parameter Values in Bayesian Estimation"

Just published: "Rejecting or Accepting Parameter Values in Bayesian Estimation" in the journal, Advances in Methods and Practices in Psychological Science.

Abstract: This article explains a decision rule that uses Bayesian posterior distributions as the basis for accepting or rejecting null values of parameters. This decision rule focuses on the range of plausible values indicated by the highest density interval of the posterior distribution and the relation between this range and a region of practical equivalence (ROPE) around the null value. The article also discusses considerations for setting the limits of a ROPE and emphasizes that analogous considerations apply to setting the decision thresholds for p values and Bayes factors.

Figure 1 of the article.


From the introduction:
In everyday life and in science, people often gather data to estimate a value precisely enough to take action. We use sensory data to decide that a fruit is ripe enough to be tasty but not overripe—that the ripeness is “just right” (e.g., Kappel, Fisher-Fleming, & Hogue, 1995, 1996). Scientists measured the position of the planet Mercury (among other things) until the estimate of the parameter γ in competing theories of gravity was sufficiently close to 1.0 to accept general relativity for applied purposes (e.g., Will, 2014).

These examples illustrate a method for decision making that I formalize in this article. This method, which is based on Bayesian estimation of parameters, uses two key ingredients. The first ingredient is a summary of certainty about the measurement. Because data are noisy, a larger set of data provides greater certainty about the estimated value of measurement. Certainty is expressed by a confidence interval in frequentist statistics and by a highest density interval (HDI) in Bayesian statistics. The HDI summarizes the range of most credible values of a measurement. The second key ingredient in the decision method is a range of parameter values that is good enough for practical purposes. This range is called the region of practical equivalence (ROPE). The decision rule, which I refer to as the HDI+ROPE decision rule, is intuitively straightforward: If the entire HDI—that is, all the most credible values—falls within the ROPE, then accept the target value for practical purposes. If the entire HDI falls outside the ROPE, then reject the target value. Otherwise, withhold a decision.

In this article, I explain the HDI+ROPE decision rule and provide examples. I then discuss considerations for setting the limits of a ROPE and explain that similar considerations apply to setting the decision thresholds for p values and Bayes factors.
The published article is available here (http://journals.sagepub.com/doi/full/10.1177/2515245918771304) and a pre-print version, with some differences in details, is available here (https://osf.io/s5vdy).

Sunday, February 25, 2018

Make model diagrams for human comprehension and ease of programming

There's a great new book by Farrell and Lewandowsky, Computational Modeling of Cognition and Behavior (at the publisher, at Amazon.com), that includes some chapters on Bayesian methods. Each chapter includes a little "in vivo" commentary by an outside contributor. My commentary accompanies their chapter regarding JAGS. The commentary is posted here in a succession of three blog posts; this is 3 of 3. (Part 1 is here, and part 2 is here.) Do check out their book!

Make model diagrams for human comprehension and ease of programming

While a JAGS model specification captures the full structure of the model, it can help human beings to have a diagrammatic representation of the model. A diagram can help the viewer achieve a comprehensive overview of the relations between parameters and their meanings with respect to each other and to the data. A good conceptual diagram of a model can also guide writing the JAGS model specification.

For example, Figure 8.13 (below) shows a representation of the normal model used in the previous section. Because of graphical conventions for probability distributions, the data must be shown at the bottom of the diagram. Starting with yi, the diagram shows that the data come from a normal distribution that has parameters μ and σ. Then the top of the diagram illustrates the prior distributions on the parameters.

Figure 8.13. Diagram of the normal model, in the style of the book, Doing Bayesian Data Analysis (Kruschke, 2015). Scan the diagram from the bottom up, that is, beginning with the data yi at the bottom. Notice that every arrow has a corresponding line of code in the JAGS model specification.
The type of diagram in Figure 8.13 has several helpful attributes. It spatially organizes related parameters in the same distribution. For example, we can see that parameters μ and σ are both participating in the same distribution, and the icon also suggests the μ is for the central tendency and σ is for the scale (standard deviation). Moreover, the diagram completely captures all the structure of the model, showing the form of the prior distribution along with the likelihood function. Indeed, every arrow in the diagram has a corresponding line of code in the JAGS model specification, as shown in the previous post's Listing 8.11 and repeated here for convenience:
model {
  for ( i in 1:N ) { y[i] ~ dnorm( mu , 1/sigma^2 ) }
  mu ~ dunif( -100 , 100 )
  sigma ~ dunif( 0 , 100 )
}
   (Listing 8.11. Describe data with a normal distribution in JAGS.)
Often when I’m creating a new model, I first sketch out a diagram in the style of Figure 8.13, and after I’m sure I have a coherent structure, then I type the model into JAGS, scanning the diagram from the bottom up.

There is another convention that is sometimes used to illustrate Bayesian models. This convention has historical roots in general treatments of statistical models that specify probabilistic dependencies between parameters such that no dependencies cycle back on themselves. Such structures are called directed acyclic graphs (DAG’s). In particular, the DAG diagrammatic convention was used by the software DoodleBUGS, which was a component of WinBUGS (Spiegelhalter, Thomas, Best, & Lunn, 2003), the precursor to JAGS.

Figure 8.14. Diagram of the normal model, in the style of conventional graphical models. Shaded node indicates observed (not estimated) values. Plate indicates repetition. Notice that the arrows have no relation to lines of code in the JAGS model specification.
Figure 8.14 shows a DAG diagram for the normal model. The arrows between variables indicate that the data, yi, are dependent on parameter μ and on parameter σ. But the diagram does not indicate whether or not the two parameters participate in the same distribution or come from different distributions. The diagram does not show the prior distributions at all. Importantly, the diagram provides no clue how to express the model in JAGS because there is no relation between the arrows in the diagram and the lines of code in JAGS. Often when DAGs are used for illustration, the diagram will be accompanied by a list of all the equations that specify the model. While the equations provide complete information, the reader must scan back and forth between equations and diagram to make sense of the diagram.

For more disucssion, see p. 197 of Kruschke (2015). It’s repeatedly emphasized for many different models in that book that every arrow in a model diagram (usually) has a corresponding line of code in JAGS. See another comparison of diagrams at this blog post. See tools for creating diagrams at this blog post

(Ironically, the next chapter of Farrell and Lewandowsky's book violates this advice. You can judge for yourself whether you think the DAGs have any useful correspondence to the JAGS code. Despite the use of DAGs, their book is a great resource!)

Saturday, February 24, 2018

Compose JAGS model statements for human readability

There's a great new book by Farrell and Lewandowsky, Computational Modeling of Cognition and Behavior (at the publisher, at Amazon.com), that includes some chapters on Bayesian methods. Each chapter includes a little "in vivo" commentary by an outside contributor. My commentary accompanies their chapter regarding JAGS. The commentary is posted here in a succession of three blog posts; this is 2 of 3. (Part 1 is here.) Do check out their book!

Compose JAGS model statements for human readability

All mathematical models are designed to describe structure in data. Logically, to comprehend a model, we must first know what the data are that the model is supposed to describe. We begin with describing how the data are probabilistically distributed according to some likelihood function. The likelihood function has parameters, which typically describe some trend or relation in the data. The parameters might be expressed in terms of higher-level parameters. Finally, the parameters have uncertainty, expressed as prior distributions on the parameters. The JAGS model-specification language lets us write models in this logical and comprehensible way: Start with the data, write the likelihood function, then write any dependencies among parameters, and finish with the prior distribution on the parameters. This makes it easy to write the model, and, importantly, easy for readers of the model specification to make sense of the model.

For example, consider a JAGS model specification for describing a set of data with a normal distribution (cf. Listing 8.3 [in Farrell and Lewandowsky's book]):
model {
  for ( i in 1:N ) { y[i] ~ dnorm( mu , 1/sigma^2 ) }
  mu ~ dunif( -100 , 100 )
  sigma ~ dunif( 0 , 100 )
}
   (Listing 8.11. Describe data with a normal distribution in JAGS.)
The model specification (above) is easy to comprehend sequentially in reading order.

JAGS does not execute the lines of the model specification as if they were procedural R commands, but instead JAGS examines the overall model statement for structural consistency. The three lines in the model specification (above) could be put in any order and JAGS would not care. For example, JAGS would also allow the following:
model {
  sigma ~ dunif( 0 , 100 )
  mu ~ dunif( -100 , 100 )
  for ( i in 1:N ) { y[i] ~ dnorm( mu , 1/sigma^2 ) }
}
   (Listing 8.12. Alternative JAGS description of a normal distribution.)
In terms of information content, it does not matter if you say “the knee bone’s connected to the thigh bone, and the thigh bone’s connected to the hip bone,” or instead say “the thigh bone’s connected to the hip bone, and the knee bone’s connected to the thigh bone.”

But for human readers trying to comprehend the statements, order does matter. Especially for complicated models with unfamiliar or arbitrary parameter names, it can be very difficult to understand model specifications that begin by specifying priors on parameters before specifying what distributions those parameters play a role in, and what the relation of the data to the parameters is. Therefore, be kind to your readers, and to your future self who will look back on your code months later. Specify JAGS models starting with the data likelihood then working through the parameters and their priors. These ideas are expressed with more examples on p. 199 and p. 414 of Kruschke (2015).

Friday, February 23, 2018

Run MCMC to achieve effective sample size (ESS) of 10,000

There's a great new book by Farrell and Lewandowsky, Computational Modeling of Cognition and Behavior (at the publisher, at Amazon.com), that includes some chapters on Bayesian methods. Each chapter includes a little "in vivo" commentary by an outside contributor. My commentary accompanies their chapter regarding JAGS. The commentary is posted here in a succession of three blog posts; this is 1 of 3. Do check out their book!

Run MCMC to achieve effective sample size (ESS) of 10,000

Bayesian analysis of complex models is possible only by virtue of modern software that takes an abstract model specification and returns a representation of the posterior distribution. In software that uses Markov-chain Monte Carlo (MCMC) methods, such as JAGS, the representation is inherently noisy. The random noise from MCMC tends to cancel out as the chain gets longer and longer. But different aspects of the posterior distribution are differently affected by noise. A relatively stable aspect is the median value of the chain. The median tends to stabilize relatively quickly, that is, with relatively shorter chains, because the median is usually in a high-density region of the posterior and the value of the median does not depend on the distance to outliers (unlike the mean). But other crucial aspects of the posterior distribution tend to need longer chains to achieve stable values.

In particular, a crucial aspect of a parameter distribution is its width. Narrower distributions connote more certainty in the estimate of the parameter. A very useful indicator of the width of a distribution is its 95% highest density interval (HDI). Parameter values within the 95% HDI have higher probability density than parameter values outside the HDI, and the parameter values inside the 95% HDI have a total probability of 95%. An example of an HDI is illustrated in Figure 1.

Figure 1. Example of a 95% highest density interval (HDI). On the axes of the graph, θ denotes a parameter in the model, and p(θ|D) denotes the posterior distribution of that parameter. The limits of the HDI are marked by the ends of the double-headed arrow. Any value of θ within the HDI has higher probability density than any value outside the HDI. The mass within the 95% HDI, shaded by gray in the figure, is 95%.

Because the limits of an HDI are usually in the low-density tails of the distribution, there are relatively few steps in the MCMC chain near the limits. Therefore it takes a long chain to generate sufficiently many representative values of the parameter to stabilize the estimate of the HDI limits.

How long of a chain is needed to produce stable estimates of the 95% HDI? One useful heuristic answer is 10,000 independent steps. The rationale for the heuristic is explained in Section 7.5.2 of Kruschke (2015). Note that the requirement is 10,000 independent steps. Unfortunately, most MCMC chains are strongly autocorrelated, meaning that successive steps are near each other, and are not independent. Therefore we need a measure of chain length that takes into account the autocorrelation of the chain. Such a measure is called the effective sample size (ESS), for which a formal definition is provided in Section 7.5.2 of Kruschke (2015).

ESS is computed in R by the effectiveSize function (which is in the coda package, which in turn is part of the rjags package for JAGS). For example, suppose we have generated an MCMC chain using the rjags function, coda.samples, and the resulting object is called mcmcfin. Then we can find the ESS of the parameters by typing effectiveSize(mcmcfin).

It is crucial to realize that (i) the ESS will usually be much less than the number of steps in the MCMC chain, and (ii) every parameter in a multi-parameter model has a different ESS. Some parameters might have large ESS while others have small ESS. Moreover, combinations of parameters, such as a difference of two means, can have quite different ESS than the separate parameters. Therefore it is important to check the ESS of every parameter of interest, and the ESS of any interesting parameter combinations.

Stay tuned for Parts 2 and 3...

Thursday, February 8, 2018

All articles from Bayesian special issue free-of-charge til April 6


Special Virtual Issue: Bayesian Inference for Psychology

Psychonomic Bulletin & Review

Scroll down on the page linked here for the articles in the special issue on Bayesian data analysis. The list of links includes these:

Articles

Bayesian data analysis for newcomers [<-clickable link]
John K. Kruschke and Torrin M. Liddell
The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective [<-clickable link]
John K. Kruschke and Torrin M. Liddell

Starting February 19, there will be a series of commentaries on the articles. I will update the blog when more information is available.