Wednesday, September 19, 2018

Which movie (treatment, group) is better? Opposite conclusions from different models.

Which movie is better? One way to answer is by considering the star ratings given to those movies. Just treat those 1-to-5 star ratings as numbers, throw them into a t test, and out pops your answer. Right? Not necessarily...

The analogous structure arises in many situations. Suppose, for example, we ask which group is happier, a group of poor people or a group of rich people? One way to answer is by considering subjective happiness ratings from an ordinal scale: 1 = very unhappy, 2 = mildly unhappy, 3 = neither unhappy nor happy, 4 = mildly happy, 5 = very happy. Just treat those 1-to-5 ratings as numbers, throw them into a t test, and out pops your answer. Right? Not necessarily...

Or, consider ratings of symptom intensity in different treatment groups. How bad is your headache? How depressed do you feel? Just treat the ratings as numbers and throw them into a t test, and out pops your answer. Right? Not necessarily...

Treating ordinal values as if they were numeric can lead to misinterpretations. Ordinal values do not indicate equal distances between their levels, nor equal coverage of each level. The conventional t test (and ANOVA and least-squares regression, etc.) assumes the data are metric values normally distributed around the model's predicted values. But obviously ordinal data are not normally distributed metric values.

A much better model of ordinal data is the ordered-probit model, which assumes a continuous latent dimension that is mapped to ordinal levels by slicing the latent dimension at thresholds. (The ordered-probit model is not the only good model of ordinal data, of course, but it's nicely analogous to the t test etc. because it assumes normally distributed noise on the latent dimension.)

The t test and the ordered probit model can produce opposite conclusions about the means of the groups. Here's an example involving star ratings from two movies:

The figure above shows data from two movies, labelled as Cases 5 and 6 in the first two columns. The pink histograms show the frequency distributions of the star ratings; they are the same in the upper and lower rows. The upper row shows the results from the ordered-probit model. The lower row shows the results from the metric model, that is, the t test. In particular, the right column shows the posterior difference of mu's for the two movies The differences are strongly in opposite directions for the two analyses. Each posterior distribution is marked with a dotted line at a difference of zero, and the line is annotated with the percentage of the distribution below zero and above zero. Notice the ordered-probit model fits the data much better than the metric model, as shown by the posterior predictions superimposed on the data: blue dots for the ordered probit model, and blue normal distributions for the metric model. (This is Figure 8 of the article linked below.)

Read all about it here:

Published article:

Preprint manuscript:

R code:


  1. One must note that there is a 10-fold difference in the number of observations for the two movies. Therefore, the one with smaller viewership is pulled stronger towards the prior (the strength of the prior defines the relative weight of the data), so the posterior of μ for the first movie depends pretty much on the exact shape of the prior. Judging from the posterior of μ, the prior is quite informative, because on Amazon I would buy the second product, but this model predicts P(μ5>μ6) < 1%.

    Given the decision theoretic context, we must not forget about the utility of different ratings: we could prefer the product/movie that’s slightly better with large probability, or significantly better with moderate probability.

    Nonetheless, it's an interesting model, thanks for the post.

    1. To be clear: I agree that a t-test for ordinal data is rarely appropriate, so any improvement is welcome.

      I just noticed that you are the author of the paper, so you could clarify a question about the JAGS code. (Thanks for making it open source.) I was wondering if the sample statement for mu should be before that of pr – in the code pr comes first. I don't know JAGS well enough to tell if it matters.

      Laszlo Treszkai
      firstname.lastname at gmail

    2. Thanks for the thoughtful comments.

      Regarding N for each movie: In the full analysis, the N is automatically taken into account by the model. Movies with larger N carry more weight than movies with smaller N.

      But what about movies? Aside from what you mentioned, there are lots of reasons not to treat the latent mean as the only guide to movie quality. The standard deviation also is informative, indicating the (in-)consistency of reviewer reaction. And even more crucially, all of the ratings are self-selected for each movie; this is *not* a random sample of reviewers. The model differences are nicely illustrated by movie data (which is why I used them), but it's really hard to interpret the results!

      Regarding JAGS code: JAGS is a declarative language, not a procedural language. Statements should be made in the order that makes sense to human readers. See this blog post:

    3. Thanks for the reply.

      Earlier, one was advised to read _Stan_ models from the bottom up, but now it seems to work with an arbitrary order of sampling statements, just like JAGS.

      I will need to analyze the source of the discrepancy between this model's strong prediction for μ5>μ6 and my strong intuitive sense for μ5<μ6: when considering the difference between movie 5 and 6, we're essentially moving rating percentage from rating "1" to "4" (and adding more ratings) and we end up with a lower predicted mean. I’ll post an update here when I do so.