## Friday, November 21, 2014

### Ordinal probit regression: Transforming polr() parameter values to make them more intuitive

In R, the polr function in the MASS package does ordinal probit regression (and ordinal logistic regression, but I focus here on probit). The polr function yields parameter estimates that are difficult to interpret intuitively because they assume a baseline intercept of 0 and a noise standard deviation of 1, which produces slopes and thresholds that are relative to an underlying scale with little direct relation to the ordinal values in the data. In can be more intuitive to fix the low and high thresholds at values that make sense on the ordinal data scale, and estimate the other parameters relative to those low and high thresholds. This intuitive parameterization is used in Chapter 23 of DBDA2E. This blog post presents a function that transforms the polr parameter values to the intuitive parameterization.

## Background: Ordinal probit regression.

See Chapter 23 of DBDA2E for a complete introduction to and explanation of ordinal probit regression. The figure below provides a quick reminder of the model.
The data to be predicted are ordinal, shown on the left side of the figure. The predictor in this case is metric, shown at the lower-right of the figure. The model assumes there exists a underlying metric scale for the ordinal outcome, and this underlying scale is chopped into subintervals that have thresholds denoted theta (see the diagram). The model assumes there is some noisy mapping from the predictor to the underlying metric response scale, as assumed by ordinary linear regression, and the probability of each discrete ordinal response is the area under the normal curve between the thresholds for that response category.

From the data we estimate the intercept, the slope, the standard deviation, and the thresholds. But the underlying scale has arbitrary location and magnification, so two of the parameters can be arbitrarily fixed. In polr, the intercept is fixed at 0 and the standard deviation is fixed at 1. But that yields thresholds and slopes on a scale with no intuitive relation to the response values, which go from 1 to K, where K is the maximum ordinal response level.

A more intuitive parameterization fixes the low threshold at 1.5 (half way between 1 and 2), and the high threshold at K-.5 (half way between K-1 and K). Then the thresholds are directly interpretable with respect to those anchors, and the standard deviation can also be interpreted relative to the distance between those anchors.

## Example: Happiness and financial assets.

Consider subjective ratings of happiness on a 1 to 5 ordinal scale, where 1 is bummed out and 5 is ecstatic (I forget the exact wording used to label the response items). For each respondent we also get their total assets. Some real data are shown in the top panel of the figure below:
See the figure caption for details. The lower panels of the figure show aspects of the Bayesian posterior distribution. Notice that the lowest threshold is fixed at 1.5, and the highest threshold is fixed 4.5. This means that the underlying metric output scale is aligned with the ordinal responses, and the other parameters are relative to the low and high anchors. In particular, the intercept is at about 3.16, which suggests a happiness rating of about 3 even with zero assets. The slope is about 4.14e-6 happiness units per yuan. The standard deviation of the noise is about 0.854 happiness units, suggesting large variability on a 5-point scale.

The output of polr, however, looks like this (where Happiness is Yord and Assets is X):

> polrInfo = polr( Yord ~ X , method="probit" )
> print( polrInfo )

Coefficients:
X
4.025474e-06

Intercepts:
1|2       2|3       3|4       4|5
-2.048053 -1.180089 -0.127664  1.544575


What do those parameter estimates mean? The coefficient on X is the slope, and the intercepts are the thresholds. While we can use those estimates to derive predictions of outcome probabilities, they aren't very directly interpretable, at least not for me!

The polr estimates can be transformed into the parameterization I explained above, and then we get the following:

> polrInfoOrd = polrToOrdScale( polrInfo )
> print( polrInfoOrd )

$sigma  0.8350432$b0
 3.210213

$coefficients X 3.361444e-06$zeta
1|2      2|3      3|4      4|5
1.500000 2.224788 3.103608 4.500000


The thresholds are the zeta values; notice that the lowest threshold is 1.5 and the highest threshold is 4.5. The sigma, intercept (i.e., b0), and slope (i.e., coefficient) are very close to the modal posterior values of the Bayesian estimation.

## The function:

Here is the specification of the function in R:

polrToOrdScale = function( polrObject ) {
polrThresh = polrObject$zeta polrSlopes = polrObject$coefficients
polrInter = 0.0
polrSigma = 1.0
K = length(polrThresh) + 1  # K is number of ordinal levels
sigmaMult = unname( (K-2)/(polrThresh[K-1]-polrThresh) )
inter = unname( 1.5 - ( sigmaMult*polrThresh ) )
respThresh = sigmaMult*polrThresh + inter
respSigma = sigmaMult*polrSigma
respB0 = sigmaMult*polrInter + inter
respSlopes = sigmaMult*polrSlopes
return( list( sigma=respSigma ,
b0=respB0 ,
coefficients=respSlopes ,
zeta=respThresh ) )
}

I have only tested the function on a few examples; there are no guarantees! Feel free to reply with refinements.

1. Thanks for the post John! So presumably you could set up the Bayesian model with an intercept and sigma fixed at 0 and 1 and then apply your transformation at each step of the chain?

I'm trying to run a similar model in STAN, but it seems to sample inefficiently when fixing the two cut off points.

All the best

2. I just tried it and it seems to do the job very well.

Thanks again for the post!

3. Opps, seems I cant get the correct results for the intercept...

4. Yeah, I used the un-intuitive parameterization in the 1st edition of the book, so you could look at that for how to specify it in BUGS.

5. Fantastic, thanks John. Got it working now and recovers the generating parameters. I've coded up both versions in Stan (fixed thresholds and fixed sigma/intercept). Both get the same results but the later is much more efficient with lower autocorrelation. Let me know if you would like the code.

Our lab (Leeds Psyc) works with Geoff Bingham on various projects. If I ever make it over to Indiana I'll have to get you a coffee as thanks!

All the best,

6. Turns out this function is what I really needed today, thanks!

7. If anyone has done the equivalent when replacing the normal distribution with a t-distribution i'd love to see the code...

8. I think the ordinal prediction makes more sense to predict the Likert scale.
But I wonder how to do this with nominal predictors?
Is there any toy example code available?

9. Dear Fenn Lien:

Sure, see Section 23.3 of DBDA2E.

Then generalize from there, e.g., put in AVOVA-like structure.

10. Thanks for the quick response, I did not notice that.

I have a quick look at 23.3, it's about two group, for nominal variables, they should be multiple groups.
I just came across BDA recently, so I do not have much background about that.
Would you please give me more hints about that? I have some trouble to extend this example to multiple groups.

Many thanks.

Fenn

11. Fenn Lien:
I don't have a specific script for that scenario, but it's straight forward to create one. (Admittedly, you have to get used to making scripts in R with JAGS and runjags or rjags, but it's worth the effort!) Essentially, you want to combine the top part of the model structure in Figure 19.2 (p. 558) with the threshold-normal likelihood function of Figure 23.6 (p. 687). That is, the mu in Fig 23.6 does not come from beta0+beta1*x, but instead comes from the baseline plus deflections of the groups in Fig.19.2. Ultimately, I recommend you use the structure in Figure 19.6, p. 574.

12. Thanks very much. That helps a lot. I will have a read of these two models, and try to implement the new model. Of course, I need to get familiar with R and JAGS first!

Thanks again!

Fenn

13. Hi Prof. Kruschke,

I have some rating data. Mostly, the ratings are 1s (over 80%). Small portion of the data are 2s, 3s, 4s, and 5s.
After I ran the program (single group of ordinal predicted variable), why the posterior distribution on mean gave negative values, say mode=-2.25, 95% HDI is from -5.36 to -0.441?
Could you help me understand that?

1. Remember, the first threshold is fixed at 1.5, and the highest threshold is fixed at K-0.5. When the mean of the underlying trend is negative, it simply means that the mean is far below the first threshold, which implies that most of the data will be 1's, with only a few 2's, 3's, etc. Please see Chapter 23 of DBDA2E for more info.

2. 14. 1. 