Doing Bayesian Data Analysis: Ordinal probit regression: Transforming polr() parameter values to make them more intuitive

Friday, November 21, 2014

Ordinal probit regression: Transforming polr() parameter values to make them more intuitive

In R, the polr function in the MASS package does ordinal probit regression (and ordinal logistic regression, but I focus here on probit). The polr function yields parameter estimates that are difficult to interpret intuitively because they assume a baseline intercept of 0 and a noise standard deviation of 1, which produces slopes and thresholds that are relative to an underlying scale with little direct relation to the ordinal values in the data. In can be more intuitive to fix the low and high thresholds at values that make sense on the ordinal data scale, and estimate the other parameters relative to those low and high thresholds. This intuitive parameterization is used in Chapter 23 of DBDA2E. This blog post presents a function that transforms the polr parameter values to the intuitive parameterization.

Background: Ordinal probit regression.

See Chapter 23 of DBDA2E for a complete introduction to and explanation of ordinal probit regression. The figure below provides a quick reminder of the model.

The data to be predicted are ordinal, shown on the left side of the figure. The predictor in this case is metric, shown at the lower-right of the figure. The model assumes there exists a underlying metric scale for the ordinal outcome, and this underlying scale is chopped into subintervals that have thresholds denoted theta (see the diagram). The model assumes there is some noisy mapping from the predictor to the underlying metric response scale, as assumed by ordinary linear regression, and the probability of each discrete ordinal response is the area under the normal curve between the thresholds for that response category.

From the data we estimate the intercept, the slope, the standard deviation, and the thresholds. But the underlying scale has arbitrary location and magnification, so two of the parameters can be arbitrarily fixed. In polr, the intercept is fixed at 0 and the standard deviation is fixed at 1. But that yields thresholds and slopes on a scale with no intuitive relation to the response values, which go from 1 to K, where K is the maximum ordinal response level.

A more intuitive parameterization fixes the low threshold at 1.5 (half way between 1 and 2), and the high threshold at K-.5 (half way between K-1 and K). Then the thresholds are directly interpretable with respect to those anchors, and the standard deviation can also be interpreted relative to the distance between those anchors.

Example: Happiness and financial assets.

Consider subjective ratings of happiness on a 1 to 5 ordinal scale, where 1 is bummed out and 5 is ecstatic (I forget the exact wording used to label the response items). For each respondent we also get their total assets. Some real data are shown in the top panel of the figure below:

See the figure caption for details. The lower panels of the figure show aspects of the Bayesian posterior distribution. Notice that the lowest threshold is fixed at 1.5, and the highest threshold is fixed 4.5. This means that the underlying metric output scale is aligned with the ordinal responses, and the other parameters are relative to the low and high anchors. In particular, the intercept is at about 3.16, which suggests a happiness rating of about 3 even with zero assets. The slope is about 4.14e-6 happiness units per yuan. The standard deviation of the noise is about 0.854 happiness units, suggesting large variability on a 5-point scale.

The output of polr, however, looks like this (where Happiness is Yord and Assets is X):

> polrInfo = polr( Yord ~ X , method="probit" )
> print( polrInfo )


Coefficients:
           X 
4.025474e-06 

Intercepts:
      1|2       2|3       3|4       4|5 
-2.048053 -1.180089 -0.127664  1.544575

What do those parameter estimates mean? The coefficient on X is the slope, and the intercepts are the thresholds. While we can use those estimates to derive predictions of outcome probabilities, they aren't very directly interpretable, at least not for me!

The polr estimates can be transformed into the parameterization I explained above, and then we get the following:

> polrInfoOrd = polrToOrdScale( polrInfo )
> print( polrInfoOrd )

$sigma
[1] 0.8350432

$b0
[1] 3.210213

$coefficients
           X 
3.361444e-06 

$zeta
     1|2      2|3      3|4      4|5 
1.500000 2.224788 3.103608 4.500000

The thresholds are the zeta values; notice that the lowest threshold is 1.5 and the highest threshold is 4.5. The sigma, intercept (i.e., b0), and slope (i.e., coefficient) are very close to the modal posterior values of the Bayesian estimation.

The function:

Here is the specification of the function in R:

polrToOrdScale = function( polrObject ) {
polrThresh = polrObject$zeta
polrSlopes = polrObject$coefficients
polrInter = 0.0
polrSigma = 1.0
K = length(polrThresh) + 1 # K is number of ordinal levels
sigmaMult = unname( (K-2)/(polrThresh[K-1]-polrThresh[1]) )
inter = unname( 1.5 - ( sigmaMult*polrThresh[1] ) )
respThresh = sigmaMult*polrThresh + inter
respSigma = sigmaMult*polrSigma
respB0 = sigmaMult*polrInter + inter
respSlopes = sigmaMult*polrSlopes
return( list( sigma=respSigma ,
                b0=respB0 ,
                coefficients=respSlopes ,
                zeta=respThresh ) )
}

I have only tested the function on a few examples; there are no guarantees! Feel free to reply with refinements.

17 comments:

Oscar GilesJune 2, 2015 at 6:21 AM
Thanks for the post John! So presumably you could set up the Bayesian model with an intercept and sigma fixed at 0 and 1 and then apply your transformation at each step of the chain?

I'm trying to run a similar model in STAN, but it seems to sample inefficiently when fixing the two cut off points.

All the best
ReplyDelete
Replies
Oscar GilesJune 2, 2015 at 7:40 AM
I just tried it and it seems to do the job very well.

Thanks again for the post!
ReplyDelete
Replies
Oscar GilesJune 2, 2015 at 11:45 AM
Opps, seems I cant get the correct results for the intercept...

ReplyDelete
Replies
John K. KruschkeJune 2, 2015 at 6:45 PM
Yeah, I used the un-intuitive parameterization in the 1st edition of the book, so you could look at that for how to specify it in BUGS.
ReplyDelete
Replies
Oscar GilesJune 3, 2015 at 6:15 AM
Fantastic, thanks John. Got it working now and recovers the generating parameters. I've coded up both versions in Stan (fixed thresholds and fixed sigma/intercept). Both get the same results but the later is much more efficient with lower autocorrelation. Let me know if you would like the code.

Our lab (Leeds Psyc) works with Geoff Bingham on various projects. If I ever make it over to Indiana I'll have to get you a coffee as thanks!

All the best,
ReplyDelete
Replies
Rasmus BååthJuly 28, 2015 at 8:32 AM
Turns out this function is what I really needed today, thanks!
ReplyDelete
Replies
UnknownJuly 31, 2015 at 9:13 AM
If anyone has done the equivalent when replacing the normal distribution with a t-distribution i'd love to see the code...

ReplyDelete
Replies
Fenn LienNovember 11, 2016 at 1:23 PM
I think the ordinal prediction makes more sense to predict the Likert scale.
But I wonder how to do this with nominal predictors?
Is there any toy example code available?
ReplyDelete
Replies
John K. KruschkeNovember 11, 2016 at 2:12 PM
Dear Fenn Lien:

Sure, see Section 23.3 of DBDA2E.

Then generalize from there, e.g., put in AVOVA-like structure.
ReplyDelete
Replies
Fenn LienNovember 12, 2016 at 9:41 AM
Thanks for the quick response, I did not notice that.

I have a quick look at 23.3, it's about two group, for nominal variables, they should be multiple groups.
I just came across BDA recently, so I do not have much background about that.
Would you please give me more hints about that? I have some trouble to extend this example to multiple groups.

Many thanks.

Fenn
ReplyDelete
Replies
John K. KruschkeNovember 14, 2016 at 8:20 AM
Fenn Lien:
I don't have a specific script for that scenario, but it's straight forward to create one. (Admittedly, you have to get used to making scripts in R with JAGS and runjags or rjags, but it's worth the effort!) Essentially, you want to combine the top part of the model structure in Figure 19.2 (p. 558) with the threshold-normal likelihood function of Figure 23.6 (p. 687). That is, the mu in Fig 23.6 does not come from beta0+beta1*x, but instead comes from the baseline plus deflections of the groups in Fig.19.2. Ultimately, I recommend you use the structure in Figure 19.6, p. 574.
ReplyDelete
Replies
Fenn LienNovember 14, 2016 at 11:53 AM
Thanks very much. That helps a lot. I will have a read of these two models, and try to implement the new model. Of course, I need to get familiar with R and JAGS first!

Thanks again!

Fenn
ReplyDelete
Replies
Fenn LienDecember 30, 2016 at 3:31 PM
Hi Prof. Kruschke,

I have some rating data. Mostly, the ratings are 1s (over 80%). Small portion of the data are 2s, 3s, 4s, and 5s.
After I ran the program (single group of ordinal predicted variable), why the posterior distribution on mean gave negative values, say mode=-2.25, 95% HDI is from -5.36 to -0.441?
Could you help me understand that?

Thanks in advance!
ReplyDelete
Replies
UnknownAugust 13, 2018 at 3:37 AM
Hi Prof. Kruschke,
Thank you for the good sharing. I am doing Bayesian ordinal regression. In my case, I have to know the likelihood and prior for ordinal regression. Would help me to get the references? You did mentioned about Chapter 23 of DBDA2E, where can i refer to? If possible, would you show me how to plot the chart above?

Thank you in advance.
ReplyDelete
Replies

Add comment

Doing Bayesian Data Analysis