Thursday, July 25, 2019

Shrinkage in hierarchical models: random effects in lmer() with and without correlation

The goal of this post is to illustrate shrinkage of parameter estimates in hierarchical (aka multi-level) models, specifically when using lmer() with and without estimated correlation of parameters. The examples will show how estimates can differ when including correlation of parameters because of shrinkage toward the estimated correlation.

Background

Data structure for these examples

“It all begins with the data… ” I will create multiple panels of \(\langle x , y \rangle\) data values, with \(x\) and \(y\) being continuous metric variables.
  • For instance, each panel could be data from a student in a classroom, with each datum being performance on a standardized math exam, with \(x\) being time and \(y\) being performance. In this scenario, each student takes a novel variant of the test repeatedly across time. The times do not need to be the same for every student, and the number of tests do not need to be the same for every student. We are interested characterizing the performance trend of each panel (i.e., each student) and the overall trend across panels (i.e., for the class as a whole).
  • As another example, each panel could be data from a distinct class within a school, with each datum being a particular student's exam performance (on the \(y\) axis) and family income (on the \(x\) axis). Again we are interested characterizing the trend of each panel (i.e., the relation of exam performance to family income within each classroom) and the overall trend across panels (i.e., the typical relationship of the variables across classrooms).
To illustrate robust shrinkage of panel estimates, each panel will have relatively few data points, and there will be relatively lots of panels. Graphs of the data will appear in analysis results, later.

Here (below) is the structure of the data. Notice there is an X variable, a Y variable, and a Panel variable. The panel variable is actually a nominal (categorical) value, even though it appears as a numerical index.
str( myData )
## 'data.frame':    208 obs. of  3 variables:
##  $ X    : num  0.4158 0.3795 0.0746 0.0588 0.4503 ...
##  $ Y    : num  -0.864 -0.579 0.227 -1.604 -0.895 ...
##  $ Panel: Factor w/ 35 levels "1","2","3","4",..: 1 1 1 1 1 1 2 2 2 3 ...

Analysis models

For simplicity, each panel will be fit with a linear trend. The hierarchical (a.k.a. multi-level) models will also estimate the typical linear trend across panels.

Parameters for panels are subject to shrinkage in hierarchical models because the panel's linear trend is trying to conform simultaneously to (a) the data in its panel and (b) the typical trend across all panels. When there are lots of panels informing the typical trend, and only a small amount of data within a panel, then the panel estimates are strongly influenced by the typical trend across panels. This makes good sense: If you don't know much about a particular panel, your best estimate should take into account what's typical across many other similar panels.

For more background about shrinkage in hierarchical models, there are lots on online sources you can search, and you can see some of my previous writings on the topic:
I will first fit a line independently to each panel, without hierarchical structure. This analysis will show the estimated intercept and slope in each panel when there is no shrinkage.

I will then fit a hierarchical model that estimates a typical intercept and typical slope across panels, but does not estimate the correlation of the intercepts and slopes across panels. This model produces some shrinkage across panel estimates, but does not shrink the estimates toward a shared correlation across panels.

Finally, I will fit a hiearchical model that also estimates the correlation of intercepts and slopes across panels. This model shrinks the panel estimates so they also conform more strongly with the estimated correlation across panels.

For the non-hierarchical analysis, I will use lm() from the base stats package of R. For the hierarchical analyses, I will use lmer() from the lme4 package in R.

Independent line for every panel

For this analysis, each individual panel is fit with its own line, separately from all other panels, using lm() on each panel. There is no hierarchical structure and no overall line estimated.

To make this analysis most analogous to subsequent analyses with lmer() the analyses should require all panels to have the same noise variance. But this is not done here, and actually the MLE coefficients are not affected in this case.

In principle, the analysis in this section would be like using lmer() with the formula y ~ 0 + (1+X||Panel), which specifies fitting lines within panels with no estimation of correlation across panels and no global parameters. But lmer() throws an error if that specification is attempted.
Here (below) are scatter plots of the data with the lm() fitted regression lines:
plot of chunk unnamed-chunk-6
Notice above:
  • Two-point panels such as Panel 4 and Panel 11 have lines going exactly through the two points. This will not be the case in hierarchical models.
  • The one-point Panel 35 has no regression line because it's undefined. This will not be the case in hierarchical models.
  • Panels 4 and 19 are color-highlighted for easy comparison with subsequent analyses.
plot of chunk unnamed-chunk-7
Notice above:
  • There is correlation of intercepts and slopes across panels (r=0.65), reflecting only how the data were generated, not any estimation of correlation in the model.
  • There is a lot of variation in intercepts and slopes across panels relative to hierarchical (multi-level) models below. There will be less variation in hierarchical models, hence the term shrinkage.
  • Panels 4 and 19 are color-highlighted for easy comparison across analyses.

Random intercepts and slopes, but no estimated correlation

I'll use lmer() with the formula, Y ~ 1 + X + ( 1 + X || Panel ), which is equivalent to Y ~ 1 + X + ( (1|Panel) + (0+X|Panel) ). lmer() assumes we want to estimate correlations of parameters across panels unless we tell it not to by using a double vertical bar or by explicitly coding the separate effects.
plot of chunk unnamed-chunk-8
Notice above:
  • Two-point panels such as Panel 4 and Panel 11 have lines not going exactly through the two points. This is because the line is trying to conform simultaneously to the data in the panel and what's typical across panels, as estimated by this particular hierarchical model.
  • The one-point Panel 35 has a regression line despite having only a single point. This is because the line is generated by what's typical across panels, influenced a bit by the single data point in the panel.
  • Panels 4 and 19 are color-highlighted for easy comparison with across analyses.
plot of chunk unnamed-chunk-9
Notice above:
  • There is correlation of intercepts and slopes across panels (r=0.814), but this reflects only how the data were generated and the separate shrinkage of intercepts and slopes, without any shrinkage from estimation of correlation.
  • There is less variation in intercepts and slopes across panels relative to the previous, non-hierarchical analysis, hence the term shrinkage. Specifically, the range of slopes across panels in the non-hierarchical model was -3.48, 2.59 but the range of slopes in this hierarchical model is -2.74, 2.03.
  • Panels 4 and 19 are color-highlighted for easy comparison across analyses.

Random intercepts and slopes, with estimated correlation

Here I use lmer() with forumla Y ~ 1 + X + ( 1 + X | Panel ). Notice the single vertical bar before Panel, so lmer() estimates the correlation of parameters across panels by default.
plot of chunk unnamed-chunk-10
Notice above:
There is even more shrinkage than in the previous model, because now the lines in each panel are also “trying” to conform to the typical correlation of intercept and slope across panels. Notice in particular the color-coded lines in panels 4 and 19.
plot of chunk unnamed-chunk-11
Notice above:
There is a strong correlation between the estimated slopes and intercepts (r=0.998). Here the correlation is estimated and there is shrinkage of estimates toward that correlation, and the correlation is stronger than the previous model because the estimates are shrunken toward that correlation.

What about the higher-level, "fixed" effects?

The higher-level intercept and slope are at the means of the panel intercepts and panel slopes, and those overall means are essentially the same across these analyses.

Conclusion

Hopefully these examples have helped to illustrate shrinkage in hierarchical models, and specifically the extra shrinkage introduced by estimating correlation across parameters.

Sunday, June 30, 2019

Bayesian estimation of severity in police use of force

In research reported in the journal Law and Human Behavior, Brad Celestin and I used Bayesian methods to measure perceived severities of police actions. For each of about two dozen actions, we had lay people rate the action's moral acceptability, appropriateness, punishability, and physical forcefulness. We regressed the ratings on the actions, simultaneously estimating latent scale values of the action severities. 

Below is a stylized graph to show the idea. The vertical axis shows the ratings, and the horizontal axis shows the underlying (latent) severity of the actions. In this graph, six actions are placed at arbitrary positions on the horizontal axis.


Below I've superimposed the regression equation. It's just linear regression, but the values of the predictors are estimated, not given.


Below is a stylized representation of the latent scale values that best fit the ratings:


Bayesian methods were especially useful for this because we obtained a complete posterior distribution on all the scale values. Bayesian methods were also very useful because the ratings were effectively censored by many respondents who pushed the response slider all the way to the top or bottom, so all we could discern from the response was that it was at least that high or low; censored dependent-variable data are handled very nicely in Bayesian analyses.

Here's the abstract from the article:
In modern societies, citizens cede the legitimate use of violence to law enforcement agents who act on their behalf. However, little is known about the extent to which lay evaluations of forceful actions align with or diverge from official use-of-force policies and heuristics that officers use to choose appropriate levels of responsive force. Moreover, it is impossible to accurately compare official policies and lay intuitions without first measuring the perceived severity of a set of representative actions. To map these psychometric scale values precisely, we presented participants with minimal vignettes describing officer and civilian actions that span the entire range of force options (from polite dialogue to lethal force), and asked them to rate physical magnitude and moral appropriateness. We used Bayesian methods to model the ratings as functions of simultaneously estimated scale values of the actions. Results indicated that the perceived severity of actions across all physical but non-lethal categories clustered tightly together, while actions at the extreme levels were relatively spread out. Moreover, less normative officer actions were perceived as especially morally severe. Broadly, our findings reveal divergence between lay perceptions of force severity and official law enforcement policies, and they imply that the groundwork for disagreement about the legitimacy of police and civilian actions may be partially rooted in the differential way that action severity is perceived by law enforcement relative to civilian observers.
A preprint of the article is here, and the published article is here. Full citation:
Celestin, B. D., & Kruschke, J. K. (2019). Lay evaluations of police and civilian use of force: Action severity scales. Law and Human Behavior, 43(3), 290-305.

Sunday, May 19, 2019

The Statistician's Error?

I just attended (and gave a talk at) the United States Conference on Teaching Statistics (USCOTS). Big thanks to Allan Rossman, who brilliantly MC-ed the conference.

• One keynote was about "moving beyond p < .05" in a talk by Ron Wasserstein and Allen Schirm, In their recent editorial in The American Statistician (with Nicole Lazar), a primary recommendation was Don't Say "Statistically Significant". Decisions with p values are about controlling error rates, but dichotomous decisions let people slip into "bright line" thinking wherein p < .05 means real and important and p > .05 means absent and unimportant.

• Another keynote, in a talk by Kari Lock Morgan, was about three possible explanations of an apparent effect of a manipulation, namely (i) genuine cause, (ii) random difference at baseline before manipulation, and (iii) random difference after manipulation.

I returned home from the conference this morning. To relax, after the intensive pre-conference preparation and during-conference insomnia, I opened a book of poetry and came across a poem by Aaron Fogel that (inadvertently) reflects upon both talks. It's a poem about how editors of printing make decisions regarding errors, and about three sources or errors, and distinguishing the sources of error. And about the role of editors (and perhaps of statisticians?).


The Printer's Error
Fellow compositors
and pressworkers!
I, Chief Printer
Frank Steinman,
having worked fifty-
seven years at my trade,
and served five years
as president
of the Holliston
Printer's Council,
being of sound mind
though near death,
leave this testimonial
concerning the nature
of printers' errors.
First: I hold that all books
and all printed
matter have
errors, obvious or no,
and that these are their
most significant moments,
not to be tampered with
by the vanity and folly
of ignorant, academic
textual editors.
Second: I hold that there are
three types of errors, in ascending
order of importance:
One: chance errors
of the printer's trembling hand
not to be corrected incautiously
by foolish professors
and other such rabble
because trembling is part
of divine creation itself.
Two: silent, cool sabotage
by the printer,
the manual laborer
whose protests
have at times taken this
historical form,
covert interferences
not to be corrected
censoriously by the hand
of the second and far
more ignorant saboteur,
the textual editor.
Three: errors
from the touch of God,
divine and often
obscure corrections
of whole books by
nearly unnoticed changes
of single letters
sometimes meaningful but
about which the less said
by preemptive commentary
the better.
Third: I hold that all three
sorts of error,
errors by chance,
errors by workers' protest,
and errors by
God's touch,
are in practice the
same and indistinguishable.
Therefore I,
Frank Steinman,
typographer
for thirty-seven years,
and cooperative Master
of the Holliston Guild
eight years,
being of sound mind and body
though near death
urge the abolition
of all editorial work
whatsoever
and manumission
from all textual editing
to leave what was
as it was, and
as it became,
except insofar as editing
is itself an error, and
therefore also divine.

Wednesday, March 13, 2019

Bayesian statistics at Princeton University, with a visit to the grave of John Von Neumann

I was very pleased to have had the opportunity to present a talk regarding Bayesian models of ordinal data at Princeton University, on Tuesday, March 5, 2019. An abstract of the talk is here, and the published article on which the talk was based is here.

A big thanks to Dr. Ting Qian who orchestrated the visit wonderfully. The lecture hall was packed, with people standing outside the door in the hall, in no small part due to Ting's organization of a popular statistics series. Big thanks also to the staff who made arrangements. And, of course, thanks to the many people who took time to meet with me while I was there.

I took a few hours the next morning to visit some special places in Princeton. In particular, I visited the graves of John Von Neumann and Kurt Godel. Von Neumann made contributions to Monte Carlo methods at the foundation of MCMC methods in Bayesian analysis.

Here's a snapshot of Von Neumann's grave stone:

And only a few feet away, Godel's grave stone:

After visiting the cemetery, I went to the Princeton University Art Museum and came across this painting by Georgia O'Keeffe. It's titled, "From a New Jersey Weekend II," painted in 1941.
It was an interesting coincidence that both O'Keeffe's and my visits featured two notable grave stones.

Here's a map of the Princeton Cemetery, with a marker at the approximate location of the graves of Von Neumann and Godel.


For posts about the book, DBDA2E, visiting other famous grave sites, see this post and its links to previous posts.

Thursday, February 21, 2019

A Stendhal moment on the way to Bayesian stats class

On the way to my Bayesian stats class this morning I had a few moments of Stendhal syndrome and thought I'd share it with y'all. (Aside from being on the way to Bayesian stats class, it has nothing directly relevant to Bayesian statistics.) The sun was shining through some construction fences lining a sidewalk and were unintentionally beautifying an otherwise routine Thursday morning. Here are a couple of snapshots from my office window:


It reminds me of Christo and Jeanne-Claude's Running Fence:

Sunday, October 21, 2018

Sinusoidal trend and global warming UPDATED

In a previous post from six years ago, I fit a sinusoidal trend, with auto-regressive component, to daily temperature data. (Spoiler alert: It's still getting warmer.) Recently I've received inquiries about the script for that analysis. I disinterred the ancient script, updated it, and grabbed more recent temperature data. The script and data file are linked below.

The result of the new analysis:
As you can see from the plot (above), the slope of the linear spine of the sinusoidal variation is 0.068 degrees Fahrenheit per year. The 95% HDI on the estimate spans zero, just as it did with the smaller data set in the previous post from six years ago. But I'm pretty sure that if this city were put into a big hierarchical model with lots of other cities across the globe, the high-level estimate of slope on the linear spine would be clearly greater than zero.

But evidence for global warming is not the point of this post. The point is to link the full script and data file. Here they are: R script; data. Hope this is helpful.

Wednesday, September 19, 2018

Which movie (treatment, group) is better? Opposite conclusions from different models.

Which movie is better? One way to answer is by considering the star ratings given to those movies. Just treat those 1-to-5 star ratings as numbers, throw them into a t test, and out pops your answer. Right? Not necessarily...

The analogous structure arises in many situations. Suppose, for example, we ask which group is happier, a group of poor people or a group of rich people? One way to answer is by considering subjective happiness ratings from an ordinal scale: 1 = very unhappy, 2 = mildly unhappy, 3 = neither unhappy nor happy, 4 = mildly happy, 5 = very happy. Just treat those 1-to-5 ratings as numbers, throw them into a t test, and out pops your answer. Right? Not necessarily...

Or, consider ratings of symptom intensity in different treatment groups. How bad is your headache? How depressed do you feel? Just treat the ratings as numbers and throw them into a t test, and out pops your answer. Right? Not necessarily...

Treating ordinal values as if they were numeric can lead to misinterpretations. Ordinal values do not indicate equal distances between their levels, nor equal coverage of each level. The conventional t test (and ANOVA and least-squares regression, etc.) assumes the data are metric values normally distributed around the model's predicted values. But obviously ordinal data are not normally distributed metric values.

A much better model of ordinal data is the ordered-probit model, which assumes a continuous latent dimension that is mapped to ordinal levels by slicing the latent dimension at thresholds. (The ordered-probit model is not the only good model of ordinal data, of course, but it's nicely analogous to the t test etc. because it assumes normally distributed noise on the latent dimension.)

The t test and the ordered probit model can produce opposite conclusions about the means of the groups. Here's an example involving star ratings from two movies:


The figure above shows data from two movies, labelled as Cases 5 and 6 in the first two columns. The pink histograms show the frequency distributions of the star ratings; they are the same in the upper and lower rows. The upper row shows the results from the ordered-probit model. The lower row shows the results from the metric model, that is, the t test. In particular, the right column shows the posterior difference of mu's for the two movies The differences are strongly in opposite directions for the two analyses. Each posterior distribution is marked with a dotted line at a difference of zero, and the line is annotated with the percentage of the distribution below zero and above zero. Notice the ordered-probit model fits the data much better than the metric model, as shown by the posterior predictions superimposed on the data: blue dots for the ordered probit model, and blue normal distributions for the metric model. (This is Figure 8 of the article linked below.)

Read all about it here:

Published article:
https://www.sciencedirect.com/science/article/pii/S0022103117307746

Preprint manuscript: https://osf.io/9h3et/

R code: https://osf.io/53ce9/files/