Sunday, July 10, 2016

Bayesian variable selection in multiple linear regression: Model with highest R^2 is not necessarily highest posterior probability

Chapter 18 of DBDA2E includes sections on Bayesian variable selection in multiple linear regression. The idea is that each predictor (a.k.a., "variable") has an inclusion coefficient $\delta_j$ that can be 0 or 1 (along with its regression coefficient, $\beta_j$). Each combination of predictors is a different "model" of the predicted variable. The plots in DBDA2E show the posterior probabilities of combinations of included variables, but the plots do not show the posterior distribution of $R^2$ for each combination. This blog post shows plots with $R^2$ and, in turn, emphasizes that the model with the highest $R^2$ is not necessarily the model with the highest posterior probability. That preference for a parsimonious model --- not only the best fitting model ---  is a signature of the automatic penalty for complexity provided by Bayesian model comparison.

In this example, the mean Scholastic Aptitude Test (SAT) score for each state of the U.S. is the predicted variable, and the four candidate predictors are spending per student (by the state), percentage of students taking the exam, student/teacher ratio, and teacher salary. Full details are provided in Chapter 18 of DBDA2E.

Here (below) are plots of the posterior distributions for the three most probable models, and the model with all predictors included. There are four rows, corresponding to four different combinations of included variables. The posterior probability of each combination is shown at the top-left of each panel, as "Model Prob".

You can see in the second row that the model that includes only the second predictor, PrcntTake, has a posterior probability of 0.235 and an $R^2$ of 0.784. The model in the fourth row includes all four predictors and has a higher $R^2$ of 0.826 but a much smaller posterior probability of 0.021. (The posterior distributions of low probability models are choppy because the MCMC chain visits them rarely.)

Thanks to Dr. Renny Maduro for suggesting the inclusion of $R^2$ in the plots. Dr. Maduro attended my recent workshop at ICPSR in Ann Arbor, Michigan.

Modified code for the graphs of this example is posted at the book's web site.

1. Could this be modified in the case of Bayes confirmatory factor analysis to reduce cross-loadings to a parsimonious set.

What I am thinking is starting with a full ESEM model (i.e., all cross-loadings) and place inclusion coefficents just on the cross loadings. So for example lets say we have a two factor latent model with all cross-loadings, a indicator item y1 might have a JAGS specification like (where t = 1 in this case representing item y1):

y[i,t] ~ dnorm(condmn[i,t], invsig2[t])

y[i,t] ~ dnorm(condmn[i,t], invsig2[t])