Sunday, July 10, 2016

Bayesian variable selection in multiple linear regression: Model with highest R^2 is not necessarily highest posterior probability

Chapter 18 of DBDA2E includes sections on Bayesian variable selection in multiple linear regression. The idea is that each predictor (a.k.a., "variable") has an inclusion coefficient \(\delta_j\) that can be 0 or 1 (along with its regression coefficient, \(\beta_j\)). Each combination of predictors is a different "model" of the predicted variable. The plots in DBDA2E show the posterior probabilities of combinations of included variables, but the plots do not show the posterior distribution of \(R^2\) for each combination. This blog post shows plots with \(R^2\) and, in turn, emphasizes that the model with the highest \(R^2\) is not necessarily the model with the highest posterior probability. That preference for a parsimonious model --- not only the best fitting model ---  is a signature of the automatic penalty for complexity provided by Bayesian model comparison.

In this example, the mean Scholastic Aptitude Test (SAT) score for each state of the U.S. is the predicted variable, and the four candidate predictors are spending per student (by the state), percentage of students taking the exam, student/teacher ratio, and teacher salary. Full details are provided in Chapter 18 of DBDA2E.

Here (below) are plots of the posterior distributions for the three most probable models, and the model with all predictors included. There are four rows, corresponding to four different combinations of included variables. The posterior probability of each combination is shown at the top-left of each panel, as "Model Prob".

You can see in the second row that the model that includes only the second predictor, PrcntTake, has a posterior probability of 0.235 and an \(R^2\) of 0.784. The model in the fourth row includes all four predictors and has a higher \(R^2\) of 0.826 but a much smaller posterior probability of 0.021. (The posterior distributions of low probability models are choppy because the MCMC chain visits them rarely.)

Thanks to Dr. Renny Maduro for suggesting the inclusion of \(R^2\) in the plots. Dr. Maduro attended my recent workshop at ICPSR in Ann Arbor, Michigan.

Modified code for the graphs of this example is posted at the book's web site.

4 comments:

  1. Could this be modified in the case of Bayes confirmatory factor analysis to reduce cross-loadings to a parsimonious set.

    What I am thinking is starting with a full ESEM model (i.e., all cross-loadings) and place inclusion coefficents just on the cross loadings. So for example lets say we have a two factor latent model with all cross-loadings, a indicator item y1 might have a JAGS specification like (where t = 1 in this case representing item y1):

    y[i,t] ~ dnorm(condmn[i,t], invsig2[t])
    condmn[i,t] <-mu[t] + fload1[t]*fscore[i,1] + fload2[t]*fscore[i,2]

    Where fload1 represents the loading on the latent factor the indicator item is supposed to load on and fload2 is the crossloading. Could we modify this as per you post such that the code becomes:

    y[i,t] ~ dnorm(condmn[i,t], invsig2[t])
    condmn[i,t] <-mu[t] + fload1[t]*fscore[i,1] + delta[t]*fload2[t]*fscore[i,2]

    ReplyDelete
  2. @Philip:

    Absoultely, in principle you can put inclusion coefficients on latent factors just like manifest variables.

    But you've gotta be very careful to do it all sensibly, and then you've gotta battle all the particulars of implementing it. These considerations are discussed in sections 18.4.4 and 18.4.5 of DBDA2E.

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. I suppose one could overcome the limitations you mention by not doing variable selection at all!

    Lets say I only put the inclusion criteria on cross-loadings. I leave target loadings (i.e., loadings of indicator items of the factors they are designed to measure) as is. Now lets say the parameters I am interested in are those at the latent level. i.e., latent correlations. It strikes me that the overall summary results for the latent correlation will be a weighted average of all candidate models for cross-loadings where the weight will be determined by model probabilities. To be honest this might be preferable to model selection is self!

    ReplyDelete