Sunday, June 11, 2017

Posterior distribution of predictions in multiple linear regression

In a previous post I showed how to compute posterior predictions from multiple linear regression inside JAGS. In this post I show how to do it after JAGS, in R.

There are trade-offs doing it in JAGS or afterwards in R. If you do it in JAGS, then you know that the model you're using to generate predictions is exactly the model you're using to describe the data, because you only specify the model once. If you do it after JAGS in R, then you have to code the model all over again in R, and face the peril of making mistakes because JAGS and R parameterize distributions differently (e.g., in dnorm and dt are parameterized differently in JAGS than in R). On the other hand, if you generate the predictions in JAGS, then you have to specify all the to-be-probed x values in advance, along with the data to be fit. If instead you generate predictions in R (after JAGS), then you can re-use the already-generated MCMC chain for to-be-probed x values as much as you like, without having to run MCMC all over again. In this post, we face the peril of specifying the model in R.

Please see the previous post for background information about this specific application to predicting SAT scores from spending per student and percentage of students taking the exam. To run the R code below, please open the high-level script Jags-Ymet-XmetMulti-Mrobust-Example.R. Run the script "out of the box" so it uses the SAT data (Guber1999data.csv) and the two predictors mentioned in the previous sentence. Then, at the end of the script, append the following R code:

# Remind self of the predictors:
show( xName )
# Specify probe values of the predictors. Columns must match predictors actually
# used, in order! Specify as many probe rows as you like:
xProbe = matrix( c( 4.0 , 10 ,   # Spend , PrcntTake
                    9.0 , 10 ,   # Spend , PrcntTake
                    4.0 , 90 ,   # Spend , PrcntTake
                    9.0 , 10 ) , # Spend , PrcntTake
                 ncol=length(xName) , byrow=TRUE )
# Convert mcmc coda object to matrix:
mcmcMat = as.matrix( mcmcCoda )
# Show column names of mcmcMat. Notice they include beta0, beta[1], beta[2],
# ..., which can be extracted using grep("^beta",colnames(mcmcMat)):
colnames( mcmcMat )
# Now, go through rows of xProbe matrix and display predicted mu and y:
for ( xProbeIdx in 1:nrow(xProbe) ) {
  # Predicted mu is b0+b1*x1+b2*x2+... at every step in chain:
  muPred = ( mcmcMat[,grep("^beta",colnames(mcmcMat))]
             %*% matrix(c(1,xProbe[xProbeIdx,]),ncol=1) )
  # Predicted y is is pred mu plus t-distrib noise at every step in chain:
  yPred = muPred + mcmcMat[,"sigma"] * rt(nrow(mcmcMat),df=mcmcMat[,"nu"])
  # Make plot of posterior distribution of prediction. Notice that plotPost()
  # also returns numerical values to console.
  par( mar=c(4,1,1,1) , mgp=c(2.0,0.7,0) , xpd=NA )
  text(0,0, adj=c(0.5,0.7),
       paste0(xName,"=",xProbe[xProbeIdx,],collapse="\n"), cex=1.25 )
  xLim = quantile( yPred , probs=c(0.005,0.995) )
  muPredInfo = plotPost( muPred ,xlab="Predicted mu" , xlim=xLim ,
                 main="" )
  show( muPredInfo )
  yPredInfo = plotPost( yPred , xlab="Predicted y" , xlim=xLim ,
                main="" )
  show( yPredInfo )

The code produces output plots such as the following (along with numerical values displayed on the command line):

N.B.: The plots only display the first three significant digits; see the numerical output at the command line for more digits.


Reminders of some recent posts: