Thursday, July 21, 2016

Bayesian predicted slopes with interaction in multiple linear regression

Suppose we have a multiple linear regression with interaction: \[\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{1\times 2} x_1 x_2 \] Notice that the slope on \(x_1\) is not just \(\beta_1\), it's \(\beta_1 + \beta_{1\times 2} x_2\): \[\hat{y} = \beta_0 + ( \beta_1 + \beta_{1\times 2} x_2) x_1 + \beta_2 x_2  \] In other words, the slope on \(x_1\) depends on the value of \(x_2\). That's the essence of interaction. Of course, the same applies analogously to the slope on \(x_2\).

This is explained in Ch. 18 of DBDA2E, and an example is given in Figure 18.9 on p. 529 that plots the predicted slopes and their 95% HDIs as a function of the value of the other predictor. The variable "Spend" is \(x_1\) and the variable "PrcntTake" is \(x_2\). The top panel (below) plots the posterior predicted slope on \(x_1\) for values of \(x_2\):
After Figure 18.9, p. 529, of DBDA2E


A reader asked me how to compute those predicted slopes in R and make the graph. I thought I already had the R code in the package of programs, but upon searching discovered that it was not there. (Sorry; with over 700 pages of text and many dozens of programs, some things get overlooked.) This blog post supplies the R code!  :-)

First, from the folder of programs, open the file Jags-Ymet-XmetMulti-Mrobust-Example.R Then, at the top of the file, comment out all the data-loading examples and put in the following code for reading in the relevant data and creating an interaction variable. This is explained on p. 528 of DBDA2E:

# Two predictors with interaction:
# Read the SAT data file:

myData = read.csv( file="Guber1999data.csv" )
# Append named column with interaction product: 

myData = cbind( myData , SpendXPrcnt=myData[,"Spend"]*myData[,"PrcntTake"] )
# Specify the column names:

yName = "SATT" ; xName = c("Spend","PrcntTake","SpendXPrcnt")
# Put the saved files into a sub-directory:

dirName = "Guber1999data-Jags-Inter"
if( !dir.exists(dirName) ) { dir.create(dirName) }
fileNameRoot = paste0(dirName,"/Guber1999data-Jags-Inter-")
# Specify the number of saved steps and the thinning (merely to save memory):

numSavedSteps=15000 ; thinSteps=20 # smaller thinSteps will run faster
# Specify the preferred graph file format:
graphFileType = "png"
Then run the MCMC, as already specified in the script. I'm repeating it here for completeness:

# Load the relevant model into R's working memory: 
source("Jags-Ymet-XmetMulti-Mrobust.R")
# Generate the MCMC chain: 

mcmcCoda = genMCMC( data=myData , xName=xName , yName=yName ,
                    numSavedSteps=numSavedSteps , thinSteps=thinSteps ,
                    saveName=fileNameRoot )


The script also has commands for displaying MCMC diagnostics, numerical summary, and basic plots of the posterior distribution.

Now the novel R code for computing the slopes and making the graph! At the end of the script, run the following code:

# Convert mcmc coda object to a matrix: 
mcmcMat=as.matrix(mcmcCoda,chains=TRUE)

# Make combs of X1 and X2 values for displaying slopes and their HDIs: 

X1Comb = seq(min(myData[,xName[1]]),max(myData[,xName[1]]),length=20)
X2Comb = seq(min(myData[,xName[2]]),max(myData[,xName[2]]),length=20)

# Create place holders for X1 median slopes and HDIs: 

X1SlopeMed = rep(0,length(X2Comb))
X1SlopeHDI = matrix(0,nrow=2,ncol=length(X2Comb))
# Compute the slope on X1 for each value of X2,
# as explained at top of p. 530 of DBDA2E:
 

for ( combIdx in 1:length(X2Comb) ) {
  credSlope = mcmcMat[,"beta[1]"] + mcmcMat[,"beta[3]"]*X2Comb[combIdx]
  X1SlopeHDI[,combIdx] = HDIofMCMC( credSlope )
  X1SlopeMed[combIdx] = median( credSlope )
}

# Same as above but for X2:

X2SlopeHDI = matrix(0,nrow=2,ncol=length(X1Comb))
X2SlopeMed = rep(0,length(X1Comb))
for ( combIdx in 1:length(X1Comb) ) {
  credSlope = mcmcMat[,"beta[2]"] + mcmcMat[,"beta[3]"]*X1Comb[combIdx]
  X2SlopeHDI[,combIdx] = HDIofMCMC( credSlope )
  X2SlopeMed[combIdx] = median( credSlope )
}

# Make the graph as in lower part of Figure 18.9, p. 529 of DBDA2E:

openGraph()
layout(matrix(1:2,nrow=2))
par( mar=c(3.5,3.5,2,1) , mgp=c(2,0.7,0) )
# Panel to display slope on X1:

plot( X2Comb , X1SlopeMed , ylim=range(X1SlopeHDI) , 
      pch=19 , cex=1.5 , col="skyblue" ,
      ylab=paste("Slope on",xName[1]) , xlab=paste("Value of",xName[2]) ,

      cex.lab=1.5 )
segments( x0=X2Comb , y0=X1SlopeHDI[1,] ,x1=X2Comb , y1=X1SlopeHDI[2,] ,
          col="skyblue" , lwd=3 )
abline(h=0,lty="dotted")
# Panel to display slope on X2: 

plot( X1Comb , X2SlopeMed , ylim=range(X2SlopeHDI) , pch=19 , cex=1.5 , 
      col="skyblue" ,
      ylab=paste("Slope on",xName[2]) , xlab=paste("Value of",xName[1]) , 

      cex.lab=1.5 )
segments( x0=X1Comb , y0=X2SlopeHDI[1,] ,x1=X1Comb , y1=X2SlopeHDI[2,] ,
          col="skyblue" , lwd=3 )
abline(h=0,lty="dotted")
# Save it:

 saveGraph( file=paste0(fileNameRoot,"Slopes") , type=graphFileType )

Hopefully the comments in the script (above) are sufficient for you to decipher each step.

Monday, July 11, 2016

MCMC effective sample size for difference of parameters (in Bayesian posterior distribution)

We'd like the MCMC representation of a posterior distribution to have large effective sample size (ESS) for the relevant parameters. (I recommend ESS > 10,000 for reasonably stable estimates of the limits of the 95% highest density interval.) In many applications, we are interested in combinations of parameters, not just individual parameters. For example, in multi-group comparisons, we are interested in differences of the mean parameters. We might also be interested in the effect size (Cohen's d), which is the difference of mean parameters divided by the (pooled) standard deviation parameter. Unfortunately, the ESS of the combination of parameters might not be as high as the ESS's of the contributing parameters (or it might be higher!). This change in ESS is seen routinely in many results throughout DBDA2E.

This blog post shows a simple contrived example to demonstrate that the correlation of the contributing parameters is not necessarily linked to how much the ESS changes when the parameters are added or subtracted from each other. Consider estimating the means of two groups, with mean parameters, \(\mu_1\) and \(\mu_2\). We are interested in their difference, \(\mu_1 - \mu_2\). See the plots below for two cases in which the correlation of the mean parameters is the same, but their ESS's when combined are quite different.

In the left panel, ESS of the difference is much less than ESS of contributing parameters.
In the right panel, ESS of difference is same as ESS of contributing parameters.


P.S. This example was motivated by an exercise for one of the workshops I give.

Appendix: The R code.

# Contrived example to illustrate that ESS of combination of parameters
# can be different that ESS of individual parameters.
library(coda)
# First example.
set.seed(47405)
nstep = 1000
x = rnorm(nstep)
effectiveSize(x)
xmy = x
xmy[2:nstep] = xmy[2:nstep] + 1.0 * xmy[1:(nstep-1)]
effectiveSize(xmy)
xpy = rnorm(length(x))
muOne = xpy - xmy
muTwo = xpy + xmy
effectiveSize(muOne)
effectiveSize(muTwo)
effectiveSize(muOne-muTwo)
# Plot result:
layout(matrix(1:2,nrow=1))
par(pty="s")
plot( muOne , muTwo , type="o" , asp=1 ,
      xlab=bquote(mu[1]) , ylab=bquote(mu[2]) ,
      main=bquote(atop(
        list(
        ESS[mu[1]]==.(round(effectiveSize(muOne),0)) ,
        ESS[mu[2]]==.(round(effectiveSize(muTwo),0)) ) ,
        list(
        ESS[mu[1]-mu[2]]==.(round(effectiveSize(muOne-muTwo),0)) ,
        ESS[mu[1]+mu[2]]==.(round(effectiveSize(muOne+muTwo),0)) )
      )) ,
      cex.lab=1.5 , cex.main=1.0 )
text( min(muOne) , min(muTwo) , adj=c(0,0) ,
      labels=bquote(r[list(mu[1],mu[2])]==.(round(cor(muOne,muTwo),2))) )
# Second example:
set.seed(47405)
nstep=1000
muOne = rnorm(nstep)
muTwo = -0.35*muOne + rnorm(nstep)
effectiveSize(muOne)
effectiveSize(muTwo)
effectiveSize(muOne-muTwo)
# Plot result:
plot( muOne , muTwo , type="o" , asp=1 ,
      xlab=bquote(mu[1]) , ylab=bquote(mu[2]) ,
      main=bquote(atop(
        list(
          ESS[mu[1]]==.(round(effectiveSize(muOne),0)) ,
          ESS[mu[2]]==.(round(effectiveSize(muTwo),0)) ) ,
        list(
          ESS[mu[1]-mu[2]]==.(round(effectiveSize(muOne-muTwo),0)) ,
          ESS[mu[1]+mu[2]]==.(round(effectiveSize(muOne+muTwo),0)) )
      )) ,
      cex.lab=1.5 , cex.main=1.0 )
text( min(muOne) , min(muTwo) , adj=c(0,0) ,
      labels=bquote(r[list(mu[1],mu[2])]==.(round(cor(muOne,muTwo),2))) )



Sunday, July 10, 2016

Bayesian variable selection in multiple linear regression: Model with highest R^2 is not necessarily highest posterior probability

Chapter 18 of DBDA2E includes sections on Bayesian variable selection in multiple linear regression. The idea is that each predictor (a.k.a., "variable") has an inclusion coefficient \(\delta_j\) that can be 0 or 1 (along with its regression coefficient, \(\beta_j\)). Each combination of predictors is a different "model" of the predicted variable. The plots in DBDA2E show the posterior probabilities of combinations of included variables, but the plots do not show the posterior distribution of \(R^2\) for each combination. This blog post shows plots with \(R^2\) and, in turn, emphasizes that the model with the highest \(R^2\) is not necessarily the model with the highest posterior probability. That preference for a parsimonious model --- not only the best fitting model ---  is a signature of the automatic penalty for complexity provided by Bayesian model comparison.

In this example, the mean Scholastic Aptitude Test (SAT) score for each state of the U.S. is the predicted variable, and the four candidate predictors are spending per student (by the state), percentage of students taking the exam, student/teacher ratio, and teacher salary. Full details are provided in Chapter 18 of DBDA2E.

Here (below) are plots of the posterior distributions for the three most probable models, and the model with all predictors included. There are four rows, corresponding to four different combinations of included variables. The posterior probability of each combination is shown at the top-left of each panel, as "Model Prob".

You can see in the second row that the model that includes only the second predictor, PrcntTake, has a posterior probability of 0.235 and an \(R^2\) of 0.784. The model in the fourth row includes all four predictors and has a higher \(R^2\) of 0.826 but a much smaller posterior probability of 0.021. (The posterior distributions of low probability models are choppy because the MCMC chain visits them rarely.)

Thanks to Dr. Renny Maduro for suggesting the inclusion of \(R^2\) in the plots. Dr. Maduro attended my recent workshop at ICPSR in Ann Arbor, Michigan.

Modified code for the graphs of this example is posted at the book's web site.