Wednesday, January 8, 2014

Bayesian variable selection in multiple regression: Sensitivity to prior

[Notice updates appended January 10, 2014.]

In multiple regression, analysts are often concerned with variable selection: Of the many predictor variables, which ones should be selected for inclusion in the model? Bayesian model comparison seems to be a great way to formulate the question: For p predictors, each of which can be included or not included, what are the posterior probabilities of the 2p models? In practice, however, the posterior probability of including a variable can be very sensitive to the vagueness (i.e., variance, precision) of the prior distribution on that variable's regression coefficient. When the priors on the regression coefficients are very vague, then there is small probability of including more variables. Although I have seen this routinely acknowledged in the theoretical literature, it seems to be underemphasized as a practical issue and I am wondering how practitioners deal with it in applied work. In this post I present a simple example of the issue, and I invite practitioners to reply with their experiences and pointers to literature.

Consider the example presented in DBDA (Figure 17.1, p. 455), in which average SAT scores in each state of the United States are predicted from average spending per pupil in each state and percentage of students taking the exam. Here is a depiction of the data:
Figure 17.2, p. 455, of Doing Bayesian Data Analysis.
The four panels show different perspectives on the same data. In this case, there are only two predictor variables. In most applications of variable selection there are many predictors, not merely two, but this case provides a simple demonstration of the issue. The issue is, which of the two predictors should be included in a good model of SAT scores? (There are many possible criteria for inclusion. One pre-analysis consideration is the purpose of the model: Is the purpose explanation or prediction? But here I'll simply leap-frog those considerations and pretend that we want to do Bayesian variable selection.)

The model is linear regression on the standardized data, with an inclusion indicator denoted delta for each predictor. The inclusion indicator can be 0 or 1. Here is an excerpt from the JAGS model specification:
  # Data block, not shown here, creates standardized data
  # denoted zy[i] and zx[i,1:p]. Below, Nx denotes the 
  # number of predictors p.
  model {
    for ( i in 1:Ntotal ) {
      zy[i] ~ dt( zbeta0                  # t dist. for robust

                  + sum( delta[1:Nx]      # delta is inclusion indicator
                         * zbeta[1:Nx]    # regression coefficient
                         * zx[i,1:Nx] ) , # standardized predictor
                  1/zsigma^2 , nu )
    }
    # Priors vague on standardized scale:
    zbeta0 ~ dnorm( 0 , 1/2^2 )           # prior SD=2  <== THE ISSUE!
    for ( j in 1:Nx ) {
      zbeta[j] ~ dnorm( 0 , 1/2^2 )       # prior SD=2  <== THE ISSUE!
      delta[j] ~ dbern( 0.5 )             # prior on inclusion indicator
    }
    zsigma ~ dunif( 1.0E-5 , 1.0E+1 )
    nu <- nuMinusOne+1
    nuMinusOne ~ dexp(1/29.0)

I have highlighted in yellow the line that specifies the prior on the the regression coefficients. It is a generic normal centered at zero with a standard deviation (SD) that is intended to be reasonably large relative to standardized data. In the example above, the SD is set to 2 because regression coefficients for standardized data will be restricted to the range +/-1 in least-squares regression, hence won't be much bigger than that, and an SD of 2 is reasonably flat in the +/-1 range.

Here are the results for two runs with different choices of the prior on the regression coefficients (and the intercept). When I set SD=10, the results look like this:

The posterior probability of the single-predictor model that includes only percentage-taking-the-exam is about 64%. The only other model with non-zero probability includes both predictors, as shown in the second row above.

Here are the results when I set SD=1:


Now the posterior probability of the single-predictor model is only about 17%, and far lower than the two-predictor model.

(Note that I did not change the prior on the inclusion indicator, delta. The prior probabilities on the models remain fixed at 25%.)

Notice that within the models, the explicit estimates of the regression coefficients are only slightly affected by the change on their priors. In both cases, the marginal posterior distribution on beta[1] is pretty clearly above zero (even taking into account a tacit ROPE and MCMC slop), which suggests that beta[1] is useful to include in the model, regardless of what its inclusion parameter might indicate.

How do practitioners "in the wild" deal with this issue? Is it all about making an argument for a particular prior? In the wild, do different reasonable priors yield noticeably different inclusion probabilities, and when they do, what is concluded in practice? Comment below and let us know! (I realize there are all sorts of other considerations in multiple regression -- explanation vs prediction, correlated predictors, etc. And I am not assuming that a single model must be selected; the problem persists for Bayesian model averaging. I am trying here to focus on the issue that posterior inclusion probability is sensitive to the prior on the regression coefficients.)

APPENDED 9:00am January 10, 2013: 

A little (more!) web searching and reading shows that some people combine variable selection with hierarchical shrinkage of regression coefficients. I was trying to be conceptually "pure" and use only variable selection (i.e., inclusion indicators) without hierarchical shrinkage (i.e., estimating the variance of the prior on regression coefficients), because the literature sometimes pits the two approaches (discrete variable selection versus continuous coefficient shrinkage) as competing. But of course there is nothing prohibiting their combination, of course, and a good Bayesian could argue that  it is appropriate: Whenever there is uncertainty, it should be expressed in the prior. Because we have uncertainty in the setting of the prior variance (i.e., we have uncertainty in the value of SD in zbeta[j] ~ dnorm( 0 , 1/SD^2 )) we should put a prior on it. Well, to express lack of knowledge, we could put a broad uniform on SD, or we could put a generic gamma on 1/SD^2, as follows.

   model {
    for ( i in 1:Ntotal ) {
      zy[i] ~ dt( zbeta0 + sum( delta[1:Nx] * zbeta[1:Nx] * zx[i,1:Nx] ) ,
                  1/zsigma^2 , nu )
    }
    # Priors vague on standardized scale:
    zbeta0 ~ dnorm( 0 , 1/2^2 )  #
    for ( j in 1:Nx ) {
      # zbeta[j] ~ dnorm( 0 , 1/2^2 )          # SD of prior on beta is fixed.
      zbeta[j] ~ dnorm( 0 , 1/sigmaOfBeta^2 )  # SD of prior on beta is estim.
      delta[j] ~ dbern( 0.5 )
    }
    zsigma ~ dunif( 1.0E-5 , 1.0E+1 )
    sigmaOfBeta ~ dunif( 1.0E-5 , 1.0E+2 )  # broad uniform
    nu <- nuMinusOne+1
    nuMinusOne ~ dexp(1/29.0)


yields this:

And

  model {
    for ( i in 1:Ntotal ) {
      zy[i] ~ dt( zbeta0 + sum( delta[1:Nx] * zbeta[1:Nx] * zx[i,1:Nx] ) ,
                  1/zsigma^2 , nu )
    }
    # Priors vague on standardized scale:
    zbeta0 ~ dnorm( 0 , 1/2^2 )  #
    for ( j in 1:Nx ) {
      # zbeta[j] ~ dnorm( 0 , 1/2^2 )          # SD of prior on beta is fixed.
      zbeta[j] ~ dnorm( 0 , 1/sigmaOfBeta^2 )  # SD of prior on beta is estim.
      delta[j] ~ dbern( 0.5 )
    }
    zsigma ~ dunif( 1.0E-5 , 1.0E+1 )
    sigmaOfBeta <- 1/sqrt(tauOfBeta)
    tauOfBeta ~ dgamma(0.001,0.001) # generic
    nu <- nuMinusOne+1
    nuMinusOne ~ dexp(1/29.0)


yields this:

So there is still a notable difference in posterior inclusion probabilities even when using hierarchical shrinkage priors, and it still comes down to an argument about the appropriate prior. I realize that the difference in outcomes would be reduced in applications that have more predictors/regressors/explanatory variables. That is, the prior has its strongest influence when the data are few, as in this example with only two predictors.

I am led to this tentative conclusion: Bayesian variable selection (i.e., using inclusion indicators) is best done with informed priors on the regression coefficients. The information can come from previous data and be explicitly expressed in constants in the prior on the regression coefficients. Or, the information can come from hierarchical sharing of information from other predictors in the current data set, via estimating the variance of the prior on the regression coefficients.

Appended more, 11:00am January 10, 2014:

The conclusion expressed above is just a special case of a general principle I have advocated in any Bayesian model comparison: To put the models on an equal playing field, their priors must be equivalently informed, even if it's merely mildly informed. That is, generic diffuse priors in the models can inadvertently lead to very different marginal likelihoods than what the models would have gotten with reasonably informed priors, even very mildly informed priors. So, what was a "tentative conclusion" above is rapidly turning into a prescriptive admonition!

7 comments:

  1. Try posting it on http://stats.stackexchange.com . I'm sure you'll get a lot more attention.

    ReplyDelete
  2. Thanks, I will keep stackexchange in mind for future questions!

    ReplyDelete
  3. Hi John,

    As you point out in your edits, the other approach would be to consider continuous shrinkage; for example using something like Lasso regression (regression penalised with the L1 norm of the weight vector). Park and Casella (2008) show that in a Bayesian regression this is equivalent to having independent double-exponential priors over the coefficients (which I understand assumes standardised inputs). See also Neal's Automatic Relevance Determination.

    ReplyDelete
  4. Thanks Tom.

    A nice tutorial on Bayesian lasso implemented in BUGS is provided by Lykou, A. and Ntzoufras, I. (2011), WinBUGS: a tutorial. WIREs Comp Stat, 3: 385–396. doi: 10.1002/wics.176

    In the blog post I was trying to focus on methods for discrete inclusion/exclusion, but found (in the appended remarks) that it led to using continuous shrinkage anyway. I have found that inclusion probabilities closely echo the magnitudes of the regression coefficients (which makes intuitive sense, of course).

    Automatic relevance determination is something I thought about years ago when I was working on attentional learning in models like ALCOVE :-)

    ReplyDelete
  5. Cool, thanks for the reference to the WIREs article John; I hadn't seen it.

    Tom

    ReplyDelete
  6. John, I really enjoyed reading your post as, similar to you, I feel that when it comes to real data there is a lot of fine tuning that is usually inadequately addressed in papers/books. I have a problem where the posterior probability of delta=1 is always above 0.5 (even when the respective variable should clearly not be selected). I was wondering whether you made progress with your issues and perhaps can share some light on mine.

    Thanks,
    rubi

    ReplyDelete

  7. Thanks for the comment.

    I’m not sure how you know that the “respective variable should clearly not be selected.” If you already know that, does that mean that you have prior knowledge that is not being expressed in the prior distribution of the model?

    My impression of variable selection is that it’s a bit of an art, in the sense that the prior matters! But that’s true of Bayesian model selection in general. Real prior knowledge should be implemented if possible, and, in the case of variable selection, shrinkage from hierarchical structure should be implemented if appropriate.

    ReplyDelete