Sunday, February 8, 2015

I've got variable Y that I want to predict from variables X1, X2, etc. What should I do?

For questions like yours -- I've got variable Y that I want to predict from variables X1, X2, etc.; What should I do? -- the best answer is usually informed by background knowledge of the domain. Generic models, like multiple linear regression, don't always make the most meaningful answer.

For example, suppose you're trying to predict the amount of fencing (Y) you'll need for rectangular lots of length X1 and width X2. Then a linear regression would serve you well. Why? Because we know (from background knowledge) that perimeter is a linear function of length and width.

But suppose you're trying to predict how much grass seed you'll need for the same lot. Then you'd want a model that includes the multiplicative product of X1 and X2, because that provides the area of the lot.

As another example, suppose you're trying to predict the installed length of a piece of pipe (Y) as a function of the date (X). You know that pipe expands and contracts as some function of temperature. And you also know that temperature cycles sinusoidally (across the seasons of a year) as a function of date. So, to predict pipe length as function of date, you'd use some trend that incorporates the expansion function on top of a sinusoidal function of date.

Whatever model you end up wanting, it can probably be implemented in JAGS (or BUGS or Stan). That's one of the beauties of the Bayesian approach with its general purpose MCMC software.


  1. Hello Professor Kruschke,

    I was wondering why we dont have generalized packages like BUGS and JAGS that are written in pure R rather than having to interface them through ugly packages.

    There are stand alone packages that do specific bayesian tasks like Bayesian GLM, but I was wondering what about the type of power that BUGS and JAGS gives us. Cant we have that in R itself ?

  2. The interfaces to BUGS and JAGS and Stan are really not that ugly. They merely involve some function names, and there would be corresponding function names even if the functions were written directly in R instead of in another language.

    One good thing about having the MCMC samplers written in a separate language is that they can be accessed from various languages other than R; for example, Matlab.

    Another reason is that R is slow because it is interpreted instead of compiled. The MCMC samplers can be programmed in other languages (such as C++) and pre-compiled so they run fast. (I'm no expert on this aspect of computing, so I might have some details wrong, but I think the gist is right.)

  3. Hello Professor Kruschke,

    Thank you very much for your response.

    So would it mean that R packages such as MCMCpack found in the Bayesian CRAN task view are slower than using RJags ?


  4. I do not know, off hand, how MCMCpack is programmed. It might be a pre-compiled program that runs fast. And, it might use special-purpose fast algorithms for particular models, as opposed to the general-purpose model parser in JAGS.