Doing Bayesian Data Analysis: 2016

Wednesday, December 21, 2016

Bayesian assessment of null values

A blog post by Christian Robert considered an ancient (2011!) article titled "Bayesian assessment of null values via parameter estimation and model comparison." Here I'll try to clarify the ideas from way back then through the lens of more recent diagrams from my workshops and a new article.

Terminology: "Bayesian assessment of null values" is supposed to be neutral wording to refer to any Bayesian method for assessing null values. Bayesian "hypothesis testing" is reserved for Bayes factors. Making decisions by posterior interval is not referred to as hypothesis testing and is not equivalent to Bayes factors.

Bayesian hypothesis testing: Suppose we are modeling some data with a model that has parameter δ in which we are currently interested, along with some other parameters. A null hypothesis model can be formulated as a prior on the parameters that puts a "spike" at the null value of δ but is spread out over the other parameters. A non-null alternative model puts a prior on δ that allows non-null values. The two models are indexed by a higher-level discrete parameter M. The entire hierarchy (a mixture model) has all its parameters updated by the data. The following slide from my workshops illustrates:

The Bayes factor (BF) is the shift in model-index probabilities:

Digression: I throw in two usual caveats about using Bayes factors. First, Bayesian model comparison --for null hypothesis testing or more generally for any (non-nested) models-- must use meaningful priors on the parameters in both models for the Bayes factor to be meaningful. Default priors for either model are typically not very meaningful and quite possibly misleading.

And, the Bayes factor is not the posterior probability of the models. Typically we ultimately want to know the posterior probabilities of the models, and the BF is just a step in that direction.

Link in slide above: http://doingbayesiandataanalysis.blogspot.com/2015/12/lessons-from-bayesian-disease-diagnosis_27.html

Assessing null value through parameter estimation: There's another way to assess null values. This other way focuses on the (marginal) posterior distribution of the parameter in which we're interested. (As mentioned at the outset, this approach is not called "hypothesis testing.") This approach is analogous to frequentist equivalence testing, which sets up a region of practical equivalence (ROPE) around the null value of the parameter:

The logic of this approach stems from a direct reading of the meaning of the intervals. We decide to reject the null value when the 95% highest density parameter values are all not practically equivalent to the null value. We decide to accept the null value when the 95% highest density parameter values are all practically equivalent to the null value. Furthermore, we can make direct probability statements about the probability mass inside the ROPE such as, "the probability that the parameter is practically equivalent to the null is 0.017" or "the probability that the parameter is practically equivalent to the null is 0.984."

The ROPE is part of the decision rule, not part of the null hypothesis. The ROPE does not constitute an interval null hypothesis; the null hypothesis here is a point value. The ROPE is part of the decision rule for two main purposes: First, it allows decisions to accept the null (again, analogous to frequentist equivalence testing). Second, it makes the decision rule asymptotically correct: As data sample size increases, the rule will come to the correct decision, either practically equivalent to the null value (within the ROPE) or not (outside the ROPE).

Juxtaposing the two approaches: Notice that the two approaches to assessing null values are not equivalent and have different emphases. The BF focuses on the model index, whereas the HDI and ROPE focus on the parameter estimate:

Therefore the two approaches will not always come to the same decision, though often they will. Neither approach is uniquely "correct;" the two approaches frame the question differently and provide different information.

Below is an example of the different information provided by hypothesis testing and estimation (for both frequentist and Bayesian analyses). The data are dichotomous, with z=14 successes out N=18 attempts (e.g., 14 heads out of 18 flips). The data are modeled by a Bernoulli distribution with parameter θ. The null value is taken to be θ=0.50. For the Bayesian analysis, the alternative-hypothesis prior is uniform merely for purposes of illustration; uniform is equivalent to dbeta(1,1).

From: Kruschke, J. K. and Liddell, T. (in press 2016), The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review.

You can see above that Bayesian hypothesis testing and Bayesian parameter estimation provide very different information. Which approach to use for assessing the null value then comes down to careful interpretation and practicalities. For more discussion please see this article: Kruschke, J. K. and Liddell, T. (2017), The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review.

Friday, December 16, 2016

The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective

UPDATE: Now published, see this post.

Two conceptual distinctions in the practice of data analysis. Rows show point-value hypothesis testing versus estimating magnitude with uncertainty. Columns show frequentist versus Bayesian methods. Cells indicate the typical information provided by each approach. [Figure 1 of Kruschke & Liddell (in press), The Bayesian new statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review.]

Many people have found the table above to be useful for understanding two conceptual distinctions in the practice of data analysis. The article that discusses the table, and many other issues, is now in press. (It was submitted in mid May, 2015, and was just accepted; a blog post announcing its original version is here, along with many comments.) The in-press version can be found at OSF and at SSRN.

Abstract: In the practice of data analysis, there is a conceptual distinction between hypothesis testing, on the one hand, and estimation with quantified uncertainty, on the other hand. Among frequentists in psychology a shift of emphasis from hypothesis testing to estimation has been dubbed "the New Statistics" (Cumming, 2014). A second conceptual distinction is between frequentist methods and Bayesian methods. Our main goal in this article is to explain how Bayesian methods achieve the goals of the New Statistics better than frequentist methods. The article reviews frequentist and Bayesian approaches to hypothesis testing and to estimation with confidence or credible intervals. The article also describes Bayesian approaches to meta-analysis, randomized controlled trials, and power analysis.

Wednesday, November 16, 2016

Is it legitimate to view the data and then decide on a distribution for the dependent variable?

An emailer asks,

In Bayesian parameter estimation, is it legitimate to view the data and then decide on a distribution for the dependent variable? I have heard that this is not “fully Bayesian”.

The shortest questions often probe some of the most difficult issues; this is one of those questions.

Let me try to fill in some details of what this questioner may have in mind. First, some examples:

Suppose we have some response-time data. Is it okay to look at the response-time data, notice they are very skewed, and therefore model them with, say, a Weibull distribution? Or must we stick with a normal distribution because that was the mindless default distribution we might have used before looking at the data? Or, having noticed the skew in the data and decided to use a skewed model, must we now obtain a completely new data set for the analysis?
As another example, is it okay to look at a scatter plot of population-by-time data, notice they have a non-linear trend, and therefore model them with, say, an exponential growth trend? Or must we stick with a linear trend because that was the mindless default trend we might have used before looking at the data? Or, having noticed the non-linear trend in the data and decided to use a non-linear model, must we now obtain a completely new data set for the analysis?
As a third example, suppose I’m looking at data about the positions of planets and asteroids against the background of stars. I’m trying to fit a Ptolemaic model with lots of epicycles. After considering the data for a long time, I realize that a completely different model, involving elliptical orbits with the sun at a focus, would describe the data nicely. Must I stick with the Ptolemaic model because it’s what I had in mind at first? Or, having noticed the Keplerian trend in the data, must I now obtain a completely new data set for the analysis?

What is the worry of the unnamed person who said it would not be “fully Bayesian” to look at the data and then decide on the model? I can think of a few possible worries:

One worry might be that selecting a model after considering the data is HARKing (hypothesizing after the results are known; Kerr 1998, http://psr.sagepub.com/content/2/3/196.short). Kerr even discusses Bayesian treatments of HARKing (pp. 206-207), but this is not a uniquely Bayesian problem. In that famous article, Kerr discusses why HARKing may be inadvisable. In particular, HARKing can transform Type I errors (false alarms) into confirmed hypothesis ergo fact. With respect to the three examples above, the skew in the RT data might be a random fluke, the non-linear trend in the population data might be a random fluke, and the better fit by the solar-centric ellipse might be a random fluke. Well, the only cure to Type I errors (false alarms) is replications (as Kerr mentions). Pre-registered replications. There are lots of disincentives to replication attempts, but these disincentives are gradually being mitigated by recent innovations like registered replication reports (https://www.psychologicalscience.org/publications/replication). In the three examples above, there are lots of known replications of skewed RT distributions, exponential growth curves, and elliptical orbits.

Another worry might be that the analyst had a tacit but strong theoretical commitment to a particular model before collecting the data, and then reneged on that commitment by sneaking a peak at the data. With respect to the three examples above, it may have been the case that the researcher had a strong theoretical commitment to normally distributed data, but, having noticed the skew, failed to mention that theoretical commitment and used a skewed distribution instead. And analogously for the other examples. But I think this mischaracterizes the usual situation of generic data analysis. The usual situation is that the analyst has no strong commitment to a particular model and is trying to get a reasonable summary description of the data. To be “fully Bayesian” in this situation, the analyst should set up a vast model space that includes all sorts of plausible descriptive models, including many different noise distributions and many different trends, because this would more accurately capture the prior uncertainty of the analyst than a prior with only a single default model. But doing that would be very difficult in practice, as the space of possible models is infinite. Instead, we start with some small model space, and refine the model as needed.

Bayesian analysis is always conditional on the assumed model space. Often the assumed model space is merely a convenient default. The default is convenient because it is familiar to both the analyst and the audience of the analysis, but the default need not be a theoretical commitment. There are also different goals for data analysis: Describing the one set of data in hand, and generalizing to the population from which the data were sampled. Various methods for penalizing overfitting of noise are aimed at finding a statistical compromise between describing the data in hand and generalizing to other data. Ultimately, I think it only gets resolved by (pre-registered) replication.

This is a big issue over which much ink has been spilled, and the above remarks are only a few off-the-cuff thoughts. What do you think is a good answer to the emailer's question?

Thursday, November 3, 2016

Bayesian meta-analysis of two proportions in random control trials

For an article that's accepted pending final revision (available here at OSF), I developed a Bayesian meta-analysis of two proportions in random control trials. This blog post summarizes and links to the complete R scripts.

We consider scenarios in which the data consist of the number of occurrences and the number of opportunities in a control group and in a treatment group. The number of occurrences in the treatment group is denoted $z_T$ and the number of opportunities in the treatment group is denoted $n_T$, and analogously $z_C$ and $n_C$ in control group. The proportion of occurrences in the treatment group is $z_T/n_T$, and the proportion of occurrences in the control group is $z_C/n_C$.

For example, perhaps we are interested in the occurrence of mortality after myocardial infarction (death after heart attack), in a control group and in a group treated with beta-blockers (heart muscle relaxant). In this case, if beta-blockers have a beneficial effect, $z_T/n_T$ should be less than $z_C/n_C$.

As another example, perhaps we are interested in re-use of towels by patrons of hotels (instead of having towels changed every day for the same patron, which wastes laundering electricity and detergent). We consider towel re-use in a control group and in a group treated with a sign that indicates it's normal for people to re-use their towels. In this case, if the treatment has a beneficial effect, $z_T/n_T$ should be greater than $z_C/n_C$.

In meta-analysis, there are several studies that each test the treatment. The data from study $s$ are denoted $z_{T[s]}, n_{T[s]}, z_{C[s]}, n_{C[s]}$. Each study is conducted at its own site (e.g., hotel, hospital). Because each site has its own specific attributes, we do not expect the underlying proportions of occurrence to be identical across sites. But we do expect them to be similar and mutually informative, so we treat the data from different sites as representative of higher-level parameters in the model that describe what's typical across sites and how much variability there is across sites. This approach is the usual random-effects model for meta-analysis.

Here are the parameters I'll use to describe the data:

$\theta_{C[s]}$ is the probability of occurrence in the control group for study $s$.
$\theta_{T[s]}$ is the probability of occurrence in the treatment group for study $s$.
$\rho_{[s]}$ is the difference of log-odds between groups:
$\rho_{[s]} = logit(\theta_{T[s]}) - logit(\theta_{C[s]})$

Re-arranged, the equation for $\rho_{[s]}$ expresses the relation of $\theta_{T[s]}$ to $\theta_{C[s]}$: \[ \theta_{T[s]} = logistic( \rho_{[s]} + logit( \theta_{C[s]} ) )\] This relation is a natural way to represent the dependency of the probabilities between groups because the relation is (i) symmetric with respect to what outcome is defined as success or failure because $logit(\theta) = -logit(1-\theta)$, and (ii) symmetric with respect to which group is defined as the treatment by reversing the sign of $\rho$. Note that $\rho_{[s]}$ is the so-called log odds ratio across groups: $\rho_{[s]} = log( [\theta_{T[s]}/(1-\theta_{T[s]})] / [\theta_{C[s]}/(1-\theta_{C[s]})] )$. I hope this little explanation and motivation of $\rho_{[s]}$ is helpful.

I'll describe the distribution of $\theta_{C[s]}$ across studies as a beta distribution, parameterized by its mode and concentration:

$\omega_{\theta C}$ is the modal value (of the beta description) of $\theta_{C[s]}$
$\kappa_{\theta C}$ is the concentration (of the beta description) of $\theta_{C[s]}$

For beta distributions parameterized by the usual $a,b$ shape parameters, we convert mode and concentration to $a,b$, and the above specification becomes $\theta_{C[s]} \sim beta( \omega_{\theta C}(\kappa_{\theta C}-2)+1 , (1-\omega_{\theta C})(\kappa_{\theta C}-2)+1 )$.

I'll describe the distribution of $\rho_{[s]}$ across studies as a normal distribution, parameterized by its mean and standard deviation:

$\mu_{\rho}$ is the mean (of the normal description) of $\rho_{[s]}$
$\sigma_{\rho}$ is the standard deviation (of the normal description) of $\rho_{[s]}$

In other words, for a normal distribution parameterized by mean and precision as in JAGS, $ \rho_{[s]} \sim normal( \mu_{\rho} , 1/\sigma_{\rho}^2 ) $. Usually, the primary focus of research is the value of $\mu_{\rho}$, that is, the estimate of the treatment effect across studies.

This type of hierarchical model is a typical random-effects model for meta-analysis, because the model gives each study its own individual parameter values which are assumed to be ("exchangably") representative of a common underlying tendency.

I'll set vague priors on the top-level parameters. An implementation of the model in R, JAGS, and runjags is provided below.

Notice that an alternative model that estimated $\theta_{C[s]}$ independently from $\theta_{T[s]}$, and then computed $\rho_{[s]}$ afterwards, would not produce the same results. Nor would it be an appropriate model! It would not be appropriate because it would treat the control group and treatment group in the same study as being completely unrelated --- if we independently permuted the study indices in the treatment and control groups (i.e., re-arranged which control groups go with which treatment groups) the results of this alternative model would be unchanged. Instead, the model I am using assumes that the treatment probability is linked to the control probability. For example, if one hospital has a low rate of heart attacks in the control condition but another hospital has a high rate of heart attacks in the control condition, the treatment should reduce heart attacks relative to the particular hospital's base rate, not to some absolute rate independent of hospital.

I'll apply the model to two sets of data. First, some data on death after heart attack, summarized on pp. 124-128 of Gelman et al., 2014, Bayesian Data Analysis, Third Edition. There were 22 studies, involving as few as 77 patients and as many as 3,887 patients. The treatment group received beta-blockers. If the treatment is effective, the log-odds-ratio will be less than 0. Below is a forest plot of the results:

In the plot above, each of the 22 lower rows shows an individual study's observed log-odds-ratio with a gray triangle. Notice that the gray triangle is greater than 0 in 6 of the 22 studies. The size of the triangle indicates the sample size in the study. The blue distribution is the posterior distribution for $\rho_{[s]}$. The distribution is marked with its modal value and its 95% highest density interval (HDI). The numerical values of the mode and HDI are indicated at the right margin. At the top of the plot is shown the posterior distribution of $\mu_{\rho}$. In words, this means that across studies the typical effect of treatment has a log-odds-ratio of about $-0.25$, with a range of uncertainty from $-0.39$ to $-0.12$, well below 0. (These values are very similar to those reported by Gelman et al., BDA3, Table 5.3, p. 127.)

Notice also in the forest plot above that there is strong shrinkage of the individual study estimates toward the modal value across studies. For example, Study 22 has a greatly reduced rate of heart attack in its treatment group (gray triangle is at a low value), but the posterior estimate of its treatment effect is not so extreme, and its posterior distribution shows a skew to accommodate the "pull" of the extreme data from the shrunken distribution. Complementary skew and shrinkage can be seen, for example, in Study 14. The posterior distributions of the individual studies also show different uncertainties depending on the sample size in the study. For example, Study 10, with a large sample size, has a much narrower HDI than Study 19 with its small sample size.

Here is another application. In this case the data come from studies of towel re-use (Scheibehenne, Jamil, and Wagenmakers (2016). Bayesian evidence synthesis can reconcile seemingly inconsistent results: The case of hotel towel reuse. Psychological Science, 27, 1043-1046). At 7 different hotels, patrons in the treatment group were told it is normal to re-use towels (see article for details). If the treatment is effective, the log-odds-ratio should be greater than 0. Here is a forest plot of the results from the analysis:

(In Study 1 above, the gray triangle is so small that it falls outside of the plot range. It had N=162.) In this case, because there were only 7 studies and wide variation in the results across studies, the overall estimate of the log-odds-ratio is fairly uncertain: Its 95% HDI goes from $-0.12$ to $+0.47$. While the mode of the overall estimate is positive (at $0.21$), the uncertainty is great enough that we would want to do more studies to nail down the magnitude of the effect of treatment. Notice also the posterior distributions of the individual studies: There is evident shrinkage, but also lots of uncertainty, again with smaller studies showing more uncertainty than larger studies.

(In Scheibehenne et al.'s published analysis, a fixed-effects model was used, which is tantamount to using a single $\rho$ and single $\theta_C$ for all studies. This can be approximated in the model used here by specifying a prior that forces there to be tiny variance across $\rho_{[s]}$ and tiny variance across $\theta_{C[s]}$.)

As mentioned at the beginning, this model was developed for an article that is accepted pending final revision, available here. I also talked about Bayesian meta-analysis, and these applications in particular, in a presentation about Bayesian methods for replication analysis, that you can watch here.

The complete R script is available here.

Tuesday, October 25, 2016

Should researchers be correcting for multiple tests, even when they themselves did not run the tests, but all of the tests were run on the same data?

A graduate student, named Caitlin Ducate, in my frequentist statistics class asks:

In Criminal Justice, it's common to use large data sets like the Uniform Crime Report (UCR) or versions of the National Longitudinal Survey (NLS) because the nature of certain questions doesn't lend itself well to experimentation or independent data gathering. As such, many researchers have conducted many analyses using the UCR and NLS. My question is whether or not p-values would need to be fixed across the entire data set? In other words, should researchers be correcting for multiple tests even when they themselves did not run the tests because all of the tests were run on the same data?

This question gets at the core conundrum of correcting for multiple tests. Certainly if two researchers were collaborating on an analysis that they considered to be one study, and each researcher had a different (perhaps overlapping) batch of tests to run, then the overall combined set of tests should have their p values corrected for all the tests. On the other hand, if the two researchers declared that they were not collaborating and they considered their batches of tests to be "separate" then the p values of the two batches of tests would not be corrected properly for the overall combined set of tests. Such separation of batches of analyses invites an inflated false alarm (Type I error) rate for tests of the data set. Thus, the appropriate practice should be that every new analysis of the data should correct for all the previous tests of the data by all previous researchers, and all previously published analyses should have updated p values to take into account subsequent tests. Right?

The puzzler above is based on the premise that corrections for multiple tests should control the error rate for any one set of data. Which begs the question of how to define "one set of data." A few years ago I was reviewing a manuscript that was submitted to a major scientific journal. The researchers had conducted an experiment with several conditions; the theoretical motivation and procedure made it obvious that the conditions were part of one conceptualization. Moreover, participants volunteered and were assigned at random across all the various conditions; in other words the conditions were obviously intended to be part of the same study. But the manuscript reported one subset of conditions as being in "Experiment 1" and the complementary subset of conditions as being in "Experiment 2." Why would the authors do such a strange and confusing thing when reporting the research? Because that way the corrections for multiple comparisons would only have to take into account the set of tests for the data of "Experiment 1" separately from the set of tests for the data of "Experiment 2." If the data from all the conditions were considered to be one set of data, then the correction for multiple comparisons would have to take into account all the tests, and various tests would no longer have p<.05. Ugh.

There's an analogous old puzzler based on the premise that corrections for multiple tests should control the error rate for any one researcher (not just for one set of data). Especially if studies conducted by the researcher are follow-ups of previous studies, are the data from the follow-ups really separate sets of data? Aren't they all really just one extended set of data from that researcher? Therefore, each researcher is allowed a lifetime false-alarm rate of, say, 5%, and the critical p value for any single test by that researcher should take into account that fact that she will be conducting hundreds, probably thousands, of tests during her research lifetime. Moreover, if you are collaborating with other researchers, be sure that they only rarely run significance tests because they won't inflate your collaborative error rate as much as frequent-testers.

The general issue, of deciding what constitutes the appropriate "family" of tests to be corrected for, is a sticky problem. To define an error rate, there must be a presumed family of tests for which the error rate is being defined. There are various arguments for defining the family this way or that in any particular application. For example, when running experiments with multi-factor designs, typically each main effect and interaction is considered to be a separate family and corrections for multiple tests only need to be made within each family, not across. The usual argument is something like this: in an experimental design, for which the independent variables are manipulated and randomly assigned, each factor could have been left out. But that argument breaks down if the factors can be redefined to be multiple levels of one factor, etc.

Those are just some rambling thoughts. What do you think is the answer to Caitlin's question, "[For shared data sets], should researchers be correcting for multiple tests even when they themselves did not run the tests because all of the tests were run on the same data?"

Saturday, October 22, 2016

Posterior predictive distribution for multiple linear regression

Suppose you've done a (robust) Bayesian multiple linear regression, and now you want the posterior distribution on the predicted value of $y$ for some probe value of $ \langle x_1,x_2,x_3, ... \rangle $. That is, not the posterior distribution on the mean of the predicted value, but the posterior distribution on the predicted value itself. I showed how to do this for simple linear regression in a previous post; in this post I show how to do it for multiple linear regression. (A lot of commenters and emailers have asked me to do this.)

The basic idea is simple: At each step in the MCMC chain, use the parameter values to randomly generate a simulated datum $y$ at the probed value of $x$. Then examine the resulting distribution of simulated $y$ values; that is the posterior distribution of the predicted $y$ values.

To implement the idea, the first programming choice is whether to simulate the $y$ value with JAGS (or Stan or whatever) while it is generating the MCMC chain, or to simulate the $y$ value after the MCMC chain has previously been generated. There are pros and cons of each option. Generating the value by JAGS has the benefit of keeping the code that generates the $y$ value close to the code that expresses the model, so there is less chance of mistakenly simulating data by a different model than is being fit to the data. On the other hand, this method requires us to pre-specify all the $x$ values we want to probe. If you want to choose the probed $x$ values after JAGS has already generated the MCMC chain, then you'll need to re-express the model outside of JAGS, in R, and run the risk of mistakenly expressing it differently (e.g., using precision instead of standard deviation, or thinking that y=rt(...) in R will use the same syntax as y~dt(...) in JAGS). I will show an implementation in which JAGS simulated the $y$ values while generating the MCMC chain.

(A SUBSEQUENT POST SHOWS HOW TO DO THIS IN R AFTER JAGS.)

To illustrate, I'll use the example of scholastic aptitude test (SAT) scores from Chapter 18 of DBDA2E, illustrated below:

Running robust multiple linear regression yields a posterior distribution as shown below:

where $\nu$ is the normality (a.k.a. df) parameter of the $t$ distribution.

Now for the issue of interest here: What is the predicted SAT score for a hypothetical state that spends, say, 9 thousand dollars per student and has 10% of the students take the exam? The answer, using the method described above, is shown below:

(Please note that the numerical annotation in these figures only shows the first three significant digits, so you'll need to examine the actual MCMC chain for more digits!) As another example, what is the predicted SAT score for a hypothetical state that spends, say, 9 thousand dollars per student and has 80% of the students take the exam? Answer:

A couple more examples of predictions:

Now for the R code I used to generate those results. I modified the scripts supplied with DBDA2E, named Jags-Ymet-XmetMulti-Mrobust.R and Jags-Ymet-XmetMulti-Mrobust-Example.R. Some of the key changes are highlighted below.

The to-be-probed $x$ values are put in a matrix called xProbe., which has columns that correspond to the $x$ predictors, and one row for each probe. The number of probed points (i.e., the number of rows of xProbe), is denoted Nprobe. The number of predictors (i.e., the number of columns of xProbe), is denoted Nx. Then, in the new low-level script, called Jags-Ymet-XmetMulti-MrobustPredict.R, the Jags model specification looks like this:

# Standardize the data:
data {
    ym <- mean(y)
    ysd <- sd(y)
    for ( i in 1:Ntotal ) {
      zy[i] <- ( y[i] - ym ) / ysd
    }
    for ( j in 1:Nx ) {
      xm[j] <- mean(x[,j])
      xsd[j] <-   sd(x[,j])
      for ( i in 1:Ntotal ) {
        zx[i,j] <- ( x[i,j] - xm[j] ) / xsd[j]
      }
      # standardize the probe values:
      for ( i in 1:Nprobe ) {
        zxProbe[i,j] <- ( xProbe[i,j] - xm[j] ) / xsd[j]
      }

    }
}
# Specify the model for standardized data:
model {
    for ( i in 1:Ntotal ) {
      zy[i] ~ dt( zbeta0 + sum( zbeta[1:Nx] * zx[i,1:Nx] ) , 1/zsigma^2 , nu )
    }
    # Priors vague on standardized scale:
    zbeta0 ~ dnorm( 0 , 1/2^2 )
    for ( j in 1:Nx ) {
      zbeta[j] ~ dnorm( 0 , 1/2^2 )
    }
    zsigma ~ dunif( 1.0E-5 , 1.0E+1 )
    nu ~ dexp(1/30.0)
    # Transform to original scale:
    beta[1:Nx] <- ( zbeta[1:Nx] / xsd[1:Nx] )*ysd
    beta0 <- zbeta0*ysd + ym - sum( zbeta[1:Nx] * xm[1:Nx] / xsd[1:Nx] )*ysd
    sigma <- zsigma*ysd
    # Predicted y values at xProbe:
    for ( i in 1:Nprobe ) {
      zyP[i] ~ dt( zbeta0 + sum( zbeta[1:Nx] * zxProbe[i,1:Nx] ) ,
                   1/zsigma^2 , nu )
      yP[i] <- zyP[i] * ysd + ym
    }

}

The changes noted above are analogous to those I used for simple linear regression in the previous post. The MCMC chain monitors the values of yP[i], and subsequently we can examine the posterior distribution.

I hope this helps. Here is complete R code for the high-level and low-level scripts:

High-level script:

#===== Begin high-level script ================
# Example for Jags-Ymet-XmetMulti-MrobustPredict.R
#-------------------------------------------------------------------------------
# Optional generic preliminaries:
graphics.off() # This closes all of R's graphics windows.
rm(list=ls()) # Careful! This clears all of R's memory!
#.............................................................................
# # Two predictors:
myData = read.csv( file="Guber1999data.csv" )
yName = "SATT"
xName = c("Spend","PrcntTake")
xProbe = matrix( c( 4 , 10 , # Spend, PrcntTake
                    9 , 10 ,
                    4 , 80 ,
                    9 , 80 ) , nrow=4 , byrow=TRUE )

fileNameRoot = "Guber1999data-Predict-"
numSavedSteps=15000 ; thinSteps=5
graphFileType = "png"
#-------------------------------------------------------------------------------
# Load the relevant model into R's working memory:
source("Jags-Ymet-XmetMulti-MrobustPredict.R")
#-------------------------------------------------------------------------------
# Generate the MCMC chain:
mcmcCoda = genMCMC( data=myData , xName=xName , yName=yName , xProbe=xProbe ,
                    numSavedSteps=numSavedSteps , thinSteps=thinSteps ,
                    saveName=fileNameRoot )
#-------------------------------------------------------------------------------
# Display diagnostics of chain, for specified parameters:
parameterNames = varnames(mcmcCoda) # get all parameter names
for ( parName in parameterNames ) {
diagMCMC( codaObject=mcmcCoda , parName=parName ,
            saveName=fileNameRoot , saveType=graphFileType )
}
#-------------------------------------------------------------------------------
# Get summary statistics of chain:
summaryInfo = smryMCMC( mcmcCoda , saveName=fileNameRoot )
show(summaryInfo)
# Display posterior information:
plotMCMC( mcmcCoda , data=myData , xName=xName , yName=yName ,
          pairsPlot=TRUE , showCurve=FALSE ,
          saveName=fileNameRoot , saveType=graphFileType )
#-------------------------------------------------------------------------------
# Plot posterior predicted y at xProbe:
mcmcMat = as.matrix(mcmcCoda)
xPcols = grep( "xProbe" , colnames(mcmcMat) , value=FALSE )
yPcols = grep( "yP" , colnames(mcmcMat) , value=FALSE )
xLim = quantile( mcmcMat[,yPcols] , probs=c(0.005,0.995) )
for ( i in 1:length(yPcols) ) {
openGraph(width=4,height=3)
xNameText = paste( "@" , paste( xName , collapse=", " ) , "=" )
xProbeValText = paste(mcmcMat[1,xPcols[seq(i,
                                          by=length(yPcols),
                                          length=length(xName))]],
                     collapse=", ")
plotPost( mcmcMat[,yPcols[i]] , xlab="Post. Pred. y" , xlim=xLim ,
            cenTend="mean" ,
            main= bquote(atop(.(xNameText),.(xProbeValText))) )
}

#-------------------------------------------------------------------------------
#===== End high-level script ================

Low-level script, named Jags-Ymet-XmetMulti-MrobustPredict.R
and called by high-level script:
#===== Begin low-level script ================
# Jags-Ymet-XmetMulti-MrobustPredict.R
# Accompanies the book:
# Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:
# A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.

source("DBDA2E-utilities.R")

#===============================================================================

genMCMC = function( data , xName="x" , yName="y" , xProbe=NULL ,
                    numSavedSteps=10000 , thinSteps=1 , saveName=NULL ,
                    runjagsMethod=runjagsMethodDefault ,
                    nChains=nChainsDefault ) {
require(runjags)
#-----------------------------------------------------------------------------
# THE DATA.
y = data[,yName]
x = as.matrix(data[,xName],ncol=length(xName))
# Do some checking that data make sense:
if ( any( !is.finite(y) ) ) { stop("All y values must be finite.") }
if ( any( !is.finite(x) ) ) { stop("All x values must be finite.") }
cat("\nCORRELATION MATRIX OF PREDICTORS:\n ")
show( round(cor(x),3) )
cat("\n")
flush.console()
Nx = ncol(x) # number of x predictors
Ntotal = nrow(x) # number of data points
# Check the probe values:
if ( !is.null(xProbe) ) {
    if ( any( !is.finite(xProbe) ) ) {
      stop("All xProbe values must be finite.") }
    if ( ncol(xProbe) != Nx ) {
      stop("xProbe must have same number of columns as x.") }
} else { # fill in placeholder so JAGS doesn't balk
    xProbe = matrix( 0 , ncol=Nx , nrow=3 )
    for ( xIdx in 1:Nx ) {
      xProbe[,xIdx] = quantile(x[,xIdx],probs=c(0.0,0.5,1.0))
    }
}
# Specify the data in a list, for later shipment to JAGS:
dataList = list(
    x = x ,
    y = y ,
    Nx = Nx ,
    Ntotal = Ntotal ,
    xProbe = xProbe ,
    Nprobe = nrow(xProbe)

)
#-----------------------------------------------------------------------------
# THE MODEL.
modelString = "
# Standardize the data:
data {
    ym <- mean(y)
    ysd <- sd(y)
    for ( i in 1:Ntotal ) {
      zy[i] <- ( y[i] - ym ) / ysd
    }
    for ( j in 1:Nx ) {
      xm[j] <- mean(x[,j])
      xsd[j] <-   sd(x[,j])
      for ( i in 1:Ntotal ) {
        zx[i,j] <- ( x[i,j] - xm[j] ) / xsd[j]
      }
      # standardize the probe values:
      for ( i in 1:Nprobe ) {
        zxProbe[i,j] <- ( xProbe[i,j] - xm[j] ) / xsd[j]
      }

    }
}
# Specify the model for standardized data:
model {
    for ( i in 1:Ntotal ) {
      zy[i] ~ dt( zbeta0 + sum( zbeta[1:Nx] * zx[i,1:Nx] ) , 1/zsigma^2 , nu )
    }
    # Priors vague on standardized scale:
    zbeta0 ~ dnorm( 0 , 1/2^2 )
    for ( j in 1:Nx ) {
      zbeta[j] ~ dnorm( 0 , 1/2^2 )
    }
    zsigma ~ dunif( 1.0E-5 , 1.0E+1 )
    nu ~ dexp(1/30.0)
    # Transform to original scale:
    beta[1:Nx] <- ( zbeta[1:Nx] / xsd[1:Nx] )*ysd
    beta0 <- zbeta0*ysd + ym - sum( zbeta[1:Nx] * xm[1:Nx] / xsd[1:Nx] )*ysd
    sigma <- zsigma*ysd
    # Predicted y values at xProbe:
    for ( i in 1:Nprobe ) {
      zyP[i] ~ dt( zbeta0 + sum( zbeta[1:Nx] * zxProbe[i,1:Nx] ) ,
                   1/zsigma^2 , nu )
      yP[i] <- zyP[i] * ysd + ym
    }

}
" # close quote for modelString
# Write out modelString to a text file
writeLines( modelString , con="TEMPmodel.txt" )
#-----------------------------------------------------------------------------
# INTIALIZE THE CHAINS.
# Let JAGS do it...
#-----------------------------------------------------------------------------
# RUN THE CHAINS
parameters = c( "beta0" , "beta" , "sigma",
                  "zbeta0" , "zbeta" , "zsigma", "nu" , "xProbe" , "yP" )
adaptSteps = 500 # Number of steps to "tune" the samplers
burnInSteps = 1000
runJagsOut <- run.jags( method="parallel" ,
                          model="TEMPmodel.txt" ,
                          monitor=parameters ,
                          data=dataList ,
                          #inits=initsList ,
                          n.chains=nChains ,
                          adapt=adaptSteps ,
                          burnin=burnInSteps ,
                          sample=ceiling(numSavedSteps/nChains) ,
                          thin=thinSteps ,
                          summarise=FALSE ,
                          plots=FALSE )
codaSamples = as.mcmc.list( runJagsOut )
# resulting codaSamples object has these indices:
#   codaSamples[[ chainIdx ]][ stepIdx , paramIdx ]
if ( !is.null(saveName) ) {
    save( codaSamples , file=paste(saveName,"Mcmc.Rdata",sep="") )
}
return( codaSamples )
} # end function

#===============================================================================

smryMCMC = function( codaSamples ,
                      saveName=NULL ) {
summaryInfo = NULL
mcmcMat = as.matrix(codaSamples,chains=TRUE)
paramName = colnames(mcmcMat)
for ( pName in paramName ) {
    summaryInfo = rbind( summaryInfo , summarizePost( mcmcMat[,pName] ) )
}
rownames(summaryInfo) = paramName
summaryInfo = rbind( summaryInfo ,
                       "log10(nu)" = summarizePost( log10(mcmcMat[,"nu"]) ) )
if ( !is.null(saveName) ) {
    write.csv( summaryInfo , file=paste(saveName,"SummaryInfo.csv",sep="") )
}
return( summaryInfo )
}

#===============================================================================

plotMCMC = function( codaSamples , data , xName="x" , yName="y" ,
                     showCurve=FALSE , pairsPlot=FALSE ,
                     saveName=NULL , saveType="jpg" ) {
# showCurve is TRUE or FALSE and indicates whether the posterior should
#   be displayed as a histogram (by default) or by an approximate curve.
# pairsPlot is TRUE or FALSE and indicates whether scatterplots of pairs
#   of parameters should be displayed.
#-----------------------------------------------------------------------------
y = data[,yName]
x = as.matrix(data[,xName])
mcmcMat = as.matrix(codaSamples,chains=TRUE)
chainLength = NROW( mcmcMat )
zbeta0 = mcmcMat[,"zbeta0"]
zbeta = mcmcMat[,grep("^zbeta$|^zbeta\\[",colnames(mcmcMat))]
if ( ncol(x)==1 ) { zbeta = matrix( zbeta , ncol=1 ) }
zsigma = mcmcMat[,"zsigma"]
beta0 = mcmcMat[,"beta0"]
beta = mcmcMat[,grep("^beta$|^beta\\[",colnames(mcmcMat))]
if ( ncol(x)==1 ) { beta = matrix( beta , ncol=1 ) }
sigma = mcmcMat[,"sigma"]
nu = mcmcMat[,"nu"]
log10nu = log10(nu)
#-----------------------------------------------------------------------------
# Compute R^2 for credible parameters:
YcorX = cor( y , x ) # correlation of y with each x predictor
Rsq = zbeta %*% matrix( YcorX , ncol=1 )
#-----------------------------------------------------------------------------
if ( pairsPlot ) {
    # Plot the parameters pairwise, to see correlations:
    openGraph()
    nPtToPlot = 1000
    plotIdx = floor(seq(1,chainLength,by=chainLength/nPtToPlot))
    panel.cor = function(x, y, digits=2, prefix="", cex.cor, ...) {
      usr = par("usr"); on.exit(par(usr))
      par(usr = c(0, 1, 0, 1))
      r = (cor(x, y))
      txt = format(c(r, 0.123456789), digits=digits)[1]
      txt = paste(prefix, txt, sep="")
      if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
      text(0.5, 0.5, txt, cex=1.25 ) # was cex=cex.cor*r
    }
    pairs( cbind( beta0 , beta , sigma , log10nu )[plotIdx,] ,
           labels=c( "beta[0]" ,
                     paste0("beta[",1:ncol(beta),"]\n",xName) ,
                     expression(sigma) , expression(log10(nu)) ) ,
           lower.panel=panel.cor , col="skyblue" )
    if ( !is.null(saveName) ) {
      saveGraph( file=paste(saveName,"PostPairs",sep=""), type=saveType)
    }
}
#-----------------------------------------------------------------------------
# Marginal histograms:

decideOpenGraph = function( panelCount , saveName , finished=FALSE ,
                              nRow=2 , nCol=3 ) {
    # If finishing a set:
    if ( finished==TRUE ) {
      if ( !is.null(saveName) ) {
        saveGraph( file=paste0(saveName,ceiling((panelCount-1)/(nRow*nCol))),
                   type=saveType)
      }
      panelCount = 1 # re-set panelCount
      return(panelCount)
    } else {
    # If this is first panel of a graph:
    if ( ( panelCount %% (nRow*nCol) ) == 1 ) {
      # If previous graph was open, save previous one:
      if ( panelCount>1 & !is.null(saveName) ) {
        saveGraph( file=paste0(saveName,(panelCount%/%(nRow*nCol))),
                   type=saveType)
      }
      # Open new graph
      openGraph(width=nCol*7.0/3,height=nRow*2.0)
      layout( matrix( 1:(nRow*nCol) , nrow=nRow, byrow=TRUE ) )
      par( mar=c(4,4,2.5,0.5) , mgp=c(2.5,0.7,0) )
    }
    # Increment and return panel count:
    panelCount = panelCount+1
    return(panelCount)
    }
}

# Original scale:
panelCount = 1
panelCount = decideOpenGraph( panelCount , saveName=paste0(saveName,"PostMarg") )
histInfo = plotPost( beta0 , cex.lab = 1.75 , showCurve=showCurve ,
                       xlab=bquote(beta[0]) , main="Intercept" )
for ( bIdx in 1:ncol(beta) ) {
    panelCount = decideOpenGraph( panelCount , saveName=paste0(saveName,"PostMarg") )
    histInfo = plotPost( beta[,bIdx] , cex.lab = 1.75 , showCurve=showCurve ,
                         xlab=bquote(beta[.(bIdx)]) , main=xName[bIdx] )
}
panelCount = decideOpenGraph( panelCount , saveName=paste0(saveName,"PostMarg") )
histInfo = plotPost( sigma , cex.lab = 1.75 , showCurve=showCurve ,
                       xlab=bquote(sigma) , main=paste("Scale") )
panelCount = decideOpenGraph( panelCount , saveName=paste0(saveName,"PostMarg") )
histInfo = plotPost( log10nu , cex.lab = 1.75 , showCurve=showCurve ,
                       xlab=bquote(log10(nu)) , main=paste("Normality") )
panelCount = decideOpenGraph( panelCount , saveName=paste0(saveName,"PostMarg") )
histInfo = plotPost( Rsq , cex.lab = 1.75 , showCurve=showCurve ,
                       xlab=bquote(R^2) , main=paste("Prop Var Accntd") )
panelCount = decideOpenGraph( panelCount , finished=TRUE , saveName=paste0(saveName,"PostMarg") )

# Standardized scale:
panelCount = 1
panelCount = decideOpenGraph( panelCount , saveName=paste0(saveName,"PostMargZ") )
histInfo = plotPost( zbeta0 , cex.lab = 1.75 , showCurve=showCurve ,
                       xlab=bquote(z*beta[0]) , main="Intercept" )
for ( bIdx in 1:ncol(beta) ) {
    panelCount = decideOpenGraph( panelCount , saveName=paste0(saveName,"PostMargZ") )
    histInfo = plotPost( zbeta[,bIdx] , cex.lab = 1.75 , showCurve=showCurve ,
                         xlab=bquote(z*beta[.(bIdx)]) , main=xName[bIdx] )
}
panelCount = decideOpenGraph( panelCount , saveName=paste0(saveName,"PostMargZ") )
histInfo = plotPost( zsigma , cex.lab = 1.75 , showCurve=showCurve ,
                       xlab=bquote(z*sigma) , main=paste("Scale") )
panelCount = decideOpenGraph( panelCount , saveName=paste0(saveName,"PostMargZ") )
histInfo = plotPost( log10nu , cex.lab = 1.75 , showCurve=showCurve ,
                       xlab=bquote(log10(nu)) , main=paste("Normality") )
panelCount = decideOpenGraph( panelCount , saveName=paste0(saveName,"PostMargZ") )
histInfo = plotPost( Rsq , cex.lab = 1.75 , showCurve=showCurve ,
                       xlab=bquote(R^2) , main=paste("Prop Var Accntd") )
panelCount = decideOpenGraph( panelCount , finished=TRUE , saveName=paste0(saveName,"PostMargZ") )

#-----------------------------------------------------------------------------
}
#===============================================================================
#===== End low-level script ================

Doing Bayesian Data Analysis