Sunday, November 30, 2014

An example of hierarchical conditional-logistic Bayesian estimation, applied to punishment choice in a public goods game.

Conditional-logistic regression and softmax regression (a.k.a. multinomial logit regression) are covered in Chapter 22 of DBDA2E, but no examples of hierarchical versions are given in that chapter. An example of hierarchical conditional-logistic Bayesian estimation, applied to punishment choice in a public goods game, is provided in a new article. The appendix of the article (linked below) provides complete details of the model.

Abstract: Punishment is an important method for discouraging uncooperative behavior. We use a novel design for a public goods game in which players have explicit intended contributions with accidentally changed actual contributions, and in which players can apply costly fines or ostracism. Moreover, all players except the subject are automated, whereby we control the intended contributions, actual contributions, costly fines, and ostracisms experienced by the subject. We assess subject’s utilization of other players’ intended and actual contributions when making decisions to fine or ostracize. Hierarchical Bayesian logistic regression provides robust estimates. We find that subjects emphasize actual contribution more than intended contribution when deciding to fine, but emphasize intended contribution more than actual contribution when deciding to ostracize. We also find that the efficacy of past punishment, in terms of changing the contributions of the punished player, influences the type of punishment selected. Finally, we find that the punishment norms of the automated players affect the punishments performed by the subject. These novel paradigms and analyses indicate that punishment is flexible and adaptive, contrary to some evolutionary theories that predict inflexible punishments that emphasize outcomes. Keywords: punishment, public goods game, ostracism, trembling-hand, intention, outcome bias.

Liddell, T. M., and Kruschke, J. K. (2014). Ostracism and fines in a public goods game with accidental contributions: The importance of punishment type. Judgment and Decision Making, 9(6), 523-547. The article is here (

Friday, November 28, 2014

How can I learn Bayesian modeling?

An email inquires:

Dear Sir, Greetings. I am a PhD ... student based in [country]. I work in the area of [...] and much of the new models in this area are based on Bayesian approach. I too feel this approach needs to [be] adopted ... . I Sir would too like to learn this approach. Unfortunately here in [country] we in Social Science are [not] trained too well in methodology. This mail is to request you to guide me as to how I can learning Bayesian Modeling relevant to [...] by self-study. There are currently no courses in [country] (to the best of my knowledge) that offer Bayesian Statistics/Bayesian Modeling. Please suggest resources (books, articles, videos) and methodology. I sincerely thank you for going through my mail and taking out you valuable time to help me. Thanks and Regards,

My reply:

Dear [...]:

Thank you for your interest.

Well, there's a vast literature out there, but since you're asking me, I'll recommend the following sequence from my own work.

For an introduction to Bayesian data analysis:
1. Read Chapter 2 of the 2nd Edition of my book. You can preview the book at various sites listed here.
2. Read the article, "Bayesian estimation supersedes the t test," linked here.
3. Read the article, "The time has come: Bayesian methods for data analysis in the organizational sciences," linked here.

For an introduction to Bayesian models of perception and cognition:
1. Read the blog post, "Bayesian models of mind, psychometric models, and data analytic models," linked here.
2. Read the article, "Bayesian learning theory applied to human cognition," linked here.
3. Read the article, "Bayesian approaches to associative learning: From passive to active learning," linked here.

Good luck with your work, and thanks again for your interest.


Tuesday, November 25, 2014

Warning message in runjags turned off in new version of DBDA2E utilities

A reader pointed out that programs that accompany DBDA2E and use runjags produce a disconcerting warning message about "starting parallel chains without setting different PRNG for each chain. Different values have been added to each set of initial values." This is merely runjags saying that it is doing the right thing, and not to worry. But it seems to cause worry. Therefore, in a new version of the DBDA2E utilities program I have set the options in runjags not to produce the warning message. To get the new version, go to the software web page and scroll down to the bottom where you will find the programs for the book. Click the download arrow on the right-hand side. Be sure to unzip/extract the file.

Monday, November 24, 2014

eBook and Kindle of DBDA2E now available

The eBook (PDF) and Kindle editions of Doing Bayesian Data Analysis, 2nd Edition, are now available. The book web page has relevant links.

Friday, November 21, 2014

Ordinal probit regression: Transforming polr() parameter values to make them more intuitive

In R, the polr function in the MASS package does ordinal probit regression (and ordinal logistic regression, but I focus here on probit). The polr function yields parameter estimates that are difficult to interpret intuitively because they assume a baseline intercept of 0 and a noise standard deviation of 1, which produces slopes and thresholds that are relative to an underlying scale with little direct relation to the ordinal values in the data. In can be more intuitive to fix the low and high thresholds at values that make sense on the ordinal data scale, and estimate the other parameters relative to those low and high thresholds. This intuitive parameterization is used in Chapter 23 of DBDA2E. This blog post presents a function that transforms the polr parameter values to the intuitive parameterization.

Background: Ordinal probit regression.

See Chapter 23 of DBDA2E for a complete introduction to and explanation of ordinal probit regression. The figure below provides a quick reminder of the model.
The data to be predicted are ordinal, shown on the left side of the figure. The predictor in this case is metric, shown at the lower-right of the figure. The model assumes there exists a underlying metric scale for the ordinal outcome, and this underlying scale is chopped into subintervals that have thresholds denoted theta (see the diagram). The model assumes there is some noisy mapping from the predictor to the underlying metric response scale, as assumed by ordinary linear regression, and the probability of each discrete ordinal response is the area under the normal curve between the thresholds for that response category.

From the data we estimate the intercept, the slope, the standard deviation, and the thresholds. But the underlying scale has arbitrary location and magnification, so two of the parameters can be arbitrarily fixed. In polr, the intercept is fixed at 0 and the standard deviation is fixed at 1. But that yields thresholds and slopes on a scale with no intuitive relation to the response values, which go from 1 to K, where K is the maximum ordinal response level.

A more intuitive parameterization fixes the low threshold at 1.5 (half way between 1 and 2), and the high threshold at K-.5 (half way between K-1 and K). Then the thresholds are directly interpretable with respect to those anchors, and the standard deviation can also be interpreted relative to the distance between those anchors.

Example: Happiness and financial assets.

Consider subjective ratings of happiness on a 1 to 5 ordinal scale, where 1 is bummed out and 5 is ecstatic (I forget the exact wording used to label the response items). For each respondent we also get their total assets. Some real data are shown in the top panel of the figure below:
See the figure caption for details. The lower panels of the figure show aspects of the Bayesian posterior distribution. Notice that the lowest threshold is fixed at 1.5, and the highest threshold is fixed 4.5. This means that the underlying metric output scale is aligned with the ordinal responses, and the other parameters are relative to the low and high anchors. In particular, the intercept is at about 3.16, which suggests a happiness rating of about 3 even with zero assets. The slope is about 4.14e-6 happiness units per yuan. The standard deviation of the noise is about 0.854 happiness units, suggesting large variability on a 5-point scale.

The output of polr, however, looks like this (where Happiness is Yord and Assets is X):

> polrInfo = polr( Yord ~ X , method="probit" )
> print( polrInfo )
Coefficients: X 4.025474e-06 Intercepts: 1|2 2|3 3|4 4|5 -2.048053 -1.180089 -0.127664 1.544575

What do those parameter estimates mean? The coefficient on X is the slope, and the intercepts are the thresholds. While we can use those estimates to derive predictions of outcome probabilities, they aren't very directly interpretable, at least not for me!

The polr estimates can be transformed into the parameterization I explained above, and then we get the following:

> polrInfoOrd = polrToOrdScale( polrInfo )
> print( polrInfoOrd )
[1] 0.8350432

[1] 3.210213


     1|2      2|3      3|4      4|5 
1.500000 2.224788 3.103608 4.500000 

The thresholds are the zeta values; notice that the lowest threshold is 1.5 and the highest threshold is 4.5. The sigma, intercept (i.e., b0), and slope (i.e., coefficient) are very close to the modal posterior values of the Bayesian estimation.

The function:

Here is the specification of the function in R:

polrToOrdScale = function( polrObject ) {
  polrThresh = polrObject$zeta
  polrSlopes = polrObject$coefficients
  polrInter = 0.0
  polrSigma = 1.0
  K = length(polrThresh) + 1  # K is number of ordinal levels
  sigmaMult = unname( (K-2)/(polrThresh[K-1]-polrThresh[1]) )
  inter = unname( 1.5 - ( sigmaMult*polrThresh[1] ) )
  respThresh = sigmaMult*polrThresh + inter
  respSigma = sigmaMult*polrSigma
  respB0 = sigmaMult*polrInter + inter
  respSlopes = sigmaMult*polrSlopes
  return( list( sigma=respSigma ,
                b0=respB0 ,
                coefficients=respSlopes ,
                zeta=respThresh ) )

I have only tested the function on a few examples; there are no guarantees! Feel free to reply with refinements.

Monday, November 10, 2014

2nd Edition has shipped (Doing Bayesian Data Analysis)

I am told by some readers that they have received a physical copy of the 2nd Edition of Doing Bayesian Data Analysis, but I have yet to see it myself. I hope the paper is heavy and opaque, but the book lifts the spirits and the concepts are transparent! Info about the 2nd Edition is here.

Wednesday, October 8, 2014

2nd Edition of Doing Bayesian Data Analysis available November 2014

The 2nd Edition of Doing Bayesian Data Analysis will be available in early or mid November, 2014. The book is completely re-written from cover to cover, with loads of new material, and all new programs. For complete information, please see the book's new web site.

Here's a partial list of what's new in the Second Edition:
  • There are all new programs in JAGS and Stan. The new programs are designed to be much easier to use than the scripts in the first edition. In particular, there are now compact high-level scripts that make it easy to run the programs on your own data sets. This new programming was a major undertaking by itself.
  • The introductory Chapter 2, regarding the basic ideas of how Bayesian inference re-allocates credibility across possibilities, is completely rewritten and greatly expanded.
  • There are completely new chapters on the programming languages R (Ch. 3), JAGS (Ch. 8), and Stan (Ch. 14). The lengthy new chapter on R includes explanations of data files and structures such as lists and data frames, along with several utility functions. (It also has a new poem that I am particularly pleased with.) The new chapter on JAGS includes explanation of the RunJAGS package which executes JAGS on parallel computer cores. The new chapter on Stan provides a novel explanation of the concepts of Hamiltonian Monte Carlo. The chapter on Stan also explains conceptual differences in program flow between it and JAGS.
  • Chapter 5 on Bayes’ rule is greatly revised, with a new emphasis on how Bayes’ rule re-allocates credibility across parameter values from prior to posterior. The material on model comparison has been removed from all the early chapters and integrated into a compact presentation in Chapter 10.
  • What were two separate chapters on the Metropolis algorithm and Gibbs sampling have been consolidated into a single chapter on MCMC methods (as Chapter 7).
  • There is extensive new material on MCMC convergence diagnostics in Chapters 7 and 8. There are explanations of autocorrelation and effective sample size. There is also exploration of the stability of the estimates of the HDI limits. New computer programs display the diagnostics, as well.
  • Chapter 9 on hierarchical models includes extensive new and unique material on the crucial concept of shrinkage, along with new examples.
  • All the material on model comparison, which was spread across various chapters in the first edition, in now consolidated into a single focused chapter (Ch. 10) that emphasizes its conceptualization as a case of hierarchical modeling.
  • Chapter 11 on null hypothesis significance testing is extensively revised. It has new material for introducing the concept of sampling distribution. It has new illustrations of sampling distributions for various stopping rules, and for multiple tests.
  • Chapter 12, regarding Bayesian approaches to null value assessment, has new material about the region of practical equivalence (ROPE), new examples of accepting the null value by Bayes factors, and new explanation of the Bayes factor in terms of the Savage-Dickey method.
  • Chapter 13, regarding statistical power and sample size, has an extensive new section on sequential testing, and making the research goal be precision of estimation instead of rejecting or accepting a particular value.
  • Chapter 15, which introduces the generalized linear model, is fully revised, with more complete tables showing combinations of predicted and predictor variable types.
  • Chapter 16, regarding estimation of means, now includes extensive discussion of comparing two groups, along with explicit estimates of effect size.
  • Chapter 17, regarding regression on a single metric predictor, now includes extensive examples of robust regression in JAGS and Stan. New examples of hierarchical regression, including quadratic trend, graphically illustrate shrinkage in estimates of individual slopes and curvatures. The use of weighted data is also illustrated.
  • Chapter 18, on multiple linear regression, includes a new section on Bayesian variable selection, in which various candidate predictors are probabilistically included in the regression model.
  • Chapter 19, on one-factor ANOVA-like analysis, has all new examples, including a completely worked out example analogous to analysis of covariance (ANCOVA), and a new example involving heterogeneous variances.
  • Chapter 20, on multi-factor ANOVA-like analysis, has all new examples, including a completely worked out example of a split-plot design that involves a combination of a within-subjects factor and a between-subjects factor.
  • Chapter 21, on logistic regression, is expanded to include examples of robust logistic regression, and examples with nominal predictors.
  • There is a completely new chapter (Ch. 22) on multinomial logistic regression. This chapter fills in a case of the generalized linear model (namely, a nominal predicted variable) that was missing from the first edition.
  • Chapter 23, regarding ordinal data, is greatly expanded. New examples illustrate single-group and two-group analyses, and demonstrate how interpretations differ from treating ordinal data as if they were metric.
  • There is a new section (25.4) that explains how to model censored data in JAGS.
  • Many exercises are new or revised.
And the cover is different! The correspondence of the doggies to Bayes’ rule is now made explicit: The half-up ears of the posterior doggie are a compromise between the perky ears and floppy ears of the likelihood and prior doggies. The marginal likelihood is not usually computed in MCMC methods, so the doggie in the denominator gets sleepy with nothing much to do. Let's hope that what’s between the covers of the book is as friendly and engaging as the doggies on the cover.

Again, for complete information about the second edition, please see the book's new web site.

Monday, September 29, 2014

Doing Bayesian Data Analysis in Python

Do you prefer Python? Some readers have undertaken to translate the computer programs from Doing Bayesian Data Analysis into Python, including Osvaldo Martin, who has this GitHub site for his ongoing project. If you are interested in what he has done, or if you are interested in contributing, please contact him. If you have your own Python-translation project going, please add a comment to this post. Thank you all for your contributions to make Bayesian methods available in all flavors.

Thursday, September 18, 2014

p value from likelihood ratio test is STILL not the same as p value from maximum likelihood estimate

In yesterday's post, I described two ways for finding a p value for a parameter, and pointed out that the two ways lead to different p values. As an example, I considered the slope parameter in simple linear regression. One way to get a p value for the slope parameter is with a standard likelihood ratio test, for linear model with free slope versus intercept-only model. The other way to get a p value is to generate a sampling distribution of the MLE slope parameter itself, from a null hypothesis that consists of the MLE intercept-only model. Through remarks from commenters, especially Arash Khodadadi and Noah Silbert, I've been prodded to clarify the differences between these approaches. This post points out
  • The likelihood ratio test compares model fits relative to the variance, while the sampling distribution of the MLE slope parameter is on the absolute scale of the data.
  • An example in which the slope parameter has p < .05 two-tailed in the sampling distribution of the MLE slope parameter but has p > .05 (one-tailed) in the sampling distribution of the likelihood ratio.
First, the data (a little different from the previous posts):
The header of the plot, above, indicates the MLE of the linear model, and -2log(LR), a.k.a. G2, rounded to three significant digits.

The null hypothesis is an intercept-only model (beta1=0) with beta0 and sigma set to their MLE values when beta1 is fixed at zero. (Which means, in this case, that beta0 is the mean of y and sigma is the sd of y using N not N-1 in the denominator.) I generated sample data from the null hypothesis using the x values in the actual data. For each sample I computed the MLE of the full model and -2log(LR). The resulting marginal sampling distributions are shown here:
In the left panel, above, the one-tailed p value of MLE beta1 is displayed; multiply it by 2 to get the two-tailed p value. Notice it is different than the p value of the likelihood ratio.

Below is the joint sampling distribution, where each point is a sample from the null hypothesis. There is a new twist to this figure: Each point is color coded for the magnitude of MLE sigma, where blue is the largest MLE sigma in the distribution and red is the smallest MLE sigma in the distribution.
The joint sampling distribution also shows the thresholds for p=.05 (one-tailed or two-tailed), and the actual data statistics are plotted as a "+".

You can see from the joint sampling distribution that MLE beta1 can be large-ish even when -2log(LR) is small-ish when the sample MLE sigma is large-ish (blue points). But the opposite can happen when the sample MLE sigma is small-ish (red points). Thus, a key difference between the measures of the slope parameter is how they deal with the variance. The likelihood ratio compares the free-slope against intercept-only models relative to the variance, while the MLE beta1 considers the slope on the absolute scale of the data, not relative to the variance.

As discussed in yesterday's post, I don't think either test is inherently better than the other. They just ask the question about the slope parameter in different ways. As mentioned yesterday, posing the question in terms of absolute MLE beta1 has direct intuitive interpretation. It's also much easier to use when defining confidence intervals as the range of parameter values not rejected by p<alpha (which is, for me, the most coherent way to define confidence intervals). But that's a topic for another day!

Wednesday, September 17, 2014

p value from likelihood ratio test is not the same as p value from maximum likelihood estimate

In a post of a few hours ago, I pointed out that I was having trouble getting p values to agree for two different methods. Thanks to a suggestion from a reader, there is a resolution: The p values should not agree. This was actually my hunch and hope all along, because it adds another reason never to talk about "the" p value for a set of data, because any data set has many different p values.

First, a recap:
I'm doing Monte Carlo simulation of sampling distributions to compute p values. For example, consider the slope parameter, β1, in simple linear regression. I want to find out if p < .05 for a null hypothesis that β1=0. I'm working with two different ways to compute a p value, and am finding that the results do not agree with each other. Here are the two ways:
  1. Consider the maximum likelihood estimate (MLE) of β1 from the actual data, denoted β1MLEactual, and see where it falls in a sampling distribution of β1MLEnull for simulated data from the null hypothesis.
  2. Consider the likelihood ratio statistic, G2 = -2log(LR) where LR is the ratio of the maximum likelihood of the restricted model with β1=0 over the maximum likelihood of the full model with β1 free. We see where G2actual falls in the sampling distribution of G2null for simulated data from the null hypothesis.
See the previous post for details about how the data from the null hypothesis were generated. But here is a repeat of the old picture of the two sampling distributions:
Notice that the p values are not the same in the two distributions. (See the end of the previous post for reasons that the difference cannot be explained away as some simple artifact.)

Now the news: Arash Khodadadi, an advanced graduate student in Psychological & Brain Sciences at Indiana University, read the post and pointed out to me that not all samples that have G2null > G2actual are the same samples that have β1MLEnull > β1MLEactual. Essentially he was saying that I really should be looking at the joint sampling distribution. So I made a plot, and here it is:
Each point corresponds to a sample from the null hypothesis. The marginals of the joint distribution were plotted in the previous figure. I put lines at the values of β1MLEactual and G2actual, and I color coded the points that exceed those values. Obviously the points contributing to the p value for β1MLE
are quite different than the points contribution to the p value for G2. It is conceivable that the red and blue points would exactly equal each other mathematically (perhaps in a two-tailed version), but I doubt it.

Conclusion: This exercise leads me to conclude that the p values are different because they are referring to different questions about the the null hypothesis. Which one is more meaningful? For me, the sampling distribution of β1MLE makes more direct intuitive contact with what people want to know (in a frequentist framework), namely, Is the observed magnitude of the slope very probable under the null hypothesis? The sampling distribution of G2 is less intuitive, as it is asking, Is the observed ratio, of the probability of the data under the zero-slope model over the probability of the data under the free-slope model, very probable under the null hypothesis? Both questions are meaningful, but the first one asks directly about the magnitude of the slope, whereas the second one asks about the relative probabilities of data under the model structures.

Now the difficult semantic question: When the p values conflict, as they do in this example (i.e., p < .05 for β1MLE, while p > .05 for G2 [and if you prefer two-tailed tests, then we could contrive a case in which the p values conflict there, too]), is the slope "significantly" non-zero? The answer: It depends! It depends on the question you are asking. I think it makes intuitive sense to ask the question about the magnitude of the slope and, therefore, to say in this case that the slope is significantly non-zero. But if you specifically want to ask the model comparison question, then you need the model-comparison p value, and you would conclude that the free-slope model is not significantly better than the zero-slope model.


Sampling distribution of maximum lilkelihood estimate - help?

I'm doing Monte Carlo simulation of sampling distributions to compute p values. For example, consider the slope parameter, β1, in simple linear regression. I want to find out if p < .05 for a null hypothesis that β1=0. I'm working with two different ways to compute a p value, and am finding that the results do not agree with each other. Here are the two ways:
  1. Consider the maximum likelihood estimate (MLE) of β1 from the actual data, denoted β1MLEactual, and see where it falls in a sampling distribution of β1MLEnull for simulated data from the null hypothesis.
  2. Consider the likelihood ratio statistic, G2 = -2log(LR) where LR is the ratio of the maximum likelihood of the restricted model with β1=0 over the maximum likelihood of the full model with β1 free. We see where G2actual falls in the sampling distribution of G2null for simulated data from the null hypothesis.
I'm finding that the p values don't agree. What am I doing wrong?

Here's an example with details. Consider the data in this scatter plot:
The maximum-likelihood regression line is shown above in blue, and the maximum-likelihood zero-slope line is shown in red. The MLE parameter values are shown in the header of the plot; in particular β1MLEactual = 5.86. The header also shows that G2actual = 4.41.

For reference, in R a call to lm yields p = 0.068 for the slope coefficient. Let's see if either of the Monte Carlo schemes will agree with that result.

We need to generate Monte Carlo data samples. For this purpose, I use the x values in the actual data, and for each x value I generate a random y value according to the MLE values of the restricted model. The restricted model has parameter values shown in the header of the figure below (rounded to three digits):
Sampling distributions from the null hypothesis. The null-hypothesis parameter values are shown in the header of the
Thus, for each x value in the data, a random value for y is created in R as rnorm(1,mean=158,sd=36.6)using the full-precision MLE values, not just the rounded values shown here. For each simulated set of data, I (well, the computer) computed β1MLEnull and G2null. There were 50,000 simulated data sets to create fairly smooth and stable sampling distributions, shown above.

The sampling distributions above also show the locations of β1MLEactual and G2actual along with their p values. You can see that the p value for β1MLEactual does not match the p value for G2actual, but the latter does match the result of R's lm function.

Maybe the problem is merely Monte Carlo sampling noise. But, no, even much bigger simulations don't help (and the p values stay essentially the same).

Maybe the problem is one-tailed vs two-tailed p values. But, no, two times the MLE p value does not equal the G2 p value.

Maybe the problem is that the null hypothesis σ should be the unbiased estimate of SD(y), using N-1 in the denominator for computing SD, instead of the MLE for SD(y), which effectively uses N in the denominator. But, no, that still doesn't make the p value match. The p value for β1MLE does increase (i.e., the sampling distribution of β1MLEnull widens while the sampling distribution of G2null remains unchanged), but still does not match, either one-tailed or two-tailed.

What am I doing wrong with the sampling distribution of  β1MLEnull?


Wednesday, August 20, 2014

Converting combination of random variables to hierarchical form for JAGS (BUGS, Stan, etc.)

An emailer asks:

Hi, John. Long-time listener, first-time caller... I have a model that says X is a function of three (independent) random variables:
X ~ normal(mu, sigma) / uniform(a,a+b) - beta(v,w)
and I also have N random samples of X. Can I use JAGS to estimate mu, sigma, a, b, v and w? Or is this problem outside its scope? Not sure if it's even possible.

My reply:

In other words, you want

x <- r / u - b
r ~ normal( rMean , rSD )
u ~ uniform( uLow , uHigh )
b ~ beta( bA , bB )

but the problem is that JAGS won't accept that sort of model specification.

One solution is to change to an equivalent hierarchical formalization:

x ~ normal( rMean/u-b , rSD/u )
u ~ uniform( uLow , uHigh )
b ~ beta( bA , bB )

This hierarchical form is directly implementable in JAGS. But be careful that in JAGS the normal is parameterized by precision, not SD, so you'll need to write x ~ dnorm( rMean/u-b , 1/(rSD/u)^2 ).

To demonstrate that the two forms are equivalent, here is a Monte Carlo sample generated in R from each formalization:

# Specify parameter values and sample size:
rMean=7 ; rSD=2
uLow=3 ; uHigh=5
bA=4 ; bB=3
N = 50000

# Sample x values your original way:
b = rbeta(N, bA,bB)
u = runif(N, uLow,uHigh)
r = rnorm(N, rMean,rSD)
x1 = r/u-b

# Sample x values a new way:
x2 = rep(0,N)
for ( i in 1:N ) {
  b = rbeta(1, bA,bB)
  u = runif(1, uLow,uHigh)
  x2[i] = rnorm(1, rMean/u-b , rSD/u )

# Plot results:
text( mean(x1) , 0 , "+" , cex=2 )
text( mean(x2) , 0 , "+" , cex=2 )

The result is shown below, where you can see that the original and new distributions are identical:

Friday, August 15, 2014

How to use MCMC posterior as prior for future data

An emailer writes:

Dear Prof. Kruschke,
Hello. My name is ... and I am ... . I'm trying to apply Bayesian theorem in developing a model of ... . I used your code to estimate posterior distribution without any trouble. Here is my question. Would you kindly let me know how to use the estimated posterior distribution as the prior of another Bayesian update? ...  Any comment would be greatly appreciated. Thank you very much for your time. Regards,
 My reply:

Thanks very much for your interest in the book.

Because MCMC methods require a mathematically specified prior, but generate a Monte Carlo sample of the posterior, you need to either (a) find a reasonable mathematical summary of the MCMC posterior to use as the mathematical prior for the next batch of data or (b) concatenate the previous data with the next batch of data and analyze them together.

Option (b) gives you the mathematically correct answer, but at the cost of a large data set which might slow down the MCMC process, and it demands that the previous and next batch of data are structured the same way and analyzed with the same model.

For option (a), there is no uniquely correct approach. For each parameter, examine its marginal MCMC posterior distribution and summarize it with a reasonable mathematical function. For example, if the parameter is metric on positive values, maybe a gamma distribution would mimic the MCMC distribution well. Use a mathematical distribution with the same central tendency and standard deviation as the MCMC posterior. A peril of mimicking each marginal distribution separately is that you lose all information about the correlation of parameters in the posterior. If the correlation is important for capturing the prior information you want, then you can use a multivariate prior for two or more parameters simultaneously (e.g., dmnorm in JAGS).

Hope this helps.

P.S. (Added an hour or so after the text above.) A similar thing was done in the book with pseudo-priors. See the middle paragraph of p. 251: "The shape and rate constants for the gamma pseudopriors were set so that the gamma distributions had the same mean and standard deviation as the posterior sample of kappa values. For example, consider the parameter kappa0. In the initial run without pseudopriors, a reasonable posterior sample was obtained for model index 2 (i.e., when kappa0 is actually constrained by the data, as shown in the bottom panel of Figure 10.3). That posterior sample had a mean m and a standard deviation s. The conversion from mean and standard deviation to shape and rate parameters was explained in the caption of Figure 9.8. The conversion yields shape and rate values specified in lines 55 to 56 of the code."

Tuesday, August 12, 2014

Stopping and testing intentions in p values

An emailer asks:

I am really interested in Bayesian analysis, but I don't get the issue of sampling intention being so important in frequentist t-tests; if you have 60 values you have 60 values surely - why does your intention matter? The computer does not know what I intended to do; or have I missed the point entirely!?

That is exactly the point -- intuitively (for natural Bayesians) the stopping and testing intentions should not matter (for some things), but for p values the stopping and testing intention is of the essence, sine qua non. In lieu of reading the BEST article, see this video, starting at 7:15 minutes:

One other quick question if I may; how does the BEST package differ from the BayesFactor package?

An appendix of the BEST article explains, but you can instead see this video, starting at 2:35 minutes:

Those video excerpts might make the most sense if you just start with the first and watch them through, all the way...

Friday, August 8, 2014

Prior for normality (df) parameter in t distribution

A routine way to describe outliers in metric data is with a heavy-tailed t distribution instead of with a normal distribution. The heaviness of the tails is governed by a normality parameter, ν, also called the df parameter. What is a reasonable prior for ν? One answer: A shifted exponential distribution, which is what I used in the "BEST" article and software. But the book (on pp. 432-433) seems to use a different and obscure prior distribution. The purpose of this post is to show that the book's prior on the normality parameter is, in fact, a shifted exponential.

The book's prior on ν, there called tdf, is
  tdf <- 1 - tdfGain*log(1-udf)
  udf ~ dunif(0,1)
where tdfGain is a constant. In other words, tdf is a strange logarithmic transformation of udf, and udf has a uniform prior. Well, that's the same thing as a shifted exponential on tdf, with tdfGain being its mean.

To prove that claim, we could do some math, as suggested by the method in Section 23.4 (p. 630) regarding reparameterization. But instead I'll show a large Monte Carlo sample of tdf from the book, and superimpose a shifted exponential density, and you can see that they are identical:

In each of the above graphs, the title indicates the value of tdfGain, labelled as nuGain. A large number of values, udf, were randomly generated from a uniform distribution, and then those values were transformed to tdf (=ν) values using the formula tdf <- 1 - tdfGain*log(1-udf). The histogram plots the resulting tdf (=ν) values. The title of each graph also indicates the mean of Monte Carlo sample, which should be nuGain+1 except for randomly sampling noise. Superimposed on the histogram is a exponential density with mean set to nuGain, and shifted 1 unit to the right (because ν must be > 1). Thus, you can see that the book's prior for tdf really is a shifted exponential, as used in the BEST article.

Appendix: Here is the code for generating the graphs.
for ( nuGain in c(3,30,100) ) {
  # Random sample of nu from book prior:
  nu = 1 - nuGain*log( 1 - runif(50000) )
  # Plot histogram:
  xMax = 5*nuGain
  histInfo = hist( nu , probability=TRUE ,
                   col="skyblue" , border="white" ,
                   xlab=bquote(nu) , xlim=c(0,xMax) ,
                   cex.lab=1.5 , cex.main = 1.5 ,
                   breaks=c(0,seq(1,xMax,length=30),1E10) ,
                   main=bquote( list( nuGain==.(nuGain) ,
                                      mean(nu)==.(round(mean(nu),2)) ) ) )
  # Superimpose shifted exponential curve with same mean-1:
  x = seq( 1 , xMax , length=201 )   # Make x-axis comb.
  lines( x , dexp( x-1 , 1/(nuGain) ) , lwd=2 )
  text( mean(nu) , dexp( mean(nu)-1 , 1/(nuGain ) ) ,
        bquote("curve: dexp( "*nu*"-1 , mean="*.(nuGain)*" )") ,
        adj=c(-0.1,-0.1) , cex=1.5 )

Tuesday, April 1, 2014

Potpourri 2: more misc emails regarding doing Bayesian data analysis

More miscellaneous communications with readers. (I have omitted all the salutations and pleasantries to save space here, but most correspondents do begin and end their messages with introductions and salutations that I truly appreciate.) Again, apologies to those I don't reply to -- too many!

  • Adapt ANOVA-style models for mixed between- within- designs? 
  • glm() in R, with a poem 
  • Metropolis algorithm
  • I have some data, what should I do?
  • Could you look over my conference paper?
  • Time series models?
  • How to write up a Bayesian analysis
  • Trying to adapt programs, getting errors
  • Modifying FilconJags.R and FilconJagsPower.R for two groups instead of four groups
  • Sequential updating (including more data) when using MCMC
  • How to take the difference between two parameter distributions
  • What book should be studied after Doing Bayesian Data Analysis?
  • Exercise 13.4

March 9: Adapt ANOVA-style models for mixed between- within- designs?
My data set comprises two populations: patients and control subjects.
All subjects were repeatedly assessed - albeit in different conditions, because the treatment manipulation of the patients is not applicable for controls.
While our main interest pertains the patients' treatment conditions (4 levels), I'd like to compare them with the "natural variation" observed in the control group, who just repeated our experiments twice.
Thus abstractly, the patient group is split by a within-factor of 4 levels P1-P4, and the controls are split by a 2-level within factor, C1 and C2. What's the best method to compare these, assuming normal distributed response variables?

Currently I use one-way BANOVAs, extended by a subject factor to account for repeated measures. But with these models (4 and 2 levels, respectively), I think I can only model the two population groups separately in different models, which hinders direct comparisons between parameter estimates.

My questions in particular:
1) Can/should I extend the BANOVA model to include both populations?
2) Or is it possible to compare the parameters of different BANOVA models? If so, how?
3) Does this kind of design have a specific name? This might be handy to know to search for modeling examples.
Thank you very much in advance!
I talked to a colleague, who was of the opinion that I can directly compare posteriors, provided that they are in the same units and play the same "role" in the model.
In my example, it would be admissible to compare the b0 parameters, i.e. the BANOVA baselines from different models. Using that comparison, I would at least be able to state whether the groups are different from each other on average.

Specifically, the comparison has to be directly computed from the MCMC chains. For each pair of values, randomly sampled from the chains, I would compute the difference, and thus build up a distribution of differences. This then may be inspected with the usual methods, e.g. using its 95% HDI.

While this sounds correct to me, I'd like to know some more details. How many samples should/can I draw from the chains? Should I inspect the autocorrelation? Any help and/or pointers to literature are greatly appreciated!

Regarding the model structure, yes, you should be able modify the ANOVA-style models to suit the structure of your data. This is not to say it is necessarily trivial or obvious how to do that, especially when trying to impose the sum-to-zero constraints (if you choose to do so). For example, see this blog post about a split-plot design, which is related to the sort of design you have:

Regarding computing differences: At every step in the MCMC chain, you can compute differences or ratios or other transformations of parameters in whatever way is meaningful. At each step in the chain, the parameters are *jointly* representative of the posterior distribution.

If you have split your data into two parts that are modeled separately, then the parameter values are, by definition, independent. You can then set the two chains side-by-side and take differences between them as if they were in a joint space.

March 11: glm() in R, with a poem
n R question that I cannot get to the bottom of. I was hoping that you might be able to shed some light on it, as everyone I ask is drawing a blank. 

It is rather simple actually, I have two factors, and I want to do a logistic regression against a (0,1) target using glm in R. 

I have a 0,1 target and then use some explanatory variables which are factors with levels e.g. ... When I try and perform the logistic regression it refuses to use the first level of each factor, as you can see from the screen cap jk1.png attached to this e-mail. ...  Is this something you have experience of? Am I overlooking something trivial?! Apologies if so! 
As your question is about glm() in R, and not about Bayesian analysis per se, I will defer to the R experts. You might try posting your question on or related groups.
Best wishes.
In my desperation
To elucidate the facts
I disturbed you unnecessarily
When I should have consulted stacks.

You're welcome, it's a superb book, and has helped me a great deal. Kind regards

March 1: Metropolis algorithm
I am currently doing my statistics thesis on ... which requires quite a great knowledge of Bayesian Theory especially MCMC methods. However I have some minor problems regarding the Metropolis algorithm:

Any distribution can be attributed to the proposal density function?

Also, the initial values given to the parameters (i.e. the first step of the Metropolis algorithm) are random or there must be some thinking behind the choice of values?

Thank you very much
If you are programming your own Metropolis sampler, you should study the general Metropolis-Hastings algorithm and learn about methods for "tuning" the proposal distribution during an adaptation phase. A great thing about JAGS/BUGS is that they do it for you.

The initial values can be anywhere that has non-zero prior and non-zero likelihood (actually, even that is okay - the real constraint is that the prior times likelihood needs to be defined at the initial values), but initial values near the mode of the posterior make burn-in a lot faster. In either case, you need to monitor convergence to be sure that the chains have lost any bias from their starting positions.

Good luck with your work.

March 14: I have some data, what should I do?
I'm contacting you to ask for your advise on the following problem.

I have 500 samples, each sample being ... [extended description omitted].
Using Bayesian analysis I want to determine the error rate of ... .
Below is an exemplary of the data set. [data omitted] I have never used Bayesian Analysis before so I dont really know how to proceed. I'm thinking that Bayesian analysis is suitable to answer the research question.
-The idea is to use a non informative prior, so the sample can be positive or negative.
-Not sure what distribution to use (binomial, bernoulli, multinomial,). not sure.
-The rest, I do not know.

I look forward to your reply.
Thanks very much.
Thanks for you interest in Bayesian analysis. I hope you do choose to pursue it.

The first thing you need to analyze your data is a descriptive model. The goal of the analysis is to estimate parameter values that are meaningful descriptions of the data. Once you have a model, Bayesian analysis provides a distribution of credible values of the parameters, given the data.

As your field of research is not my specialization, I'm not in a position to prescribe the model you should use. But, whatever it is, I am confident that Bayesian analysis is a richly informative method for analyzing your data.

Best wishes,

 March 16: Could you look over my conference paper?
... I will be presenting a paper at the annual meeting of the ... Association. My paper is based on your BEST software, and I was wondering if you would be willing to take a look at it to point out any major errors I may have committed and save me from potential embarrassment.
Thanks for thinking of me and showing me a draft of your paper. [And I meant that sincerely!] Unfortunately I don't have time to study its 20 pages. If you have a specific question about some topic, however, I might be able to provide a quick answer. Best of luck with your presentation!

March 16: Time series models?
Do you think it is possible to apply Bayesian data methods to time series? What I would hope to avoid is trying to splice GARCH methodology with the Bayesian approach. GARCH is lumbering and clumsy enough in itself. For high frequency financial data, it must surely be possible to assume distribution stability, until evidence begins to emerge that the distribution has broken down, and needs to be reformulated? I would greatly value your advice.
I have not ventured deeply into time-series models, so I have no personal endorsements of particular resources. But Bayesian methods can (in principle) be applied to any model that has parameters, even so-called "non-parametric" models (which have up to an infinite number of parameters). For example, there is a 2011 book titled "Bayesian time series models" that collects recent advanced applications. Although it's not a beginner's tutorial, it might provide useful pointers.

Feb. 25: How to write up a Bayesian analysis
Sir, thanks for Bayesian Estimation Supersedes the t-Test, I can make the plots, sorry but how to analyse the plots and write a paper. Will you please put up a model paper with such analyses. [Here] we have been brought up on a strict diet of ANOVA-RBD/CRD, and t tests with CV values (Analysis of Variance - Randomized Block Design / Completely Randomized design). How can I do these analyses / similar in Bayesian methods,
Reply from Mike Meredith:

I can't point you to papers describing a Bayesian analysis of ANOVA-type models, but I will indicate sources with advice and pass this on to John Kruschke who may have more references to hand.

John has a blog post on the topic at This has a reproduction of the first part of section (23.1) on "Reporting a Bayesian analysis" from Kruschke, J.K. (2011) Doing Bayesian data analysis: a tutorial with R and BUGS Elsevier, Amsterdam etc.

The main points are summarised in Kruschke, J.K., Aguinis, H., & Joo, H. (2012) The time has come: Bayesian methods for data analysis in the organizational sciences. Organizational Research Methods, 15, 722-752 in the section "Recommendations and Illustration of How to Report Bayesian Analysis Results", where they also give an example of writing up a regression analysis.

Also Kruschke, J.K. (2013) Bayesian estimation supersedes the t test. *Journal of Experimental Psychology: General*, 142, 573-603 has a section on "Reporting the results of a Bayesian analysis". You would need to cite that paper as the description of the methods for robust Bayesian analysis.

There are links to the two Kruschke papers at

In my own field (wildlife ecology) we can rarely do experiments, so hypothesis testing is scarcely relevant anyway. The focus is on estimation (of density, abundance, occupancy, survival, etc) or on modelling. In the latter context, information theoretic approaches (using Akaike's Information Criterion, AIC) have been the norm, and moving from there to a Bayesian approach is relatively painless. So I can't point you to papers reporting a Bayesian analysis of ANOVA-type data; Marc Kery's book, An Introduction to WinBUGS for Ecologists (Academic Press, 2010), shows how to do this, but not how to write it up.

April 1: Trying to adapt programs, getting errors
I have been looking through your textbook Doing Bayesian Data Analysis and specifically have been looking at how to do the Bayesian ANOVA in Chapter 18. I have been running into many problems, and I was wondering if you would have time to take a quick look over my code to see what's going wrong. I have no familiarity with R prior to this book and no experience programming, so I am not sure where my errors are. Generally speaking, the problem is that objects and functions are not being found. I have attached a the .csv of my imaginary data set and a .txt of the code I am running in RStudio in case you are able to take a look (I don't mean to presume you will, it's just quicker to already have the materials attached). I do load the .csv into RStudio before running the code. I suppose my core problem here is that I don't know how to take the code from the book using the examples there and apply the code to my own data sets for personal use. Thank you for your time,
Thanks very much for your interest in the book. To have any chance to help you, I need to know what error messages are being shown. Could you please run your script and then copy and paste the R console, with error messages, into an email? Especially important is to identify the very first error that occurs.
Then I run the code, which I have attempted (unsuccessfully) to adapt from what is in the textbook for a Bayesian ANOVA.

After running the code, this is what I get. It seems like it's failing to create several objects that it needs to perform the analysis, or is not finding the functions it needs to do it. I'm not really sure where the problem begins. I know in programming code a single error can lead to other errors, but this is my first exposure to R programming and the code in the book is for specific practice examples and I'm just not sure if I am adapting it to my pretend data set properly (which I'm most likely not given the errors).

 I also get several windows where graphs are supposed to be, but nothing is in them. I colored the errors red to make them easier to find.


> modelCheck( "model.txt" )
Error: could not find function "modelCheck"
+     fnroot = paste( fnroot , dataSource , sep="" )
+     datarecord = read.table( "model.txt", header=T ,
+                              colClasses=c("factor","numeric") )
Error: unexpected symbol in:
"    NxLvl = length(unique(datarecord$Group))
    contrastList = list(1v2"
A few quick thoughts:

* If you're just interested in comparing two groups, definitely look here:
It provides you with complete programs for that case.

* I now strongly recommend NOT using BUGS, and instead using JAGS. JAGS is much like BUGS but works better. See these posts:

* Your first error,
> modelCheck( "model.txt" )
Error: could not find function "modelCheck"
is that R cannot find BRugs. I think you need to first say
or whatever (I've forgotten since it's been so long that I've used BRugs).

* Your second error is that datarecord() is reading in model.txt, not data.
Hope that helps.

March 25: Modifying FilconJags.R and FilconJagsPower.R for two groups instead of four groups
The R code assumes the existence of four conditions (consistent with the ongoing example), but the data file has data for only two, so the code fails when it attempts to access the non-existent data. I can remove the code for handling the two conditions not in the data, but I assume that the intent was to have all four in the Rdata file. Is there a version of FilconJagsMuKappa.Rdata that has the expected four conditions?
If I understand your question correctly, what you are trying to do is apply the FilconJags.R and FilconJagsPower.R programs to data with only 2 groups instead of 4 groups. I think the programs should run without complaint for just 2 groups, although there won't be much benefit (i.e., shrinkage) from the hierarchical structure when there are only 2 groups.

Try this:
In FilconJags.R, include only the first 2 groups. Find the data section and insert the lines shown here:

# For each subject, specify the condition s/he was in,
# the number of trials s/he experienced, and the number correct.
# (These lines intentionally exceed the margins so that they don't take up
# excessive space on the printed page.)
cond = c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4)
N = c(64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64,64)
z = c(45,63,58,64,58,63,51,60,59,47,63,61,60,51,59,45,61,59,60,58,63,56,63,64,64,60,64,62,49,64,64,58,64,52,64,64,64,62,64,61,59,59,55,62,51,58,55,54,59,57,58,60,54,42,59,57,59,53,53,42,59,57,29,36,51,64,60,54,54,38,61,60,61,60,62,55,38,43,58,60,44,44,32,56,43,36,38,48,32,40,40,34,45,42,41,32,48,36,29,37,53,55,50,47,46,44,50,56,58,42,58,54,57,54,51,49,52,51,49,51,46,46,42,49,46,56,42,53,55,51,55,49,53,55,40,46,56,47,54,54,42,34,35,41,48,46,39,55,30,49,27,51,41,36,45,41,53,32,43,33)

# Include only the first two groups:
includeIndex = ( cond <= 2 )
cond = cond[includeIndex]
N = N[includeIndex]
z = z[includeIndex]

nSubj = length(cond)
nCond = length(unique(cond))
Then source FilconJags.R. This will save the MCMC chain for the 2-group model. It will also try to produce graphs, but because the graphs assume four groups, the program will balk. But the posterior MCMC is saved before that happens.

FilconJagsPower.R will now run as is, except for two things. First, its graphics assume 4 groups. Therefore you can make this change:

    # Make plots for first 10 simulated experiments:
    #if ( expIdx <= 10 ) { plotRes = T } else { plotRes = F }
And, its "goals" are defined in terms of 4 groups, so you can make this change:

# Goal is filtration vs condensation 95% HDI exceeds zero:
#goal1Ach = ( HDIofMCMC( (mu[1,]+mu[2,])/2 - (mu[3,]+mu[4,])/2 )[1] > 0.0 )
# Goal is filtration1 vs filtration2 95% HDI exceeds zero:
goal2Ach = ( HDIofMCMC( mu[1,]-mu[2,] )[1] > 0.0 )
# Goal is condensation1 vs condensation2 95% HDI has width less than 0.2:
#HDIlim = HDIofMCMC( mu[3,]-mu[4,] )
#HDIwidth = HDIlim[2] - HDIlim[1]
#goal3Ach = ( HDIwidth < 0.2 )
#goalAchieved = c( goal1Ach , goal2Ach , goal3Ach )
goalAchieved = c( goal2Ach )

I hope that steers you toward achieving your goals!

April 1: Sequential updating (including more data) when using MCMC
I'm sorry to bother you, but I have a question for which I did a lot of researches, but I couldn't get any result. Basically, I implemented a Hierarchal Bayesian model in R and it works fine. However, what I will like to do now is to apply the model on a trial by trial basis (instead of applying it on the whole set of trials) and get the parameters estimations after each trial to use them both as "prior" parameters on the next trial and to track the evolution of these parameters. I do not want to use sequential methods, I just want to apply the model on a trial-by-trial methods. Do you think you can give some idea?
If you are using MCMC methods, and you really want to use the posterior from one run as the prior for the next, it is not directly possible. This is because the posterior distribution from the first run will usually be complex (e.g., curved and correlated in parameter space, with possibly skewed tails and so on) and not possible to be exactly expressed mathematically in the prior specification. (The beauty of using conjugate priors is that the posterior has the same mathematical form as the prior, so incremental updating is elegantly easy. For complex models, there are no [known] conjugate priors.)
Some people attempt to get around that problem by approximating the posterior of the first run with some simpler mathematical form, and using that mathematical re-description as the prior for the next run. But this method is only as good (or bad) as the approximation, and I wouldn't recommend it unless you have really cogent reasons to think that the approximation is excellent.
If you are merely illustrating how the posterior changes with incremental updating, then you can simply do the data analysis using incremental data sets. Do an analysis with {x1,...xN}, then an analysis using the original prior with {x1,...,xN,xN+1}, then an analysis using the original prior with {x1,...xN,xN+1,xN+2}, and so on.

March 28: How to take the difference between two parameter distributions
one thing that has me perplexed right now is how you take the difference between two posterior distributions to see if there is a change. One very common application to us is to have similar data on a specific metric before and after a ... change. Naturally, the intent of our analysts is to understand if there has been a difference after that ... change. Our first step is to calculate the posteriors of the two datasets (before and after). Then I assumed that the posteriors are a random variable so they are compared by forming the convolution of the two distributions. However, if you just visualize two normal distributions (the posteriors) where one has a mean near zero and the other is shifted (say, to 5), the convolution will still cover zero. So although these two datasets are clearly different and the software change has had an impact, the HDI will still straddle zero and the conclusion will be that there was no change. Is this method of using convolution how you did your poster comparisons in the textbook? If not, I would very much like to know the methods for taking this difference like you did with respect to recall and classical music in your text (chapter 12). Thanks very much for you time,
At every step in the MCMC chain, the parameter values at that step are jointly representative of the multidimensional posterior distribution. Therefore the difference of relevant parameter values at each step is a representative difference.

I think the first example of this in the book occurs at the top of p. 176.

Because posterior distributions can have strange shapes, it is important to compute the difference at each step, maintaining the joint parameter values. See Fig. 12.3, p. 300.

March 27: What book should be studied after Doing Bayesian Data Analysis?
What do you recommend as a good second book [after Doing Bayesian Data Analysis]? I’m a long time R user and my problem space is in engineering.
For a follow-up book, I'd recommend
Ntzoufras (2009) Bayesian modeling using WinBUGS
Gelman et al. (2014) Bayesian data analysis 3rd ed.
You might also try
Lee & Wagenmakers (2014) Bayesian cognitive modeling.

None of those is specifically engineering, but Ntzoufras and Lee & Wagenmakers give lots of BUGS/JAGS examples and Gelman et al. is a great authoritative resource.

March 27: Exercise 13.4
Exercise 13.4 refers to Kruscke1996CSbugs.R; I don’t see it (or the Jags version) in the collection of R programs and data. The associated data file is there (Kruschke1996CSdatsum.Rdata), however. Is it not there on purpose (i.e., the student is expected to write the code from scratch)?
The program for that exercise was deliberately excluded from the posted programs because it was supposed to be a programming exercise, not merely "click the 'source' button and watch what happens." I should not have mentioned the program name in the exercise description, but I did because that was how I got my word processing software (LaTeX) to include code snippets. The program I have for that exercise is old and not prepared for release, so I would prefer not to circulate it. I apologize that this is not the answer you were hoping for!

Saturday, March 22, 2014

Potpourri of recent inquires about doing Bayesian data analysis

I get a lot of email asking about the book and the blog. Too often, I say to myself, "That's interesting. I'll have to think about it and reply later." And then later never comes. My apologies to everyone to whom I have not yet replied. Below is a sample of a few of the messages I've received in the last few days, actually with replies.

  • Initializing chains for multi-factor ANOVA-style models.
  • Errors in BUGS (another reason to use JAGS).
  • Getting the decimal places of the HDI limits that don't appear in the graphs.
  • Effect size measure in ANOVA-style log-linear models.
  • Variable selection in multiple regression.

March 22, 2014:
Dear Prof. Kruschke,

by now I've opened and closed your contact form many a times, and have been successfully able to tackle each problem I had at the time using your book. Now, though, I am stumped. I am calculating a Poisson Exponential "ANOVA" for 4 nominal predictors and the code for initializing the chains troubles me. For that matter, my question is regarding any and all situations in which there are more than two nominal predictors present. The code to initialize the linear predictor sum is this:

linpred = as.vector( outer( a1 , a2 , "+" ) + a0 )

I understand the concept of the outer() function for two variables, but could you please explain how you would deal with adding a3 and a4 into this mix, assuming the rest of the code has been updated accordingly?

Kind regards,


Thanks very much for your interest in the book.

One approach is to try not to initialize manually, letting JAGS initialize for you. You have to check that the chains have burned in well and converged. But if it works, problem solved.

But if you really want to manually initialize, you have a few choices. One option is not to bother with the interaction terms. That is, initialize the main effects with lines like
a0 = mean( theData$y )
a1 = aggregate( theData$y , list( theData$x1 ) , mean )[,2] - a0
a2 = aggregate( theData$y , list( theData$x2 ) , mean )[,2] - a0

but leave the interaction terms all initialized at zero.

If you really, really want to initialize the interactions, you can use the outer() function to build higher dimensional arrays. To get help, type
at R's command line. As an example, consider this:

> a=1:2
> b=3:5
> c=6:9
> outer(a,b,"+")
     [,1] [,2] [,3]
[1,]    4    5    6
[2,]    5    6    7
> outer(c,outer(a,b,"+"),"+")
, , 1

     [,1] [,2]
[1,]   10   11
[2,]   11   12
[3,]   12   13
[4,]   13   14

, , 2

     [,1] [,2]
[1,]   11   12
[2,]   12   13
[3,]   13   14
[4,]   14   15

, , 3

     [,1] [,2]
[1,]   12   13
[2,]   13   14
[3,]   14   15
[4,]   15   16

Hope that helps!

March 20, 2014:
Hi John,

First of all I would like to mention that I am really enjoying reading the book.

I have studied a bit of Bayesian statistics in a university course and I find really interesting to see real world applications of MCMC.

I have been doing the exercises okay but I have come across a stumbling block in chapter Nine.

On exercise 9.2 when I changed the data parameters as per exercise solutions I do get the following error message. These errors appear when I run the FilconBrugs.R script in its original form as well

'Initializing chain 1:
initial values loaded and chain initialized but another chain contain uninitialized variables
Initializing chain 2:
initial values loaded and chain initialized but another chain contain uninitialized variables
Initializing chain 3:
model is initialized,

After this error then I encounter the following one:

"update error for node <theta[4]> algorithm conjugate beta updater error argument two  has too small a value"

Basically the model does not produce any posterior. If you could let me know what the error could be due to it would be much appreciated. I'm using R studio and I haven't come across any errors in the rest of exercises and code that I have ran so far.

Kind Regards,


Thanks very much for your interest in the book.

First off, I strongly recommend using JAGS instead of BUGS. I have repeatedly found that BUGS yields troubles when JAGS works fine. Here is the link to a blog post about using JAGS instead of BUGS:

Both of the errors you encounter are from BUGS. The statements about intializing the chains are not errors, but merely status reports. It's simply going through the chains and initializing them, but letting you know that after chain 1 there are still 2 more chains to go, and after chain 1 there is still another chain to go.

The bit about "update error for node..." is a genuine error. As I recall, I do not get that sort of error in JAGS. There are work-arounds in BUGS, by truncating the distribution or the values of higher-level parameters. But give JAGS a try and see if that solves the problem.

Thanks again.

March 20, 2014:
Good afternoon Mr. Kruschke,

I have a  questions about two-way ANOVA from your book. On page 522, you run an example using salary data and you mention that the means and HDI limits are displayed with only three significant digits, but more precise values can be obtained directly from the program.  How do I get those more precise values?

Thank you for help,

Thanks very much for your interest in the book.

The 22-Jun-2012 version of plotPost.R (see programs at ) produces a summary of the posterior as its returned value. For example, if you use the R statement

postInfo = plotPost( ... )

then postInfo has the mean, median, mode, HDI limits, etc.

Don't forget that even though multiple decimal points are there, they are not necessarily meaningful. The values are estimates from a random MCMC sample from the posterior, so they are noisy!

Thanks again.

March 19, 2014:
Dear Prof. Krushke, I'm working with ch 22 of DBDA to analyse contingency tables. I'm curious to know where to look for a measure of effect size, both for the whole table and for individual interactions. Any tips you have would be much appreciated!


Thanks for your interest in the book.

My short answer is that I don’t have an adequate answer. There are many relevant measures of effect size in the frequentist literature, of course, but we seek something for hierarchical Bayesian log-linear models. A measure of effect size is supposed to be an aggregated difference (i.e., the “effect”) standardized relative to some indicator of variance or noise in the data. Moreover, in a Bayesian setting, the effect size has a posterior distribution; it’s not just a point estimate. An example of posterior effect size for a difference between groups is given here:
One possibility for describing effect size in hierarchical Bayesian ANOVA-style models (including log-linear models) is a ratio of estimated variance parameters. For example, the main effect of factor A might have an “effect size” given by sigma_A divided by sigma_y, where sigma_A is the scale parameter of the distribution of the factor-A deflections, and sigma_y is the scale parameter of the within-cell noise distribution. But that’s just off the top of my head.
Thanks again,

March 18, 2014:
John, I really enjoyed reading your post [] as, similar to you, I feel that when it comes to real data there is a lot of fine tuning that is usually inadequately addressed in papers/books.

I have a problem where the posterior probability of delta=1 is always above 0.5 (even when the respective variable should clearly not be selected). I was wondering whether you made progress with your issues and perhaps can share some light on mine.



Thanks for the comment.

I’m not sure how you know that the “respective variable should clearly not be selected.” If you already know that, does that mean that you have prior knowledge that is not being expressed in the prior distribution of the model?

My impression of variable selection is that it’s a bit of an art, in the sense that the prior matters! But that’s true of Bayesian model selection in general. Real prior knowledge should be implemented if possible, and, in the case of variable selection, shrinkage from hierarchical structure should be implemented if appropriate.