Friday, December 21, 2012

Another reason to use JAGS instead of BUGS

BUGS is the pioneering software that made MCMC available to so many of us, but it has some problems with robustness that are not suffered by the subsequent software JAGS. Readers of DBDA tell me of some new problems running models in BUGS, which I have not been able to solve in BUGS, but the same model runs fine in JAGS. The purpose of this post is to point out the particular problem and the complete solution in JAGS. (If you have a solution for BUGS, please comment.)

I converted all the BUGS programs to JAGS versions back on 01 January 2012. Please see the previous post for information about JAGS.

A reader reports this problem:
I [the reader] have received this error several times:
"update error for node <theta[66]> algorithm conjugate beta updater error argument two  has too small a value "
...  In particular, I [the reader] noticed for exercise 12.3, if I remove the 0's from the data, I do not get the error. 
When I try running the program from Exercise 12.3 with BUGS, I also get the error, although I did not get any error a few years ago when I created the program and reported its results in the solutions manual. I have tried to solve the problem by arbitrarily censoring/truncating the distributions in BUGS (as I did in some other programs in the book), but without success in this application.

I quickly re-wrote the BUGS+BRugs program for JAGS+rjags, going step-by-step through the guidelines from this previous post. The resulting program, using the identical model statement, runs fine in JAGS and produces results identical to those in the solutions manual. Here is the complete program and results:


graphics.off()
rm(list=ls(all=TRUE))
fileNameRoot="Exercise12.3.A.JAGS" # for constructing output filenames

require(rjags)

if ( .Platform$OS.type != "windows" ) {
  windows <- function( ... ) X11( ... )
}


#------------------------------------------------------------------------------
# THE MODEL.

modelstring = "
model {
   for ( subjIdx in 1:nSubj ) {
      # Likelihood:
      z[subjIdx] ~ dbin( theta[subjIdx] , N[subjIdx] )
      # Prior on theta: Notice nested indexing.
      theta[subjIdx] ~ dbeta( a[cond[subjIdx]] , b[cond[subjIdx]] )
   }
   for ( condIdx in 1:nCond ) {
      a[condIdx] <- mu[condIdx] * kappa[condIdx]
      b[condIdx] <- (1-mu[condIdx]) * kappa[condIdx]
      # Hyperprior on mu and kappa:
      mu[condIdx] ~ dbeta( Amu , Bmu )
      kappa[condIdx] ~ dgamma( Skappa , Rkappa )
   }
   # Constants for hyperprior:
   Amu <- 0.35 * 2
   Bmu <- 0.65 * 2
   Skappa <- pow(meanGamma,2)/pow(sdGamma,2)
   Rkappa <- meanGamma/pow(sdGamma,2)
   meanGamma <- 10
   sdGamma <- 10
}
" # close quote for modelstring
# Write model to a file:
writeLines(modelstring,con="model.txt")

#------------------------------------------------------------------------------
# THE DATA.

nCorrOfSubj = c( 8, 4, 6, 3, 1, 4, 4, 6, 4, 2, 2, 1, 1, 4, 3 ,3 ,2 ,6, 3, 4,
                 2, 1, 1, 3, 2, 7, 2, 1, 3, 1, 0, 2, 4, 2, 3, 3, 0, 1, 2, 2 )
CondOfSubj = c( rep(1,20) , rep(2,20) )
nTrlOfSubj = rep(10,40)
nSubj = length(CondOfSubj)
nCond = length(unique(CondOfSubj))

# Specify the data in a form that is compatible with BRugs model, as a list:
dataList = list(
 nCond = nCond ,
 nSubj = nSubj ,
 cond = CondOfSubj ,
 N = nTrlOfSubj ,
 z = nCorrOfSubj
)

#------------------------------------------------------------------------------
# INTIALIZE THE CHAINS.


#------------------------------------------------------------------------------
# RUN THE CHAINS.

parameters = c("mu","kappa","theta")     # The parameter(s) to be monitored.
adaptSteps = 500              # Number of steps to "tune" the samplers.
burnInSteps = 1000            # Number of steps to "burn-in" the samplers.
nChains = 3                   # Number of chains to run.
numSavedSteps=50000           # Total number of steps in chains to save.
thinSteps=1                   # Number of steps to "thin" (1=keep every step).
nIter = ceiling( ( numSavedSteps * thinSteps ) / nChains ) # Steps per chain.
# Create, initialize, and adapt the model:
jagsModel = jags.model( "model.txt" , data=dataList , # inits=initsList ,
                        n.chains=nChains , n.adapt=adaptSteps )
# Burn-in:
cat( "Burning in the MCMC chain...\n" )
update( jagsModel , n.iter=burnInSteps )
# The saved MCMC chain:
cat( "Sampling final MCMC chain...\n" )
codaSamples = coda.samples( jagsModel , variable.names=parameters ,
                            n.iter=nIter , thin=thinSteps )
# resulting codaSamples object has these indices:
#   codaSamples[[ chainIdx ]][ stepIdx , paramIdx ]


#------------------------------------------------------------------------------
# EXAMINE THE RESULTS.

source("plotPost.R")

mcmcChain = as.matrix( codaSamples )

# Extract parameter values and save them.
mu = NULL
kappa = NULL
for ( condIdx in 1:nCond ) {
   mu = rbind( mu , mcmcChain[, paste("mu[",condIdx,"]",sep="") ] )
   kappa = rbind( kappa , mcmcChain[, paste("kappa[",condIdx,"]",sep="") ] )
}
save( mu , kappa , file=paste(fileNameRoot,"MuKappa.Rdata",sep="") )
chainLength = NCOL(mu)

# Histograms of mu differences:
windows(width=7,height=7)
plotPost( mu[1,]-mu[2,] , xlab=expression(mu[1]-mu[2]) , main="" ,
          breaks=20 , compVal=0 , ROPE=c(-.1,.1) )
savePlot( file=paste(fileNameRoot,"MuDiffs.jpg",sep="") , type="jpg" )

# Scatterplot of mu,kappa in each condition:
windows(width=7,height=7)
muLim = c(min(mu),max(mu))
kappaLim = c(min(kappa),max(kappa))
mainLab="Posterior"
thindex = round( seq( 1 , chainLength , len=300 ) )
plot( mu[1,thindex] , kappa[1,thindex] , main=mainLab ,
      xlab=expression(mu[c]) , ylab=expression(kappa[c]) , cex.lab=1.75 ,
      xlim=muLim , ylim=kappaLim , log="y" , col="red" , pch="1" )
points( mu[2,thindex] , kappa[2,thindex] , col="blue" , pch="2" )
savePlot( file=paste(fileNameRoot,"Scatter.jpg",sep="") , type="jpg" )





Tuesday, November 27, 2012

Shrinkage in multi-level hierarchical models


In a previous post and in a video, I used baseball data to illustrate some ideas about shrinkage and multiple comparisons in hierarchical models. Here I take it a step further, to multi-level hierarchical models, to illustrate shrinkage of estimates within different levels of the model.

Background: Consider the sport of baseball, in which players have opportunities at bat during a season of games, and occasionally hit the ball. In Major League Baseball, players typically hit the ball on about 24% of their opportunities at bat. That ratio, of hits (H) divided by opportunities at bat (AB), is called the "batting average" of each player. Players also must play a position in the field when the other team is at bat. Some fielding positions are specialized and those players are expected to focus on those skills, not necessarily on hitting. In particular, pitchers are not expected to be strong hitters, and catchers might also not be expected to focus so much on hitting. (But I'm no baseball aficionado, so please forgive any inadvertent insults or errors! My thanks to colleagues Ed Hirt and Jim Sherman for tutoring me in some of the basics of baseball rules.)

Our goal is to estimate the underlying probability of getting a hit by each player, based on their hits H, at bats AB, and primary position. The model will put each player's estimated hitting ability (i.e., probability of getting a hit) under a distribution for the primary position, and put those position distributions under an overarching distribution across all positions. Here is a diagram of the hierarchical model. It should be scanned from the bottom up, as described next.
The data are indicated at the bottom of the diagram, with Hi being the number of hits for player i, and ABi being the number of at-bats for player i. The parameter θi in the binomial distribution is the underlying probability of getting a hit for player i. The θi are assumed to come from a beta distribution, but there is a distinct beta distribution for each type of field position: pitcher, catcher, 1st base, 2nd base, 3rd base, shortstop, left field, center field, and right field. The subscript pp(i) indicates the primary position of player i. Thus, each of the nine position-specific beta distributions has a mean μpp and a consistency (like precision) κpp. (Now I notice a small error in the diagram: The ellipses labeled with pp(i) should instead be labeled with only pp.) Moving up the diagram, the nine position-specific means μpp come from a beta distribution, with a a mean μμpp that expresses the central tendency across all positions.At the top level, the priors are vague and noncommittal.

The data are from 948 players in the 2012 regular season of Major League Baseball. Thanks to Matthew Maenner for posting a comment explaining how to download the data. The model simultaneously estimates 948 + 9*2 + 4 = 970 parameters.

Results:


Let's first consider the estimates of hitting ability for different positions. At right are the marginal posterior distributions for the μpp parameters for the positions of catcher and pitcher. The blue distributions show the credible values of the parameters, and the red crosses on the x-axis indicate the ratio in the data of total hits divided by total at bats for all players in that position. Notice that the modes of the posterior are not centered on the red crosses. Instead, the modal estimates are shrunken toward the middle between the pitchers (who tend to have low batting averages) and the catchers (who tend to have higher batting averages). The lowest panel shows the distribution of differences between catchers and pitchers. The distribution falls far above zero, so we can say with high credibility that catchers hit better than pitchers. (Click on the picture to see it enlarged.)


At right is a comparison of two individual players with same record, 1 hit in 3 at bats, but playing different positions, namely catcher and pitcher. Notice that the red crosses are at the same place on the x-axes for the two players, but notice the radically different estimates of their underlying probability of getting a hit, because of the different positions they play. The data from all the other catchers tells the model that catchers tend to have theta values around 0.24. Because this particular catcher has so few data to inform his estimate, the estimate from the higher-level distribution dominates. The same is true for the pitcher, but the higher-level distribution says that pitchers tend to have theta values around 0.13. The resulting distribution of differences, in the lowest panel, suggests that these two players may have credibly different hitting abilities, even though their hits and at-bats are identical! In other words, because we know the players play these particular different positions, we can infer that they probably have different hitting abilities. (Click on the picture to see it enlarged.)


Here is a comparison of two indivdiual players, both of whom are pitchers, with seemingly quite different batting averages of 18/61 and 4/61, as marked by the red crosses on the x-axis. But the posterior estimates of their hitting probabilities are not so different. Notice the dramatic shrinkage of the estimates toward the average of players who are pitchers. In the lower panel, we see that a difference of zero is credible. (Click on the picture to see it enlarged.)


Here is a comparison of two indivdiual players with different records but both playing Center Field. The larger amount of individual data (593 and 120 at bats) allows somewhat less shrinkage than the previous example, so that the estimated difference excludes zero (but still overlaps the particular ROPE used here). Notice also the difference in widths of HDIs for the two players: There is a narrower HDI for the player with more data. (Click on the picture to see it enlarged.)


Here are two players from Right Field with huge numbers of at-bats and nearly the same (impressive!) batting average. The HDI of the difference falls almost entirely within the ROPE, so we accept that the players have identical probability of getting a hit (for practical purposes). (Click on the picture to see it enlarged.)







Summary: With multi-level hierarchical models, there is shrinkage within each level. In this model, there was shrinkage of the position parameters toward each other, as illustrated by the pitcher and catcher distributions. And there was shrinkage of the player parameters within each position, as shown by various examples above. The model also provided some strong inferences about player abilities based on position alone, as illustrated by the estimates for individual players with few at bats.

Monday, November 12, 2012

Video: Bayesian Methods Interpret Data Better

Another 15 minute talk, (to be) presented at the Psychonomic Society Special Session on Improving the Quality of Psychological Science. This talk briefly discusses sequential testing, the goal of achieving precision (as opposed to rejecting a null value), and multiple comparisons. The video can be found here.

Sunday, November 11, 2012

Video: Bayesian estimation supersedes the t test

Bayesian estimation supersedes the t test in 15 minutes of video.  
For the complete info, see the article and software at this site.

Friday, October 12, 2012

Abelson's paradox, baseball, null differences, the ROPE, and hierarchical Bayesian analysis

In a thread elsewhere, a reader of Bayesian estimation supersedes the t test commented: "Doesn't Abelson's paradox preclude the establishment of guidelines for a sufficiently small effect size?"  In this post I briefly review Abelson's paradox and demonstrate how hierarchical Bayesian data analysis is actually applied to real baseball data of the type that Abelson used to illustrate his paradox. Bayesian estimation provides rich estimates of the abilities of players and their differences. There is no paradox.

First, a cursory review of Abelson's paradox. It was described in a brief article: Abelson, R. P. (1985). A variance explanation paradox: When a little is a lot. Psychological Bulletin, 97(1), 129-133. Abelson considered the batting averages of major league baseball players. A batting average is simply the ratio of number of hits to number of at-bats. In the 2012 season, players each had about 550 opportunities at bat, and typically got about 150 hits, for a typical batting average of about 0.27. Across 143 different players, the batting averages ranged from about 0.19 to 0.34. (Data from ESPN.) Abelson was concerned with whether that variation across players was large compared to the variation within players; in other words, he wanted to know the magnitude of the proportion of variance accounted for by player ability. He came up with a mathematical approximation and argued that for typical data the magnitude of the proportion of variance accounted for is about 0.003. The "paradox" is that such a small number conflicts with sports fans' intuition that player ability should account for a lot more of the variance of hits. Abelson discussed the meaning of his calculation and how it differs from the everyday intuition of the sports fan, saying that his statistic refers to the predictability of a single at-bat. But I can't say I fully understand what Abelson was going for, and the Bayesian analysis of real data (shown below) seems clear and unparadoxical to me.

What does any of that have to do with Bayesian data analysis? Presumably the reader who commented about Abelson's paradox was thinking about the approach to null values that I and others have advocated, whereby a null value for a parameter is deemed accepted for practical purposes if its 95% highest density interval (HDI) falls entirely within a region of practical equivalence (ROPE) around the null value. Perhaps the reader was concerned that no ROPE can be established because even tiny values, like Abelson's 0.003, can correspond to intuitively large effects.

Perhaps the best rebuttal is simply to demonstrate how to do a Bayesian analysis of batting averages, including a ROPE. The analysis uses a hierarchical model that simultaneously estimates individual player abilites and the overall average ability of major league players.

The model comes directly from Chapter 9 of Doing Bayesian Data Analysis. (For an example of applying the same model to meta-analysis of ESP data, see this post. That post also includes a hierarchical diagram of the model.) In the model, the i-th player has an underlying probability of getting a hit, denoted theta[i]. The value of theta[i] is estimated from observing the number of hits y[i] and at-bats N[i] for each player. Importantly, the model assumes that the theta[i] come from a higher-level distribution that describes the individual differences among major league players. There is an overall central tendency, mu, for the batting average of major league players, and a "tightness" of clustering around that central tendency, denoted kappa. (For details, please see Ch. 9 of DBDA.)

The hierarchical structure of the model provides shrinkage in the estimates of the individual abilities. Essentially, each player's data inform the estimate of theta[i] for that player, but also inform the group-level parameters mu and kappa, which in turn influence the estimated values of theta[.] for other players. Thus, if many players all have batting averages right around .27, the group-level distribution is estimated to be very "tight," which pulls in (shrinks) outlying individuals toward the group average. This is intuitively reasonable: If you have information that player after player has an average around .27, you should use that information to inform your estimate of the next player.

When making decisions about differences in estimated abilities, what sort of ROPE is meaningful? There is no single correct answer because it depends on practical consequences. But suppose we say that a difference of 10 hits, in a season of 550 at bats, has practical significance. The ratio 10/550 is just under 0.02, so let's establish a ROPE as -0.02 to +0.02. Our decision rule for assessing differences between players is now this: The difference in abilities between players is deemed to be credibly and practically different from zero if the 95% HDI on the estimated difference falls completely outside a ROPE from -0.20 to +0.02. The difference in abilities is deemed to be practically equivalent to zero if the 95% HDI on the estimated difference falls completely inside the ROPE.

Here are some results from the analysis. First, consider the players with the highest and lowest batting average during the 2012 regular season:
The right panel shows that their abilities (in terms of estimated probability of getting a hit at bat) are credibly different, and the posterior distribution reveals in detail the relative credibility of the whole range of candidate differences. The graphs also plot the observed batting average (y[i]/N[i]) as small red +'s on the abscissa. Notice that the estimated theta values show clear shrinkage toward the group average. Thus, although Buster Posey had a batting average of 0.336, the estimate of the underlying probability of getting a hit is shrunken toward the major league average, with a mean estimate of 0.313. Similarly, although Carlos Pena had a batting average of 0.197, the estimate of the underlying probability of getting a hit is shrunken toward the central tendency of the group, with a mean estimate of 0.225. Despite the shrinkage, the difference (right panel) is still credibly non-zero.

Here are the results for the two players in the middle of the pack:
The right panel shows that the estimated difference in their underlying probabilities of getting a hit is nearly zero. 52% of the posterior distribution falls within the ROPE. Thus, we do not have enough precision in the estimate of the differences to declare that their abilities are equal for practical purposes, where "practical" is defined in terms of this choice of ROPE.

Thus, Bayesian analysis provides rich and meaningful inferences about the sort of data that Abelson was interested in. I don't see any "paradox" that needs to be overcome. The Bayesian analysis never even brought up the issue of "proportion of variance accounted for" as Abelson did. Because the Bayesian analysis directly estimates all the parameters of interest, and provides a complete posterior distribution for their credibilities, Abelson's paradoxical statistic never even arose.


Appendix: The complete program. Data are from ESPN, linked in text above.

rm(list = ls())
graphics.off()
fileNameRoot="MajorLeagueBaseballBattingJAGS"
if ( .Platform$OS.type != "windows" ) {
  windows <- function( ... ) X11( ... )
}
                       # In the style of:
require(rjags)         # Kruschke, J. K. (2011). Doing Bayesian Data Analysis:
                       # A Tutorial with R and BUGS. Academic Press / Elsevier.
#------------------------------------------------------------------------------
# THE MODEL.

# Specify the model in JAGS language, but save it as a string in R:
modelString = "
model {
    # Likelihood:
    for ( i in 1:nPlayers ) {
        y[i] ~ dbin( theta[i] , N[i] )
    }
    # Prior:
    for ( i in 1:nPlayers ) {
        theta[i] ~ dbeta( a , b )
    }
    a <- mu * kappa
    b <- ( 1.0 - mu ) * kappa
    mu ~ dbeta( 1,1 )
    kappa ~ dgamma( 1.393 , 0.0393 ) # mode=10, sd=30
}
# ... JAGS model specification ends.
" # close quote to end modelString
# Write the modelString to a file, using R commands:
writeLines(modelString,con="model.txt")

#------------------------------------------------------------------------------
# THE DATA.

dataFrame = read.csv( file="MajorLeagueBaseballBattingStats2012.csv" )

y = dataFrame$H  # hits for each player
N = dataFrame$AB  # at bats for each player
nPlayers = length(y)

dataList = list(
    y = y ,
    N = N ,
    nPlayers = nPlayers
)

#------------------------------------------------------------------------------
# INTIALIZE THE CHAIN.

# Let JAGS do it randomly...

#------------------------------------------------------------------------------
# RUN THE CHAINS.

parameters = c( "mu" , "kappa" , "theta" )   # The parameter(s) to be monitored.
adaptSteps = 1000              # Number of steps to "tune" the samplers.
burnInSteps = 1000            # Number of steps to "burn-in" the samplers.
nChains = 3                   # Number of chains to run.
numSavedSteps=100000           # Total number of steps in chains to save.
thinSteps=1                   # Number of steps to "thin" (1=keep every step).
nIter = ceiling( ( numSavedSteps * thinSteps ) / nChains ) # Steps per chain.
# Create, initialize, and adapt the model:
jagsModel = jags.model( "model.txt" , data=dataList , # inits=initsList ,
                        n.chains=nChains , n.adapt=adaptSteps )
# Burn-in:
cat( "Burning in the MCMC chain...\n" )
update( jagsModel , n.iter=burnInSteps )
# The saved MCMC chain:
cat( "Sampling final MCMC chain...\n" )
codaSamples = coda.samples( jagsModel , variable.names=parameters ,
                            n.iter=nIter , thin=thinSteps )
# resulting codaSamples object has these indices:
#   codaSamples[[ chainIdx ]][ stepIdx , paramIdx ]

#------------------------------------------------------------------------------
# EXAMINE THE RESULTS.

checkConvergence = FALSE
if ( checkConvergence ) {
  autocorr.plot( codaSamples , ask=T )
}

# Convert coda-object codaSamples to matrix object for easier handling.
# But note that this concatenates the different chains into one long chain.
# Result is mcmcChain[ stepIdx , paramIdx ]
mcmcChain = as.matrix( codaSamples )

# Extract the posterior sample from JAGS for easier reference:
mu = mcmcChain[,"mu"]
kappa = mcmcChain[,"kappa"] # BRugs gets sample from JAGS
theta = matrix( 0 , nrow=nPlayers , ncol=nChains*nIter )
for ( i in 1:nPlayers ) {
    nodeName = paste( "theta[" , i , "]" , sep="" )
    theta[i,] = mcmcChain[,nodeName]
}

# Make a graph using R commands:
source("plotPost.R")

windows(width=7,height=2.5)
layout( matrix( 1:2 , nrow=1 , byrow=TRUE ) )
#par(mar=c(2.95,2.95,1.0,0),mgp=c(1.35,0.35,0),oma=c( 0.1, 0.1, 0.1, 0.1) )
plotPost( mu , xlab="mu" , main="Group Mean" )
plotPost( kappa , xlab="kappa" , main="Group Certainty" )
savePlot( file=paste(fileNameRoot,"MuKappa",sep="") , type="jpg" )

plotPlayerDiff = function( idx1 , idx2 , diffRope=c(-0.02,0.02) , savePlotFile=FALSE ) {
  windows(width=7,height=2.5)
  layout( matrix( 1:3 , nrow=1 , byrow=TRUE ) )
  #par(mar=c(2.95,2.95,1.0,0),mgp=c(1.35,0.35,0),oma=c( 0.1, 0.1, 0.1, 0.1) )
  plotPost( theta[idx1,] , xlab=paste("theta",idx1) , main=dataFrame$PLAYER[idx1] )
  points( dataFrame$AVG[idx1] , 0 , pch="+" , col="red" , cex=1.5 )
  plotPost( theta[idx2,] , xlab=paste("theta",idx2) , main=dataFrame$PLAYER[idx2] )
  points( dataFrame$AVG[idx2] , 0 , pch="+" , col="red" , cex=1.5)
  plotPost( theta[idx1,] - theta[idx2,] ,
            xlab=paste("theta",idx1,"-","theta",idx2) , main="Difference" ,
            compVal=0.0 , ROPE=diffRope )
  points( dataFrame$AVG[idx1]-dataFrame$AVG[idx2] , 0 , pch="+" , col="red" , cex=1.5)
  if ( savePlotFile ) {
    savePlot( file=paste(fileNameRoot,"Theta",idx1,"Theta",idx2,sep="") , type="jpg" )
  }
}
plotPlayerDiff(1,nPlayers,savePlotFile=TRUE)
plotPlayerDiff( round(nPlayers/2)-1 , round(nPlayers/2) ,savePlotFile=TRUE)








Tuesday, October 2, 2012

Bayesian estimation of trend with auto-regressive AR(1) deviation

This post shows how to estimate trend coefficients when there is an auto-regressive AR(1) process on the deviation from the trend. The specific example uses a sinusoidal trend to describe daily temperatures across many years, but the programming method in JAGS/BUGS can be easily adapted to other trends.

The example extends a previous post about average daily temperatures modeled as sinusoidal variation around a linear trend. The substantive goal was to estimate the slope of the linear component, to determine whether there is a credible non-zero increase in temperatures over the years. In that post, the discussion mentioned lack of independence across days in the deviation from the trend, and with this post the dependence is described by a simple auto-regressive AR(1) model.  Here is the model specification with the essential conceptual components highlighted in yellow:

model {
    trend[1] <- beta0 + beta1 * x[1] + amp * cos( ( x[1] - thresh ) / wl )
    for( i in 2 : Ndata ) {
      y[i] ~ dt( mu[i] , tau , nu )
      mu[i] <- trend[i] + ar1 * ( y[i-1] - trend[i-1] )
      trend[i] <- beta0 + beta1 * x[i] + amp * cos( ( x[i] - thresh ) / wl )
    }
    ar1 ~ dunif(-1.1,1.1) # or dunif(-0.01,0.01)
    beta0 ~ dnorm( 0 , 1.0E-12 )
    beta1 ~ dnorm( 0 , 1.0E-12 )
    tau ~ dgamma( 0.001 , 0.001 )
    amp ~ dunif(0,50)
    thresh ~ dunif(-183,183)
    nu <- nuMinusOne + 1
    nuMinusOne ~ dexp(1/29)
}


The trend is modeled as a linear component plus a sinusoidal component:
trend[i] <- beta0 + beta1 * x[i] + amp * cos( ( x[i] - thresh ) / wl )
The slope on the linear component is beta1.

The predicted value of y at time i, denoted mu[i], is the trend at time i plus a proportion of the deviation from the trend on the previous time step:
mu[i] <- trend[i] + ar1 * ( y[i-1] - trend[i-1] )
Notice that if ar1 is zero, then the model reduces to simply mu[i] = trend[i]. Here is the posterior when ar1 is restricted to being essentially zero, by setting its prior to ar1 ~ dunif(-0.01,0.01):
The parameter estimates are basically identical to those in the previous post (as they should be!). In particular, the linear trend component is credibly greater than zero.

When ar1 is freely estimated, by setting its prior to ar1 ~ dunif(-1.1,1.1), then the posterior looks like this:
Notice that the AR(1) coefficient is quite large positive, which makes sense for consecutive daily temperatures (if it's hotter than the sinusoid would predict on one day, it'll probably be hotter than the sinusoid would predict on the next day too). Notice that the estimate of the standard deviation of the noise is now smaller than before, which again makes sense because the AR(1) process is accounting for deviation from the trend which used to be accounted for only by the noise. Importantly, notice that estimates of the other trend parameters are now less certain. In particular, the linear trend component, while having the nearly the same mean in the posterior, has a much wider 95% HDI, which now includes zero.


Thursday, September 20, 2012

Workshop on doing Bayesian data analysis at Indiana University

Workshop on doing Bayesian data analysis at Indiana University's Workshop in Methods, Friday Oct. 5, 2:00pm. See details here.

A list of some previous workshops is here.

Tuesday, September 18, 2012

From BUGS with BRugs to JAGS with rjags



As mentioned in several previous posts, I strongly recommend using JAGS instead of BUGS, and I have converted all the BUGS programs to JAGS versions. Here I provide guidelines for how to make the conversion in case you want to convert your own programs.

For a concrete example, I will use the programs BernTwoBugs.R and BernTwoJags.R. I’ll proceed section by section through the programs.

The header:

BUGS + BRugs version:

library(BRugs)

JAGS + rjags version:

require(rjags)

Instead of “require” it could say “library”. Also in the JAGS + rjags version I added a way for the graphics to work on non-Windows machines. This is just R, so it can work with BUGS too:

if ( .Platform$OS.type != "windows" ) {
  windows <- function( ... ) X11( ... )
}


The model specification:

BUGS + BRugs version:

modelstring = "
# BUGS model specification begins here...
model {
    # Likelihood. Each flip is Bernoulli.
    for ( i in 1 : N1 ) { y1[i] ~ dbern( theta1 ) }
    for ( i in 1 : N2 ) { y2[i] ~ dbern( theta2 ) }
    # Prior. Independent beta distributions.
    theta1 ~ dbeta( 3 , 3 )
    theta2 ~ dbeta( 3 , 3 )
}
# ... end BUGS model specification
" # close quote for modelstring
# Write model to a file:
.temp = file("model.txt","w") ; writeLines(modelstring,con=.temp) ; close(.temp)
# Load model file into BRugs and check its syntax:
modelCheck( "model.txt" )

JAGS + rjags version:

modelString = "
# JAGS model specification begins here...
model {
    # Likelihood. Each flip is Bernoulli.
    for ( i in 1 : N1 ) { y1[i] ~ dbern( theta1 ) }
    for ( i in 1 : N2 ) { y2[i] ~ dbern( theta2 ) }
    # Prior. Independent beta distributions.
    theta1 ~ dbeta( 3 , 3 )
    theta2 ~ dbeta( 3 , 3 )
}
# ... end JAGS model specification
" # close quote for modelstring
# Write the modelString to a file, using R commands:
writeLines(modelString,con="model.txt")

Notice that the model specification is the same in JAGS as in BUGS. Also, in both cases the modelString gets written to a file called “model.txt”. The JAGS + rjags version uses a streamlined version of writeLines that would also work in the BUGS program, as it is just an R command. The only difference is in how the specification gets communicated to BUGS or JAGS: BRugs uses the modelCheck command, but there is no analogous command in rjags.

The data:

BUGS + BRugs version:

# Specify the data in a form that is compatible with BRugs model, as a list:
datalist = list(
    N1 = 7 ,
    y1 = c( 1,1,1,1,1,0,0 ) ,
    N2 = 7 ,
    y2 = c( 1,1,0,0,0,0,0 )
)
# Get the data into BRugs:
modelData( bugsData( datalist ) )

JAGS + rjags version:

# Specify the data in a form that is compatible with JAGS model, as a list:
dataList = list(
    N1 = 7 ,
    y1 = c( 1,1,1,1,1,0,0 ) ,
    N2 = 7 ,
    y2 = c( 1,1,0,0,0,0,0 )
)

The specification of the data is the same in JAGS as in BUGS. The only difference is in how the specification gets communicated to BUGS or JAGS.: BRugs uses the modelData command, but there is no analogous command in rjags.

Initialize the chains:

BUGS + BRugs version:

modelCompile()
modelGenInits()

JAGS + rjags version:

# Can be done automatically in jags.model() by commenting out inits argument.
# Otherwise could be established as:
# initsList = list( theta1 = sum(dataList$y1)/length(dataList$y1) ,
#                   theta2 = sum(dataList$y2)/length(dataList$y2) )

The BUGS version has to compile the model (using the BRugs modelCompile command) and then generate initial values (using the BRugs modelGenInits command). The JAGS version does not need any explicit initialization at this point, as the commented code explains.


Run the chains:

BUGS + BRugs version:

samplesSet( c( "theta1" , "theta2" ) ) # Keep a record of sampled "theta" values
chainlength = 10000                     # Arbitrary length of chain to generate.
modelUpdate( chainlength )             # Actually generate the chain.

JAGS + rjags version:

parameters = c( "theta1" , "theta2" )     # The parameter(s) to be monitored.
adaptSteps = 500              # Number of steps to "tune" the samplers.
burnInSteps = 1000            # Number of steps to "burn-in" the samplers.
nChains = 3                   # Number of chains to run.
numSavedSteps=50000           # Total number of steps in chains to save.
thinSteps=1                   # Number of steps to "thin" (1=keep every step).
nIter = ceiling( ( numSavedSteps * thinSteps ) / nChains ) # Steps per chain.
# Create, initialize, and adapt the model:
jagsModel = jags.model( "model.txt" , data=dataList , # inits=initsList ,
                        n.chains=nChains , n.adapt=adaptSteps )
# Burn-in:
cat( "Burning in the MCMC chain...\n" )
update( jagsModel , n.iter=burnInSteps )
# The saved MCMC chain:
cat( "Sampling final MCMC chain...\n" )
codaSamples = coda.samples( jagsModel , variable.names=parameters ,
                            n.iter=nIter , thin=thinSteps )
# resulting codaSamples object has these indices:
#   codaSamples[[ chainIdx ]][ stepIdx , paramIdx ]

Roughly the equivalent of BRugs modelCompile is rjags jags.model. For burning in, the rough equivalent of BRugs modelUpdate before samplesSet  is rjags update. Notice that the BUGS version here did no burning in. For the final chain, the rough equivalent of BRugs modelUpdate is rjags coda.samples. Notice that rjags specifies the parameters to be stored with the variable.names argument in the coda.samples command, whereas BRugs specifies the parameters in its samplesSet command.

Examine the results:

BUGS + BRugs version:

theta1Sample = samplesSample( "theta1" ) # Put sampled values in a vector.
theta2Sample = samplesSample( "theta2" ) # Put sampled values in a vector.

# Plot the trajectory of the last 500 sampled values.

JAGS + rjags version:

# Convert coda-object codaSamples to matrix object for easier handling.
# But note that this concatenates the different chains into one long chain.
# Result is mcmcChain[ stepIdx , paramIdx ]
mcmcChain = as.matrix( codaSamples )

theta1Sample = mcmcChain[,"theta1"] # Put sampled values in a vector.
theta2Sample = mcmcChain[,"theta2"] # Put sampled values in a vector.

# Plot the trajectory of the last 500 sampled values.

Once the chain is put into variables in R, they can be plotted the same way.

One big difference not shown above is how the chains can be examined for autocorrelation and convergence. My homegrown plotChains function uses BRugs commands and will not work with the JAGS output. Instead, JAGS+rjags returns the chain as a coda-package object, named codaSamples above. There are many useful functions in the coda package for displaying the chains, not shown above. For example, summary( codaSamples ), plot( codaSamples ), and autocorr.plot( codaSamples ).

Thursday, September 6, 2012

Posterior predictive check can and should be Bayesian

Perception apparently employs prior knowledge that illumination comes from above, and that the surface itself has constant color.
Here's a new article, in press. I'm a bit reluctant about it because it got zero feedback and is unchanged from the initial submission, but, damn the torpedos ..., fools rush in ..., and all that sort of thing.
Abstract: Bayesian inference is conditional on the space of models assumed by the analyst. The posterior distribution indicates only which of the available parameter values are less bad than the others, without indicating whether the best available parameter values really fit the data well. A posterior predictive check is important to assess whether the posterior predictions of the least bad parameters are discrepant from the actual data in systematic ways. Gelman and Shalizi (2012a) assert that the posterior predictive check, whether done qualitatively or quantitatively, is non-Bayesian. I suggest that the qualitative posterior predictive check might be Bayesian, and the quantitative posterior predictive check should be Bayesian. In particular, I show that the “Bayesian p value,” from which an analyst attempts to reject a model without recourse to an alternative model, is ambiguous and inconclusive. Instead, the posterior predictive check, whether qualitative or quantitative, should be consummated with Bayesian estimation of an expanded model. The conclusion agrees with Gelman and Shalizi (2012a) regarding the importance of the posterior predictive check for breaking out of an initially assumed space of models. Philosophically, the conclusion allows the liberation to be completely Bayesian instead of relying on a non-Bayesian deus ex machina. Practically, the conclusion cautions against use of the Bayesian p value in favor of direct model expansion and Bayesian evaluation.
Kruschke, J. K. (in press). Posterior predictive check can and should be Bayesian: Comment on Gelman and Shalizi (2012a). British Journal of Mathematical and Statistical Psychology.
Get the full manuscript here.