## Friday, October 12, 2012

### Abelson's paradox, baseball, null differences, the ROPE, and hierarchical Bayesian analysis

In a thread elsewhere, a reader of Bayesian estimation supersedes the t test commented: "Doesn't Abelson's paradox preclude the establishment of guidelines for a sufficiently small effect size?"  In this post I briefly review Abelson's paradox and demonstrate how hierarchical Bayesian data analysis is actually applied to real baseball data of the type that Abelson used to illustrate his paradox. Bayesian estimation provides rich estimates of the abilities of players and their differences. There is no paradox.

First, a cursory review of Abelson's paradox. It was described in a brief article: Abelson, R. P. (1985). A variance explanation paradox: When a little is a lot. Psychological Bulletin, 97(1), 129-133. Abelson considered the batting averages of major league baseball players. A batting average is simply the ratio of number of hits to number of at-bats. In the 2012 season, players each had about 550 opportunities at bat, and typically got about 150 hits, for a typical batting average of about 0.27. Across 143 different players, the batting averages ranged from about 0.19 to 0.34. (Data from ESPN.) Abelson was concerned with whether that variation across players was large compared to the variation within players; in other words, he wanted to know the magnitude of the proportion of variance accounted for by player ability. He came up with a mathematical approximation and argued that for typical data the magnitude of the proportion of variance accounted for is about 0.003. The "paradox" is that such a small number conflicts with sports fans' intuition that player ability should account for a lot more of the variance of hits. Abelson discussed the meaning of his calculation and how it differs from the everyday intuition of the sports fan, saying that his statistic refers to the predictability of a single at-bat. But I can't say I fully understand what Abelson was going for, and the Bayesian analysis of real data (shown below) seems clear and unparadoxical to me.

What does any of that have to do with Bayesian data analysis? Presumably the reader who commented about Abelson's paradox was thinking about the approach to null values that I and others have advocated, whereby a null value for a parameter is deemed accepted for practical purposes if its 95% highest density interval (HDI) falls entirely within a region of practical equivalence (ROPE) around the null value. Perhaps the reader was concerned that no ROPE can be established because even tiny values, like Abelson's 0.003, can correspond to intuitively large effects.

Perhaps the best rebuttal is simply to demonstrate how to do a Bayesian analysis of batting averages, including a ROPE. The analysis uses a hierarchical model that simultaneously estimates individual player abilites and the overall average ability of major league players.

The model comes directly from Chapter 9 of Doing Bayesian Data Analysis. (For an example of applying the same model to meta-analysis of ESP data, see this post. That post also includes a hierarchical diagram of the model.) In the model, the i-th player has an underlying probability of getting a hit, denoted theta[i]. The value of theta[i] is estimated from observing the number of hits y[i] and at-bats N[i] for each player. Importantly, the model assumes that the theta[i] come from a higher-level distribution that describes the individual differences among major league players. There is an overall central tendency, mu, for the batting average of major league players, and a "tightness" of clustering around that central tendency, denoted kappa. (For details, please see Ch. 9 of DBDA.)

The hierarchical structure of the model provides shrinkage in the estimates of the individual abilities. Essentially, each player's data inform the estimate of theta[i] for that player, but also inform the group-level parameters mu and kappa, which in turn influence the estimated values of theta[.] for other players. Thus, if many players all have batting averages right around .27, the group-level distribution is estimated to be very "tight," which pulls in (shrinks) outlying individuals toward the group average. This is intuitively reasonable: If you have information that player after player has an average around .27, you should use that information to inform your estimate of the next player.

When making decisions about differences in estimated abilities, what sort of ROPE is meaningful? There is no single correct answer because it depends on practical consequences. But suppose we say that a difference of 10 hits, in a season of 550 at bats, has practical significance. The ratio 10/550 is just under 0.02, so let's establish a ROPE as -0.02 to +0.02. Our decision rule for assessing differences between players is now this: The difference in abilities between players is deemed to be credibly and practically different from zero if the 95% HDI on the estimated difference falls completely outside a ROPE from -0.20 to +0.02. The difference in abilities is deemed to be practically equivalent to zero if the 95% HDI on the estimated difference falls completely inside the ROPE.

Here are some results from the analysis. First, consider the players with the highest and lowest batting average during the 2012 regular season:
The right panel shows that their abilities (in terms of estimated probability of getting a hit at bat) are credibly different, and the posterior distribution reveals in detail the relative credibility of the whole range of candidate differences. The graphs also plot the observed batting average (y[i]/N[i]) as small red +'s on the abscissa. Notice that the estimated theta values show clear shrinkage toward the group average. Thus, although Buster Posey had a batting average of 0.336, the estimate of the underlying probability of getting a hit is shrunken toward the major league average, with a mean estimate of 0.313. Similarly, although Carlos Pena had a batting average of 0.197, the estimate of the underlying probability of getting a hit is shrunken toward the central tendency of the group, with a mean estimate of 0.225. Despite the shrinkage, the difference (right panel) is still credibly non-zero.

Here are the results for the two players in the middle of the pack:
The right panel shows that the estimated difference in their underlying probabilities of getting a hit is nearly zero. 52% of the posterior distribution falls within the ROPE. Thus, we do not have enough precision in the estimate of the differences to declare that their abilities are equal for practical purposes, where "practical" is defined in terms of this choice of ROPE.

Thus, Bayesian analysis provides rich and meaningful inferences about the sort of data that Abelson was interested in. I don't see any "paradox" that needs to be overcome. The Bayesian analysis never even brought up the issue of "proportion of variance accounted for" as Abelson did. Because the Bayesian analysis directly estimates all the parameters of interest, and provides a complete posterior distribution for their credibilities, Abelson's paradoxical statistic never even arose.

Appendix: The complete program. Data are from ESPN, linked in text above.

rm(list = ls())
graphics.off()
fileNameRoot="MajorLeagueBaseballBattingJAGS"
if ( .Platform$OS.type != "windows" ) { windows <- function( ... ) X11( ... ) } # In the style of: require(rjags) # Kruschke, J. K. (2011). Doing Bayesian Data Analysis: # A Tutorial with R and BUGS. Academic Press / Elsevier. #------------------------------------------------------------------------------ # THE MODEL. # Specify the model in JAGS language, but save it as a string in R: modelString = " model { # Likelihood: for ( i in 1:nPlayers ) { y[i] ~ dbin( theta[i] , N[i] ) } # Prior: for ( i in 1:nPlayers ) { theta[i] ~ dbeta( a , b ) } a <- mu * kappa b <- ( 1.0 - mu ) * kappa mu ~ dbeta( 1,1 ) kappa ~ dgamma( 1.393 , 0.0393 ) # mode=10, sd=30 } # ... JAGS model specification ends. " # close quote to end modelString # Write the modelString to a file, using R commands: writeLines(modelString,con="model.txt") #------------------------------------------------------------------------------ # THE DATA. dataFrame = read.csv( file="MajorLeagueBaseballBattingStats2012.csv" ) y = dataFrame$H  # hits for each player
N = dataFrame$AB # at bats for each player nPlayers = length(y) dataList = list( y = y , N = N , nPlayers = nPlayers ) #------------------------------------------------------------------------------ # INTIALIZE THE CHAIN. # Let JAGS do it randomly... #------------------------------------------------------------------------------ # RUN THE CHAINS. parameters = c( "mu" , "kappa" , "theta" ) # The parameter(s) to be monitored. adaptSteps = 1000 # Number of steps to "tune" the samplers. burnInSteps = 1000 # Number of steps to "burn-in" the samplers. nChains = 3 # Number of chains to run. numSavedSteps=100000 # Total number of steps in chains to save. thinSteps=1 # Number of steps to "thin" (1=keep every step). nIter = ceiling( ( numSavedSteps * thinSteps ) / nChains ) # Steps per chain. # Create, initialize, and adapt the model: jagsModel = jags.model( "model.txt" , data=dataList , # inits=initsList , n.chains=nChains , n.adapt=adaptSteps ) # Burn-in: cat( "Burning in the MCMC chain...\n" ) update( jagsModel , n.iter=burnInSteps ) # The saved MCMC chain: cat( "Sampling final MCMC chain...\n" ) codaSamples = coda.samples( jagsModel , variable.names=parameters , n.iter=nIter , thin=thinSteps ) # resulting codaSamples object has these indices: # codaSamples[[ chainIdx ]][ stepIdx , paramIdx ] #------------------------------------------------------------------------------ # EXAMINE THE RESULTS. checkConvergence = FALSE if ( checkConvergence ) { autocorr.plot( codaSamples , ask=T ) } # Convert coda-object codaSamples to matrix object for easier handling. # But note that this concatenates the different chains into one long chain. # Result is mcmcChain[ stepIdx , paramIdx ] mcmcChain = as.matrix( codaSamples ) # Extract the posterior sample from JAGS for easier reference: mu = mcmcChain[,"mu"] kappa = mcmcChain[,"kappa"] # BRugs gets sample from JAGS theta = matrix( 0 , nrow=nPlayers , ncol=nChains*nIter ) for ( i in 1:nPlayers ) { nodeName = paste( "theta[" , i , "]" , sep="" ) theta[i,] = mcmcChain[,nodeName] } # Make a graph using R commands: source("plotPost.R") windows(width=7,height=2.5) layout( matrix( 1:2 , nrow=1 , byrow=TRUE ) ) #par(mar=c(2.95,2.95,1.0,0),mgp=c(1.35,0.35,0),oma=c( 0.1, 0.1, 0.1, 0.1) ) plotPost( mu , xlab="mu" , main="Group Mean" ) plotPost( kappa , xlab="kappa" , main="Group Certainty" ) savePlot( file=paste(fileNameRoot,"MuKappa",sep="") , type="jpg" ) plotPlayerDiff = function( idx1 , idx2 , diffRope=c(-0.02,0.02) , savePlotFile=FALSE ) { windows(width=7,height=2.5) layout( matrix( 1:3 , nrow=1 , byrow=TRUE ) ) #par(mar=c(2.95,2.95,1.0,0),mgp=c(1.35,0.35,0),oma=c( 0.1, 0.1, 0.1, 0.1) ) plotPost( theta[idx1,] , xlab=paste("theta",idx1) , main=dataFrame$PLAYER[idx1] )
points( dataFrame$AVG[idx1] , 0 , pch="+" , col="red" , cex=1.5 ) plotPost( theta[idx2,] , xlab=paste("theta",idx2) , main=dataFrame$PLAYER[idx2] )
points( dataFrame$AVG[idx2] , 0 , pch="+" , col="red" , cex=1.5) plotPost( theta[idx1,] - theta[idx2,] , xlab=paste("theta",idx1,"-","theta",idx2) , main="Difference" , compVal=0.0 , ROPE=diffRope ) points( dataFrame$AVG[idx1]-dataFrame\$AVG[idx2] , 0 , pch="+" , col="red" , cex=1.5)
if ( savePlotFile ) {
savePlot( file=paste(fileNameRoot,"Theta",idx1,"Theta",idx2,sep="") , type="jpg" )
}
}
plotPlayerDiff(1,nPlayers,savePlotFile=TRUE)
plotPlayerDiff( round(nPlayers/2)-1 , round(nPlayers/2) ,savePlotFile=TRUE)

1. I love this example. I like how your book and Peter Hoff's book both start here -- Gelman doesn't get there until chapter 5.

It might be helpful to readers to plot the posterior for the hierarchical priors (a,b) as they indicate the model fit for the population of batters.

I did a similar analysis using (R2Win)BUGS, comparing MLE vs. empirical Bayes vs. full Bayes, ending with the following post discussing hierarchical multiple comparisons for the batting crown (stressing the automatic adjustment for differing number of at bats):

http://lingpipe-blog.com/2009/11/04/hierarchicalbayesian-batting-ability-with-multiple-comparisons/

Now I need to re-do it all with Stan :-)

2. lingpipe: Thanks for your comment and for the link to your blog post. Very interesting! In the future, I'll have to incorporate your point about determining the best (or worst) batter. Thanks again!

3. "Abelson's paradoxical statistic never even arose." Yes, but how did the Bayesian hierarchical model even address the original research question, i.e. "whether [the] variation across players was large compared to the variation within players"?

4. Reply to Alex: I think it comes down to definitions. We need to define parameters that describe variation across players and variation within players. In the present context, I'm not sure how to define variation within players, because the model assumes a constant underlying ability. I think the real source of the "paradox" is not having a clear descriptive model of player variability. Once such a model is defined, with meaningful parameters, then Bayesian estimation shows us the credible values of those parameters.

5. John, wouldn't variation within a player be the standard deviation of the posterior distribution of the (Bayesian) estimated parameter (here batting average)? And variation across players would be the standard deviation of the set of point-estimates of the players (the peak of each of their posterior's)? I don't think you need another model, but can compute these from what you've done...

6. For a particular player, the SD of the posterior distribution for that player's ability is our uncertainty in the estimate of the assumed constant player ability. There is no within-player variability in this model.

I could imagine a different model that assumes that player ability changes week to week, or game to game, or whatever. Then there would be a parameter that describes the degree of change from time to time. (And there would be uncertainty in our estimate of that parameter.)

7. John, thank you so much for that reply. That was a crucial distinction and cleared it up nicely. Thanks again! -- Alex

8. I'm new to this, so please excuse the stupid question, but what is the conceptual difference between variance and uncertainty? (The variance just is, because we are not certain about the exact value, because it could be measured precisely?)

9. Is there a way to solve hierarchical models analytically (i.e., with conjugate priors)? For example, with a simple updating rule? In trying to think through how to do it, I'm not sure if see how you would calculate the dispersion of the batting averages around the group mean.

10. Will leave it to people who like to algebraically manipulate lots of multiplied beta and gamma distributions.

11. Hello, I am taking my first course in bayesian statistics. I am quite new in R, but i have some questions about the code, in the model part of your jags.

mu ~ dbeta( 1,1 )
kappa ~ dgamma( 1.393 , 0.0393 ) # mode=10, sd=30

What is this mu and kappa, and where did you get these numbers from?

Do I understand it correct if i say that mu is the group mean for all players and kappa is the precision for the group?

What does these ones stand for, how did you come up with these? Same question for the 1.393 , 0.0393 in the kappa-gamma distribution.

When i try to use dbeta(1,1) alone i get an error:
Error in dbeta(1, 1) : argument "shape2" is missing, with no default

Thanks for a super post! :)

12. Dear Anonymous:

Ultimately you'll have to read more in DBDA2E. Chapter 9 gives extensive explanation and extends the baseball example. For a diagram of the model used in this old post, see another example that uses the very same model --as mentioned in the post-- here: http://doingbayesiandataanalysis.blogspot.com/2011/08/extrasensory-perception-esp-bayesian.html