In a previous post and in a video, I used baseball data to illustrate some ideas about shrinkage and multiple comparisons in hierarchical models. Here I take it a step further, to

*hierarchical models, to illustrate*

**multi-level***shrinkage of estimates within different levels of the model*.

Background: Consider the sport of baseball, in which players have opportunities at bat during a season of games, and occasionally hit the ball. In Major League Baseball, players typically hit the ball on about 24% of their opportunities at bat. That ratio, of hits (H) divided by opportunities at bat (AB), is called the "batting average" of each player. Players also must play a position in the field when the other team is at bat. Some fielding positions are specialized and those players are expected to focus on those skills, not necessarily on hitting. In particular, pitchers are not expected to be strong hitters, and catchers might also not be expected to focus so much on hitting. (But I'm no baseball aficionado, so please forgive any inadvertent insults or errors! My thanks to colleagues Ed Hirt and Jim Sherman for tutoring me in some of the basics of baseball rules.)

Our goal is to estimate the underlying probability of getting a hit by each player, based on their hits H, at bats AB, and primary position. The model will put each player's estimated hitting ability (i.e., probability of getting a hit) under a distribution for the primary position, and put those position distributions under an overarching distribution across all positions. Here is a diagram of the hierarchical model. It should be scanned from the bottom up, as described next.

The data are indicated at the bottom of the diagram, with H

_{i}being the number of hits for player i, and AB

_{i}being the number of at-bats for player i. The parameter θ

_{i}in the binomial distribution is the underlying probability of getting a hit for player i. The θ

_{i}are assumed to come from a beta distribution, but there is a distinct beta distribution for each type of field position: pitcher, catcher, 1st base, 2nd base, 3rd base, shortstop, left field, center field, and right field. The subscript

*pp(i)*indicates the primary position of player i. Thus, each of the nine position-specific beta distributions has a mean μ

_{pp}and a consistency (like precision) κ

_{pp}. (Now I notice a small error in the diagram: The ellipses labeled with

*pp(i)*should instead be labeled with only

*pp*.) Moving up the diagram, the nine position-specific means μ

_{pp}come from a beta distribution, with a a mean μ

_{μpp}that expresses the central tendency across all positions.At the top level, the priors are vague and noncommittal.

The data are from 948 players in the 2012 regular season of Major League Baseball. Thanks to Matthew Maenner for posting a comment explaining how to download the data. The model simultaneously estimates 948 + 9*2 + 4 = 970 parameters.

**Results:**

Let's first consider the estimates of hitting ability for different positions. At right are the marginal posterior distributions for the μ

_{pp}parameters for the positions of catcher and pitcher. The blue distributions show the credible values of the parameters, and the red crosses on the x-axis indicate the ratio in the data of total hits divided by total at bats for all players in that position. Notice that the modes of the posterior are not centered on the red crosses. Instead, the modal estimates are shrunken toward the middle between the pitchers (who tend to have low batting averages) and the catchers (who tend to have higher batting averages). The lowest panel shows the distribution of differences between catchers and pitchers. The distribution falls far above zero, so we can say with high credibility that catchers hit better than pitchers. (Click on the picture to see it enlarged.)

At right is a comparison of two individual players with

**same record, 1 hit in 3 at bats,**but playing

**different positions, namely catcher and pitcher**. Notice that the red crosses are at the same place on the x-axes for the two players, but notice the radically different estimates of their underlying probability of getting a hit, because of the different positions they play. The data from all the other catchers tells the model that catchers tend to have theta values around 0.24. Because this particular catcher has so few data to inform his estimate, the estimate from the higher-level distribution dominates. The same is true for the pitcher, but the higher-level distribution says that pitchers tend to have theta values around 0.13. The resulting distribution of differences, in the lowest panel, suggests that these two players may have credibly different hitting abilities, even though their hits and at-bats are identical! In other words, because we know the players play these particular different positions, we can infer that they probably have different hitting abilities. (Click on the picture to see it enlarged.)

Here is a comparison of two indivdiual players, both of whom are pitchers, with

**seemingly quite different batting averages of 18/61 and 4/61**, as marked by the red crosses on the x-axis. But the posterior estimates of their hitting probabilities are not so different. Notice the dramatic shrinkage of the estimates toward the average of players who are pitchers. In the lower panel, we see that a difference of zero is credible. (Click on the picture to see it enlarged.)

Here is a comparison of two indivdiual players with different records but both playing Center Field. The larger amount of individual data (593 and 120 at bats) allows somewhat less shrinkage than the previous example, so that the estimated difference excludes zero (but still overlaps the particular ROPE used here). Notice also the difference in widths of HDIs for the two players: There is a narrower HDI for the player with more data. (Click on the picture to see it enlarged.)

Here are two players from Right Field with huge numbers of at-bats and nearly the same (impressive!) batting average. The HDI of the difference falls almost entirely within the ROPE, so we accept that the players have identical probability of getting a hit (for practical purposes). (Click on the picture to see it enlarged.)

**Summary:**With multi-level hierarchical models, there is shrinkage within each level. In this model, there was shrinkage of the position parameters toward each other, as illustrated by the pitcher and catcher distributions. And there was shrinkage of the player parameters within each position, as shown by various examples above. The model also provided some strong inferences about player abilities based on position alone, as illustrated by the estimates for individual players with few at bats.

Dear John:

ReplyDeleteThis is fantastic! Thank you for continuing with the baseball example.

Matt

This comment has been removed by the author.

ReplyDeleteI'm really enjoying your book, Dr. Kruschke.

ReplyDeleteDo you have any information on parameter estimation for a non-nested hierarchical model?

Continuing with the baseball theme, let's say that in addition to categorizing players by Primary Position, I also categorize them by whether they came through to the Major League from the Minor Leagues or from College Baseball. To keep it simple we could imagine this is just a dichotomous variable, Minor Leagues or College, and avoid estimating separate values for each league/college.

Would I need to set either Minor League or College Baseball as my baseline and just estimate a difference score, perhaps added to the omega and kappa parameters at some level? Or is there some way to estimate each of the two groups from a distribution as you've done above with Primary Position?

Sounds like what you have in mind is crossed nominal factors for prediction. You might consider an "ANOVA"-like design; see Ch 20.

DeleteGreat, thank you for the pointer! I'll check it out.

DeleteThank you for wonderful work Dr.Kruschke.

ReplyDeleteI have a question for this.

How can I make a R code for multi-level hierarchical models?

I revised your "Jags-Ybinom-XnomSsubjCcat-MbinomBetaOmegaKappa.R" like this:

model {

for ( sIdx in 1:Nsubj ) {

z[sIdx] ~ dbin( theta[sIdx] , N[sIdx] )

theta[sIdx] ~ dbeta( omega[c[sIdx]]*(kappa[c[sIdx]]-2)+1 ,

(1-omega[c[sIdx]])*(kappa[c[sIdx]]-2)+1 )

}

for ( cIdx in 1:Ncat ) {

omega[cIdx] ~ dbeta( omegaO*(kappaO-2)+1 ,

(1-omegaO)*(kappaO-2)+1 )

kappa[cIdx] <- kappaMinusTwo[cIdx] + 2

kappaMinusTwo[cIdx] ~ dgamma( SO , RO ) # mean=1 , sd=10 (generic vague)

}

omegaO ~ dbeta( 1.0 , 1.0 )

kappaO <- kappaMinusTwoO + 2

kappaMinusTwoO ~ dgamma( 0.01 , 0.01 ) # mean=1 , sd=10 (generic vague)

SO ~ dgamma( 0.01 , 0.01 ) # mean=1 , sd=10 (generic vague)

RO ~ dgamma( 0.01 , 0.01 ) # mean=1 , sd=10 (generic vague)

}

and I wonder it is right or not.

This model is fully implemented and described in Chapter 9 of DBDA2E.

Delete