Saturday, December 28, 2013

Icons for the essence of Bayesian and frequentist data analysis

Goal: Simple graphical icons that capture the essence of Bayesian data analysis and frequentist data analysis. Why? Visual icons serve as mnemonic cognitive packaging that facilitate initial understanding and subsequent remembering. In this post, I propose simple icons that attempt to portray the essence of Bayesian data analysis and frequentist data analysis.

[Feb. 14, 2014: See follow-up post, with new icons, HERE.]

(The prod that got me thinking about this was a light-hearted blog post by Rasmus Baath regarding a "mascot" for Bayesian data analysis. My comment on that post is the beginning of this post. Please note that the icons presented here are not intended as advocacy or cheer leading or as mascots. Instead, I would like the icons to capture succinctly key ideas.)

What is "the essence" of Bayesian data analysis? And what is "the essence" of frequentist data analysis? Any answer will surely provoke disagreement, but that is not my goal. The questions are earnest and often asked by beginners and by experienced practitioners alike. As an educator, I think that the questions deserve earnest answers, with the explicit caveat that they will be incomplete and subject to discussion and improvement. So, here goes.

The essence of Bayesian data analysis is inferring the uncertainty (i.e., relative credibility) of parameters in a model space, given the data. Therefore, an icon should represent the data, the form of the model space, and the credible parameters.

The simplest example I can think of is linear regression: Many people are familiar with x-y scatter plots as diagrams of data, and many people are familiar with lines as a model form. The credible parameters can then be suggested by a smattering of lines sampled from the posterior distribution, like this (Fig. 1):
Figure 1. An icon for Bayesian data analysis.
The icon in Figure 1 represents the data as black dots. The icon represents the model form by the obvious linearity of every trend line. The icon represents uncertainty, or relative credibility, by the obvious range of slopes and intercepts, with greater density in the middle of the quiver of lines. I like Figure 1 as a succinct representation of Bayesian analysis because the figure makes it visually obvious that there is a particular model space being considered, and that there is a range of credible possibilities in that space, given the data.

Are there infelicities in Figure 1? Of course. For example, the shape and scale of the noise distribution are not represented. But many ancillary details must be suppressed in any icon.

Perhaps a more important infelicity in Figure 1 is that the prior distribution is not represented, other than the form of the model space. That is, the lines indicate that the prior distribution puts zero probability on quadratic or sinusoidal or other curved trends, but the lines do not indicate the form of the prior distribution on the allowed parameters.

Some people may feel that this lack of representing the prior is a failure to capture an essential part of Bayesian data analysis. Perhaps, therefore, a better icon would include a representation of the prior -- maybe as a smattering of grey lines set behind the data and the blue posterior lines. For the vague prior used in this example, the prior would be a background of randomly criss-crossing grey lines, which might be confusing (and ugly).

Now, an analogous icon for frequentist data analysis.

The essence of frequentist data analysis is inferring the extremity of a data property in the space of possibilities sampled a specified way from a given hypothesis. (That is, inferring a p value.) Note that the data property is often defined with respect to a model family, such as the best fitting slope and intercept in linear regression. Therefore, an icon should represent the data, the data property, and space of possibilities sampled from the hypothesis (with the extremity of the data property revealed by its visual relation to the space of possibilities).

Keeping with the linear regression scenario, an icon for frequentist analysis might look like this (Fig. 2):
Figure 2. An icon for frequentist data analysis.
As before, the data are represented by black dots. The data property is represented by the single blue line, which shows the least-squares fit to the data. The space of possibilities is represented by the smattering of red lines, which were created as least-squares fits to randomly resampled x and y values (with replacement, using fixed sample size equal to the data sample size). In other words, the hypothesis here is a "null" hypothesis that there is no systematic covariation between the x and y values. I like Figure 2 as a succinct representation of frequentist data analysis, especially when juxtaposed with Figure 1, because Figure 2 shows that there is a point estimate to describe the data (i.e., the single blue line) and a sample of hypothetical descriptions unrelated to the data.

Are there infelicities in Figure 2? Of course. Perhaps the most obvious is that there is no representation of a confidence interval. In my opinion, to properly represent a confidence interval in the format of Figure 2, there would need to be two additional figures, one for each limit of the confidence interval. One additional figure would show a quiver of red lines generated from the lower limit of the confidence interval, which would show the single blue line among the 2.5% steepest red lines. The second additional figure would show a quiver of red lines generated from the upper limit of confidence interval, which would show the single blue line among the 2.5% shallowest red lines. The key is that there would be a single unchanged blue line in all three figures; what changes across figures is the quiver of red lines sampled from the changing hypothesis.

Well, there you have it. I should have been spending this Saturday morning on a thousand other pressing obligations (my apologies to colleagues who know what I'm referring to!). Hopefully it will have taken you less time to have read this far than it took me to have written this far.

Appended 12:30pm, 28 Dec 2013: My wife suggested that the red lines of the frequentist sampling distribution ought to be more distinct from the data, and from the best fitting line. So here are modified versions that might be better:

Description of data is the single blue line. Red lines show the sampling distribution from the null hypothesis.

Blue lines show the distribution of credible descriptions from the posterior.

Monday, November 4, 2013

Optional stopping in data collection: p values, Bayes factors, credible intervals, precision

This post argues that data collection should stop when a desired degree of precision is achieved (as measured by a Bayesian credible interval), not when a critical p value is achieved, not when a critical Bayes factor is achieved, and not even when a Bayesian highest density interval (HDI) excludes a region of practical equivalence (ROPE) to the null value.

Update: For expanded follow-up talk, from March 14, 2014 at U.C. Irvine, see:

It is a well-known fact of null-hypothesis significance testing (NHST) that when there is "optional stopping" of data collection with testing at every new datum (a procedure also called "sequential testing" or "data peeking"), then the null hypothesis will eventually be rejected even when it is true. With enough random sampling from the null hypothesis, eventually there will be some accidental coincidence of outlying values so that p < .05 (conditionalizing on the current sample size). Anscombe (1954) called this phenomenon, "sampling to reach a foregone conclusion."

Bayesian methods do not suffer from this problem, at least not to the same extent. Using either Bayesian HDI with ROPE, or a Bayes factor, the false alarm rate asymptotes at a level far less than 100% (e.g., 20-25%). In other words, using Bayesian methods, the null hypothesis is accepted when it is true, even with sequential testing of every datum, perhaps 75-80% of the time.

But not all is well with the Bayesian methods.Within the two Bayesian methods, the Bayes-factor method is far too eager to accept the null when it is not true. And both Bayesian methods, like the p-value method, give biased estimates of the parameter value when the null is not true, because they stop when extreme values are (randomly) obtained.

The proposed solution to the problem of biased estimates is to stop when a desired degree of precision is achieved, regardless of what it implies about the null hypothesis. This is standard procedure in political polling, in which sampling is designed to achieve a desired confidence bound on the estimate (e.g., "plus or minus 3 percentage points"), not to argue for one extreme or the other. I previously mentioned this proposal in a video (at 6:45 minutes) and alluded to it in an article (pp. 579-580) and in the book (Doing Bayesian Data Analysis; goals for power analysis, e.g., pp. 320-321), but surely I am not the first to suggest this; please let me know of precursors.

What follows is a series of examples of sequential testing of coin flips, using four different stopping criteria. The underlying bias of the coin is denoted θ (theta). Here are the four stopping rules:
  • NHST: For every new flip of the coin, stop and reject the null hypothesis, that θ=0.50, if p < .05 (two-tailed, conditionalizing on the current N), otherwise flip again.
  • Bayes factor (BF): For every flip of the coin, conduct a Bayesian model comparison of the null hypothesis that θ=0.50 against the alternative hypothesis that there is a uniform prior on θ. If BF > 3, accept null and stop. If BF < 1/3 reject null and stop. Otherwise flip again.
  • Bayesian HDI with ROPE: For every flip of the coin, compute the 95% HDI on θ. If the HDI is completely contained in a ROPE from 0.45 to 0.55, stop and accept the null. If the HDI falls completely outside the ROPE stop and reject the null. Otherwise flip again.
  • Precision: For every flip of the coin, compute the 95% HDI on θ. If its width is less than 0.08 (.8*width of ROPE) then stop, otherwise flip again. Once stopped, check whether null can be  accepted or rejected according to HDI with ROPE criteria.
The diagrams below show the results for when the null hypothesis is true, and for three cases in which the null hypothesis is not true. The panels show the results for Monte Carlo simulation of flipping (or spinning) a coin that has bias θ. In each case, there were 1,000 simulated experiments, and for each experiment the sample size was allowed to grow up to 1,500 flips. The upper panels in each figure show the proportion of the 1,000 experiments that decided to accept the null, reject the null, or remain undecided, as a function of the sample size. The light blue curve shows the proportion of decisions to accept the null, the dark blue curve shows the proportion of decisions to reject the null, and the grey curve shows the remaining undecided proportion.
The lower panels in each figure show a histogram of the 1,000 sample proportions of heads when the sequence was stopped. The solid triangle marks the true underlying theta value, and the outline triangle marks the mean of the 1,000 sample proportions when the sequence was stopped. If the sample proportion at stopping is unbiased, then the outline triangle would be superimposed on the solid triangle.

Figure 1

Figure 1, above, shows the case of θ=0.50, that is, when the null hypothesis is true. The left column shows the behavior of the p-value stopping rule. As the sample size increases (notice that N is plotted on a logarithmic scale), the proportion of sequences that have rejected the null continues to rise -- in fact linearly with the logarithm of N. The lower left panel shows the distribution of sample proportions when the sequence stopped, that is, when the sequence had successfully rejected the null.

The second column of Figure 1 shows the results from using the Bayes-factor (BF) stopping rule. When the null hypothesis is true, it shows similar behavior to the HDI-with-ROPE stopping rule, but with smaller sample sizes. In fact, it can make decisions with very small sample sizes that yield very uncertain estimates of theta.

The third column of Figure 1 shows the results from using the HDI-with-ROPE stopping rule. You can see that there is some false rejection of the null for small sample sizes, but the false alarms soon asymptote. When the sample size gets big enough so the HDI is narrow enough to fit in the ROPE, the remaining sequences all eventually accept the null.

The fourth column of Figure 1 shows the results of using the precision stopping rule. Because the desired precision is a width of 0.08, a fairly large sample is needed. Once the desired HDI width is attained, it is compared with the ROPE to make a decision. The curves extend (misleadingly) farther right over larger N even though data collection has stopped.

Figure 2

Figure 2 shows the results when the true bias is θ=0.60 (not 0.50). The left column shows that decision by p-value always rejects the null, but sooner that when θ=0.50. The second column shows that the BF still accepts the null hypothesis more than 50% of the time! The third column shows that the HDI-with-ROPE method always gets the right answer (in terms of rejecting the null) but can take a lot of data for the HDI to exclude the ROPE. The real news is in the lower row of Figure 2, which shows that the sample proportion, at the point of decision, over estimates the true value of θ, for all methods except stopping at desired precision. (Actually, there is a slight bias even in the latter case because of ceiling and floor effects for this type of parameter; but it's very slight.)

Figure 3

The same remarks apply when θ=0.65 as in Figure 3. Notice that the BF still accepts the null almost 40% of the time! Only the stop-at-critical-precision method does not overestimate θ.

Figure 4
The same remarks apply when θ=0.70 as in Figure 4. Notice that the BF still accepts the null about 20% of the time! Only the stop-at-critical-precision method does not overestimate θ.


"Gosh, I see that the stop-at-critical-precision method does not bias the estimate of the parameter, which is nice, but it can take a ton of data to achieve the desired precision. Yuck!" Oh well, that is an inconvenient truth about noisy data -- it can take a lot of data to cancel out the noise. If you want to collect less data, then reduce the noise in your measurement process.

"Can't the method be gamed? Suppose the data collector actually collects until the HDI excludes the ROPE, notes the precision at that point, and then claims that the data were collected until the precision reached that level. (Or, the data collector rounds down a little to a plausibly pre-planned HDI width and collects a few more data values until reaching that width, repeating until the HDI still excludes the ROPE.)" Yup, that's true. But the slipperiness of the sampling intention is the fundamental complaint against all frequentist p values, which change when the sampling or testing intentions change. The proportions being plotted by the curves in these figures are p values -- just p values for different stopping intentions.

"Can't I measure precision with a frequentist confidence interval instead of a Bayesian posterior HDI?" No, because a frequentist confidence interval depends on the stopping and testing intention. Change, say, the testing intention --e.g., there's a second coin you're testing-- and the confidence interval changes. But not the Bayesian HDI.

The key point: If the sampling procedure, such as the stopping rule, biases the data in the sample, then the estimation can be biased, whether it's Bayesian estimation or not. A stopping rule based on getting extreme values will automatically bias the sample toward extreme estimates, because once some extreme values show up by chance, sampling stops. A stopping rule based on precision will not bias the sample unless the measure of precision depends on the value of the parameter (which actually is the case here, just not very noticeably for parameter values that aren't very extreme).

Thursday, October 31, 2013

Diagrams for hierarchical models: New drawing tools

Two new drawing tools for making hierarchical diagrams have been recently developed. One tool is a set of distribution and connector templates in LibreOffice Draw and R, created by Rasmus Bååth. Another tool is scripts for making the drawings in LaTeX via TikZ, created by Tinu Schneider. Here is an example of a diagram made by Tinu Schneider, using TikZ/LaTeX with Rasmus Bååth's distribution icons:
The drawing language TikZ is very powerful and can create very elaborate illustrations. Here is a variation in which the distribution nodes have boxes around them:

The distribution icons above are images created in R with a program by Rasmus Bååth. The program generates a catalog of icons with the same visual formatting. The icons can then be used in TikZ/LaTeX as shown in Tinu Schneider's drawings above ...

... or they can be used with freehand drawing in LibreOffice Draw. Rasmus has created a screencast that shows how do use the icons in LibreOffice Draw:

If you use LaTeX and you want precise positioning of nodes, ability to show wavy arrows and double-line arrows, and math fonts in the diagram that exactly match the math fonts in the text, then the TikZ/LaTeX tools are for you. If you want WYSIWYG editing, then LibreOffice is for you.

LINKS to the details:

Rasmus Bååth's blog post about his system is here. In particular, the LibreOffice template is here, and the R code for generating the distributions is at his Github folder. You can download LibreOffice here. (You might recall that Rasmus also created the in-browser app for "Bayesian estimation supersedes the t test" (BEST) available here.)

Tinu Schneider's description of his TikZ/LaTeX examples are in this README document at his Github folder. Info about the TikZ package is here, and it can be downloaded here. (You might recall that Tinu also created the dynamic visualization of the Metropolis algorithm with rejected proposals at this blog post.)

Big THANKS to Rasmus and Tinu for creating these terrific tools!

Wednesday, October 9, 2013

Diagrams for hierarchical models - we need your opinion

When trying to understand a hierarchical model, I find it helpful to make a diagram of the dependencies between variables. But I have found the traditional directed acyclic graphs (DAGs) to be incomplete at best and downright confusing at worst. Therefore I created differently styled diagrams for Doing Bayesian Data Analysis (DBDA). I have found them to be very useful for explaining models, inventing models, and programming models. But my idiosyncratic impression might be only that, and I would like your insights about the pros and cons of the two styles of diagrams.

To make the contrast concrete, let's consider the classic "rats" example from BUGS. I will start with a textual explanation of the model (using modified phrasing and variable names), and then provide two diagrams of the model, one in DAG style and one in DBDA style. Here is the text:
The data come from 30 young rats whose weights were measured weekly starting at birth for five weeks, with the goal being to assess how their weights changed as a function of age. The variable yj|i denotes the weight of the ith rat measured at days since birth xj|i. The weights are assumed to be distributed normally around the predicted value ωj|i (omega):
yj|i ~ normal( ωj|i , λ )
The parameter λ (lambda) represents the precision (i.e., 1/variance) of the normal distribution. The predicted weight of the ith rat is modeled as a linear function of days since birth:
ωj|i = φi + ξi xj|i
The individual intercepts φi (phi) and slopes ξi (xi) are assumed to come from group-level normal distributions,
φi ~ normal( κ , δ )    and    ξi ~ normal( ζ , γ )
where κ (kappa) and ζ (zeta) are the means of the group-level distributions. The priors on the group-level means are set as vague normal distributions,
κ ~ normal( M , H )    and    ζ ~ normal( M , H )
where the mean M is approximately central on the scale of the data and the precision H is very small. The precision parameters, λ (lambda), δ (delta), and γ (gamma), are given vague gamma priors,
λ ~ gamma( K , I )    and    δ ~ gamma( K , I )    and    γ ~ gamma( K , I )
where the shape K and rate I parameters are set to very small values.
Below are two different diagrams of the model. The first is a DAG. Because there are a variety of style conventions in the literature for DAGs, I used my own hybrid explained in the caption. But I think it is a good and useful hybrid, one that I would want to use if I used DAGs. Take a look at the DAG:
Squares denote constants, circles denote variables. Solid arrows denote stochastic dependency, heavy dotted arrows denote deterministic dependency. Rounded-corner rectangles denote "plates" for indices.
For DAG users, does the style above capture what you think is needed in a DAG? Does the DAG above help you understand the model? How? Does the DAG confuse you? Why? Would the DAG help you program the model in BUGS / JAGS / Stan? Or not?

Below is a DBDA-style diagram for the model:
Arrows marked by "~" denote stochastic dependency, arrows marked by "=" denote deterministic dependency. Ellipses on arrows denote indices over which the dependency applies.
Does the DBDA-style diagarm above help you understand the model? How? Does the DBDA-style confuse you? Why? Would the DBDA-style help you program the model in BUGS / JAGS / Stan? Or not?

Please provide your answers as comments on this post. Thank you in advance!

Monday, September 16, 2013

Visualizing a normal distribution as outcomes from a spinner

Mike Meredith has a new blog post showing an animation that converts a spinner to a normal distribution. It shows hows segments of the spinner map to regions under the normal distribution. Give it a look!

(Mike Meredith is also the fellow who created the BEST package for R. See this previous post about that!)

Sunday, September 15, 2013

JAGS users: runjags has been updated

Image credit:
The package runjags by Matt Denwood has been recently updated. The runjags package has many nice facilities for running JAGS, especially a seamless interface for running parallel chains on multi-core computers and thereby greatly reducing the time needed to achieve a desired MCMC sample size from JAGS. Give it a look!

Thursday, September 12, 2013

Visualization of Metropolis with rejected proposals

In response to a previous post, Tinu Schneider has embellished the Metropolis visualization of Maxwell Joseph so it includes explicit plotting of rejected proposals (shown as grey segments). Here are the animated gif and R code. Many thanks to Tinu for sending this! (The gif, below, may take a few moments to load.)

# Animating the Metropolis Algorithm
# -------------------------------------------
# source:
# Prepare
require(sm) # needs 'rpanel'
# Setup
# --------------------------------------
# population level parameters
mu    <- 6
sigma <- 3
# data-values
n <- 50 
# metropolis
iter   <- 10000
chains <-     3
sigmaJump <- 0.2
# plotting
seq1 <- seq(  1, 200,  by=  5)
seq2 <- seq(200, 450,  by= 10)
seq3 <- seq(450, 750,  by= 50)
seq4 <- seq(750, iter, by=150)
xlims <- c(4, 8)
ylims <- c(1, 5)
zlims <- c(0, 0.7)
widthPNG  <- 1200
heightPNG <-  750
# Helpers
# --------------------------------------
# collect some data (e.g. a sample of heights)
gendata <- function(n, mu, sigma) {
    rnorm(n, mu, sigma)
# log-likelihood function
ll <- function(x, muhat, sigmahat){
    sum(dnorm(x, muhat, sigmahat, log=TRUE))
# prior densities
pmu <- function(mu){
    dnorm(mu, 0, 100, log=TRUE)
psigma <- function(sigma){
    dunif(sigma, 0, 10, log=TRUE)
# posterior density function (log scale)
post <- function(x, mu, sigma){
    ll(x, mu, sigma) + pmu(mu) + psigma(sigma)
geninits <- function(){
    list(mu    = runif(1, 4, 10),
         sigma = runif(1, 2,  6))
jump <- function(x, dist = 0.1){ # must be symmetric
    x + rnorm(1, 0, dist)
# compute
# ----------------------------------------
# init 
proposed  <- array(dim = c(chains, 2, iter))
posterior <- array(dim = c(chains, 2, iter))
accepted  <- array(dim = c(chains, iter-1))
# the data
x <- gendata(n, mu, sigma)
for (c in 1:chains){
    props      <- array(dim=c(2, iter)) <- array(dim=c(2, iter))
    inits <- geninits()[1, 1] <- inits$mu[2, 1] <- inits$sigma
    for (t in 2:iter){
        # theta_star = proposed next values for parameters
        jumpMu    <- jump([1, t-1], sigmaJump)
        jumpSigma <- jump([2, t-1], sigmaJump)
        theta_star <- c(jumpMu, jumpSigma)
        pstar <- post(x, mu = theta_star[1],      sigma = theta_star[2])
        pprev <- post(x, mu =[1, t-1], sigma =[2, t-1])
        lr <- pstar - pprev
        r  <- exp(lr)
        # theta_star is accepted if posterior density is higher w/ theta_star
        # if posterior density is not higher, it is accepted with probability r
        # else theta does not change from time t-1 to t
        accept <- rbinom(1, 1, prob = min(r, 1))
        accepted[c, t - 1] <- accept
        if (accept == 1){
  [, t] <- theta_star
        } else {
  [, t] <-[, t-1]
        props[, t] <- theta_star
    proposed[c, , ]  <- props
    posterior[c, , ] <-
# Plot
# ----------------------------------------------
oldPath <- getwd()
sequence <- c(seq1, seq2, seq3, seq4)
cols <- c("blue", "purple", "red")
png(file = "metrop%03d.png", width=widthPNG, height=heightPNG)
    for (i in sequence){
        par(font.main=1, cex.main=1.2)
        layout(matrix(1:2, nrow=1), widths=c(1.0, 1.5))
  # prepare empty plot
        plot(0, type="n", 
             xlim=xlims, ylim=ylims,
             xlab="mu",  ylab="sigma", main="Markov chains")
        for (j in 1:chains) { # first all the gray lines
            lines(proposed[j, 1, 1:i],  proposed[j, 2, 1:i],  col="gray")
        for (j in 1:chains) { # now the accepteds
            lines(posterior[j, 1, 1:i], posterior[j, 2, 1:i], col=cols[j])
        text(x=6, y=1.0, paste("Iteration ", i), cex=1.2)
        sm.density(x=cbind(c(posterior[, 1, 1:i]), c(posterior[, 2, 1:i])),
                   xlab="mu",  ylab="sigma", zlab="", 
                   xlim=xlims, ylim=ylims,   zlim=zlims, 
        title(main="Posterior density")
# combine .png's into one .gif
# system('convert2gif.bat') # had to write a .bat-file
# system('convert -delay 20 *.png "metropolis.gif"') # doesent work
Created by Pretty R at

Tuesday, September 10, 2013

Cool visualization of Metropolis sampling

A cool visualization of Metropolis sampling has been posted by Maxwell Joseph here. I've pasted in his animated gif below, as a teaser for you to check his post (and I figured it would be okay because his work focuses on "parasitic biodiversity" ;-). Thanks to Joseph Young for pointing me to this.
From Maxwell Joseph's post.
What would also be cool is an animation of Metropolis that shows the proposals at every step, even if rejected.

Friday, September 6, 2013

Bayesian methods article wins best paper award in Organizational Research Methods

The article, "The Time Has Come: Bayesian Methods for Data Analysis in the Organizational Sciences," won the 2012 Best Paper Award for the journal, Organizational Research Methods. The publisher has made the article freely available. You can see the announcement and find a link to the article here. The accompanying programs can be found here . If you find that article interesting, you would probably also like the article, "Bayesian estimation supersedes the t test," described here.

Thursday, August 8, 2013

How much of a Bayesian posterior distribution falls inside a region of practical equivalence (ROPE)

The posterior distribution of a parameter shows explicitly the relative credibility of the parameter values, given the data. But sometimes people want to make a yes/no decision about whether a particular parameter value is credible, such as the "null" values of 0.0 difference or 0.50 chance probability. For making decisions about null values, I advocate using a region of practical equivalence (ROPE) along with a posterior highest density interval (HDI). The null value is declared to be rejected if the (95%, say) HDI falls completely outside the ROPE, and the null value is declared to be accepted (for practical purposes) if the 95% HDI falls completely inside the ROPE. This decision rule accepts the null value only when the posterior estimate is precise enough to fall within a ROPE. The decision rule rejects the null only when the posterior exceeds the buffer provided by the ROPE, which protects against hyper-inflated false alarms in sequential testing. And the decision rule is intuitive: You can graph the posterior, its HDI, its ROPE, and literally see what it all means in terms of the meaningful parameter being estimated. (For more details about the decision rule, its predecessors in the literature, and alternative approaches, see the article on this linked web page.)

But how do we set the limits of the ROPE? How big is "practically equivalent to the null value"? There is typically no uniquely correct answer to this question, because it depends on the particular domain of application and the current practice in that domain. But the rule does not need a uniquely correct ROPE, it merely needs a reasonable ROPE that is justifiable to the audience of the analysis. As a field matures, the limits of practical equivalence may change, and therefore what may be most useful to an audience is a description of how much of the posterior falls inside or outside the ROPE as a function of the width of the ROPE. The purpose of this blog post is to provide some examples and an R program for doing that.

Consider a situation in which we want to estimate an effect size parameter, which I'll call δ. Suppose we collect a lot of data so that we have a very precise estimate, and the posterior distribution shows that the effect size is very nearly zero, as plotted in the left panel below:
We see that zero is among the 95% most credible values of the effect size, but we can ask whether we should decide to "accept" the value zero (for practical purposes). If we establish a ROPE around zero, from -0.1 to +0.1 as shown above, then we see that the 95% HDI falls entirely within the  ROPE and we accept zero. What if we think a different ROPE is appropriate, either now or in the future? Simply display how much of the posterior distribution falls inside the ROPE, as a function of the ROPE size, as shown in the right panel above. The curve plots how much  of the  posterior distribution falls inside a ROPE centered at the null value, as a function of the radius (i.e., half width) of the ROPE. As a landmark, the plot shows dashed lines at the 95% HDI limit farthest from the null value. From the plot, readers can decide for themselves.

Here is another example. Suppose we are spinning a coin (or doing something more interesting that has dichotomous outcomes) and we want to estimate its underlying probability of heads, denoted θ. Suppose we collect a lot of data so that we have a precise estimate, as shown in the left panel below.
We see that the chance value of 0.50 is among the 95% most credible values of the underlying probability of heads, but we can ask whether we should decide to "accept" the value 0.50. If we establish a ROPE around zero, from 0.49 to 0.51 as shown above, we can see that the 95% HDI falls entirely within the ROPE and we would decide to accept 0.50. But we can provide more information by displaying the amount of the posterior distribution inside a ROPE centered on the null value as a function of the width of the ROPE. The right panel, above, allows the readers to decide for themselves.

The plots can be used for cases of rejecting the null, too. Suppose we spin a coin and the estimate shows a value apparently not at chance, as shown in the left panel below.
Should we reject the null value? The 95% HDI falls entirely outside the ROPE from 0.49 to 0.51, which means that the 95% most credible values are all not practically equivalent to 0.5. But what if we used a different ROPE? The posterior distribution, along with the explicit marking of the 95% HDI limits, is all we really need to see what the biggest ROPE could be and still let us reject the null. If we want more detailed information, the panel on the right, above, reveals the answer --- although we need to mentally subtract from 1.0 to get the posterior area not in the ROPE (on either side).

Here is the R script I used for creating the graphs above. Notice that it defines a function for plotting the area in the ROPE.


# From get
# openGraphSaveGraph.R, plotPost.R, and HDIofMCMC.R. Put them in the working
# directory of this script, or specify the path in the next lines' source()
# commands:

# Define function for computing area in ROPE and plotting it:
plotAreaInROPE = function( mcmcChain , maxROPEradius , compVal=0.0 ,
                           HDImass=0.95 , ... ) {
  ropeRadVec = seq( 0 , maxROPEradius , length=201 ) # arbitrary comb
  areaInRope = rep( NA , length(ropeRadVec) )
  for ( rIdx in 1:length(ropeRadVec) ) {
    areaInRope[rIdx] = ( sum( mcmcChain > (compVal-ropeRadVec[rIdx])
                            & mcmcChain < (compVal+ropeRadVec[rIdx]) )
                         / length(mcmcChain) )
  plot( ropeRadVec , areaInRope ,
        xlab=bquote("Radius of ROPE around "*.(compVal)) ,
        ylab="Posterior in ROPE" ,
        type="l" , lwd=4 , col="darkred" , cex.lab=1.5 , ... )
  # From
  # get HDIofMCMC.R. Put it in the working directory of this script, or specify
  # the path in the next line's source() command:
  # source("./HDIofMCMC.R")
  HDIlim = HDIofMCMC( mcmcChain , credMass=HDImass )
  farHDIlim = HDIlim[which.max(abs(HDIlim-compVal))]
  ropeRadHDI = abs(compVal-farHDIlim)
  areaInFarHDIlim = ( sum( mcmcChain > (compVal-ropeRadHDI)
                         & mcmcChain < (compVal+ropeRadHDI) )
                      / length(mcmcChain) )
  lines( c(ropeRadHDI,ropeRadHDI) , c(-0.5,areaInFarHDIlim) ,
         lty="dashed" , col="darkred" )
  text( ropeRadHDI , 0 ,
        bquote( atop( .(100*HDImass)*"% HDI limit" ,
                     "farthest from "*.(compVal) ) ) , adj=c(0.5,0) )
  lines( c(-0.5,ropeRadHDI) ,c(areaInFarHDIlim,areaInFarHDIlim) ,
         lty="dashed" , col="darkred" )
  text( 0 , areaInFarHDIlim , bquote(.(signif(areaInFarHDIlim,3))) ,
        adj=c(0,1.1) )
  return( list( ROPEradius=ropeRadVec , areaInROPE=areaInRope ) )


# Generate a fictitious MCMC posterior:
m = 0.03
s = (0.1-m)/3
paramMCMCsample = rnorm(50000,m,s)

# Specify the desired comparison value, ROPE radius, and HDI mass:
compVal = 0.0
ropeRad = 0.1
HDImass = .95

# Open graphics window and specify layout:
layout( matrix( c(1,2) , nrow=1 , byrow=FALSE ) )

# Plot the posterior with HDI and ROPE:
postInfo = plotPost( paramMCMCsample , compVal=compVal ,
                     ROPE=compVal+c(-ropeRad,ropeRad) , showMode=TRUE ,
                     credMass=HDImass , xlab=expression(delta*" (parameter)") )

# Plot the area in ROPE:
ropeInfo = plotAreaInROPE( mcmcChain=paramMCMCsample , maxROPEradius=0.15 ,
                           compVal=compVal , HDImass=HDImass )


Friday, August 2, 2013

Near Paris? Snap a shot of the book with Laplace. Or pose it with other Bayesians.

The book previously has been posed at Bayes' tomb, Fisher's remains, and Jacob Bernoulli's grave (in an attempt to be amusing and informative; hopefully not inadvertently being offensive). Conspicuously absent from that list is Pierre-Simon Lapace, who was the foundational developer of Bayesian methods. If you are in or near Paris, and would have fun posing the book at Laplace's grave, please do so and send me a photo! If you're among the first responders, I'll post the photo, with attribution to you, on the blog. Information about the location of Laplace's grave can be found here, with a map here.

If you pose the book with other famous Bayesians, or pre-Bayesians, or anti-Bayesians, either dead or not-quite-yet dead, please send me those photos too! (Again, the goal is to be amusing and informative, not offensive.) Thanks, and have fun!

Saturday, July 20, 2013

Decisions from posterior distributions: Tail probability or highest density interval?

How should we decide whether a parameter's posterior distribution "rejects" a particular value such as zero? Should we consider the percentage of the distribution above/below the value? Should we consider the relation of the highest density interval (HDI) to the value? Here are some examples to explain why I think it makes more sense to use the HDI.

Here are the two decision rules being compared. First, the tail-probability decision rule: If there is less than 2.5% of the distribution on either side of the value, then reject the value. This is tantamount to using a 95% equal-tailed credibility interval: Values outside the 95% equal-tailed credibility interval are "rejected." Second, the HDI decision rule: Values outside the 95% HDI are "rejected." (Of course, I like to enhance the decision rule with a ROPE, to allow acceptance decisions and to provide a buffer against false alarms -- but that's a separate discusssion.)

The two histograms below represent the MCMC results for two hypothetical parameters.
The upper panel shows a posterior distribution for which the parameter value of zero falls will within the 95% HDI, but 2.0% (<2.5%) of the distribution falls below the value zero. Do we "reject" zero or not? I think it would be wrong to reject zero, because the probability density (i.e., credibility) of zero is high. Zero is clearly among the most credible values of the parameter, even though less than 2.5% of the distribution falls below zero.

The lower panel shows a posterior distribution for which the parameter value of zero falls well outside the 95% HDI (thus, even with of a modest ROPE, e.g. from -1 to +1, zero would still fall outside the 95% HDI), but a full 3.0% (>2.5%) of the distribution falls below zero. Do we "reject" zero or not? If we use a tail-probability decision rule, we do not reject zero. But clearly zero is not among the most credible values of the parameter, in that zero has low probability density.

Proponents of using equal-tailed credibility intervals will argue that percentiles of distributions are invariant under transformations of the parameter, but HDIs are not. True enough, but I think that most parameters are specifically scaled to be meaningful, and we want to know about credibility (probability density) on the meaningful scale, not on a meaningless transformed scale. But I am not saying that one decision rule is "correct" and the other is "wrong." The decision rules are merely rules with differing interpretations and characteristics; I am showing examples that convince me that a more useful, intuitive rule is using the HDI not the equal-tailed interval.

Then why do the plots in DBDA (like those above) bother to display the percentage of the distribution below/above the comparison value? Primarily merely as an additional descriptive statistic, but also to inform those who wish to think about tail probabilities for a decision rule that I eschew.

For another example, see this previous post.

Wednesday, June 26, 2013

Doing Bayesian Data Analysis at Jacob Bernoulli's grave

The book visited the grave of Jacob Bernoulli (1655-1705) in Basel, Switzerland, as shown in these photos taken by Dr. Benjamin Scheibehenne. Jacob Bernoulli pre-dated Bayes (1701-1761), of course, but Bernoulli established foundational concepts and theorems of probability.

The cutie admiring the book in the first photo is Benjamin's daughter. Many thanks to Benjamin and family for venturing out to take these photos. Thanks also to Benjamin and colleagues for hosting my visit to Basel for a day back in June, during which we unfortunately did not have time to visit Jacob Bernoulli's grave.

If you find this amusing, you might also like the book at Bayes' tomb and the book at R. A. Fisher's remains.

Sunday, June 16, 2013

Bayesian robust regression for Anscombe quartet

In 1973, Anscombe presented four data sets that have become a classic illustration for the importance of graphing the data, not merely relying on summary statistics. The four data sets are now known as "Anscombe's quartet." Here I present a Bayesian approach to modeling the data. The Anscombe quartet is used here as merely another illustration of two frequently made points in Bayesian analysis: JAGS/BUGS is a very flexible language for expressing a variety of models (e.g., robust non-linear regression), and it is important to conduct a posterior predictive check of the descriptive adequacy of the model.

(I first messed around with analyzing Anscombe's quartet back in October 2010, for personal interest. A conversation I had last week in Basel, Switzerland, suggested that this analysis might interest a larger audience. Hence this blog post.)

Below are the four data sets, shown as black circles. They are modeled with typical linear regression that assumes normality:
y ~ dnorm( b0+b1*x , sigma )
with vague priors on the intercept b0, the slope b1, and the standard deviation sigma.

Actually, for easy consistency with the subsequent analysis, I used a t distribution with the normality (df) parameter given a prior that insists it is very large, near 500, which is essentially normal. Thus, the model is
y ~ dt( b0+b1*x , sigma , nu )
with vague priors on b0, b1, and sigma, and the prior on normality being
nu ~ dnorm( 500 , sd=0.1 )

The figures below show the data in the left panels, with a smattering of credible regression lines superimposed, and veritical bars that indicate the 95% HDI of posterior predicted y values at selected x values:

Notice that the marginal parameter estimates are the same for all four data sets (up to MCMC sampling error), which is exactly the point that Anscombe wanted to emphasize for least-squares regression. Here we see that it is also true for Bayesian estimation, and this is no surprise when we remember from formal Bayesian analysis of linear regression (with normally distributed noise) that the "sufficient statistics" of the  data determine the posterior distribution, and the sufficient statistics are identical for the four data  sets.

One of the beauties of the flexibility of JAGS/BUGS is that it is easy to implement other models. In particular, it is easy to implement regression that is robust to outliers, using the t distribution. I used the same model as above,

y ~ dt( b0+b1*x , sigma , nu )
with vague priors on b0, b1, and sigma, but I put a vague prior on the normality of the t distribution:
nu <- nuMinusOne + 1
nuMinusOne ~ dexp( 1/29 )
This is the same prior on nu as is used in the "BEST" model for robust comparison of two groups. The resulting posteriors look like this:

The normality parameter (nu) is not much different from its prior for Types 1, 2, and 4, resulting in slightly smaller estimates of sigma than when forcing nu to be 500.  Most notably, however, for Type 3, the model detects that the data fall on a line and have a single outlier, indicated by the SD (sigma) estimated as an extremely small value, and the normality (nu) hugging 1.0, which is its smallest possible.

Analogously, we could easily do a robust quadratic trend analysis, and thereby detect the non-linear trend in Type 2. See this previous blog post for how to extend a JAGS/BUGS linear regression to include a quadratic trend.

To reiterate and conclude, the Anscombe quartet not only makes its original point, that we should look at the posterior predictions to make sure that the model is a reasonable description of the data, but it also highlights that for Bayesian analysis using JAGS/BUGS, it is easy to try other models that may be better descriptions of the data.
ADDED 18 JUNE 2013: Rasmus Baath, after reading the above post, has actually implemented the extensions that I merely suggested above. Please continue reading his very interesting follow up.