Friday, April 29, 2011

Simple Example of How Bayesian Analysis Is Better Than MLE/NHST

Here's a simple example to illustrate some of the advantages of Bayesian data analysis over maximum likelihood estimation (MLE) with null hypothesis significance testing (NHST). I haven't seen this example anywhere else, but please let me know if similar things have previously appeared "out there".

Figure 1.
Suppose we have data consisting of <x,y> pairs for every individual; e.g., X is time and Y is amount of carbon dioxide in a sample. An example of this type of data is shown in Figure 1. We might try to describe this sort of data with simple linear regression. Instead, we'll describe the data with an exponential growth curve:
Y = exp( b1 ( X - b0 ) ) + N(0,s)
where b1 is the growth rate, b0 is the X value at which the predicted Y value is one, and s is the standard deviation of normally distributed noise. Notice that s is the same at all values of X. (The data in the figure were generated from the model using particular values for b0, b1, and s. But you don't need to know the generating values.)

Goals of the analysis:
  • Estimate the values of b0, b1, and s that are most credible given the sample of data. 
  • Decide whether or not Y is growing (exponentially); i.e., decide whether b1 is non-zero. 
  • Predict values of Y for future X; i.e., values of X larger than any yet observed.
We can attempt to address those goals by Bayesian analysis or by MLE+NHST. Let's see what happens.

Goal: Estimate the values of b0, b1, and s that are most credible given the sample of data.

The MLE is the specific combination of <b0,b1,s> values that maximizes the probability of the data:
Πi p( yi | xi, b0, b1, s ) =Πi N( yi - exp( b1 ( xi - b0 ) ) , s )
The MLE is graphed as the red line in Figure 1. It can be seen that the MLE provides a reasonable description of the trend in the data. But it provides no sense of what other parameter values are reasonable, given the data.

By contrast, a Bayesian analysis inherently reveals the entire 3-dimensional conjoint distribution of credible parameter values. Figure 2 (below) shows aspects of the posterior distribution, using a very diffuse prior.

Figure 2.
Notice in the left panel of Figure 2 that the credible values of b0 and b1 trade off with each other, in a curious curved fashion. A random subset of these credible values are plotted as the blue curves in Figure 1.

The MLE point estimate is plotted as the red circle in the left panel of Figure 2. Notice that the MLE reveals nothing about what other values of b0 and b1 are reasonable, given the data. Sometimes people approximate the likelihood function at the MLE with a multivariate Gaussian, with covariance matrix set to the second derivatives of the likelihood function (either analytically computed or numerically approximated). But the posterior is obviously not Gaussian, so the asymptotic (large N) approximation is not good in this case, even when N=100.

Goal: Decide whether or not Y is growing (exponentially); i.e., decide whether b1 is non-zero.

The right panel of Figure 2 shows clearly that the credible values of b1 are larger than zero. 100% of the posterior falls above zero, and the 95%HDI falls way above zero; even a wide ROPE would be excluded by the HDI. The posterior also reveals the gradation of credibility across the range of b1.

How would NHST decide whether b1 is non-zero? We need to determine whether p<.05. And to determine p, we need to get a sampling distribution of b1. To get a sampling distribution, we need to determine what the intended sampling process was. In this case, there happened to be N=100 data points, but we don't know whether the experimenter intended to stop when N=100, or intended to stop at the end of the week and just happened to get 100 observations, or intended to collect 150 but was unexpectedly interrupted, or whatever. For each sampling process, there is a different sampling distribution and a different p value! But suppose we select some sampling process. Next, we need to figure out a null hypothesis from which to generate simulated samples. The null hypothesis will be the model with b1=0, because the "null" we're asking about is b1=0. But what values should be use for b0 and for s? We could use the MLE estimates of b0 and s that we happened to observe in our real data, but those values are just from a random sample and plausible null hypotheses could use different values of b0 and s. For each value of b0 and s, there is a different sampling distribution and a different p value (I think; correct me if I'm wrong). But suppose we select some values for b0 and s. Then we need to generate the sampling distribution. Maybe someone could derive it mathematically, but in lieu of that derivation, it can be approximated by simulating many random samples from the null hypothesis and for every random sample computing the MLE. Finally, we compute the percentile of the actual MLE b1 in the sampling distribution of MLE b1, which yields (1 minus) the p value. All that work to get something that is ill defined, and tells us only whether we'd reject b1=0.

It does not tell us what other values of b1 we would or would not reject. To determine the NHST confidence interval, we need to repeat the above process for every candidate value of b1! The confidence interval is the range of b1 values that we would not reject, so we have to go through the sampling distribution computations, and all their assumptions, for every candidate value of b1. And even when we determine the limits of the confidence interval, we still don't have any gradation of confidence over that range.

Goal: Predict values of Y for future X; i.e., values of X larger than any yet observed.

Figure 3.
Figure 3 shows the data replotted with an expanded Y scale. The MLE is plotted by the red bars, which show the 95% HDI of its predicted values Y. In other words, the MLE value of s specifies a spread of variability in the data, and that spread is plotted as vertical red bars, spanning the MLE curve that was shown in Figure 1.

Figure 3 also shows the Bayesian posterior predictions, as the vertical blue bars. Notice that the Bayesian predictions reveal the increasing uncertainty in prediction as the probed X values extrapolate farther beyond the observed values. Notice also that the mean of the Bayesian predictions (marked by the blue horizontal tic marks) curves away from the MLE prediction in this case! This is caused by the asymmetric spread of credible curves around the MLE curve.

3 comments:

  1. Thank you for this eloquent example! Is the source code available? Presumably this example is done with R and OpenBUGS? It would be nice to work through the example to understand how to set up a similar problem.

    ReplyDelete
  2. The code is a very simple modification of SimpleLinearRegressionJags.R. The original linear regression model is this:

    model {
    for( i in 1 : Ndata ) {
    y[i] ~ dnorm( mu[i] , tau )
    mu[i] <- beta0 + beta1 * x[i]
    }
    beta0 ~ dnorm( 0 , 1.0E-12 )
    beta1 ~ dnorm( 0 , 1.0E-12 )
    tau ~ dgamma( 0.001 , 0.001 )
    }

    The exponential model changes a single line of code:

    model {
    for( i in 1 : Ndata ) {
    y[i] ~ dnorm( mu[i] , tau )
    mu[i] <- exp( beta1 * ( x[i] - beta0 ) )
    }
    beta0 ~ dnorm( 0 , 1.0E-12 )
    beta1 ~ dnorm( 0 , 1.0E-12 )
    tau ~ dgamma( 0.001 , 0.001 )
    }

    You can generate some random data like this:

    Ndata = 100
    b0True = 1.0
    b1True = 1.0
    sdTrue = 0.5
    set.seed(4740)
    xMin=(-3.0) ; xMax=xMin+4
    x = rnorm( Ndata , mean=mean(c(xMin,xMax)) , sd=(xMax-xMin)/4 )
    y = exp( b1True * ( x - b0True ) ) + rnorm( Ndata , 0 , sdTrue )

    One more thing: Do not standardize the data and transform back, because it just gets too confusing in this non-linear situation. Instead, run the chains a long time to smooth out autocorrelation.

    ReplyDelete