Wednesday, September 17, 2014

p value from likelihood ratio test is not the same as p value from maximum likelihood estimate

In a post of a few hours ago, I pointed out that I was having trouble getting p values to agree for two different methods. Thanks to a suggestion from a reader, there is a resolution: The p values should not agree. This was actually my hunch and hope all along, because it adds another reason never to talk about "the" p value for a set of data, because any data set has many different p values.

First, a recap:
I'm doing Monte Carlo simulation of sampling distributions to compute p values. For example, consider the slope parameter, β1, in simple linear regression. I want to find out if p < .05 for a null hypothesis that β1=0. I'm working with two different ways to compute a p value, and am finding that the results do not agree with each other. Here are the two ways:
  1. Consider the maximum likelihood estimate (MLE) of β1 from the actual data, denoted β1MLEactual, and see where it falls in a sampling distribution of β1MLEnull for simulated data from the null hypothesis.
  2. Consider the likelihood ratio statistic, G2 = -2log(LR) where LR is the ratio of the maximum likelihood of the restricted model with β1=0 over the maximum likelihood of the full model with β1 free. We see where G2actual falls in the sampling distribution of G2null for simulated data from the null hypothesis.
See the previous post for details about how the data from the null hypothesis were generated. But here is a repeat of the old picture of the two sampling distributions:
Notice that the p values are not the same in the two distributions. (See the end of the previous post for reasons that the difference cannot be explained away as some simple artifact.)

Now the news: Arash Khodadadi, an advanced graduate student in Psychological & Brain Sciences at Indiana University, read the post and pointed out to me that not all samples that have G2null > G2actual are the same samples that have β1MLEnull > β1MLEactual. Essentially he was saying that I really should be looking at the joint sampling distribution. So I made a plot, and here it is:
Each point corresponds to a sample from the null hypothesis. The marginals of the joint distribution were plotted in the previous figure. I put lines at the values of β1MLEactual and G2actual, and I color coded the points that exceed those values. Obviously the points contributing to the p value for β1MLE
are quite different than the points contribution to the p value for G2. It is conceivable that the red and blue points would exactly equal each other mathematically (perhaps in a two-tailed version), but I doubt it.

Conclusion: This exercise leads me to conclude that the p values are different because they are referring to different questions about the the null hypothesis. Which one is more meaningful? For me, the sampling distribution of β1MLE makes more direct intuitive contact with what people want to know (in a frequentist framework), namely, Is the observed magnitude of the slope very probable under the null hypothesis? The sampling distribution of G2 is less intuitive, as it is asking, Is the observed ratio, of the probability of the data under the zero-slope model over the probability of the data under the free-slope model, very probable under the null hypothesis? Both questions are meaningful, but the first one asks directly about the magnitude of the slope, whereas the second one asks about the relative probabilities of data under the model structures.

Now the difficult semantic question: When the p values conflict, as they do in this example (i.e., p < .05 for β1MLE, while p > .05 for G2 [and if you prefer two-tailed tests, then we could contrive a case in which the p values conflict there, too]), is the slope "significantly" non-zero? The answer: It depends! It depends on the question you are asking. I think it makes intuitive sense to ask the question about the magnitude of the slope and, therefore, to say in this case that the slope is significantly non-zero. But if you specifically want to ask the model comparison question, then you need the model-comparison p value, and you would conclude that the free-slope model is not significantly better than the zero-slope model.



  1. This recent paper (and its citations) could be useful:

  2. The test of the slope parameter ignores the intercept parameter and the fact that the estimate will vary depending on whether or not the slope is present in the model.

    The likelihood ratio test compares the full models, one of which has a slope and an intercept, one of which just has an intercept.

    I wrote some code to run similar simulations, but first I standardized x and the error before generating y. In this case, the full and restricted models both have estimated intercepts of zero (or close enough, in the ball park of 1e-16), as expected. Both tests also give the same p value (I used the anova() function, so it's a p value for an F statistic, but that should behave pretty much like a G2 statistic).

    If I plot the joint distribution of b1 and F from the null model, it looks like f(x) = x^2, pretty much - a nice, clean U shape with little to no detectable noise.

  3. Noah: When I rerun the simulation with mean-centered actual data (both x and y), and mean-centered simulated sample data --so that the estimated intercept in actual and sampled data is always zero-- I get the same results as in the post.

  4. This comment has been removed by the author.

  5. The joint distribution plot seems to make it pretty clear that the majority of the discrepancy is due to the one-tailed vs. two-tailed issue that you mentioned in the last post. As for the non-trivial discrepancy that reminds after that, it seems like Noah's suggestion should do it. But you say that when you do this, contrary to Noah, you "get the same results as in the post." But does this include also correcting the tail issue? Between the two measures it seems like we should be able to get them to match pretty much exactly. Having the code could be helpful here.

  6. I meant to mention this in my first comment, but when you write that " adds another reason never to talk about "the" p value for a set of data, because any data set has many different p values", I feel like it's worth pointing out that anyone who claims that there's a single p value for a data set is forgetting (or doesn't know) that a p value is associated with a test carried out on a data set. A data set without a test doesn't have any p value at all (nor does a test without a data set).

    Anyway, when I use standardized data (x and e), I get exactly the same p values for the beta coefficient test and the model comparison test. Each time I run 5000 simulations, I get simulated p values that are very close to the values returned by lm() and anova(). I've tried it four or five times with different values for b1 this morning, and the results are very consistent (with respect to the p values matching across tests).

    Here's my code:

    N = 100
    x = runif(N)*5 + 3
    x = (x-mean(x))/sqrt(((N-1)/N)*var(x))
    b0 = 0
    b1 = .23
    e = rnorm(N,mean=0,sd=7)
    e = (e-mean(e))/sqrt(((N-1)/N)*var(e))
    y = b0 + b1*x + e

    data = data.frame(y=y,x=x)

    BF = matrix(nrow=nsmp,ncol=2)
    colnames(BF) = c("beta","Fs")
    fit1 = lm(y~x,data=data)
    fit2 = lm(y~1,data=data)
    test = anova(fit2,fit1)

    b1.f = coef(fit1)[2]
    Fs.f = test$F[2]

    nsmp = 5000
    b1.H0 = 0

    for(smpi in 1:nsmp){
    e = rnorm(N,mean=0,sd=7)
    e = (e-mean(e))/sqrt(((N-1)/N)*var(e))
    y = b0 + b1.H0*x + e
    d = data.frame(x=x,y=y)
    fit.a = lm(y~x,data=d)
    fit.b = lm(y~1,data=d)
    test.ab = anova(fit.b,fit.a)
    BF[smpi,"beta"] = coef(fit.a)[2]
    BF[smpi,"Fs"] = test.ab$F[2]

    blims = c(min(-1.25*b1.f,min(BF[,'beta'])),max(-1.25*b1.f,max(BF[,'beta'])))
    temp = hist(BF[,1],breaks=seq(blims[1],blims[2],length=50))
    Flim = max(1.25*Fs.f,max(BF[,"Fs"]))
    temp = hist(BF[,2],breaks=seq(0,Flim,length=50))

    ff = summary(fit1)$fstatistic
    c(pf(ff[1],ff[2],ff[3],lower.tail=F),2*sum(BF[,1] > b1.f)/nsmp)

    c(test[["Pr(>F)"]][2],sum(BF[,2] > Fs.f)/nsmp)

  7. The key conceptual difference between the sampling distributions (of MLE beta1 vs G2) is not how they treat the intercept, it's how they treat the variance. In some random samples, MLE beta1 can be large but with large MLE sigma, in which case G2 would be relatively small -- these are the blue dots. In other random samples, MLE beta1 can be small but with small MLE sigma, in which case G2 could be relatively large -- these are the red dots. G2 is (in this normal+linear case like the F ratio) a measure of slope vs intercept-only relative to variance in the sample, while MLE beta1 is a 'direct' measure of slope on the absolute scale of the null hypothesis. I'll post some pictures soon...

  8. The fact that you had mentioned centering (but not scaling) x and y, while in my code I'm centering and scaling x and e (but not y) got me wondering about the role of variance. I hadn't thought through it yet, but it sounds like you have. I look forward to seeing the new pictures.

  9. Ah, I see. A solution is emerging! Very interesting.