Thursday, September 18, 2014

p value from likelihood ratio test is STILL not the same as p value from maximum likelihood estimate

In yesterday's post, I described two ways for finding a p value for a parameter, and pointed out that the two ways lead to different p values. As an example, I considered the slope parameter in simple linear regression. One way to get a p value for the slope parameter is with a standard likelihood ratio test, for linear model with free slope versus intercept-only model. The other way to get a p value is to generate a sampling distribution of the MLE slope parameter itself, from a null hypothesis that consists of the MLE intercept-only model. Through remarks from commenters, especially Arash Khodadadi and Noah Silbert, I've been prodded to clarify the differences between these approaches. This post points out
  • The likelihood ratio test compares model fits relative to the variance, while the sampling distribution of the MLE slope parameter is on the absolute scale of the data.
  • An example in which the slope parameter has p < .05 two-tailed in the sampling distribution of the MLE slope parameter but has p > .05 (one-tailed) in the sampling distribution of the likelihood ratio.
First, the data (a little different from the previous posts):
The header of the plot, above, indicates the MLE of the linear model, and -2log(LR), a.k.a. G2, rounded to three significant digits.

The null hypothesis is an intercept-only model (beta1=0) with beta0 and sigma set to their MLE values when beta1 is fixed at zero. (Which means, in this case, that beta0 is the mean of y and sigma is the sd of y using N not N-1 in the denominator.) I generated sample data from the null hypothesis using the x values in the actual data. For each sample I computed the MLE of the full model and -2log(LR). The resulting marginal sampling distributions are shown here:
In the left panel, above, the one-tailed p value of MLE beta1 is displayed; multiply it by 2 to get the two-tailed p value. Notice it is different than the p value of the likelihood ratio.

Below is the joint sampling distribution, where each point is a sample from the null hypothesis. There is a new twist to this figure: Each point is color coded for the magnitude of MLE sigma, where blue is the largest MLE sigma in the distribution and red is the smallest MLE sigma in the distribution.
The joint sampling distribution also shows the thresholds for p=.05 (one-tailed or two-tailed), and the actual data statistics are plotted as a "+".

You can see from the joint sampling distribution that MLE beta1 can be large-ish even when -2log(LR) is small-ish when the sample MLE sigma is large-ish (blue points). But the opposite can happen when the sample MLE sigma is small-ish (red points). Thus, a key difference between the measures of the slope parameter is how they deal with the variance. The likelihood ratio compares the free-slope against intercept-only models relative to the variance, while the MLE beta1 considers the slope on the absolute scale of the data, not relative to the variance.

As discussed in yesterday's post, I don't think either test is inherently better than the other. They just ask the question about the slope parameter in different ways. As mentioned yesterday, posing the question in terms of absolute MLE beta1 has direct intuitive interpretation. It's also much easier to use when defining confidence intervals as the range of parameter values not rejected by p<alpha (which is, for me, the most coherent way to define confidence intervals). But that's a topic for another day!


  1. So, given that the p values match up when the data are standardized ahead of time, and given that these two tests seem like they should be leading to the same inference, this seems like a reasonable argument for standardizing data.

  2. you have a quite small sample, meaning the chi square distribution for your test statistic might not be exactly true.

    with a larger sample, you can run permutation test, which may give you better control of the variability of the estimate. And usually we run permutation tests on test statistics rather than on effect estimates.

    still 0.048 is similar to 0.054. i hope you do not seriously believe they oppose each other

  3. still 0.048 is similar to 0.054. i hope you do not seriously believe they oppose each other

    If your cut-off is determined by alpha = 0.05, then the conclusion based on 0.048 will be different than the conclusion based on 0.054.

    As Andrew Gelman is fond of saying, the difference between not statistically significant and statistically significant need not itself be statistically significant. So, yes, you're right that 0.048 and 0.054 are awfully similar, but they're on either side of a line that is commonly believed to be very, very important.

  4. My concerns are (1) understanding the exact concept (2) applied as generally as possible. Noah, while standardizing might make things "nice" for this case (normal noise, linear model), I'm not sure it would help more generally (e.g., non-normal heteroscedastic noise, non-linear model). Wei, while asymptotic approximations may be fine for large N, that's not my goal here. The chi-square approximation to the exact distribution of -2log(LR) is bad for the small N here. I'm pointing out that the exact distribution of MLE beta1 is yet again different.

  5. It was pointed out to me by Michael Trosset (IU Dept of Statistics) that what I am doing appears to be related to the difference between a Wald test and a LR test. Indeed, my sampling distribution of the MLE is like an "unstandardized" Wald statistic, and what I am pointing out is analogous to pointing out that p values from the Wald test will not match p values from the LR test for small N (though they both converge to chi-squared for large N).

  6. I think you colleague is right - see here for the asymptotic equivalence of both tests: