Sunday, February 19, 2017

Interpreting Bayesian posterior distribution of a parameter: Is density meaningful?


Background: Suppose a researcher is interested in the Bayesian posterior distribution of a parameter, because the parameter is directly meaningful in the research domain. This occurs, for example, in psychometrics. Specifically, in item response theory (IRT; for details and an example of Bayesian IRT see this blog post), the data from many test questions (i.e., the items) and many people yield estimates of the difficulties \(\delta_i\) and discriminations \(\gamma_i\) of the items along with the abilities \(\alpha_p\) of the people. That is, the item difficulty is a parameter \(\delta_i\), and the analyst is specifically interested in the magnitude and uncertainty of each item's difficulty. The same is true for the other parameters, item discrimination and person ability. That is, the analyst is specifically interest in the discrimination \(\gamma_i\) magnitude and uncertainty for every item and the ability \(\alpha_p\) magnitude and uncertainty for every person.

The question: How should the posterior distribution of a meaningful parameter be summarized? We want a number that represents the central tendency of the (posterior) distribution, and numbers that indicate the uncertainty of the distribution. There are two options I'm considering, one based on densities, the other based on percentiles.

Densities. One way of conveying a summary of the posterior distribution is in terms of densities. This seems to be the most intuitive summary, as it directly answers the natural questions from the researcher:
  • Question: Based on the data, what is the most credible parameter value? Answer: The modal (highest density) value. For example, we ask: Based on the data, what is the most credible value for this item's difficulty \(\delta_i\)? Answer: The mode of the posterior is 64.5.
  • Question: Based on the data, what is the range of the 95% (say) most credible values? Answer: The 95% highest density interval (HDI). For example, we ask: Based on the data, what is the range of the 95% most credible values of \(\delta_i\)? Answer: 51.5 to 75.6.
Percentiles. A different way of conveying a summary of the posterior distribution is in terms of percentiles. The central tendency is reported as the 50th percentile (i.e., the median), and the range of uncertainty (that covers 95% of the distribution) is the equal-tailed credible interval from the 2.5 %ile to the 97.5 %ile. When using percentiles, densities are irrelevant, and the shape of the distribution is irrelevant.
An illustration from DBDA2E showing how highest-density intervals and equal-tailed intervals (based on percentiles) are not necessarily equivalent.


Some pros and cons:

Density answers what the researcher wants to know: What is the most credible value of the parameter, and what is the range of the credible (i.e., high density) values? Those questions simply are not answered by percentiles. On the other hand, density is not invariant under non-linear (but monotonic) transformations of the parameters. By squeezing or stretching different regions of the parameter, the densities can change dramatically, but the percentiles stay the same (on the transformed scale). This transformation invariance is the key reason that analysts avoid using densities in abstract, generic models and derivations.

But in applications where the parameters have meaningful interpretations, I don't think researchers are satisfied with percentiles.  If you told a researcher, "Well, we cannot tell you what the most probable parameter value is, all we can tell you is the median (50 %ile)," I don't think the researcher would be satisfied. If you told the researcher, "We can tell you that 30% of the posterior falls below this 30th %ile, but we cannot tell you whether values below the 30th %ile have lower or higher probability density than values above the 30th %ile," I don't think the researcher would be satisfied. Lots of parameters in traditional psychometric models have meaningful scales (and aren't arbitrarily non-linearly transformed). Lots of parameters in conventional models have scales that directly map onto the data scales, for example the mean and standard deviation of a normal model (and the data scales are usually conventional and aren't arbitrarily non-linearly transformed). And in spatial or temporal models, many parameters directly correspond to space and time, which (in most terrestial applications) are not non-linearly transformed.

Decision theory to the rescue? I know there is not a uniquely "correct" answer to this question. I suspect that the pros and cons could be formalized as cost functions in formal decision theory, and then an answer would emerge depending on the utilities assigned to density and tranformation invariance. If the cost function depends on densities, then mode and HDI would emerge as the better basis for decisions. If the cost function depends on transformation invariance, then median and equal-tail interval would emerge as the better basis for decisions.

What do you think?

6 comments:

  1. Well from a Decision theory perspective, just looking at the point estimate, the question is how bad it is to be off by a certain amount, right? If you consider it twice as bad to be off by 2.0 as to be off by 1.0, then you have a linear cost function, and the summary of the posterior that you should use is the median posterior.

    From a decision theoretic perspective the mode is a little bit tricky as it's the solution to the case when you have a 0-1 loss, that is, if you're correct then there is no loss, but if your wrong (even just a little) then you get all the loss. This is usually not the situation you are in, as it's usually worse to be off by much than to be off by a little.

    So no good answer :) But one thing I've notice is that if you transform the scale of your parameter so that it spans (-∞,∞), for example logit transforming a rate parameter, then the HDI and quantile intervals are often pretty similar...

    ReplyDelete
  2. I think Rasmus scratched on what I consider a correct answer. The model parameters should be on a real line such that the estimate's distribution is approximately normal. This makes sense from a perspective of the MCMC samplers which perform best with normal-linear parameter distribution. From the point of probability theory the idea could be backed up with Jaynes' maximum entropy idea - a model with normal parameter estimates would be most informative with respect to a fixed prior (this is just my intuition, it would be nice to see a proof). Thus if I find a skewed non-normal posterior the first thing for me is to try to improve the model (even if the convergence indicators are in green)such that the parameter estimate's distribution is approx. normal. Then it should not matter whether one reports HDI or quantiles, the two should coincide and you get the inferential benefits of both.

    I think the decision analysis is a great tool and can be useful for answering the research question that motivated the study. However, I think the researcher should always report posterior parameter estimates. This is because, the results may be of interest to a researcher with a different research question which may require a decision analysis with different cost function. In bayesian decision analysis, the re-analysis requires the knowledge of the entire posterior distribution, which is difficult to report. However, if we assume that the posterior is approximately normal (see my first point), it's straightforward to recreate the posterior from info about HDI or quantile. IMO, the summaries are just proxies to implement the posterior model, which can then be used for prediction and inference in general.

    ReplyDelete
  3. Thank you for this post. It provides a venue for a tortured question I've had for a while.

    "Based on the data, what is the range of the 95% most credible values of \(\delta_i\)?"

    This doesn't mean anything to me and I don't know how to parse it. There is no "95% of values" of a continuous variable and so they can't have a range.

    ReplyDelete
    Replies
    1. To say, "the 95% most credible values" is synonymous with saying "values in the 95% HDI". It's just a simple re-phrasing, sort of like saying "sampling distribution" instead of "the distribution of sample statistics".

      Delete
    2. So when you write: "Density answers what the researcher wants to know: (...) what is the range of the credible (i.e., high density) values?" do you similarly mean that as a shorthand for: "Density answers what the researcher wants to know: (...) what is the HDI?" Because that feels circular to me: the claim that researchers are interested in densities seems to rest on itself.

      Delete
    3. Yeah, the claim that researchers intuitively want to know posterior probability density is an intuitive claim. Much like saying that researchers intuitively want to know posterior probability of parameters, not p values.

      The intuitive claim does not mean that researchers can not or should not want to know other information, such as quantile (percentile) intervals and p values. The intuitive claim is merely that it's natural for researchers to want to know posterior probability density on parameters with meaningful scales.

      Delete