Sunday, December 27, 2015

Lessons from Bayesian disease diagnosis: Don't over-interpret the Bayes factor, VERSION 2


[This is a revised version of a previous post which is superseded by this post. 
This revision has corrected derivations, new R/JAGS code, and new diagrams.] 

Overview
"Captain, the prior probability of this
character dying and leaving the show
is infinitesimal."
A primary example of Bayes' rule is for disease diagnosis (or illicit drug screening). The example is invoked routinely to explain the importance of prior probabilities. Here's one version of it: Suppose a diagnostic test has a 97% detection rate and a 5% false alarm rate. Suppose a person selected at random tests positive. What is the probability that the person has the disease? It might seem that the odds of having the disease are 0.97/0.05 (i.e., the detection rate over the false alarm rate), which corresponds to a probability of about 95%. But Bayesians know that such an answer is not appropriate because it ignores the prior probability of the disease, which presumably is very rare. Suppose the prior probability of having the disease is 1%. Then Bayes' rule implies that the posterior probability of having the disease is only about 16%, even after testing positive! This type of example is presented over and over again in introductory expositions (e.g., pp. 103-105 of DBDA2E), emphatically saying not to use only the detection and false alarm rates, and always to incorporate the prior probabilities of the conditions.


The ratio of the detection and false-alarm rates is the Bayes factor, hence the lesson of the primary example is: Do not use the Bayes factor alone to make decisions; instead always incorporate the prior probabilities of the options. Unfortunately, this lesson is often forgotten or at least de-emphasized in general model-selection settings. In model-selection settings, some people compute the Bayes factor and use it alone as the basis for model selection. The tacit assumption is that the models all have equal prior probabilities. But this assumption of equal prior probabilities seems implausible in most situations, and should be explicitly justified. If we should not rely on the Bayes factor alone to make decisions in disease diagnosis or drug testing, why should we rely on the Bayes factor alone to make decisions in other model-selection applications?

In this post
  • I explain the role of the Bayes factor in the simple disease-diagnosis situation. (This also applies to random drug screening.)
  • Then I extend to the situation in which the prior probabilities are not certain. I show that in the limit, with utter uncertainty in the prior probabilities, the posterior odds fade to match the Bayes factor. But even with some modest knowledge of the base rates, the posterior odds can be far from the Bayes factor. 
  • Hierarchical diagrams are provided to illustrate the model comparisons.
  • JAGS code is provided for hands-on experience.
Conclusion: When deciding between models, the decision should not be based on the Bayes factor alone.

Bayes' rule applied to disease diagnosis

A person is randomly selected from a population in which the prior probability of having a particular disease is 1%. Denote the condition of the person by the variable \(M\), which can have two values, either \(M\!\!=\!\!1\) for having the disease or \(M\!\!=\!\!2\) for not having the disease. The prior probability can be written as \(p(M\!\!=\!\!1) = 0.01\), and hence \(p(M\!\!=\!\!2) = 0.99\).

A diagnostic test for the disease can show a "positive" result suggesting that the disease is present, or a "negative" result suggesting that the disease is absent. The result of the test is the datum, denoted \(D\), which can have two values: \(D\!\!=\!\!1\) for "positive" or \(D\!\!=\!\!0\) for "negative". The correct-detection rate (a.k.a. sensitivity) of the test is 97%, meaning \(p( D\!\!=\!\!1 | M\!\!=\!\!1 ) = 0.97\). The false-alarm rate of the test is 5%, meaning \(p( D\!\!=\!\!1 | M\!\!=\!\!2 ) = 0.05\). The complement of the false alarm rate is called the specificity of the test, which in this example is 95%.

Applying Bayes' rule yields: \[ p( M\!\!=\!\!1| D\!\!=\!\!1 ) = p( D\!\!=\!\!1 | M\!\!=\!\!1 ) p( M\!\!=\!\!1 ) / p( D\!\!=\!\!1 ) \] and \[ p( M\!\!=\!\!2| D\!\!=\!\!1 ) = p( D\!\!=\!\!1 | M\!\!=\!\!2 ) p( M\!\!=\!\!2 ) / p( D\!\!=\!\!1 ) \] In the above two equations, the denominator \( p( D\!\!=\!\!1 ) \) is the same, computed as \( p( D\!\!=\!\!1 ) = \sum_m p( D\!\!=\!\!1 | M\!\!=\!\!m ) p( M\!\!=\!\!m ) \). The ratio of the two equations is \begin{align} \underbrace{\frac{p( M\!\!=\!\!1 | D\!\!=\!\!1 )}{p( M\!\!=\!\!2 | D\!\!=\!\!1 ) } }_{\mbox{posterior odds}} &= \underbrace{\frac{p( D\!\!=\!\!1 | M\!\!=\!\!1 )}{p( D\!\!=\!\!1 | M\!\!=\!\!2 )} }_{\mbox{Bayes factor}} \times \underbrace{\frac{p( M\!\!=\!\!1 ) }{ p( M\!\!=\!\!2 )}}_{\mbox{prior odds}} \label{eq:BF}\end{align} Notice from Equation \ref{eq:BF} that the Bayes factor is the ratio of the detection rate to the false alarm rate (when the datum is \(D\!\!=\!\!1\)). The Bayes factor converts the prior odds (i.e., the ratio of prior probabilities) to the posterior odds (i.e., the ratio of posterior probabilities).

When we plug in the numerical values of the example, we get \[ \underbrace{\frac{0.164}{0.836}}_{\mbox{posterior odds}} = \underbrace{\frac{0.97}{0.05}}_{\mbox{Bayes factor}} \times \underbrace{\frac{0.01}{0.99}}_{\mbox{prior odds}} \] or \[  \underbrace{0.196}_{\mbox{posterior odds}} = \underbrace{19.4}_{\mbox{Bayes factor}} \times \underbrace{0.0102}_{\mbox{prior odds}}\] (the displayed values are rounded, whereas the underlying calculations retained many decimal places). Thus, even though the Bayes factor is 19.4 in favor of having the disease, the posterior probability of having the disease is only 0.164 because the prior probability was 0.01.

The real-life lesson of these sorts of examples is that when doing random disease screening or random drug testing, it is not appropriate to use the Bayes factor alone to interpret the result of the diagnostic test. The prior probabilities of the conditions (disease or healthy, using drug or not) must be factored in, and the decision should be based on the posterior probability. 


Disease diagnosis is model comparison

The case of disease diagnosis is just a case of model comparison. In model comparison, we have some data and we infer the posterior probabilities of the models given the data. For disease diagnosis, the data consist of a dichotomous value, namely \(D\!\!=\!\!1\) or \(D\!\!=\!\!0\), and the two models are merely two possible underlying states of the world, namely \(M\!\!=\!\!1\) or \(M\!\!=\!\!2\). Disease diagnosis is simply asking which model, disease or healthy, is a better model of the outcome of the diagnostic test.

The moral of the disease-diagnosis example ---that the Bayes factor alone is not appropriate for selecting the model--- applies to model comparison generally. Model comparison should be based on the posterior probabilities of the models.

Figure 1: Disease diagnosis as model comparison. At bottom, D is the datum from the diagnostic test, with D=1 denoting "positive" result and D=0 denoting negative result. At left, model M=1 represents disease present. The right bar of its Bernoulli distribution over D=1 has height labeled by p(D=1|M=1), which is the correct-detection rate. At right, model M=2 represents disease absent.The right bar of its Bernoulli distribution over D=1 has height labeled by p(D=1|M=2) which is the false-alarm rate. At the top of the diagram is the base rate of the diseases. The prior probability of having the disease, M=1, is \(\pi_1\).
Figure 1 shows disease diagnosis as model comparison analogous to the diagrams in Figures 10.1 and 10.2 of DBDA2E (pp. 267 and 269). The caption provides a description. Essentially, the model for "has disease" has likelihood function \(D \sim \mbox{dbern}( p(D\!\!=\!\!1|M\!\!=\!\!1) ) \) and the model for "no disease" has likelihood function \(D \sim \mbox{dbern}( p(D\!\!=\!\!1|M\!\!=\!\!2) ) \). The prior probability of the model indices is \(M \sim \mbox{dcat}( \pi_1 , 1\!-\!\pi_1 )\).

Decision rule in model comparison sometimes uses only the Bayes factor, but should not

One practice in model comparison is to base decisions on the Bayes factor, forgetting about or at least de-emphasizing the prior probabilities (and hence the posterior probabilities) of the models. A typical decision rule is,

If the Bayes factor is >3 in favor of one model relative to the other, accept the better model and reject the other. 
Applying the Bayes-factor-only decision rule to disease diagnosis would lead us to decide that a person has the disease whenever the diagnostic test is positive because the Bayes factor is 19.4 which is much greater than 3. Using only the Bayes factor for decision making, instead of the posterior probabilities of the models, ignores Bayes' rule and is not Bayesian despite the terminology. A Bayes factor is only Bayesian in the full context of Bayes' rule, which converts prior probabilities to posterior probabilities. (Note that I am not concerned with the value 3 in the decision rule. The decision threshold could be 3 or 10 or whatever. Rather, the problem is not taking into account the prior probabilities of the models.)

The usual justification for using the Bayes factor alone is two fold. First, if the prior probabilities of the models are all equal, then the posterior odds algebraically equal the Bayes factor. (You can see that for yourself from Equation \ref{eq:BF}, above.) Setting the prior probabilities to be equal seems to be an innocuous default for many model-comparison settings, even if it's an absurd assumption for random screening of diseases or drug usage. Second, even if unequal prior probabilities for the models are assumed, the Bayes factor tells us how to modify those prior odds to arrive at the corresponding posterior odds. Therefore it is informative to report the Bayes factor.

The two justifications must be used carefully. First, I doubt that most model-comparison settings really are well represented by assuming equal prior probabilities of the models. Setting the prior probabilities equal to each other is an assertion of equality, not an expression of uncertainty; the next section discusses uncertainty. Setting the prior probabilities to 0.50/0.50 should be based on as much prior information as setting the prior probabilities to 0.01/0.99. Second, even if the Bayes factor is reported to summarize the analysis, the Bayes factor should not be the basis for decision making. The decision should be based on the posterior probabilities of the models.

Modeling uncertainty in the prior probabilities of the models

The usual formal set-up, as in the equations above, assumes specific exact values for the prior probabilities of the models. That is to say, \(p(M\!\!=\!\!m)\) has a specific exact value such as
\(p(M\!\!=\!\!1) = 0.01\) for the "has disease" model. But what if we are uncertain about the prior probabilities of the models? Modeling uncertainty is exactly what Bayesian analysis is all about.

In a previous version of this post, I made a mathematical derivation that seemed reasonable but which did not match my intuition for how the answer should play out. I asked readers to let me know if they could point out errors in the derivation. In a series of emails, Tomasz Miąsko (tomasz.miasko@gmail.com) patiently helped me to reconceptualize the derivation, and he also provided me with a simple JAGS script to verify the result. Thank you, Tomasz! I finally have clarity and here I present an elaborated derivation and JAGS script.

Denote \(p(M\!\!=\!\!1)\) by the parameter \(\pi_1\), which can take on any value from 0 to 1. Of course, \(p(M\!\!=\!\!2) = 1-\pi_1\). We express our uncertainty in \(\pi_1\) as a probability distribution \(p(\pi_1)\). For example, \(p(\pi_1)\) could be a beta distribution that has mode \(\omega\) and concentration \(\kappa\). (Hence the shape parameters for the beta distribution are \(a=\omega (\kappa-2) + 1\) and \(b= (1-\omega) (\kappa-2) + 1\).) If we are extremely certain about the prior probability of the model, then \(\kappa\) has a very large value, meaning that the prior distribution on \(\pi_1\) is tightly concentrated at \(\omega\), and the analysis converges to the point value assumed in the equations above. If we are uncertain about the prior probability of the model, then \(\kappa\) is a small-ish value, with \(\kappa \ge 2\). When \(\kappa=2\) the prior is a uniform distribution and we have "utter" uncertainty in the value of \(\pi_2\). Figure 2 shows a diagram of the model.

Figure 2: Disease diagnosis with uncertainty in the base rates of the diseases. See Figure 1 for description of the lower part of the diagram. Near the top of the diagram is the base rate of the diseases, where the prior probability of having the disease is denoted \(\pi_1\). There is uncertainty in the value of \(\pi_1\) represented by the beta distribution at the top of the diagram. The beta distribution has mode \(\omega\) and concentration \(\kappa\).

Notice that the overall model involves two parameters: \(M\) and \(\pi_1\). Consider Bayes' rule for the two-parameter space (as pointed out to me by Tomasz Miasko):
\begin{align}
p( M , \pi_1 | D )
&= p( D | M , \pi_1 ) p( M , \pi_1 ) / p^*(D)
\notag
\\
&= p( D | M ) p( M | \pi_1 ) p( \pi_1 ) / p^*(D)
\label{eq:pMpi1gD}
\end{align}
where, and this is crucial, \(p^*(D)\) is the double integral over the two-parameter space:
\begin{align}
p^*(D)
&=
\sum_m \int_{\pi_1=0}^{\pi_1=1} d\pi_1 p( D | M ) p( M | \pi_1 ) p( \pi_1 )
\label{eq:pD2param}
\end{align}
Notice that \(p^*(D)\) is a constant, not a function of \(\pi_1\).

To compute the posterior probability of \(M\) given data \(D\), we marginalize the two-parameter posterior of Equation \ref{eq:pMpi1gD}:
\begin{align}
p(M|D)
&=
\int_{\pi_1=0}^{\pi_1=1} d\pi_1 p( M , \pi_1 | D )
\notag
\\
&=
\int_{\pi_1=0}^{\pi_1=1} d\pi_1 p( D | M ) p( M | \pi_1 ) p( \pi_1 ) / p^*(D)
\notag
\\
&=
\frac{ p( D | M )}{ p^*(D) } \int_{\pi_1=0}^{\pi_1=1} d\pi_1  p( M | \pi_1 ) p( \pi_1 )
\label{eq:pMgD2param}
\end{align}
Notice in Equation \ref{eq:pMgD2param} (above) that \(p^*(D)\) was pulled in front of the integral sign, which is valid because \(p^*(D)\) is a constant. For the two models, Equation \ref{eq:pMgD2param} becomes
\begin{align}
p(M=1|D)
&=
\frac{ p( D | M=1 )}{ p^*(D) }
\int_{\pi_1=0}^{\pi_1=1} d\pi_1  \pi_1 p( \pi_1 )
\notag
\\
&\mbox{and}
\notag
\\
p(M=2|D)
&=
\frac{ p( D | M=2 )}{ p^*(D) }
\int_{\pi_1=0}^{\pi_1=1} d\pi_1  (1-\pi_1) p( \pi_1 )
\notag
\end{align}
The posterior odds are therefore
\begin{align}
\frac{ p(M=1|D) }{ p(M=2|D) }
&=
\frac{ p( D | M=1 ) }{ p( D | M=2 ) }
\times
\frac{ \int_{\pi_1=0}^{\pi_1=1} d\pi_1  \pi_1 p( \pi_1 ) }{ \int_{\pi_1=0}^{\pi_1=1} d\pi_1  (1-\pi_1) p( \pi_1 )}
\label{eq:postOdds2param}
\end{align}

In particular, when the distribution \(p( \pi_1 )\) is uniform with \(p( \pi_1 )=1.0\), the prior odds in Equation \ref{eq:postOdds2param} are 1/1. Thus, when there is complete uncertainty on \(\pi_1\) as expressed by a uniform distribution, then the posterior odds match the Bayes factor.

[What was wrong with the derivation of the previous post superseded by this post? The previous derivation began with the wrong conditional, but I didn't realize it because the notation suppressed the conditioned variable and hid the problem. The previous derivation started with \(p(M=1|D)\) without explicitly noting that it was conditional on \(\pi_1\). Therefore it was easy to think, in fact difficult not to think, that the derivation began with \(p(M=1,\pi_1|D)\) when in fact it began with \(p(M=1|D,\pi_1)\). Thus, when integrating over \(\pi_1\), the result is \(\int d\pi_1 \, p(M=1|D,\pi_1) \) and not the desired \(\int d\pi_1 \, p(M=1,\pi_1|D) \). Thus, the derivation of the previous post derived some strange quantity and not the posterior probability of \(M\).]


Numerical results from R and JAGS

Here are some results from the R script for JAGS appended below. In all cases, the mode of the prior on \(\pi_1\) is \(\omega=0.01\).

When the prior on \(\pi_1\) is extremely concentrated around \(\omega=0.01\), with \(\kappa=20000\), a run of JAGS produces
P(M=1|D)=0.165; P(M=2|D)=0.835; P(M=1|D)/P(M=2|D)=0.198
This result essentially matches the numerical result in the opening paragraphs of this post, which assumed a singular value for \(\pi_1\) at 0.01.

When the prior on \(\pi_1\) is only modestly concentrated around \(\omega=0.01\), with \(\kappa=20\), a run of JAGS produces
P(M=1|D)=0.547; P(M=2|D)=0.453; P(M=1|D)/P(M=2|D)=1.21
Notice that the posterior odds are now slightly in favor of having the disease, but are far from the Bayes factor of 19.4.


When the prior on \(\pi_1\) is weakly concentrated around \(\omega=0.01\), with \(\kappa=10\), a run of JAGS produces
P(M=1|D)=0.701; P(M=2|D)=0.299; P(M=1|D)/P(M=2|D)=2.34
Notice that the posterior odds are leaning in favor of having the disease, but have still not exceeded a ratio of 3, and are still far from the Bayes factor of 19.4.


When the prior on \(\pi_1\) is completely uncertain around \(\omega=0.01\), that is, a uniform distribution with \(\kappa=2\), a run of JAGS produces
P(M=1|D)=0.951; P(M=2|D)=0.0493; P(M=1|D)/P(M=2|D)=19.3
Here, with utter uncertainty in the prior probabilities of the models, the posterior odds match the Bayes factor, as was commented just after Equation \ref{eq:postOdds2param}.

Recap: The posterior odds of the models are not the same thing as the Bayes factor of the models. Decisions should be based on the posterior probabilities. The Bayes factor does not take into account the prior probabilities of the models (and despite its name the Bayes factor by itself does not even use Bayes' rule). The derivations and numerical examples above have shown that even with a lot of uncertainty in the prior probabilities on models, the posterior odds can be quite different than the Bayes factor. The posterior odds numerically equal the Bayes factor when (i) there is utter uniform uncertainty in the prior probabilities of the models or (ii) when there is complete prior knowledge that the models have exactly equal prior probability. I suspect that neither of those conditions occurs in reality. [Revised 28-Dec-2015, 9:35am] The posterior odds numerically equal the Bayes factor when (as a broad sufficient but not necessary condition) the uncertainty of the model prior probability is symmetric and known to have a mode at exactly 0.5, which I suspect rarely occurs in reality.

The R and JAGS script:

# Original script by Tomasz Miasko 24-Dec-2015 tomasz.miasko@gmail.com
# Elaborated by John Kruschke 26-Dec-2015 johnkruschke@gmail.com

rm(list=ls())

library(rjags)

# JAGS model for disease diagnosis example, 27-December-2015,
# http://doingbayesiandataanalysis.blogspot.com/

# Datum D is outcome of diagnostic test(s) for a single individual.
# D=1 suggests disease is present, D=0 suggests disease is absent.
# pD1gM[m] is probability of D=1 given model index m.
# m=1 means disease is truly present, m=2 means disease is truly absent.

modelString = "
model {
  for ( i in 1:length(D) ) {
    D[i] ~ dbern( pD1gM[m] ) # pD1gM[1:2] are constants provided in data list.
  }
  m ~ dcat( pM[1:2] )   # model index probability.
  pM[1] ~ dbeta( omega*(kappa-2)+1 ,
                (1-omega)*(kappa-2)+1 ) # prior probability for m=1
  pM[2] <- 1-pM[1]      # prior probability for m=2
  # omega and kappa are constants provided in data list.
}
"

# Specify constants:
# The result of the diagnostic test(s) for a single individual:
D = c(rep(1,1),rep(0,0)) # use D=NA for prior
# Probability D=1 for m=1,2; i.e.,
# correct detection rate and false alarm rate:
pD1gM = c(0.97,0.05) # high value for disease present, low value for absent
# Modal base rate of disease present:
omega = 0.01
# Certainty of modal base rate (kappa >= 2):
kappa = c(2,20,20000)[3]

model = jags.model(
  file=textConnection(modelString),
  data=list( D=D , pD1gM=pD1gM , omega=omega , kappa=kappa ),
  n.chains=4)
update( object=model, n.iter=1000 )
samples = coda.samples(
  model=model,
  variable.names=c("m","pM"),
  n.iter=125000) # total MCMC steps = n.chains*n.iter
chain = as.matrix(samples)

pM1gD = sum(chain[,"m"]==1)/length(chain[,"m"])
pM2gD = sum(chain[,"m"]==2)/length(chain[,"m"])

print( paste0( "For D = ", sum(D) , " 1's and " , length(D)-sum(D) , " 0's" ,
              ": P(M=1|D)=" , signif(pM1gD,3) ,
              "; P(M=2|D)=" , signif(pM2gD,3) ,
              "; P(M=1|D)/P(M=2|D)=", signif(pM1gD/pM2gD,3) ) )
hist( chain[,"pM[1]"] , breaks=51 ,
      main=bquote("p( pM[1] | D=" *.(sum(D))* " 1's and " 
                    *.(length(D)-sum(D))* " 0's )" ) ,
      col="grey" , border="grey" , xlim=c(0,1) )

4 comments:

  1. Thanks for a detailed post! :) For the longest time I actually thought Bayes factors referred to posterior odds, as that was how people used them...

    Btw, while it's a really good example in this post, I really don't like widespread use of the disease-diagnosis situation to introduce Bayes. In my mind, Bayes is a lot about modeling uncertainty, but the disease-diagnosis example starts out by giving these, for some reason, extremely certain detection/false-alarm rates, which in any "real" situation would be pretty uncertain too. So it's a good example if you want to confuse people, imho :)

    ReplyDelete
  2. Right, the disease-diagnosis example is so over-used (without further examples involving continuous parameters) that a lot of beginners think that Bayesian analysis only applies to discrete hypotheses.

    And yes, an even more realistic expansion of the model would not only include uncertainty in the base rates, as was done in this post, but would also include uncertainty in the detection rate and false-alarm rate.

    ReplyDelete
  3. For another treatment of uncertainty in the prior probability along with uncertainty in the Bayes Factor (that is, the ratio of the test's sensitivity and specificity) see

    Mossman D, Berger JO: Intervals for posttest probabilities: a comparison of 5 methods. Med Decis Making 2001;21:498-507

    available here:
    http://mdm.sagepub.com/content/21/6/498.full.pdf

    ReplyDelete
  4. In equation (2), I need a bit of explanation how come the second equality holds.
    If it holds p(D|M,π1)=p(D|M), but this is a merginal probability of over π1.

    ReplyDelete