This post was provoked by a recent blog at the New Yorker by Gary Marcus and Ernest Davis. The topic of their blog was a criticism of Nate Silver's book,

*The Signal and the Noise*. In the course of their critique, Marcus and Davis go through a case of the standard textbook example, involving diagnosis of breast cancer. They assume, for purposes of the example, that there is a well-known prior probability of breast cancer. They say,

A Bayesian approach is particularly useful when predicting outcome probabilities in cases where one has strong prior knowledge of a situation. ... But the Bayesian approach is much less helpful when there is no consensus about what the prior probabilities should be.I think that the second sentence above is exactly wrong. The Bayesian approach is also helpful, perhaps uniquely helpful, when there is uncertainty in the prior. All we have to do is express the uncertainty in a mathematical model, and then let Bayesian inference tell us how to re-allocate uncertainty given the data. The present blog post is an explicit illustration of one way to do this.

Please note that this post is not defending or criticizing Nate Silver's book. This post is about what Marcus and Davis say about Bayesian methods, illustrated specifically by the case of disease diagnosis. This post is not about anything Silver does or doesn't say about Bayesian or non-Bayesian methods. My goal is to clarify what Bayesian methods can do, and specifically one way for expressing uncertainty in the case of disease diagnosis.

First, the standard textbook example, as provided by Marcus and Davis in their blog:

Suppose, for instance (borrowing an old example that Silver revives), that a woman in her forties goes for a mammogram and receives bad news: a “positive” mammogram. However, since not every positive result is real, what is the probability that she actually has breast cancer? To calculate this, we need to know four numbers.In the case above, the prior probability (emphasized by bold font), is assumed to be exactly 0.014. This probability presumably came from some gigantic previous survey of women, using some extremely accurate assay of the disease, so that the prior probability is conceived as having no uncertainty.The fraction of women in their forties who have breast cancer is 0.014[bold added], which is about one in seventy. The fraction who do not have breast cancer is therefore 1 - 0.014 = 0.986. These fractions are known as the prior probabilities. The probability that a woman who has breast cancer will get a positive result on a mammogram is 0.75. The probability that a woman who does not have breast cancer will get a false positive on a mammogram is 0.1. These are known as the conditional probabilities. Applying Bayes’s theorem, we can conclude that, among women who get a positive result, the fraction who actually have breast cancer is (0.014 x 0.75) / ((0.014 x 0.75) + (0.986 x 0.1)) = 0.1, approximately. That is, once we have seen the test result, the chance is about ninety per cent that it is a false positive.

But what if there is uncertainty about the prior probability of the disease? What if, instead a gigantic previous survey, there was only a small survey, or no previous survey at all? Bayesian methods handle this situation well. I've got to introduce a little mathematical notation here, but the ideas are straightforward. Let's denote the true probability of the disease in the population as the symbol θ (Greek letter theta). The standard example above stated that θ=0.014. But in Bayesian methods, we can put a distribution of relative credibility across the entire range of θ values from 0 to 1. Instead of assuming we believe only in the exact value θ=0.014, we say there is a spectrum of possibilities, and the prior knowledge expresses how certain we are in the various possible values of θ. The distribution across the range of θ values could be very broad --- that's how we express

*un*certainty. Or, the distribution could be sharply peaked over θ=0.014 --- that's how we express certainty.

Here is the standard example in which there is high certainty that θ=0.014, with the positive test result denoted mathematically as

*y=1*.

The left panel shows a sharp spike over the value θ=0.014, indicating that all other values have essentially zero credibility. (Ignore the messy numbers under "95% HDI." They are irrelevant for present purposes.) This "spike" distribution can be thought of as being based on a previous survey of 20,000 women. That's why the header of the left panel says "Prior Certainty: 20000." The right panel above shows the posterior probability of having the disease, given the positive test result (y=1). The right bar, centered on disease=1.0, has height of 0.097, which is the posterior probability of having the disease.

*This result closely matches the result stated by Marcus and Davis in their example*.

We can use Bayesian inference seamlessly when the prior is less certain. Suppose, for example, that the previous survey involved only 20 women. Then the prior is a broader distribution over possible underlying probabilities of the disease, and the result looks like this:

The left panel shows a distribution that is spread out over θ, meaning that we are not very certain what the value of θ is. (The displayed distribution is actually the posterior distribution of θ given the test result; the prior distribution looks similar.) In particular, lots of values for θ greater than 0.014 are considered to be possible. The right panel above shows the posterior probability of having the disease, given the positive test result. Notice that the Bayesian inference indicates that the probability of having the disease is about 0.336, much more than the result of 0.097 when the prior was highly certain. This makes sense; after all, the prior knowledge is uncertain, so the the test result should weigh more heavily.

We can use Bayesian inference seamlessly even when the prior is hugely uncertain. Suppose that the previous survey involved only 2 women, which abstractly means merely that we know the disease can happen, but that's all we know. Then the prior distribution looks like this:

The prior, displayed above, shows that any value of θ is equally possible, and the probability of having the disease, according to this prior, is 50-50. The posterior distribution, given a positive test result, looks like this:

Notice that the posterior probability of having the disease is 0.882. Why so high? Because we have very uncertain prior knowledge about the disease, and all we have to go by is the outcome of the diagnostic test. (The Bayesian inference also automatically updates the distribution of credibility across θ as shown in the left panel above. It shows that, given the single positive test result, we should shift our beliefs about the underlying probability of the disease toward higher values.)

In this post I've tried to show that Bayesian inference is seamlessly useful regardless of how much uncertainty there is in prior knowledge. The simplistic framing of the standard textbook example should not be construed as the only way to do Bayesian analysis. Indeed, the whole point of Bayesian inference is to express uncertainty and then re-allocate credibility when given new knowledge. If there is lack of consensus in prior knowledge, then the prior should express the lack of consensus, for example with a broad prior distribution as illustrated in the examples above. If different camps have different strong priors, then Bayesian analysis can tell each camp how they should re-allocate their idiosyncratic beliefs. With enough data, the idiosyncratic posterior distributions will converge, despite starting with different priors.

By the way, I am not aware of the sort of analysis I've provided above appearing elsewhere in the literature. But it's straightforward, and I imagine it must be "out there" somewhere. If any of you can point me to it, I'd appreciate it.

## 24 comments:

John --

It's an interesting and elegant example. I haven't seen it before. Our statement was carelessly worded. It is certainly the case that Bayesian analysis can work well when the prior distribution has a wide variance. But there is no question in your examples as to prior distribution should be. Our point, poorly expressed, was that Bayesian analysis becomes problematic when there is no consensus about the prior

distribution, as in our example of assigning a prior to the Milgram experiments.-- Ernie Davis

Sorry "as to prior distribution" should be "as to what the prior distribution ..."

Come to think of it, in a probability course I gave this fall, I gave a homework example of the broadly general structure: You have an bin of coins with specified frequencies of different weightings, you flip a coin twice and observe the result, then you ask what is the probability of heads on the next flip. So here the initial distribution is explicit and discrete, as opposed to implicit and uniform, but otherwise the structure of the inference is the same.

http://cs.nyu.edu/courses/Fall12/CSCI-GA.1180-001/ex4.pdf

"the probability of having the disease, according to this prior, is 50-50." This, I suppose, is to reassure readers who might be dismayed by 0.5? Do you really think that such a reader exists?

I wish you had given some idea of the mechanics of this. Still I suppose I learned wherte the mechanics goes at least so thanks for that.

On third thought, my assignment has the revised prior but nothing that corresponds to the test and the posterior. The increased importance of the test as the prior becomes vaguer in your example is striking.

-- Ernie

Ernie:

Thanks very much for your thoughtful comments. As I understand it, your main point came in your first comment: "Our point ... was that Bayesian analysis becomes problematic when there is no consensus about the prior *distribution* ..."

Again, I think that is wrong. The foundational purpose of Bayesian analysis is to express uncertainty via a mathematical model and then let Bayes' rule do the correct re-allocation of uncertainty when given new data. When there is meaningful uncertainty, at any level, the model should be devised to express that uncertainty.

In particular, when there is uncertainty in what distribution to put on a parameter, such as the distribution on parameter theta in my example, then you simply express that uncertainty in a higher-level distribution. I put a beta density prior on theta (unstated in the blog post because I didn't want the forest obscured by the trees). The beta distribution has two shape constants; the values of those constants express our prior knowledge, which in this case can be thought of as the sample size of the previous survey and the proportion of disease cases in the previous survey. If we have uncertainty is those numbers, as you're suggesting, the Bayesian framework handles it: We just put a higher-level distribution on those values, such that the higher-level distribution describes our uncertainty.

You (or other readers) might think, "Aha, gotcha now! The approach suffers from an infinite regress of higher levels that never end, as we have meta-meta-uncertainty about meta-uncertainty about uncertainty, and therefore the hierarchical Bayesian approach is doomed never to get anywhere." I don't think so. It's infinite regress in much the same way that Zeno's paradox is infinite regress: Each level becomes less and less influential in the actual Bayesian result -- and the fact that Bayesian analysis shows the diminishing influence of meta-meta-uncertainty is itself a lovely revelation of Bayesian analysis. Moreover, in real applications, expressions of meta-meta-meta-uncertainty become meaningless.

To reiterate my statement above: The foundational purpose of Bayesian analysis is to express uncertainty via a mathematical model and then let Bayes' rule do the correct re-allocation of uncertainty when given new data. When there is meaningful uncertainty, at any level, the model should be devised to express that uncertainty.

Thank you again for your comments!

--John

Dear John

Thank you very much for the informative post.

I wonder if you could elaborate shortly on the methods you used?

As I understood it, you chose a beta prior with mean(?) 0.014 and with the equivalent information of a + b = 20000 (in the first example)? And then you used bayes' formula to calculate the posterior?

Cheers!

-- Serdash

A prior probability distribution might be useful for representing a single person's uncertainty at a given time, but how well can it work for representing the readers of a paper? Even if we were able to accurately represent a community's consensus opinion as a probability distribution, this consensus can change and the paper's analysis would rapidly become dated.

The "different camps" approach seems more promising. One could provide an optimist's and skeptic's analysis and show how the results differ. Some people might say that the skeptic's analysis isn't skeptical enough, but it should cover most readers.

[Deborah Mayo e-mailed me with this comment, which she reports she couldn't get the blogger system to post.]

The most serious problem with Bayesianism as far as being relevant for science is that it requires an exhaustive set of hypotheses and assignments of prior degrees of belief (or the like) to them. This renders it a closed universe, and inferences to a hypothesis H would vary greatly according to the prior given to “H is false”---the Bayesian catchall hypothesis. Is there a deflection of light of approximately the amount Einstein’s GTR predicts? Does prion infection occur without nucleic acid? As with most hypotheses of interest, these are not events, they are claims about some aspect of the phenomenon giving rise to the data. Scientists do not and would not want to assign weights to hypotheses and their denials, when they are setting out to learn new things. There are legitimate relative frequencies for disease incidence, and we might be able to determine the appropriate reference class for the event that someone has the disease but, despite Reichenbach’s hope for the future, few people harbor the idea that the evidence for a hypothesis is determinable by considering how often certain types of hypotheses have (apparently) held up (or not) in the past.

Despite years of trying to use methods of subjective elicitation, these attempts are considered unreliable and a diversion from the work of model building, even by leading Bayesians. The priors generally in use are “default” and not even intended to represent beliefs; instead the goal is for them to have a minimal effect on the analysis. Well I’ve said all this before (errorstatistics.com).

Thank you.

D. Mayo

Krushke,

This example does not seem correct to me.

You claim that the "uncertainty" about theta will change the result, but that does not seem true.

What changes the result of P(d=1|y=1) is the mean of theta, not the distribution of theta!

See if I'm right.

We assume that P(d=1|theta)=theta.

So, the "unconditional" probability of having a disease is

P(d=1) = integral[P(d=1|theta)*P(theta)] = integral[theta*P(theta)]

P(d=1) = mean of theta

And that is what is used in calculating P(d=1|y=1).

That is, P(d=1|y=1) = P(y=1|d=1)P(d=1)/[P(d=1|y=1)P(d=1)+P(d=1|y=0)P(d=0)]

P(d=1|y=1) = P(y=1|d=1)*(mean of theta)/[P(d=1|y=1)(mean of theta)+P(d=1|y=0)(1-mean of theta)]

So, as far as I see, the only information needed to calculate P(y=1|d=1) is the mean of theta, regardless of how wide the distribution is.

So, in your first example, the mean of theta is 0.014.

And then P(y=1|d=1)=0.75*0.014)/[(0.75*0.014)+(0.1*(1-0.014)] = 0.097 ( or 0.096, it depends on the decimals after 0.014)

In the third example, the mean of theta is 0.5.

Then, of course, P(y=1|d=1)=0.75*0.5)/[(0.75*0.5)+(0.1*(1-0.5)] = 0.882

But this result came not because the "uncertainty" about theta was bigger, but just because the mean of theta was different. You assumed that the the uncondional probability of someone randoly selected having the diseas is 50%! Then, of course, the result will be much higher!

If instead of that prior we used a prior with a big standard deviation but with mean 0.014, as far as I see we would get the same result... right?

Thanks!

Carlos

**Correcting the bayes rule:

P(d=1|y=1) = P(y=1|d=1)P(d=1)/[P(y=1|d=1)P(d=1)+P(y=1|d=0)P(d=0)]

P(d=1|y=1) = P(y=1|d=1)*(mean of theta)/[P(y=1|d=1)(mean of theta)+P(y=1|d=0)(1-mean of theta)]

Dear Carlos:

Perhaps this will clarify.

For any specific value of theta, it is straightforward to compute the posterior probability of having the disease, given the positive test result.

The example merely considers every possible value of theta, from 0 to 1, and computes the posterior probability of having the disease given a positive test result for every possible value of theta.

Then we simply integrate across those posterior probabilities, weighted by the prior probability of each value of theta.

--John

Ok, I see the mathematical difference between what I did and what you did.

Taking a Beta prior, what you did was something like this in R:

integrand <- function(x) {dbeta(x,a,b)*(0.75*x)/((0.75*x)+(0.1)*(1-x))}

integrate(integrand, lower = 0, upper = 1)

Rigth?

But is that the right calculation?

Let me give a simple example:

Suppose theta can be either 1 or 0 with 50/50 chance.

Then what is the chance that a randomly selected person who has tested positive has the disease?

I would calculate like this:

First, the person was selected randomly with the prior. Then the (unconditional) probability that she has a disease is:

P(d=1) = SumOverTheta[P(d=1|theta)] = 1*50%+0*50% = 50%

Then the person would be tested.

She tested positive and I would apply bayes, using P(d=1) = 50%:

P(d=1|y=1) = 88%

Now, if I understood right, you would do something like this:

First you would have each separate world test for the disease. That is, you would calculate:

P(d=1|y=1, theta)

For theta=1 and for theta=0.

Then you would randomly sample from everyone who tested positive in each gruoup using the prior, that is, getting 50 of theta=1 and 50 of theta=0 and calculate the probability of selecting one who has a disease. You would get:

SumOverTheta[P(d=1|y=1, theta)]=75%*50% + 0%*50% = 37,5%

It seems to me that this second calculation is not right. The calculation above, I think, is something akin to the probability of selecting a person with a disease if everyone in both worlds had taken the test and you only sampled from the ones who tested positive. This is different from the probability that a randomly selected person who tested positive has the disease, which is what we want to know, right?

Dear Carlos:

Let theta be the base rate (i.e., prior probability) of the disease. Then, from Bayes' rule,

p(disease|positive test) = (theta x 0.75) / ((theta x 0.75) + ((1.0-theta) x 0.1))

An example of that was given in the blog post (quoted from Marcus and Davis) for theta=0.014.

When there is uncertainty about theta, we simply compute that posterior probability on average across the values of theta:

p(disease|positive test) = Integral_theta p(theta) * (theta x 0.75) / ((theta x 0.75) + ((1.0-theta) x 0.1))

In the blog post, I simulated that in JAGS instead of doing the integral mathematically.

Hi, John, I understand

But even considering that p(disease|positive test) = Integral_theta p(theta) * (theta x 0.75) / ((theta x 0.75) + ((1.0-theta) x 0.1)) is the correct way to calculate the probability that a randomly selected person who just took a test and got positive has the disease, take the example above, where the prior for theta is P(theta=1)=P(theta=0)=50%

If we calculate p(disease | positive test) like you said, we get 50% (The 37.5% above was wrong, I wrote the answer to another problem I was solving!). Or we could put a prior of 25%, 50%, 25% for thetas 0, 1/2 and 1 and this would give us about 69% on The prior if I did not miscalculated again!

But if instead we are completely sure that theta is 50% then p(disease | positive test)=88%

So we have distributions with the same mean for theta, but one is completely certain. And the more uncertains ones have lower posterior, not the opposite. That contradicts the claim that " Because we have very uncertain prior knowledge about the disease, and all we have to go by is the outcome of the diagnostic test".

It seems to me that the examples show that the test result will be "more informative" about the disease not because there is more uncertainty, but just because the priors are completely different as to the mean theta be several orders of magnitude higher!

Sorry to bother you with all the questions, and I really appreciate the help and support you give to the readers of your book!

Thanks

Carlos

***69% on the posterior

All I meant by "very uncertain prior knowledge" was a prior on theta that is very diffuse (e.g., uniform from 0 to 1). In that case, the relative hit rate and false alarm rate dominate the posterior inference. But that's just imprecise language trying to give intuitive interpretation to the math; ultimately the math drives the conclusion.

But this is not just about imprecise language, it's your core critique to Gary Marcus and Ernest Davis claim that the bayesian approach is not so useful when there's no consensus about the prior.

It seems to me that you tried to show them that it is easy to incorporate uncertainty about the prior and still get intuitive and agreeable results. But that's not what the results seems to show, because all three priors are completely different from each other!

The last prior is not about less consensus only, because it assumes that a randomly selected person has a 50/50 chance of having the disease, and the 95% HDI excludes 0.014, which is the mean of the first prior!

I used the code in R below to calculate the posterior with different Beta priors with the same mean, 50%, but different "certainties", that is, different a+b(don't know if the code is correct, but it seems to be):

integrand <- function(x) {dbeta(x,a,b)*(0.75*x)/((0.75*x)+(0.1)*(1-x))}

integrate(integrand, lower = 0, upper = 1)

If this code is right, when we have a+b = 20.000 and a/(a+b)=50% the posterior is 88.2%. Which is the posterior when we are exactly certain, as it should be!

Then if we have a+b=2 and a/(a+b)=50%, which is much, much more uncertain, then the posterior is 79,6%!

Maybe I'm not getting the math right, but if it is right, the result of more uncetainty in this case does not seem to make a compelling argument. It does not seem to show that it is easy to add uncertainty and get intuitive and agreeable results like the one you said:

"This makes sense; after all, the prior knowledge is uncertain, so the the test result should weigh more heavily."

Hello again Carlos.

Well, I'm not sure where our disconnect is happening, but I'll try again with an explicit R program.

First, what I've been talking about in these Comments is actually the "prior predictive" for the disease and not the actual full posterior on theta and disease (which I got from a JAGS program). But the prior predictive is close, so still serves to illustrate the point.

Here's a program in R to closely approximate the prior predictive:

--------------------------

# PrDis, defined next, is "theta" of the Comments.

# Create comb of PrDis values, where PrDis is base rate (prior) of disease:

nbin = 10000

binWidth = 1/nbin

PrDis = seq( binWidth/2 , 1.0-binWidth/2 , binWidth )

# Specify prior on PrDis as a beta distribution:

modePrDis = 0.014 # mode

kappaPrDis = 20 # certainty. Try 20000, 20, 2

pPrDis = dbeta( PrDis ,

modePrDis*(kappaPrDis-2)+1 ,

(1.0-modePrDis)*(kappaPrDis-2)+1 )

# Compute probability of disease given positive test:

pPosTestGivenDis = 0.75 # hit rate

pPosTestGivenNoDis = 0.10 # false alarm rate

postDis = sum( binWidth * pPrDis

* (PrDis * pPosTestGivenDis)

/ ( (PrDis * pPosTestGivenDis)

+ ((1.0-PrDis) * pPosTestGivenNoDis) ) )

# Display result:

show( c( kappaPrDis , postDis ) )

--------------------------

Results:

kappaPrDis postDis

20000 0.0965

20 0.2928

2 0.7962

Thus, as the prior on theta gets wider --more uncertain-- the probability of having the disease, given the positive result, gets larger.

Again, this is the prior predictive, which means just integrating across the prior on theta, without actually doing the Bayesian updating of theta. The JAGS results in the main blog post also updated the distribution over theta.

Hope this helps clarify.

John,

Thanks for the code.

Now it is easy to spot the difference.

See, the means of the prior distributions you are creating are not the same.

The mean for the first prior is 0.0140486

The mean for the second prior is 0.0626

And the mean for the third prior is 0.5000

That's why the posteriors are different, it is not due to the variance, but due to the mean!

That`s what I was trying to say.

***That's why the posteriors are so different - and getting higher - it is not due to the variance, but due to the mean! With higher mean, higher posterior!

That`s what I was trying to say.

Try this code, it will keep the mean of the beta distribution constant, that is, mean = a/(a+b) constant

****

# PrDis, defined next, is "theta" of the Comments.

# Create comb of PrDis values, where PrDis is base rate (prior) of disease:

nbin = 10000

binWidth = 1/nbin

PrDis = seq( binWidth/2 , 1.0-binWidth/2 , binWidth )

# Specify prior on PrDis as a beta distribution:

meanPrDis = 0.014 # mean

kappaPrDis = 2 # certainty. Try 20000, 20, 2

a = meanPrDis*kappaPrDis

b = kappaPrDis-a

pPrDis = dbeta( PrDis,a,b)

# Compute probability of disease given positive test:

pPosTestGivenDis = 0.75 # hit rate

pPosTestGivenNoDis = 0.10 # false alarm rate

postDis = sum( binWidth * pPrDis

* (PrDis * pPosTestGivenDis)

/ ( (PrDis * pPosTestGivenDis)

+ ((1.0-PrDis) * pPosTestGivenNoDis) ) )

mean = sum( binWidth * pPrDis * PrDis)

# Display result:

show( c( kappaPrDis , postDis, mean ) )

***

The results are

20.000 -> 9,6%

20 -> 7,8%

2 -> 4,1%

That is, the wider priors generate the lower posteriors!

In your example that did not happen because the mean of the prior was getting larger when you reduced kappaPrDis.

PS: I think this counterintuitive result is happening because this is not the proper way to compute the posterior... I think you can`t just average the posterior on theta, I'm writing a new code on what I thik may be the right way, I post it later.

Thanks,

Carlos

Carlos:

1. The point of my main post remains: The essence of Bayesian analysis is to specify uncertainty and then let Bayes' rule do re-allocation of the uncertainty when given new data. Thus, when Marcus and Davis said that Bayesian inference has troubles when the prior is uncertain, I said no, Bayesian analysis is completely at home when the prior is uncertain.

2. My examples in the main post are correct, as far as I know, and illustrative of particular choices of uncertainty on theta. There are other choices one can make to describe other forms of uncertainty, which will of course yield different numerical particulars. That does not alter my objection to Marcus & Davis in any way. It simply bolsters the fact that when you specify the uncertainty, Bayes' rule tells us how to re-allocate uncertainty.

3. I mentioned in my previous comment that the calculation I've done in the Comments is of the "prior predictive," not the posterior. The main post did the complete posterior using JAGS.

I appreciate your thoughtful engagement in this blog entry. Nevertheless, I need to leave this thread behind now, and focus on other obligations. Until next time!

Ok!

I think my point is just that the results of the example come from the higher mean of the prior, and not from the higher uncertainty (in the sense of variance).

With the new code I posted this gets clear, when we keep the mean the same, the higher variance yelds lower "prior predictive" values, not higher!

Thanks for all the replies!

Best

Carlos

John

Just one more thing, how did you update theta?

Did you use:

Posterior = Prior * ((0.75*theta)+(0.1)*(1-theta))/((0.75*(meantheta)+(0.1)*(1-meantheta)))

This how I'm doing it and I think this how you did it, because the update rule above matches the 62,7%, in your third example, and it also matches the "posterior predictive" of 88,2% if you average over the posterior (this last one I think is not right for the reasons I mentioned in the other comments, I think the correct "posterior predictive" would be 92,6%).

If you could share the complete code, that would be great!

Post a Comment