Friday, January 6, 2012

Complete example of right censoring in JAGS (with rjags)

In this post I provide a complete and simple example of how to do estimation of right-censored data in JAGS (with rjags). I could not find a complete, simple example anywhere (on the web or in print), and it took me lots of trial and error to figure out the tacit assumptions, which I try to make explicit here. (Of course, there still might be errors or infelicities. Let me know.)

Before launching into programming details, let me illustrate why this is important. Right-censored data can occur in many situations. For example, when measuring response times, the subject is given a limited temporal window in which to respond, and if no response has occurred by the end of the window, then the trial is aborted and the response is recorded as censored. If we assume that the subject was "on task" but merely didn't have enough time to respond before the experimenter got bored, then it would be inappropriate to omit this trial from the modeled data. In other words, the trial is providing genuine information about the response time, namely, that it takes longer than the experimenter's censoring limit. If we omit this trial from the modeled data, then our estimates will be too small!

Here is an example. The data are 500 values generated from a normal distribution, rescaled so that the sample mean is 100.0 and the sample standard deviation is 15.0. Then I marked any value above +1SD as NA. In other words, I censored about 15% of the data in the right tail. Suppose we model the data with a normal likelihood and estimate its μ and σ parameters. We would like the estimates to recover the true generating values of μ=100.0 and σ=15.0. But if we simply omit the censored values (as if they weren't part of the data), the estimates look like this:
Posterior estimates when censored data are omitted. Estimates are too low.
(ROPE is shown only for visual landmarks; it's not particularly meaningful in this case.)
 Notice that the estimated parameter values are too low. This makes sense, because the model is trying merely to describe the uncensored data shown in grey, without any attempt to take into account the censored data.

On the other hand, if we tell JAGS that there are censored values that it should take into account, the estimates look like this:
Posterior estimates when censored data are modeled. Estimates are accurate.
(ROPE is shown only for visual landmarks; it's not particularly meaningful in this case.)
Notice that the estimated parameter values are very accurate. This improvement in accuracy (and meaningfulness!) is why we want to be able to model censored data.

The basic idea for implementing in JAGS is presented briefly (one paragraph) in the JAGS user manual. The idea starts with defining an indicator vector that encodes, for every data value, whether the data value was censored or not. The indicator value, here named isCensored, is 1 for censored values and 0 for non-censored values. Then that indicator value is modeled thus:  isCensored ~ dinterval(y,censorLimit) It took me a while to infer from that syntax the key idea: When isCensored is 1 and y is missing, then all JAGS knows is that y is somewhere above censorLimit, so JAGS effectively imputes a random value for y from the likelihood function (defined separately, in the usual way). Here are the tacit assumptions that I eventually figured out:
  • Censored data must be recorded as NA, not as the value of censoring limit.
  • When explicitly initializing the chains, the censored values of the data must be explicitly initialized (to values above the censoring limits)!
The full program for generating the figures above can be found here. For details, read the comments in the top of the program. Please let me know of errors, improvements, etc. And please let me know if this was helpful.

4 comments:

jay said...

This post is fortuitous, because I was struggling with right-censored data this week. I've expanded this model into one which is slightly more complicated, in which the data have associated measurement uncertainties, so the body of the model is:

isCensored[i] ~ dinterval(yobs[i], censorLimitVec[i])
yobs[i] ~ dnorm(y[i],yerr[i])
y[i] ~ dnorm(mu, tau)

with "yobs" representing the observed values with associated uncertainties "yerr".

For this to run successfully the yerr value associated with the censored yobs points have to be defined---it won't accept "NA". Why this is true isn't immediately clear to me.

Anonymous said...

Very helpful!! Thanks!

Galeri B said...

Running your program results in an error due to a change in JAGS 3.3.0 that do not allow values to be overwritten during initialization. The error goes away if


# intial values of censored data:
yInit = y
yInit[isCensored] = censorLimitVec[isCensored]+1


is rewritten to


# intial values of censored data:
yInit = NA
yInit[isCensored] = censorLimitVec[isCensored]+1

Anonymous said...

In reply to Galeri B. For generality, a better way to do this


yinit=y
yinit[]=NA
yInit[isCensored] = censorLimitVec[isCensored]+1


So that you are guaranteed to keep the dimension, even if there are no censored values in the range