Before launching into programming details, let me illustrate why this is important. Right-censored data can occur in many situations. For example, when measuring response times, the subject is given a limited temporal window in which to respond, and if no response has occurred by the end of the window, then the trial is aborted and the response is recorded as censored. If we assume that the subject was "on task" but merely didn't have enough time to respond before the experimenter got bored, then it would be inappropriate to omit this trial from the modeled data. In other words, the trial is providing genuine information about the response time, namely, that it takes longer than the experimenter's censoring limit. If we omit this trial from the modeled data, then our estimates will be too small!
Here is an example. The data are 500 values generated from a normal distribution, rescaled so that the sample mean is 100.0 and the sample standard deviation is 15.0. Then I marked any value above +1SD as NA. In other words, I censored about 15% of the data in the right tail. Suppose we model the data with a normal likelihood and estimate its μ and σ parameters. We would like the estimates to recover the true generating values of μ=100.0 and σ=15.0. But if we simply omit the censored values (as if they weren't part of the data), the estimates look like this:
Posterior estimates when censored data are omitted. Estimates are too low. (ROPE is shown only for visual landmarks; it's not particularly meaningful in this case.) |
On the other hand, if we tell JAGS that there are censored values that it should take into account, the estimates look like this:
Posterior estimates when censored data are modeled. Estimates are accurate. (ROPE is shown only for visual landmarks; it's not particularly meaningful in this case.) |
The basic idea for implementing in JAGS is presented briefly (one paragraph) in the JAGS user manual. The idea starts with defining an indicator vector that encodes, for every data value, whether the data value was censored or not. The indicator value, here named isCensored, is 1 for censored values and 0 for non-censored values. Then that indicator value is modeled thus: isCensored ~ dinterval(y,censorLimit) It took me a while to infer from that syntax the key idea: When isCensored is 1 and y is missing, then all JAGS knows is that y is somewhere above censorLimit, so JAGS effectively imputes a random value for y from the likelihood function (defined separately, in the usual way). Here are the tacit assumptions that I eventually figured out:
- Censored data must be recorded as NA, not as the value of censoring limit.
- When explicitly initializing the chains, the censored values of the data must be explicitly initialized (to values above the censoring limits)!
This post is fortuitous, because I was struggling with right-censored data this week. I've expanded this model into one which is slightly more complicated, in which the data have associated measurement uncertainties, so the body of the model is:
ReplyDeleteisCensored[i] ~ dinterval(yobs[i], censorLimitVec[i])
yobs[i] ~ dnorm(y[i],yerr[i])
y[i] ~ dnorm(mu, tau)
with "yobs" representing the observed values with associated uncertainties "yerr".
For this to run successfully the yerr value associated with the censored yobs points have to be defined---it won't accept "NA". Why this is true isn't immediately clear to me.
Very helpful!! Thanks!
ReplyDeleteRunning your program results in an error due to a change in JAGS 3.3.0 that do not allow values to be overwritten during initialization. The error goes away if
ReplyDelete# intial values of censored data:
yInit = y
yInit[isCensored] = censorLimitVec[isCensored]+1
is rewritten to
# intial values of censored data:
yInit = NA
yInit[isCensored] = censorLimitVec[isCensored]+1
In reply to Galeri B. For generality, a better way to do this
ReplyDeleteyinit=y
yinit[]=NA
yInit[isCensored] = censorLimitVec[isCensored]+1
So that you are guaranteed to keep the dimension, even if there are no censored values in the range
What happens and how do you proceed if there is right censoring, but the number of censored values are not known?
ReplyDeleteDear hypergeometric (12 Sept 2013):
ReplyDeleteThe model assumes that you know when you've conducted a measurement trial but the measurement was censored. It other words, the model assumes that you know when you have started and stopped an observation window that defines a measurement trial. I think this is a ubiquitous, if tacit, assumption in all conceptions of data measurements.
Do you mean that you have missing data, and some of the data might be censored trials? Then you're getting into models of missing values, and into the issues of missing-at-random, or missing systematically, etc.
I'm trying to understand how to construct the censorLimitVec with a standard set of survival data. I have a vector of event times t and a vector of censored status c. I see that I must set t[which(c)] = NA, and is.censored = c. Can I put the actual censor time in censorLimitVec for censored events? What do I put in censorLimitVec for uncensored events? Or does it matter? If it doesn't matter, then can I just do is.censored[i] ~ dinterval(t, t)?
ReplyDeleteDavid et al.:
ReplyDeleteThe core of the JAGS model for censored data is this:
isCensored[i] ~ dinterval( y[i] , censorLimitVec[i] )
y[i] ~ dnorm( mu , tau )
The key to understanding what JAGS is doing is that JAGS automatically imputes a random value for any variable that is not specified as a constant in the data. Thus, when y[i] is NA (i.e., a missing value, not a constant), then JAGS imputes a random value for it.
But what value should it generate?
The second line of the model, above, says that y[i] should be randomly generated from a normal distribution with mean mu and precision tau.
But the first line of the model, above, puts another constraint on the randomly generated value of y[i]. That line says that whatever value of y[i] is randomly generated, it must fall on the side of censorLimitVec[i] dictated by the value of isCensored[i].
To understand this part, let's unpack the dinterval() distribution. Suppose that censorLimitVec has 3 values in it, not just 1:
censorLimitVec = c(10,20,30)
Then randomly generated values from dinterval(y,c(10,20,30)) will be either 0, 1, 2, or 3 depending on whether y<10, 10<y<20, 20<y<30, or 30<y. So, if y=15, dinterval(y,c(10,20,30)) has output of 1 with 100% probability. The trick is this: We instead specify the output of dinterval, and impute a random value of y that could produce it. Thus, if we say
1 ~ dinterval(y,c(10,20,30))
then y is imputed as a random value between 10 and 20.
Putting the two model statements together,
1 ~ dinterval( y , censorLimit )
y ~ dnorm( mu , tau )
means that y comes from a normal density and y must fall above the censorLimit.
Hope that helps!!
John,
ReplyDeleteThat was extremely helpful. Thank you. You might consider pasting it as an answer to my similar Cross-Validated question. Search for "Right-censored survival fit with JAGS."
Dr Kruschke,
ReplyDeleteThanks for your helpful comment.
Yes, I was getting at missing-at-random, missing-below-threshold kinds of situations.
-- Jan ("hypergeometric")
P.S. Thank you so much for putting up such an excellent compendium of resources on Bayesian calculation, as well as your book. (I personally own 2 copies, and convinced my department to purchase a third. We have JAGS and runjags widely installed, etc.)
Thanks for doing this!
ReplyDeleteDear Dr. Kruschke,
ReplyDeleteThank you very much for this blogpost (and your excellent book!).
Do you (or anyone reading this) know how to specify/change the model so that it deals with data that is both left and right censored?
Is it possible to also graph the full set of y's including imputed values? It would be a great posterior check and look great in two colors.
ReplyDeleteI suspect we need to somehow add y_imputed to the parameters in the RUN THE CHAINS section.
We can't use
# RUN THE CHAINS
parameters = c( "mu" , "sigma", "y" )
because the codaSamples will include all 500 y's and the non-missing values are 'constants.'
We also can't average each of the imputed y's to 'fill out' the distribution since all those y_imputed averages are around 122.4.
Anyone have ideas/code on how to graph the distribution of all 500 y's?
This comment has been removed by the author.
Delete@ Sean S:
DeletePlease see this new post:
https://doingbayesiandataanalysis.blogspot.com/2017/12/graphs-of-imputed-censored-y-values.html
<a href="https://doingbayesiandataanalysis.blogspot.com/2017/12/graphs-of-imputed-censored-y-values.html>here</a>