Q&A - PPV, NPV, Sens, Spec with censored data QUESTION: How can you estimate sensitivity, specificity, positive predictive value, negative predictive value, and prevalence with censored data? ANSWER: First, let's pretend I had complete data (no censoring, no missing data) on the random variables CRP and T0 measuring C-reactive protein levels and time to death, respectively. We will use the event CRP > c as a "positive test". We will use T0 < k as "disease". Now, with cross-sectional sampling (i.e., I did not pre-specify either the number of subjects with disease in my sample, nor did I pre-specify the number of subjects with a positive test in my sample), I can DIRECTLY estimate all of the following: Prevalence of disease is Pr(T0 < k). With no censoring, I would just count up the number of people who had T0 < k and divide by the number of people. Prev = #(T0 < k) / #(in sample) Sensitivity of test is Pr(CRP > c | T0 < k). With no censoring, I would consider only the people who had T0 < k. Then of those people, I would count how many had CRP > c, and divide by the total number of people with T0 < k. So Sens = #(CRP > c and T0 < k) / #(T0 < k) Similarly, specificity of test is Pr(CRP < c | T0 > k). With no censoring, I would consider only the people who had T0 > k. Then of those people, I would count how many had CRP < c, and divide by the total number of people with T0 > k. So Spec = #(CRP < c and T0 > k) / #(T0 > k) Predictive value of positive is Pr(T0 < k | CRP > c). With no censoring, I would consider only the people who had CRP > c. Then of those people, I would count how many had T0 < k, and divide by the total number of people with CRP > c. So PVP = #(CRP > c and T0 < k) / #(CRP > c) Predictive value of negative is Pr(T0 > k | CRP < c). With no censoring, I would consider only the people who had CRP < c. Then of those people, I would count how many had T0 > k, and divide by the total number of people with CRP < c. So PVN = #(CRP < c and T0 > k) / #(CRP < c) Okay. So far so good. Now what happens if we have censored data for T0? What can we estimate directly in this case? -- We estimate Pr(T0 > t) DIRECTLY using the KM curve. (We can estimate the prevalence). -- We estimate Pr(T0 > t | CRP > c) DIRECTLY by restricting our data analysis to people with CRP > c, and then using a KM curve just in that subset. (We can estimate PVP and PVN.) -- We cannot estimate Pr(CRP > c | T0 > t) DIRECTLY if there are some subjects censored before time t, because we do not have any way of restricting our analysis to the people who have T0>t. (Some of the subjects censored before time t might have also died before time t.) (We cannot directly estimate sensitivity and specificity.) Can we INDIRECTLY estimate Sens and Spec in this case by using Bayes Rule in this setting? Pr (Dis | Pos) Pr(Pos) Pr (Pos | Dis) = -------------------------------------------------- Pr (Dis | Pos) Pr(Pos) + Pr(Dis | Neg) Pr(Neg) Pr (Hlth | Neg) Pr(Neg) Pr (Neg | Hlth) = -------------------------------------------------- Pr (Hlth | Pos) Pr(Pos) + Pr(Hlth | Neg) Pr(Neg) Providing we had cross-sectional sampling, we can estimate everything on the right hand side, so that would allow us to get a sensitivity and a specificity through these indirect means. (If we had sampled according to CRP results-- purposely getting a certain number of positive and negative subjects-- we could not do this, because estimating the Pr(Pos) and Pr(Neg) would not have been possible.) Now if we want to consider what would happen in a population with a different mix of diseased and non-diseased... If we presume that sensitivity and specificity is always the same, then that means the PVP and PVN we estimate from this data is dependent upon the prevalence of disease in our sample. Hence, those estimates are completely irrelevant to our new population. But can use the estimated sensitivity and specificity for the timeframes involving the censored data, and then use Bayes Rule to get the PVP and PVN for the new prevalences. This works providing our new population would have the same sensititivity and specificity (an assumption we usually make, but it is not always strictly true-- see the example in class) And lastly we need to consider the missing CRP data. If we assume they are "missing at random", we can safely ignore those cases-- basically pretend that those people were not in our sample. (If we cannot assume this or reliably impute the data in some other way, we are completely lost.) Scott ##################################################################### Scott S. Emerson, M.D., Ph.D. Biost Dept: (O) 206-543-1044 Professor of Biostatistics (F) 206-543-3286 Department of Biostatistics Box 357232 ROC: (O) 206-221-4185 University of Washington (F) 206-543-0131 Seattle, Washington 98195 semerson@u.washington.edu #####################################################################