Censored Weibull

Hi all. I’m trying to adapt the approach to estimate incubation periods from this paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4757028/#pone.0148506.ref002
R code and data here: Dryad | Data -- Association between the severity of influenza A(H7N9) virus infections and length of the incubation period .
However, I cannot get similar results to those reported in the paper, and I cannot figure out why. Here’s my code:

# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import pymc as pm
import arviz as az
import pytensor.tensor as pt
from scipy.special import gammainc as gammai

np.random.seed(27)

def weibull_cdf(x, kappa, theta):
    return 1 - pt.exp(-(x/theta)**kappa)

def censored(name, kappa, theta, lower, upper):
    L = weibull_cdf(lower, kappa, theta)
    U = weibull_cdf(upper, kappa, theta)
    return U - L

data = pd.read_csv("./original_data/data_h7n9_severity.csv")

lower = data.IncP_min.values #incubation periods lower boundary
upper = data.IncP_max.values #incubation periods upper boundary

fatal = data.death_status.values   

with pm.Model() as mod:
    # kappa = pm.HalfNormal('kappa', 5)
    # theta = pm.HalfNormal('theta', 5)
    kappa = pm.Uniform('kappa', 0, 100, shape=2) #as used in the paper
    theta = pm.Uniform('theta', 0, 100, shape=2) #as used in the paper
    mu = pm.Deterministic('mu', theta*pt.gamma(1 + 1 / kappa)) #incubation period mean
    y = pm.Potential('y', censored('censored', kappa[fatal], theta[fatal], lower, upper))
    
with mod:
    idata = pm.sample(2000, tune=2000, nuts_sampler='numpyro', random_seed=27)

az.summary(idata)

Out[27]: 
            mean      sd  hdi_3%  hdi_97%  ...  mcse_sd  ess_bulk  ess_tail  r_hat
kappa[0]  51.804  27.256   7.450   96.496  ...    0.643     831.0     917.0   1.01
kappa[1]  55.424  26.378  14.275   99.970  ...    0.574     969.0     829.0   1.00
theta[0]   3.569   1.385   1.252    6.081  ...    0.043     530.0     627.0   1.01
theta[1]   3.825   1.130   1.371    4.991  ...    0.048     423.0     681.0   1.01
f[0]       3.503   1.379   1.180    5.957  ...    0.042     546.0     738.0   1.01
f[1]       3.771   1.131   1.309    4.986  ...    0.046     427.0     721.0   1.01
mu[0]      3.508   1.363   1.260    5.991  ...    0.042     531.0     644.0   1.01
mu[1]      3.769   1.115   1.365    4.936  ...    0.047     420.0     698.0   1.01

[8 rows x 9 columns]

Any help will be really appreciated.

There is a wrapper for creating censored distributions

https://www.pymc.io/projects/docs/en/stable/api/distributions/censored.html

Also for censoring what happens is that you have a bunch of data say [10, 10, 10 ,180, 180, 1240]
with an upper an lower boundary (say 10, 1240). You supply this data as the observed input to your censored distribution and 10 and 1240 as lower and upper boundaries so that censored distribution decides how to tackle each data point. Probability of 10 and 1240 is then automatically calculated with cdf where as 180’s are calculated with the pdf.

Since I cant see what lower and upper looks like in your case without the data, I am not sure what you are doing here but it does not seem like it. Can you post an example of what lower and upper looks like?

Hi, thank you. Yes, sorry about that. When reading the paper I felt a similar confusion, as it does not seem like a censored approach ‘proper’. But it is (partially, I guess) a standard in epidemiology to estimate incubation periods in this way. The main issue is that there are not unique lower and upper boundaries. Instead, every observation has its own boundaries. The common case is that each patient has a possible lower and upper boundary of an hypothetical/estimated exposure period, where incubation is time of symptoms onset minus time of exposure. So we have, for instance, minimum exposure moment (minExp) and maximum exposure moment (maxExp) and symptoms onset (onset) So the point is estimating a ‘true’ value (often called ‘exact’) of the incubation period (i.e. that lies between onset - minExp and onset - maxExp).

The approach in the paper is by sampling from the CDFs of Weibull distributions. You can see that in the linked paper and R code. You can see the data and code in the second link of my initial post. But the key bit of their code is this one (I guess):

# prior function
logprior = function(current){
  current$logprior=dunif(current$k,0,100,log=TRUE)+dunif(current$theta,0,100,log=T)
  current
}

# The likelihood function
loglikelihood = function(current,data){
  current$loglikelihood <- log(pweibull(data$IncP_max,current$k,current$theta)-pweibull(data$IncP_min,current$k,current$theta))
  current
}

Which is what I’m trying to translate to PyMC. They sample it with Metropolis, I’d prefer to use HMC.

I have checked the implementation of the PyMC Censored and multiple posts on the topic before posting. But it seems that the data I’m working with is not matching the requirements. PyMC censored requires access to the censored data (e.g. a list of values) and two absolute boundaries (upper and lower). While in the data I have corresponds to two lists of variable boundaries (395 lower boundaries, 395 upper boundaries).

PyMC censored requires access to the censored data (e.g. a list of values) and two absolute boundaries (upper and lower)

If you want to achieved this, it is easy with the Censored class. Lets say your data is

[10,10,20,180,180,1200]

If you supply

lower = [10, 10, 20, -np.inf, -np.inf, -np.inf] 
upper = [np.inf, np.inf, np.inf, np.inf, 180, 1200]

it will apply lower (left) censoring to 10,10,20 and upper (right) censoring to 180 and 1200.

Regarding the code you posted

pweibull(data$IncP_max,current$k,current$theta)

I assume this is the cumulative distribution function. If so, perhaps you do not really want censoring. Do you just have some measurements that you know falls within an interval (but do not know the exact measurement)? In that case difference between CDFs is indeed the right way to go and I would also do it via Potential terms. I have looked at the data provided by the paper and there are some instances where the lower is equal to upper so for those cases they might have used pdf for weibul (that is what I would have done atleast). So for instance something like this:

import pandas as pd
import numpy as np
import pymc as pm
import arviz as az
import pytensor.tensor as pt
from scipy.special import gammainc as gammai

np.random.seed(27)

def weibull_cdf(x, kappa, theta):
    return 1 - pt.exp(-(x/theta)**kappa)

def censored(name, kappa, theta, lower, upper):
    L = weibull_cdf(lower, kappa, theta)
    U = weibull_cdf(upper, kappa, theta)
    return U - L

data = pd.read_csv("./data_h7n9_severity.csv")

lower = data.IncP_min.values #incubation periods lower boundary
upper = data.IncP_max.values #incubation periods upper boundary

fatal = data.death_status.values

interval = np.array([[x,y,f] for x,y,f in zip(lower,upper,fatal) if x!=y])
exact = np.array([[x,f] for x,y,f in zip(lower, upper, fatal) if x==y])

with pm.Model() as mod:
    # kappa = pm.HalfNormal('kappa', 5)
    # theta = pm.HalfNormal('theta', 5)
    kappa = pm.Uniform('kappa', 0, 5, shape=2) #as used in the paper
    theta = pm.Uniform('theta', 0, 5, shape=2) #as used in the paper
    mu = pm.Deterministic('mu', theta*pt.gamma(1 + 1 / kappa)) #incubation period mean
    y = pm.Potential('y', censored('censored', kappa[interval[:,2]],
                                   theta[interval[:,2]], interval[:,0],
                                   interval[:,1]))
    pm.Weibull("exact", kappa[exact[:,1]], theta[exact[:,1]],
               observed=exact[:,0])


with mod:
    idata = pm.sample(1000, tune=1000)#, nuts_sampler='numpyro')#, random_seed=27)

az.summary(idata)

az.plot_posterior(idata)

from which I get

1 Like

There is one thing else missing in your code which might help too. You are not taking log of the difference as far as I can see. You should return

return pm.math.log(U - L)

however for this to work you have to either discard data where U=L or do like I did (which I think is more sensible if I did not misunderstand your data). Nevertheless if you throw away U=L and return log you get

1 Like

Thank you very much for the quick and helpful reply. This solved the issue.

Indeed, I had removed that log as in my code it was not working, because I didn’t have the sampling distribution for ‘exact’ values, I couldn’t guess that from their code, but they mention it on the paper a bit implicitly. I should learn to look a bit more carefully to the data as well :sweat_smile: