Simple Exponential Survival Function

I’m having trouble replicating survival analysis in PyMC3. I’ve been following along with the Statistical Rethinking lectures, and there’s an example about survival analysis on cat adoptions that isn’t in the published book. I’m wondering how to convert it into a PyMC3 model, particularly this sort of logic:

Screen Shot 2020-04-28 at 11.46.29 AM

If there is an instance of adoption it isn’t censored data so I can use a regular exponential distribution, and if it isn’t labeled as an adoption it is censored data because the cat might still end up being adopted so I’d use the exponential-CCDF. How do I implement this sort of logic in PyMC3?

Here’s stan code for the model as well:

model{
vector[22356] lambda; 
b ~ normal( 0 , 1 ); 
a ~ normal( 0 , 1 ); 
for ( i in 1:22356 ) {
        lambda[i] = a + b * black[i];
        lambda[i] = exp(lambda[i]);
    }

for ( i in 1:22356 )
    if ( adopted[i] == 0 ) target += exponential_lccdf(days_to_event[i] | lambda[i]);
for ( i in 1:22356 )
    if ( adopted[i] == 1 ) days_to_event[i] ~ exponential( lambda[i] );
}
1 Like

You can use a potential for that. Note that we have logcdf implemented for pm.Exponential but not the survival function (i.e., lccdf), so the best way I can think of is:

with pm.Model():
    a = pm.Normal('a', 0, 1)
    b = pm.Normal('b', 0, 1)
    log_rate = a + b * black
    y = pm.Exponential('observed', tt.exp(log_rate), observed=days_to_event)
    pm.Potential('survival', -log_rate[adopted == 0])  ==> exp_lccdf = exp_log_prob(lambda, rate) - log_rate
1 Like

I couldn’t end up following your solution as well as I had hoped, part of it is being new to PyMC3 and part of it is needing to work on my theoretical knowledge more. I did however find this exponential survival function from PyMC3s website (which I just managed to derive myself after and figure out how they got it) Screen Shot 2020-04-29 at 9.37.46 AM
and a corresponding function

def exponential_logp(censored, value):
    return (censored * tt.log(lam) - value*lam).sum()

So the model ends up being

with pm.Model() as m1:
    a_c = pm.Normal('a_color', 0, 1, shape=len(df.color_id.unique()))
    mu = tt.exp(a_c[cid])
    lam = 1/mu
    exp_surv = pm.DensityDist('exp_surv', exponential_logp, observed={'censored':censored, 'value':days_to_event})
    
    m1_trace = pm.sample(draws=1000)

And this model is returning the exact same inference as the Stan model. However, when I try to to run
pm.sample_posterior_predictive(trace_m1)
I get the following error:
ValueError: Distribution was not passed any random method Define a custom random method and pass it as kwarg random

Are there any good resources on how to fix this issue and create a custom distribution from my exponential_logp function? I attempted to troubleshoot the issue but I haven’t been getting very far

edit: with a little more effort I ended up following along with your code, which unlike with my approach, allows for sampling. Thanks!

1 Like

@john_c I have been trying to solve the same problem. Did you end up using Potential? If so, How did you incorporated your model? Seems like you are missing priors for exponential_logp !! Thanks

1 Like

The error you’re getting with DensityDist is telling you that the model does not know how to generate a random sample from your custom distribution (see Custom Distribution).

You need to define a sampling function random, then pass DensityDist(..., random=random). In the cats example, the function would be:

from scipy import stats

def random(*dist_params, rng=None, size=None):
    """The function from which to generate random samples.

    Parameters
    ----------
    dist_params : list
        The same parameters as given to `exponential_logp`. (t, lam) in this case.
    rng : Generator
        The random number generator used internally.
    size : tuple
        The desired size of the random draw.

    Returns
    -------
    result : (N,) ndarray
        A random sample of `size` from the desired distribution.
    """
    t, lam = dist_params
    return np.where(
        np.array(adopted == 1),
        stats.expon.rvs(scale=1/lam, size=size),                    # not censored = Exp(lam)
        stats.expon.isf(stats.uniform.rvs(size=size), scale=1/lam)  # censored = Exp(lam).isf()
    )

# NOTE DensityDist -> CustomDist in new API
exp_surv = pm.CustomDist('exp_surv', days_to_event, lam,
                         logp=logp, observed=adopted,
                         random=random)

This addition will allow you to sample from the posterior predictive. Note that I’ve only done a basic check on the mean of the posterior predictive, so I’m not sure if this function is entirely accurate.

@john_c Can you post your code for using pm.Potential to solve the issue? I cannot seem to get it to work properly either.

Update: I figured out the pm.Potential solution. In @junpenglao 's terms, the line is

pm.Potential('survival', pm.math.switch(adopted, 0, -log_rate))

In @john_c 's terms, the line is

pm.Potential('survival', pm.math.switch(adopted, 0, a_c[cid]))

For reasons discussed in McElreath’s book, use of the index variable cid in the latter is preferred over the indicator variable black from the former.

I get the following error when I try to run pm.sample_posterior_predictive on your model with pm.Potential:

UserWarning: The effect of Potentials on other parameters is ignored during posterior predictive sampling. This is likely to lead to invalid or biased predictive samples.

Is there any further information on this error? Since the Potential is only used to estimate the main parameter in this case, does this warning even matter?

If you’re using a Potential instead of CystomDist for the likelihood PyMC won’t know how to perform posterior predictive sampling of censored variables (hence the warning). If you don’t care about censoring in posterior predictive you can ignore the warning.

Note that we now have a Censored class that can be used in many of these survival models Censored — PyMC dev documentation

Which alleviates the need for CustomDist or Potentials. It doesn’t work based on indicator variables but based on specifying the censoring bounds. If you specify Censored(..., upper=[5, np.inf], observed=[5, 6]) would correspond to a case where the first observation had a value of 5 and was censored and the second had a value of 6 that was not censored.

In case anyone else goes the Censored class route, blindly copies over the example, and gets an error about “non-finite logp”, that’s just due to the prior on the a’s being Normal(0,1). a is supposed to be the log average wait time in days, so setting it to something more reasonable, like just Normal(2,1), solves the issue. The wait times of ~1000 days do not compute (ha!) with that N(0,1) prior.