Censored Model with Hurdle Parameter

Hi!

I’m trying to build a statistical model for the following scenario.

There are N machines which can stop working at some point in time T. The machines are checked on some schedule if they work or not. When they are checked the time of check Tc and the state of the machine (0-working, 1 not-working) are recorded. There is one complication: with a probability of p the sensor measuring the state may fail itself and in that case the observed state is always 0.

I need to estimate T and p from the data. For simplicity, let’s assume that each machine is checked only once to not deal with measurements of each individual subject.

This problem is interval-censored, I never observe the true value of T, it is either left-censored (in case of failure = 1) or right-censored (in case of failure = 0):

df = df_simulated.assign(
    lower=lambda df: np.where(df["observed_failure"] == 1, df["check_day"], 0),
    upper=lambda df: np.where(df["observed_failure"] == 0, df["check_day"], np.inf),
)

I fit it like this:

with pm.Model() as m2:
    α = pm.Exponential('α', 1)
    β = pm.Exponential('β', 1)
    time_to_event = pm.Weibull.dist(alpha = α, beta = β)
    obs = pm.Censored('obs', dist = time_to_event, upper=df.upper, lower=df.lower, observed = df.check_day)

And it works beautifully for p = 0, I can recover the predefined failure rate in the simulation.

For p > 0 the model starts to over-estimate the rate as it observes more of “non-failures” than it should.

Any help would be great on how to add the hurdle parameter to the model. I understand that I need to modify the log-likelihood somehow, but my problem that I have two observed parameters: failure and the time. Should I use pm.Potential or there are a cleaner way?

I believe @tcapretto built hurdle models into bambi recently. Maybe he can make some suggestions.

You may be able to do a Mixture between censored and DiracDelta(0). Can you show some code for how synthetic data would be generated?

Yeah, that what I was thinking about, but I struggle to understand how to pass two variables into a mixture (observed_failure and T).

Here is my data generation process:

import scipy.stats as st
import pandas as pd


typical_days_to_failure = st.gamma(a=5, scale=1).rvs(10000)
days_to_failure = st.expon(scale=typical_days_to_failure).rvs(10000)
p_faulty_sensor = 0.1
check_days = np.asarray([np.random.choice(np.arange(0,19))+1 for _ in range(0,10000)])

df_simulated = pd.DataFrame().assign(true_failure=days_to_failure, check_day=check_days).assign(
    failure=lambda df: (df.true_failure < df.check_day).astype(int),
    faulty_sensor = st.bernoulli(p = p_faulty_sensor).rvs(10000),
    observed_failure = lambda df: (1-df.faulty_sensor)*df.failure
)

Thinking about your unobserved failures (you can split the dataset in observed and unobserved failures). The likelihood for unobserved failures is something like:

pm.Mixture(
  "unobserved_failure",
  w=[p_faulty_sensor, 1 - p_faulty_sensor], 
  dist=[
    # If sensor is faulty, we have left censoring
    pm.Censored.dist(dist=time_to_event, lower=df.check_day),    
    # Otherwise, right censoring
    pm.Censored.dist(dist=time_to_event, upper=df.check_day),
  ],
  observed=df.check_day
)

For observed failures it’s just left censoring.

The other thing you want to do is to inform p_faulty_sensor by the number of observed_failures as well. That’s an extra likelihood term of the form:

pm.Binomial("observed_failures", n=n_observed_failures, p=1-p_faulty_sensor, observed=n_observed_failures")

I would have to think a bit more whether you can combine the 3 elements into a single “distribution”. If not, a CustomDist with manual logp may be the more clean implementation.

This combines the time observations:

with pm.Model() as m2:
    α = pm.Exponential('α', 1)
    β = pm.Exponential('β', 1)
    time_to_event = pm.Weibull.dist(alpha = α, beta = β)
    p_faulty_sensor = pm.Beta('p_faulty_sensor', 1, 1)
    
    p_observed_failure = pm.math.switch(
        df.observed_failure, 
        1,  # If we observed a failure we are sure of that
        p_faulty_sensor,  # Otherwise the sensor may have failed
    )
    obs = pm.Mixture(
        "obs",
        w=[p_observed_failure, 1-p_observed_failure],
        comp_dists=[
            pm.Censored.dist(time_to_event, lower=df.check_day)
            pm.Censored.dist(time_to_event, upper=df.check_day),
        ],
        observed=df.check_day
    )

But I still suspect you need to inform p_faulty_sensor by the number of observed failures. If you observed many, the sensor is likely not failing that often?

If you have multiple measurements, you have even more information about the sensor probability of failure

I tried my suggestion here without complete success: Google Colab

The parameters seems less biased when I hard code the known prob of the sensor failing, but if given the chance to infer it the posterior goes to zero, so it behaves as when you ignore it.

Either the parameter is very hard to sample or, more likely, I did something wrong.

Hi, thanks for taking the time to look into that! The failure rate is somewhat known, so I can put a tight prior around it. I put Beta(100, 900) prior, and it samples quite well. It seems it’s impossible to infer both failure rate and sensor failure probability at the same time without repeated observations.

That’s exactly my next step. I can check the sensors multiple times for each machine before the failure is registered. The assumption is that as time goes by, more machines will naturally fail, and only the faulty sensors will remain.

The problem is that each unit will now have a different number of observations - after a machine fails, there are no more observations. Is that a problem for the posterior as defined in your model?

Ideally, I probably want a multilevel model for each machine having a separate distribution, but so far, I want it to work as a fully pooled model.

I appreciate your help.

It seems to depends on the failure rate. If it’s something with a smaller right tail (alpha=3, beta=5), it estimates p well.

Instead, when the data (or prior) is compatible with a long right tail, the model can easily assume 0 probability sensor failure and that all unobserved failures correspond to failures later down the road.

You also get a new source of information with repeated measures. Now when you detect a failure it’s not just left censoring, but interval censoring: the failure happened somewhere between the previous observation and now.

However the machine may have already died in the previous observation, but due to the faulty sensor you didn’t know this. So it’s a mixture of all the possible censoring intervals, weighted by the sensor failure rate. You’ll probably need a custom likelihood for this.