Censored Model with Hurdle Parameter

Cricket4444 · October 11, 2024, 5:55pm

Hi!

I’m trying to build a statistical model for the following scenario.

There are N machines which can stop working at some point in time T. The machines are checked on some schedule if they work or not. When they are checked the time of check Tc and the state of the machine (0-working, 1 not-working) are recorded. There is one complication: with a probability of p the sensor measuring the state may fail itself and in that case the observed state is always 0.

I need to estimate T and p from the data. For simplicity, let’s assume that each machine is checked only once to not deal with measurements of each individual subject.

This problem is interval-censored, I never observe the true value of T, it is either left-censored (in case of failure = 1) or right-censored (in case of failure = 0):

df = df_simulated.assign(
    lower=lambda df: np.where(df["observed_failure"] == 1, df["check_day"], 0),
    upper=lambda df: np.where(df["observed_failure"] == 0, df["check_day"], np.inf),
)

I fit it like this:

with pm.Model() as m2:
    α = pm.Exponential('α', 1)
    β = pm.Exponential('β', 1)
    time_to_event = pm.Weibull.dist(alpha = α, beta = β)
    obs = pm.Censored('obs', dist = time_to_event, upper=df.upper, lower=df.lower, observed = df.check_day)

And it works beautifully for p = 0, I can recover the predefined failure rate in the simulation.

For p > 0 the model starts to over-estimate the rate as it observes more of “non-failures” than it should.

Any help would be great on how to add the hurdle parameter to the model. I understand that I need to modify the log-likelihood somehow, but my problem that I have two observed parameters: failure and the time. Should I use pm.Potential or there are a cleaner way?

cluhmann · October 11, 2024, 6:40pm

I believe @tcapretto built hurdle models into bambi recently. Maybe he can make some suggestions.

ricardoV94 · October 11, 2024, 7:49pm

You may be able to do a Mixture between censored and DiracDelta(0). Can you show some code for how synthetic data would be generated?

Cricket4444 · October 12, 2024, 1:21am

Yeah, that what I was thinking about, but I struggle to understand how to pass two variables into a mixture (observed_failure and T).

Here is my data generation process:

import scipy.stats as st
import pandas as pd


typical_days_to_failure = st.gamma(a=5, scale=1).rvs(10000)
days_to_failure = st.expon(scale=typical_days_to_failure).rvs(10000)
p_faulty_sensor = 0.1
check_days = np.asarray([np.random.choice(np.arange(0,19))+1 for _ in range(0,10000)])

df_simulated = pd.DataFrame().assign(true_failure=days_to_failure, check_day=check_days).assign(
    failure=lambda df: (df.true_failure < df.check_day).astype(int),
    faulty_sensor = st.bernoulli(p = p_faulty_sensor).rvs(10000),
    observed_failure = lambda df: (1-df.faulty_sensor)*df.failure
)

ricardoV94 · October 12, 2024, 7:37am

Thinking about your unobserved failures (you can split the dataset in observed and unobserved failures). The likelihood for unobserved failures is something like:

pm.Mixture(
  "unobserved_failure",
  w=[p_faulty_sensor, 1 - p_faulty_sensor], 
  dist=[
    # If sensor is faulty, we have left censoring
    pm.Censored.dist(dist=time_to_event, lower=df.check_day),    
    # Otherwise, right censoring
    pm.Censored.dist(dist=time_to_event, upper=df.check_day),
  ],
  observed=df.check_day
)

For observed failures it’s just left censoring.

The other thing you want to do is to inform p_faulty_sensor by the number of observed_failures as well. That’s an extra likelihood term of the form:

pm.Binomial("observed_failures", n=n_observed_failures, p=1-p_faulty_sensor, observed=n_observed_failures")

I would have to think a bit more whether you can combine the 3 elements into a single “distribution”. If not, a CustomDist with manual logp may be the more clean implementation.

ricardoV94 · October 12, 2024, 7:55am

This combines the time observations:

with pm.Model() as m2:
    α = pm.Exponential('α', 1)
    β = pm.Exponential('β', 1)
    time_to_event = pm.Weibull.dist(alpha = α, beta = β)
    p_faulty_sensor = pm.Beta('p_faulty_sensor', 1, 1)
    
    p_observed_failure = pm.math.switch(
        df.observed_failure, 
        1,  # If we observed a failure we are sure of that
        p_faulty_sensor,  # Otherwise the sensor may have failed
    )
    obs = pm.Mixture(
        "obs",
        w=[p_observed_failure, 1-p_observed_failure],
        comp_dists=[
            pm.Censored.dist(time_to_event, lower=df.check_day)
            pm.Censored.dist(time_to_event, upper=df.check_day),
        ],
        observed=df.check_day
    )

But I still suspect you need to inform p_faulty_sensor by the number of observed failures. If you observed many, the sensor is likely not failing that often?

If you have multiple measurements, you have even more information about the sensor probability of failure

ricardoV94 · October 12, 2024, 12:16pm

I tried my suggestion here without complete success: Google Colab

The parameters seems less biased when I hard code the known prob of the sensor failing, but if given the chance to infer it the posterior goes to zero, so it behaves as when you ignore it.

Either the parameter is very hard to sample or, more likely, I did something wrong.

Cricket4444 · October 12, 2024, 7:17pm

Hi, thanks for taking the time to look into that! The failure rate is somewhat known, so I can put a tight prior around it. I put Beta(100, 900) prior, and it samples quite well. It seems it’s impossible to infer both failure rate and sensor failure probability at the same time without repeated observations.

That’s exactly my next step. I can check the sensors multiple times for each machine before the failure is registered. The assumption is that as time goes by, more machines will naturally fail, and only the faulty sensors will remain.

The problem is that each unit will now have a different number of observations - after a machine fails, there are no more observations. Is that a problem for the posterior as defined in your model?

Ideally, I probably want a multilevel model for each machine having a separate distribution, but so far, I want it to work as a fully pooled model.

I appreciate your help.

ricardoV94 · October 12, 2024, 8:16pm

It seems to depends on the failure rate. If it’s something with a smaller right tail (alpha=3, beta=5), it estimates p well.

Instead, when the data (or prior) is compatible with a long right tail, the model can easily assume 0 probability sensor failure and that all unobserved failures correspond to failures later down the road.

ricardoV94 · October 12, 2024, 8:48pm

You also get a new source of information with repeated measures. Now when you detect a failure it’s not just left censoring, but interval censoring: the failure happened somewhere between the previous observation and now.

However the machine may have already died in the previous observation, but due to the faulty sensor you didn’t know this. So it’s a mixture of all the possible censoring intervals, weighted by the sensor failure rate. You’ll probably need a custom likelihood for this.

Topic		Replies	Views
SamplingError with Gamma dist for censored model	3	181	April 8, 2024
Censored likelihoods with different bounds per observation v5 modeling	2	312	March 9, 2023
Bounding Output and Potential Values Questions	4	1529	June 20, 2018
(newbie) Random censoring using pymc.Censored? v5 modeling	0	330	January 14, 2023
Zero-inflated Bounded Continuous Outcome v5 modeling	10	72	February 20, 2025

Censored Model with Hurdle Parameter

Related topics