Hierarchical Bernoulli model for machine fault rate

Hi PyMC3 folks - first time user here so please excuse me if I say some daft things.

I’ve been tinkering with a toy modelling problem for a few days now and I seem to have hit a roadblock. Essentially I have some data of a manufacturing process where I have a product’s “path” through a manufacturing facility and a binary variable at the end denoting whether the product was manufactured successfully or whether a fault developed part-way through manufacturing.

The process is therefore only partially observable and I can’t estimate the marginal fault probability directly.

What I’m trying to model using PyMC3 is the marginal probability of a fault developing at each given machine (I want to estimate this figure non-conditioned on the other machines).

In case I’m not making sense, here’s a screenshot of some rows from a (synthetic) dataset I created.
Capture

Here’s the data-generating function I’m using…

f = 5
n = 2000
n_ = n // f

n_machines_per_stage = 2
data = {
    'stage_1': np.hstack([([0] * n_), np.random.randint(0, n_machines, n - n_)]),
    'stage_2': np.hstack([([0] * n_), np.random.randint(0, n_machines, n - n_)]),
    'stage_3': np.hstack([([0] * n_), np.random.randint(0, n_machines, n - n_)]),
    'manufactured_without_faults': np.random.binomial(n=1, p=0.85, size=n)
}
n_stages = len(data.keys())

df = pd.DataFrame(data).sample(frac=1).reset_index(drop=True)

target = 'manufactured_without_faults'
mask = (df['stage_2'] == 0)
df.loc[mask, target] = np.random.binomial(n=1, p=0.50, size=mask.sum())

Finally, here’s the model that I’m using - I’ve essentially adapted the log-linear random effect model from the Rugby analytics notebook in the PyMC3 docs to use a Bernoulli RV as the output.

with pm.Model() as model:
    mu = pm.StudentT('mu', nu=1, mu=(np.exp(1) / (1 + np.exp(1))), sd=1)
    sig = pm.HalfStudentT('sig', sd=1)

    # machine-specific model parameters
    deterministics = {}
    dists = {}
    thetas = []
    for key in data.keys():
        if key != target:
            dists[key] = pm.Normal(f'{key}_logit', mu=mu, sd=sig, shape=df[key].nunique())
            deterministics[key] = pm.Deterministic(key, pm.invlogit(dists[key]))
            thetas.append(dists[key][df[key].values])
    theta = T.sum(thetas, axis=0)
    
    # likelihood of observed data
    fault = pm.Bernoulli(target, p=pm.invlogit(theta), observed=df[target].values)

However, when I look at the posterior probabilities (by inspecting the inverse logit), they don’t appear “sensible” (they appear pushed too far towards 0 - which is making inference difficult).

Anyway. Is there anything glaringly obvious that I’m getting wrong here? I can’t figure out what I’m getting wrong…

I see a few potential problems.

  1. It looks to me like you’re sharing a prior on {key}_logit:mu and {key}_logit:sd across all stages. Is this intentional? You may want to create these priors per stage instead.
  2. You’re treating this essentially as hierarchical logistic regression, right? You may need to enforce something like sum-to-zero constraints on the coefficients. This notebook has some examples that might help you get started.
  3. (edit) Also, on further thought, I’m not sure using \frac{e}{1 + e} is the right value for the prior - should probably just be 0, and may want to lose the hyperprior {key}_mu entirely.
  4. (edit) There’s also no intercept here - that might be a helpful thing to include.
with pm.Model() as model:
    deterministics = {}
    dists = {}
    thetas = []

    for key in data.keys():
        if key != target:
            # Edit: may want to replace this with mu = 0
            mu = pm.StudentT(f'{key}_mu', nu=1, mu=0, sd=1)
            sig = pm.HalfStudentT(f'{key}_sig', sd=1)
            dists[key] = pm.Normal(f'{key}_logit', mu=mu, sd=sig, shape=df[key].nunique())
            deterministics[key] = pm.Deterministic(key, pm.invlogit(dists[key]))

            thetas.append(dists[key][df[key].values])

    intercept = pm.Normal('intercept', mu=0, sd=1)

    theta = intercept + T.sum(thetas, axis=0)
    
    # likelihood of observed data
    fault = pm.Bernoulli(target, p=pm.invlogit(theta), observed=df[target].values)

Thanks for your reply - you’ve actually gone into a lot of depth and this is really helpful.

I’ve noticed the sum-to-zero constraint in other models and I’m not fully sure why it’s used. I don’t think it would help in my particular case, so I omitted it. In my case, I’m trying to figure out the probability that an order will develop a fault given a partially observable process.

Basically I’m making the assumption that… if an order passes through 3 machines (stage 1, stage 2 and stage 3) then the overall probability of “success” for that order (i.e. no fault) is p_{succ}(s_{1}) \times p_{succ}(s_{2}) \times p_{succ}(s_{3}).

My model has been tweaked and given some promising results initially but I’m struggling with parameterising my priors (Gelman-Rubin statistic looks good but BFMI is typically hovering between 0.4 and 0.2 so it’s fairly low).

I’m wondering if I can bring the BFMI up by tightening up the priors and sampling for longer…

Given that this model has shown some promising results (can’t share as commercially sensitive), do you think I’d still need the sum-to-zero constraint?

with pm.Model() as model:
    # fairly wide hyperpriors, half-Cauchy for long tail
    mu = pm.Normal('mu', mu=df[target].mean(), sd=5)
    sigma = pm.HalfCauchy('sig', beta=3)

    # define model and priors
    dists = {}
    probs = {}
    thetas = []
    for i, key in enumerate(encoded_categorical):
        if key != target:
            dists[key] = pm.Normal(
                '{}_logit'.format(key),
                mu=mu, sd=sigma,
                shape=df[key].nunique()
            )
            probs[key] = pm.Deterministic(key, pm.invlogit(dists[key]))
            thetas.append(pm.invlogit(dists[key][df[key].values]))

    # product of individual success probabilities
    theta = T.prod(T.stack(thetas, axis=0), axis=0)
    fault = pm.Bernoulli(target, p=theta, observed=df[target].values)

For the model, probably can’t speak too well as to exactly why sampling is difficult, but here are my further thoughts.

Just do logistic regression?

I don’t think explicitly modeling the joint probability as product of individual machine probabilities here is actually going to do you any favors, especially when sampling.

I’m not sure it’s the right framing for the problem even: a failure at any machine means a failure at output, right? In that case, shouldn’t the probability of an error anywhere be the sum of the individual error probabilities, not the product?

But, practically speaking, if you add outputs of pm.invlogit, you may get a value > 1, which is invalid for theta.

So AFAICT, you really do just want to do more conventional logistic regression here, as you had originally.

EDIT

Ah crap, I overlooked that you’re modeling success probability instead of error probability. I see better where you came from here.

I still think doing a regular logistic regression here is probably better. Consider the interpretation of logistic regression coefficients - theta / (1-theta) = exp(b0 + b1*x1 ...) = exp(b0) * exp(b1 * x1) .... The product naturally ends up as a sum in the exponentiation.

It might be good to revisit the derivation of logistic regression if you have time to see better how this reflects the joint probability of all successes.

Intercept: I really think it will help!

I imagine for your problem there’s some substantial class imbalance - positive:negative ratio for flaw/no flaw probably 1:5 or lower? Without an intercept term, you’re basically saying the “base rate” is .5, and each machine’s coefficient must determine a deflection from that base rate. However, the base rate is probably much lower (I hope? :sweat_smile:) Introducing an intercept could potentially help a lot.

(See logistic function: do you really want x_0 fixed at .5?)

Prior width

Your priors seem super wide to me - I don’t think you need to have sd=5 for mu and a wide sigma prior. I would change sd=5 to sd=1.

You could also consider replacing HalfCauchy(...) with Gamma('sigma', 1.64, .32). (see Kruschke DBDA for rationale behind this prior choice)

Hierarchical priors and ST0

Similar to what I said before, when you share a prior on variance across machine types, you’re basically saying “all machine types’ coefficients are on the same scale”. Is that the case, or might there be large differences in scale between machine types? (Are some machine types much more likely than others to introduce flaws?)

I would definitely urge trying separate priors per machine type if you haven’t yet. By sharing the prior across individual machines, within machine type, you’re saying each individual machine’s coefficient is conditional on what type of machine it is - from my remote position, this looks like it reflects the problem structure better than a global prior.

(You could even go one step further and reintroduce a population-level hyperprior on the per-machine-type variance priors, if you feel it necessary).

This is also where sum-to-zero (ST0) becomes more relevant. As I understand it, ST0 has mainly to do with interpretability. You might use it here within machine type - so that each individual machine’s coefficient describes a deflection from the baseline error rate for that machine type. (But it’s not strictly necessary, and you may want to interpret your coefficients differently.)

Robust (beta-binomial) likelihood

Especially if you have imbalanced data, using output of logistic regression as Bernoulli theta directly may be a bit problematic - Bernoulli can’t estimate variance and expectation separately, so if expectation of an error is quite low, the variance will also be low and you might have some difficulty sampling. Using a beta prior on the Bernoulli likelihood gets you an robust likelihood function than might better accommodate sparser/overdispersed target events:

# ...

# Beta-binomial variance
kappa = pm.Gamma('kappa', .01, .01)

# mu is output of logistic regression
a = mu * kappa
b = (1 - mu) * kappa
theta = pm.Beta('theta', a, b)

fault = pm.Bernoulli(target, p=theta, observed=df[target].values)

This is really helpful and thorough - thanks for all the time you’ve put into your answer. I’ve got a lot to go away and think about (and I should probably stop putting off the purchase of DBDA…).

I’ll try out your suggestions and see how things turn out. In the meantime, thank you. :slight_smile: