Hi PyMC3 folks - first time user here so please excuse me if I say some daft things.
I’ve been tinkering with a toy modelling problem for a few days now and I seem to have hit a roadblock. Essentially I have some data of a manufacturing process where I have a product’s “path” through a manufacturing facility and a binary variable at the end denoting whether the product was manufactured successfully or whether a fault developed part-way through manufacturing.
The process is therefore only partially observable and I can’t estimate the marginal fault probability directly.
What I’m trying to model using PyMC3 is the marginal probability of a fault developing at each given machine (I want to estimate this figure non-conditioned on the other machines).
In case I’m not making sense, here’s a screenshot of some rows from a (synthetic) dataset I created.
Here’s the data-generating function I’m using…
f = 5
n = 2000
n_ = n // f
n_machines_per_stage = 2
data = {
'stage_1': np.hstack([([0] * n_), np.random.randint(0, n_machines, n - n_)]),
'stage_2': np.hstack([([0] * n_), np.random.randint(0, n_machines, n - n_)]),
'stage_3': np.hstack([([0] * n_), np.random.randint(0, n_machines, n - n_)]),
'manufactured_without_faults': np.random.binomial(n=1, p=0.85, size=n)
}
n_stages = len(data.keys())
df = pd.DataFrame(data).sample(frac=1).reset_index(drop=True)
target = 'manufactured_without_faults'
mask = (df['stage_2'] == 0)
df.loc[mask, target] = np.random.binomial(n=1, p=0.50, size=mask.sum())
Finally, here’s the model that I’m using - I’ve essentially adapted the log-linear random effect model from the Rugby analytics notebook in the PyMC3 docs to use a Bernoulli RV as the output.
with pm.Model() as model:
mu = pm.StudentT('mu', nu=1, mu=(np.exp(1) / (1 + np.exp(1))), sd=1)
sig = pm.HalfStudentT('sig', sd=1)
# machine-specific model parameters
deterministics = {}
dists = {}
thetas = []
for key in data.keys():
if key != target:
dists[key] = pm.Normal(f'{key}_logit', mu=mu, sd=sig, shape=df[key].nunique())
deterministics[key] = pm.Deterministic(key, pm.invlogit(dists[key]))
thetas.append(dists[key][df[key].values])
theta = T.sum(thetas, axis=0)
# likelihood of observed data
fault = pm.Bernoulli(target, p=pm.invlogit(theta), observed=df[target].values)
However, when I look at the posterior probabilities (by inspecting the inverse logit), they don’t appear “sensible” (they appear pushed too far towards 0 - which is making inference difficult).
Anyway. Is there anything glaringly obvious that I’m getting wrong here? I can’t figure out what I’m getting wrong…