Logistic Regression of Biochemical Production Failures: General Model Questions (please roast my model!)

First off, I want to say hi to the PyMC community. I am really impressed with how supportive this community is. I’m a DS at a biotech company. I have been deep-diving into Bayes inference for the last 3-4 months. I just finished Statistical Rethinking and am almost caught up in the Learn Bayes Stats podcast catalog.

I am making a model of a biochemical production process that produces custom molecules for customers. I want to use QC fail/pass data to determine what aspects of the molecular designs impact the probability of failure. I currently have a dataset of ~100k observations.

I included a DAG below. Various design aspects (L, 5’, G+, …) affect the latent manufacturing complexity (C) of a given molecule. The perceived complexity then informed the manufacturing process used (S, P), which determines the manufacturing effectiveness (M), since more challenging molecules require more sophisticated methods to produce. Both (C) and (M) inform the probability of failure (F). There are also random sources of failure (R), such as operator error and reagent contamination. Based on (F), the actual outcome of a given run is observed (F_obs).

So starting simple, I’ve only focused on the molecular complexity (C) directly drives failure. The statistical model is:

F_{obs} \sim Binomial(p=F, n=1)
F = \frac{e^{C}}{1+e^{C}}
C = \beta_0 + \sum_{i=1}^n\beta_i x_i
\beta_0 \sim Normal(\mu=log(0.5), \sigma=0.25)
\beta_i \sim Half-Normal(\sigma=0.50)

I used half-normal distributions on the \beta_i params, since the teams only care to identify aspects of the design that increase failure. Features that have negative or negligible affect on failure can be ignored. Also some of the features are correlated, so I’m using this as a regularization hack.

I have run simulations and prior predictive checks with everything looking good. Posterior predictive checks look good as well. The predictions on temporal holdout data have poor predictive power (same with ML models). Sampling time is ~20 hours for 500 draws, 25 params, and ~100k observations.

Some questions, in order of importance:

  1. Is there any concerns or issues with my model as described? This is my first Bayes model, so I’m sure there are some problems.
  2. What is the best way to incorporate the random causes of failure (R). Is it as simple as adding R \sim Dist() and F = \frac{e^{C+R}}{1+e^{C+R}}? Or do I need some sort of mixture model, such as a Beta-binomial (12.1.1 in Rethinking Stats)?
  3. What is the best way to incorporate the manufacturing effectiveness (M)? Any re-parametrization issues?
  4. I have some features that are highly correlated. I see that McElreath’s book has some treatments for this by estimating the covariance matrix. Any sources on how to do this in PyMC?
  5. At some point I would like to upgrade to a Hierarchical model, since many of my features are one-hot encode categorical variables. When I’m feeling brave enough, I’ll go through the PyMC tutorials. In the mean time, any advice on how to do this for this particular model?

This is a beast of a post. Thanks for reading!

1 Like