Logistic Regression of Biochemical Production Failures: General Model Questions (please roast my model!)

meehljd · July 11, 2023, 4:11pm

First off, I want to say hi to the PyMC community. I am really impressed with how supportive this community is. I’m a DS at a biotech company. I have been deep-diving into Bayes inference for the last 3-4 months. I just finished Statistical Rethinking and am almost caught up in the Learn Bayes Stats podcast catalog.

I am making a model of a biochemical production process that produces custom molecules for customers. I want to use QC fail/pass data to determine what aspects of the molecular designs impact the probability of failure. I currently have a dataset of ~100k observations.

I included a DAG below. Various design aspects (L, 5’, G+, …) affect the latent manufacturing complexity (C) of a given molecule. The perceived complexity then informed the manufacturing process used (S, P), which determines the manufacturing effectiveness (M), since more challenging molecules require more sophisticated methods to produce. Both (C) and (M) inform the probability of failure (F). There are also random sources of failure (R), such as operator error and reagent contamination. Based on (F), the actual outcome of a given run is observed (F_obs).

So starting simple, I’ve only focused on the molecular complexity (C) directly drives failure. The statistical model is:

F_{obs} \sim Binomial(p=F, n=1)
F = \frac{e^{C}}{1+e^{C}}
C = \beta_0 + \sum_{i=1}^n\beta_i x_i
\beta_0 \sim Normal(\mu=log(0.5), \sigma=0.25)
\beta_i \sim Half-Normal(\sigma=0.50)

I used half-normal distributions on the \beta_i params, since the teams only care to identify aspects of the design that increase failure. Features that have negative or negligible affect on failure can be ignored. Also some of the features are correlated, so I’m using this as a regularization hack.

I have run simulations and prior predictive checks with everything looking good. Posterior predictive checks look good as well. The predictions on temporal holdout data have poor predictive power (same with ML models). Sampling time is ~20 hours for 500 draws, 25 params, and ~100k observations.

Some questions, in order of importance:

Is there any concerns or issues with my model as described? This is my first Bayes model, so I’m sure there are some problems.
What is the best way to incorporate the random causes of failure (R). Is it as simple as adding R \sim Dist() and F = \frac{e^{C+R}}{1+e^{C+R}}? Or do I need some sort of mixture model, such as a Beta-binomial (12.1.1 in Rethinking Stats)?
What is the best way to incorporate the manufacturing effectiveness (M)? Any re-parametrization issues?
I have some features that are highly correlated. I see that McElreath’s book has some treatments for this by estimating the covariance matrix. Any sources on how to do this in PyMC?
At some point I would like to upgrade to a Hierarchical model, since many of my features are one-hot encode categorical variables. When I’m feeling brave enough, I’ll go through the PyMC tutorials. In the mean time, any advice on how to do this for this particular model?

This is a beast of a post. Thanks for reading!

Topic		Replies	Views
(Online Meetup): Bayesian Modeling in Biotech: Using PyMC to Analyze Agricultural Data (July 27, 2022) Events community	1	653	August 10, 2022
Hierarchical Bernoulli model for machine fault rate Questions	4	1025	October 11, 2019
Microbial Cell Counting in a Noisy Environment by Cameron Davidson-Pilon PyMCon2020	7	1340	November 1, 2020
[Beginner level question on modeling] Bayesian analysis of F1 scores from two ML models v5 modeling	5	410	January 24, 2023
Engineering PyMC3 models into production ML systems Sharing	7	1493	May 18, 2021

Logistic Regression of Biochemical Production Failures: General Model Questions (please roast my model!)

Related topics