Hierarchical model for outlier detection on high-throughput screening assays

Hi everyone,

First post here. I’m a Bayesian noob, even though I’ve been aware of Bayesian stats for a long time and it’s always fascinated me. I’m making my way through my first book at the moment (“Think Bayes”), but I’m impatient and want to get my feet wet. I’ve got a problem that seems perfect for a hierarchical model, but I can’t quite figure out how to use it.

I really hope it’s not too basic for this forum, but I’m a biologist, not a statistician and literally have no one in real life that’s remotely interested in these methods to talk to.

A little bit of background, if anyone is curious. I’m a molecular biologist and I work in a synthetic biology company. Most of my work revolves around making bacteria produce some chemical. A very common approach is to mutate thousands of cells and screen them all. Most of the time nothing happens, but sometimes you get a mutation that makes the bacteria spit out more of the chemical.

We try to find the good ones by isolating thousands and growing them individually in well-plates with, then measuring the content of the chemical in each hole. We do this in multiple “experiments”, each with tens of plates with 96 holes each. Because it’s biology, there’s a lot of variability - between individual experiments/runs, plates, instrumental errors and other noise.

This is how a typical dataset looks like: https://i.imgur.com/6jiLLaX.png
x represents the different mutants or strains and y represents the different measurements that correspond to each strain.

I want to: 1. normalize the data while keeping the between-group uncertainty, 2. classify the outliers, if there are any.

The goal is to identify the “good” bacteria, or the ones where the mutations actually did something. Assuming mutations did absolutely nothing to the overwhelming majority of the population, we are looking for outliers.

Picking an outlier that ends up not being one wastes a ton of time. That’s why I think uncertainty is extremely important here, hence my desire to try a Bayesian approach. I tried making a simple hierarchical model taking only the between-experiment variability into account to keep it simple. Since the x is not really a variable, I fixed the slope to 0 and only fit the intercept and noise. I used normal distributions across the board, again to keep it simple for now.

And this is what I got https://i.imgur.com/BTXyPtR.png
Very close to what I get if I just calculate the mean and std of each group, so that’s good.

The problem is, I don’t really know what to do with it! I thought about normalizing the dataset to the same “Experiment” mean, but then there’s no point in the whole thing, because I’m not using the uncertainty.

Any tips welcome!

And again, I apologize for such a basic post. It’s probably extremely obvious and I’d not be asking this if I read more, but I really wanted to get something useful done before I manage to finish some books.

1 Like


Outlier detection can be handled implicitly or explicitly. It sounds like you want to model inlier/outlier explicitly a scenario that is typically done via a mixture model. You can take a look through this notebook or this one. You can also look at this notebook, which is perhaps a bit more involved that what you need, but might get you thinking about how this approach might (or might not) be appropriate for your scenario.

Regardless, if you want feedback on a particular model, it might be useful to generate some synthetic data that approximates the kind of thing displayed in the plot you shared. By generating the data yourself, you will have a much easier time knowing when your model is behaving “correctly”. And it will be easier for others to get a sense of the scenario you’re dealing with.