Hello,
I am pretty new to Bayesian modeling and very new to PyMC3, so this may very well be an easy question to answer. I apologize of the long post, but I tried to be complete. Any help is greatly appreciated!
Experimental design
I have 10 different cell lines and am testing 3 genes (the same 3 genes in each cell line). For each combination of cell line and gene, I knock-out (i.e. remove) the gene in the cell line and then measure the growth rate of the cell lines. I run the experiment 3 times for each combination. The output is a log-fold-change (logFC) of the number of cells at the beginning of the experiment compared to the number at the end.
The model
I want to model the logFC depending on the gene and cell line. I think this means I need 2 varying intercepts, one for the gene and one for the cell line. Each has its own set of hyper-parameters for partial pooling.
\log(FC) \sim \mathcal{N}(\mu_{g,c} \sigma) \\ \mu_{g,c} = \alpha_g + \beta_c \\ \quad \alpha_g \sim \mathcal{N}(\mu_\alpha, \sigma_\alpha) \\ \qquad \mu_\alpha = \mathcal{N}(0,5) \quad \sigma_\alpha \sim \text{Exp}(1) \\ \quad \beta_c \sim \mathcal{N}(\mu_\beta, \sigma_\beta) \\ \qquad \mu_\beta = \mathcal{N}(0,5) \quad \sigma_\beta \sim \text{Exp}(1) \\ \sigma \sim \text{Exp}(1)
Synthetic data
Below is how I created some synthetic data under this model.
import pandas as pd
import numpy as np
import plotnine as gg
import pymc3 as pm
import arviz as az
import seaborn as sns
import matplotlib.pyplot as plt
import string
from itertools import product
from numpy.random import normal, exponential
RANDOM_SEED = 103
np.random.seed(RANDOM_SEED)
# Data parameters.
num_cell_lines = 10
num_genes = 3
n_experiments = 3
# Model parameters.
real_params = {
"mu_alpha": -1,
"sigma_alpha": 0.5,
"mu_beta": 0,
"sigma_beta": 0.5,
"sigma": 0.05,
}
real_params["alpha_g"] = normal(
real_params["mu_alpha"], real_params["sigma_alpha"], num_genes
)
real_params["beta_c"] = normal(
real_params["mu_beta"], real_params["sigma_beta"], num_cell_lines
)
data = pd.DataFrame(
product(range(num_cell_lines), range(num_genes), range(n_experiments)),
columns=["cell_line", "gene", "expt"],
)
data["logfc"] = np.nan
for i in range(len(data)):
c = data.loc[i, "cell_line"]
g = data.loc[i, "gene"]
mu_gc = real_params["alpha_g"][g] + real_params["beta_c"][c]
data.loc[i, "logfc"] = normal(mu_gc, real_params["sigma"], 1)
data[10:]
cell_line | gene | expt | logfc | |
---|---|---|---|---|
0 | 0 | 0 | 0 | -1.73762 |
1 | 0 | 0 | 1 | -1.87747 |
2 | 0 | 0 | 2 | -1.88619 |
3 | 0 | 1 | 0 | -1.27018 |
4 | 0 | 1 | 1 | -1.32484 |
5 | 0 | 1 | 2 | -1.28888 |
6 | 0 | 2 | 0 | -0.934375 |
7 | 0 | 2 | 1 | -0.936662 |
8 | 0 | 2 | 2 | -1.08875 |
9 | 1 | 0 | 0 | -2.13649 |
Here are some plots of the synthetic data. The first is of the genes and the second shows the logFC values separating by cell line.
Coding the model
cell_line_idx = data["cell_line"].values
gene_idx = data["gene"].values
with pm.Model() as model:
# Hyper-priors
mu_alpha = pm.Normal("mu_alpha", 0, 5)
sigma_alpha = pm.Exponential("sigma_alpha", 1)
mu_beta = pm.Normal("mu_beta", 0, 5)
sigma_beta = pm.Exponential("sigma_beta", 1)
# Priors
alpha_g = pm.Normal("alpha_g", mu_alpha, sigma_alpha, shape=num_genes)
beta_c = pm.Normal("beta_c", mu_beta, sigma_beta, shape=num_cell_lines)
# Likelihood
mu_gc = pm.Deterministic("mu_gc", alpha_g[gene_idx] + beta_c[cell_line_idx])
sigma = pm.Exponential("sigma", 1)
logfc = pm.Normal("logfc", mu_gc, sigma, observed=data["logfc"].values)
# Sampling
model_prior_check = pm.sample_prior_predictive(random_seed=RANDOM_SEED)
model_trace = pm.sample(4000, tune=1000, random_seed=RANDOM_SEED)
model_post_check = pm.sample_posterior_predictive(
model_trace, random_seed=RANDOM_SEED
)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sigma, beta_c, alpha_g, sigma_beta, mu_beta, sigma_alpha, mu_alpha]
Sampling 2 chains, 0 divergences: 100%|██████████| 10000/10000 [09:17<00:00, 17.94draws/s]
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The number of effective samples is smaller than 10% for some parameters.
100%|██████████| 8000/8000 [00:10<00:00, 769.95it/s]
Posteriors
Below are the posterior distributions for the varying intercepts. They are very wide and some seem to lean in the correct direction, but others lean in the wrong direction.
Below are the posterior distributions for the hyper-parameters. Again, they are very wide.
Finally, the following plot is of the posterior predictions of each data point. The original data point is in color (colored by the cell line and the shape corresponds to the gene) and the means of the posterior distributions are in black. The predictions were very accurate with incredibly small CI (the little bars on the black points).
To me, this indicated that there was a lot of correlation between the parameter values. So I plotted, for each draw of the MCMC, the average value for \alpha_g against the average value for \beta_c. They are perfectly correlated:
This correlation explains why the posterior distributions are so wide while the posterior predictions are so accurate, but I don’t understand why these varying intercepts are correlated. Is there a different way of modeling this data where I can still get the same information?