Modeling two varying intercepts

Hello,
I am pretty new to Bayesian modeling and very new to PyMC3, so this may very well be an easy question to answer. I apologize of the long post, but I tried to be complete. Any help is greatly appreciated!


Experimental design

I have 10 different cell lines and am testing 3 genes (the same 3 genes in each cell line). For each combination of cell line and gene, I knock-out (i.e. remove) the gene in the cell line and then measure the growth rate of the cell lines. I run the experiment 3 times for each combination. The output is a log-fold-change (logFC) of the number of cells at the beginning of the experiment compared to the number at the end.

The model

I want to model the logFC depending on the gene and cell line. I think this means I need 2 varying intercepts, one for the gene and one for the cell line. Each has its own set of hyper-parameters for partial pooling.

\log(FC) \sim \mathcal{N}(\mu_{g,c} \sigma) \\ \mu_{g,c} = \alpha_g + \beta_c \\ \quad \alpha_g \sim \mathcal{N}(\mu_\alpha, \sigma_\alpha) \\ \qquad \mu_\alpha = \mathcal{N}(0,5) \quad \sigma_\alpha \sim \text{Exp}(1) \\ \quad \beta_c \sim \mathcal{N}(\mu_\beta, \sigma_\beta) \\ \qquad \mu_\beta = \mathcal{N}(0,5) \quad \sigma_\beta \sim \text{Exp}(1) \\ \sigma \sim \text{Exp}(1)

Synthetic data

Below is how I created some synthetic data under this model.

import pandas as pd
import numpy as np
import plotnine as gg
import pymc3 as pm
import arviz as az
import seaborn as sns
import matplotlib.pyplot as plt
import string
from itertools import product
from numpy.random import normal, exponential

RANDOM_SEED = 103
np.random.seed(RANDOM_SEED)

# Data parameters.
num_cell_lines = 10
num_genes = 3
n_experiments = 3

# Model parameters.
real_params = {
    "mu_alpha": -1,
    "sigma_alpha": 0.5,
    "mu_beta": 0,
    "sigma_beta": 0.5,
    "sigma": 0.05,
}

real_params["alpha_g"] = normal(
    real_params["mu_alpha"], real_params["sigma_alpha"], num_genes
)
real_params["beta_c"] = normal(
    real_params["mu_beta"], real_params["sigma_beta"], num_cell_lines
)

data = pd.DataFrame(
    product(range(num_cell_lines), range(num_genes), range(n_experiments)),
    columns=["cell_line", "gene", "expt"],
)

data["logfc"] = np.nan
for i in range(len(data)):
    c = data.loc[i, "cell_line"]
    g = data.loc[i, "gene"]
    mu_gc = real_params["alpha_g"][g] + real_params["beta_c"][c]
    data.loc[i, "logfc"] = normal(mu_gc, real_params["sigma"], 1)

data[10:]
cell_line gene expt logfc
0 0 0 0 -1.73762
1 0 0 1 -1.87747
2 0 0 2 -1.88619
3 0 1 0 -1.27018
4 0 1 1 -1.32484
5 0 1 2 -1.28888
6 0 2 0 -0.934375
7 0 2 1 -0.936662
8 0 2 2 -1.08875
9 1 0 0 -2.13649

Here are some plots of the synthetic data. The first is of the genes and the second shows the logFC values separating by cell line.

image

Coding the model

cell_line_idx = data["cell_line"].values
gene_idx = data["gene"].values


with pm.Model() as model:
    # Hyper-priors
    mu_alpha = pm.Normal("mu_alpha", 0, 5)
    sigma_alpha = pm.Exponential("sigma_alpha", 1)
    mu_beta = pm.Normal("mu_beta", 0, 5)
    sigma_beta = pm.Exponential("sigma_beta", 1)

    # Priors
    alpha_g = pm.Normal("alpha_g", mu_alpha, sigma_alpha, shape=num_genes)
    beta_c = pm.Normal("beta_c", mu_beta, sigma_beta, shape=num_cell_lines)

    # Likelihood
    mu_gc = pm.Deterministic("mu_gc", alpha_g[gene_idx] + beta_c[cell_line_idx])
    sigma = pm.Exponential("sigma", 1)
    logfc = pm.Normal("logfc", mu_gc, sigma, observed=data["logfc"].values)

    # Sampling
    model_prior_check = pm.sample_prior_predictive(random_seed=RANDOM_SEED)
    model_trace = pm.sample(4000, tune=1000, random_seed=RANDOM_SEED)
    model_post_check = pm.sample_posterior_predictive(
        model_trace, random_seed=RANDOM_SEED
    )
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (2 chains in 2 jobs)
NUTS: [sigma, beta_c, alpha_g, sigma_beta, mu_beta, sigma_alpha, mu_alpha]
Sampling 2 chains, 0 divergences: 100%|██████████| 10000/10000 [09:17<00:00, 17.94draws/s]
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The number of effective samples is smaller than 10% for some parameters.
100%|██████████| 8000/8000 [00:10<00:00, 769.95it/s]

Posteriors

Below are the posterior distributions for the varying intercepts. They are very wide and some seem to lean in the correct direction, but others lean in the wrong direction.

image

Below are the posterior distributions for the hyper-parameters. Again, they are very wide.

Finally, the following plot is of the posterior predictions of each data point. The original data point is in color (colored by the cell line and the shape corresponds to the gene) and the means of the posterior distributions are in black. The predictions were very accurate with incredibly small CI (the little bars on the black points).

To me, this indicated that there was a lot of correlation between the parameter values. So I plotted, for each draw of the MCMC, the average value for \alpha_g against the average value for \beta_c. They are perfectly correlated:

image

This correlation explains why the posterior distributions are so wide while the posterior predictions are so accurate, but I don’t understand why these varying intercepts are correlated. Is there a different way of modeling this data where I can still get the same information?

Usually, posterior figure like below indicates your model is has a multicollinearity problem, there is an example in Statistical Rethinking (6.1. Multicollinearity) image

You can reparameterized the model in to a regression like design matrix (pm.glm is a good place to start)

1 Like

To build on Junpeng’s answer, here is the link to the PyMC3 port of SR2, which you’ll probably find useful: https://github.com/pymc-devs/resources/tree/master/Rethinking_2

Thank you @junpenglao and @AlexAndorra (great podcast!) for your quick replies! I looked through Rethinking and don’t think I found an analogous situation to the one I have here. The examples in the book are collinear predictor variables whereas my example is just of two varying intercepts. Also, since I generated the synthetic data under the model with two independent (of each other) intercepts, I was surprised to find this collinearity. I didn’t include and don’t expect any correlation between the two varying intercepts.

Would the design matrix effectively be turning the model from a hierarchical model with varying intercepts into a single level model with many indicator variables? Is there a way to still maintain the hierarchical model?

A bit more details of the collinearity here is that, you have alpha_g[gene_idx] + beta_c[cell_line_idx] without an intercept, which will be split between the 2 terms - in effect, you can think of alpha_g + delta and beta_c - delta where delta is a scalar, unless these terms are strongly regulated (more informative prior), they basically being equivalent over a range of delta, which is why you see figure like above.

Formulate it into a design matrix helps because it will have an explicit Intercept, and you can place prior on coefficients so that they are regulated around 0.

1 Like

Thank you for that explanation - that makes a lot of sense. If I use a design matrix approach, however, don’t I lose the partial pooling that I would get from a hierarchical model?

I have nothing to add to Junpeng’s very good explanation of the additive non-identifiability in your model, but just wanna point out that I’d advise building the hierarchical model in plain PyMC3 (as is done in the SR2 NBs I shared aboved) and not with the glm module – otherwise the hierarhical structure could indeed be difficult to specify.
Hope this helps, and thanks a lot for you feedback on the podcast, I really appreciate it!