Extended BEST model

imagine we want to compare two group means. Something like this example: https://docs.pymc.io/notebooks/BEST.html
But each of data point is average of 5 replication. For example assuming the format for the example I referred to, which data is like this:

drug = (101,100,102,104,102,97)
placebo = (99,101,100,101,102)

now imagine each point of data obtained by averaging 3 other numbers (e.g. 101=(103+102+98)/3 and so on). My goal is comparing “drug” and “placebo” and I am thinking go one level higher and use kind of hierarchical model that can consider also replication for each number in each group. How can I do that? and is it a good idea?

It sounds like you’re describing some kind of nested trial design, with replicates. I’ll assume there are two arms (“drug”, “placebo”) and one nesting variable (“knockout_A”, “knockout_B”, “knockout_C”, “WT”), each of which have some number of replicates. This is classical linear model territory, and is fairly straightforward.

if the knockouts are to be compared. If you have the observations at the granular level (i.e. you’re not observing means) then BEST can be updated straightforwardly. This generates some replicated data

import numpy as np
import pandas as pd
n_replicates = 4
is_drug = np.hstack([np.ones(4), np.zeros(4)])
eye = np.diag(np.ones(4))
design = np.hstack([is_drug.reshape(8,1), np.vstack([eye, eye])])
design = pd.DataFrame(np.vstack([design] * 4))
design.columns=['drug', 'ko_A', 'ko_B', 'ko_C', 'WT']
effects = np.array([0.4, -0.1, 0.3, 0.1, 0.0])
total_r2 = 0.75
true_values = np.dot(design.values, effects)
vtrue = np.var(true_values)
vobs = vtrue/total_r2  # has a tendency to really fluctuate
verr = vobs - vtrue
observed_values = true_values + np.random.normal(scale=np.sqrt(verr), size=(true_values.shape[0],))

The approach in the example parametrizes the setting using the difference of means between groups. We’re going to do that here as well, but making this hierarchical means treating this difference as a random variable per group, with a shared distribution.

group_index = {gname: np.where(design.loc[:, gname] == 1)[0] for gname in design.columns[1:]}
with pm.Model() as hierarchical:
    drug_eff_overall = pm.Normal('drug_eff', 0., 1.)
    intercept_prior = pm.Normal('intercept', 0., 1.)
    err_sd_hp = pm.HalfNormal('error_sd_hp', 1.)
    # build the per_group submodel
    for grp in group_index:
        treatment = design.iloc[group_index[grp], 0]
        obs = observed_values[group_index[grp]]
        group_effect = pm.Normal('{}_eff'.format(grp), intercept_prior, 0.1)
        group_drug_delta = pm.Normal('{}_drug_delta'.format(grp), 
                            drug_eff_overall, 0.1)
        # i have pretty good reason to believe the effects
        # are the same across groups
        lat_metric = pm.Deterministic('{}_lat_met'.format(grp), group_drug_delta * treatment + group_effect)
        obs_metric = pm.Normal('{}_metric'.format(grp), lat_metric, err_sd_hp, observed=obs)
    
    fit_res = pm.sample(500, cores=1)

Then the plots

import seaborn as sbn
from matplotlib import pyplot as plt
wnames = [x for x in fit_res.varnames if x[-2:] != '__' and 'lat_' not in x]
vardf = pd.DataFrame(np.hstack([pd.DataFrame(fit_res[x]) for x in wnames]))
vardf.columns = wnames
def add0(xvals, **kwargs):
    plt.gca().axvline(linewidth=2, color='r')
    return
pp = sbn.pairplot(vardf.loc[:,np.array(['intercept', 'drug_eff'] + [x for x in wnames if 'delta' in x])])
pp.map_diag(add0)

plt.figure()
pp = sbn.pairplot(vardf.loc[:, np.array([x for x in wnames if 'eff' in x])])
pp.map_diag(add0)

One thing i noticed is that it’s not super straightforward to perform hierarchical nesting given a design matrix. I’m used to syntax like “y ~ 1 + group + drug|group” . I guess I could create an auxiliary design matrix as design.values[:,0][:, np.newaxis] * design.values[:,1:] and then use

drug_eff = pm.Normal(..., shape=(4,))
lat = pm.Deterministic('lat', group_eff + tt.dot(aux_design, drug_eff))`

but that seems to introduce a lot of 0*variable into the graph. Is there a “right” way to do this?

3 Likes

These kind of linear function parser is not great in python (unlike in R). You can use patsy to do some of it. For example see:

1 Like

Thank you @chartl. My problem should be much simpler. Let me explain what I have:
group1={A,B,C,D,E}
But A obtained from average of (a1,a2,a3,a4,a5) and B average of (b1,…b5), and so on.
I was thinking instead of observed A, B, C…
I can do something logically like this (code is not correct and just for expressing what I think):

with pm.Model() as mymodel:
      mu_A=pm.Normal(something, something)
      sd_A=pm.halfNormal(something)
      Obs_A=pm.Normal(mu_A, sd_A,observed={a1,a2,a3,a4,a5})

Similarly for B , C , D and E.
Then continue as:

with mymodel:
        mu_group1=pm.Normal(something, something)
        sd_group1=pm.halfNormal(something)
       nu=pm.exponential(something)+1
       obs_group1=pm.Normal(mu_group1, sd_group1,nu,observed={mu_A,mu_B,...,mu_D})

This is the idea I am thinking, but not sure how can I do it in a correct way. I think logically it makes more sense than use posterior of mu_A, mu_B and … mu_E as observed value for mu_group, that simply use numeric average of them. Am I right?

You’re re-using information here. You observe a1, ..., a5; (these are values), but you do not observe mu_A (this is a parameter). Multilevel modeling is about how to properly aggregate information from the value level to the parameter level; and what you’re describing is precisely a hierarchical model with

group1_mu = pm.Normal('g1_mu', 0, 10.)
subgroup_sd = pm.HalfNormal('subgroup_sd_hp', 5.)
err_sd = pm.HalfNormal('err_sd', 5.)
mu_A=pm.Normal('mu_A', group1_mu, subgroup_sd)
obs_A = pm.Normal('A', mu_A, err_sd, observed=[a1, a2, a3, a4, a5])
...

Now if A is present both in group1 and group2, you’ll have to expand into a nested model like a previous example. But if letters are unique within group, i.e. group1=[A, …, E], group2=[F, …, J]; then the above should suffice.

1 Like

Thank you Chris!