Hi All,
I hope that topic title is not too vague. I’ll preface this by saying I am detailing a biochemical/biophysical problem below in some detail so a) I can solve it and b) understand partial-pooling/multilevel models better. Below is quite a ‘tome’ so the tl;dr is: I can’t write a model were I am pooling noise and an affinity measurement but unpooled individual experimental measurements which are a series of data points themselves.
I am trying to do a parameter estimation of binding affinity of a molecule (ligand) to a protein (in biochem jargon, a K_d). Without going into much detail, the method involves being able to take measurements for many (229) locations on the protein. Measurements are a shift from an initial ligand concentration up to some saturation level. We measure this movement as ‘d’ for each ligand concentration where ‘d_max’ (the maximum shift possible, at ligand saturation) is a parameter to be estimated. There are 14 measurements of ‘d’ as ligand concentration increases. Adding ligand also dilutes the protein concentration, so protein concentration also changes. I have a vector of 14 values for ligand concentration:
lc = np.asarray([0,0.014,0.028,0.042,0.06,0.083,0.109,0.149,0.192,0.243,0.307,0.389,0.594,0.811])
and protein concentration:
pc = np.asarray([0.046,0.046,0.046,0.046,0.045,0.045,0.045,0.045,0.044,0.044,0.043,0.042,0.041,0.039])
and an equation (model) that relates ‘d’ to ‘d_max’ and K_d given lc and pc:
d = d_max*((pc+lc+K_d) - np.sqrt((pc+lc+K_d)**2 - 4*pc*lc))/(2*pc)
Now for each of the 229 sites on the protein, I have a series of 14 measurements of d that correspond to the values in lc and pc. It is easy to fit this data individually, either with a standard least squares approach or even a straight forward Bayesian pymc way:
with pm.Model() as model:
d_max = pm.Normal('d_max', mu=data.iloc[-1], sigma=1)
K_d = pm.Exponential('K_d', 1)
sigma = pm.Exponential('sigma', 1)
d = d_max*((pc+lc+K_d) - np.sqrt((pc+lc+K_d)**2 - 4*pc*lc))/(2*pc)
y = pm.Normal('y', mu=d, sigma=sigma, observed=data)
trace = pm.sample(tune=1000, draws=1000, target_accept=0.99)
In words, the 14 data points (a series) is called ‘data’. d_max is modeled as a normal centered on the last value in the series (most likely to be closest to the fully saturated point, d_max). A variance of 1 is very wide here but works. K_d can only be positive so is modeled as an exponential. Likely values given my data are around 0.05. sigma is an error of measurement for the d values and appears in the likelihood equation ‘y’. I also like to estimate this value.
This samples just great, and I can iterate over all the 229 sites and get K_d values.
What I want to do is write a model with global fitting for K_d and sigma, but d_max should be fitted individually for each of the 229 series of data. My approach was:
N.B. ‘data_229_systems’ is a 229x14 numpy matrix.
with pm.Model() as model:
sigma = pm.Uniform('sigma', lower=0, upper=1)
K_d = pm.Exponential('K_d', 1)
for i, data in enumerate(data_229_systems): # this pulls out the 14 points of data for each system into 'data'
d_max_n = pm.Normal(f'd_max_{i}', mu=data[-1], sigma=1)
d_n = d_max*((pc+lc+K_d) - np.sqrt((pc+lc+K_d)**2 - 4*pc*lc))/(2*pc)
y_n = pm.Normal(f'y_{i}', mu=d_n, sigma=sigma, observed=data)
trace = pm.sample(tune=1000, draws=1000, target_accept=0.99)
This does seem to work but I have seen other approaches using grouping and this seems more pythonic and better for writing more complex multilevel models (see: https://towardsdatascience.com/bayesian-hierarchical-modeling-in-pymc3-d113c97f5149 for the example I’m following). In my case, I have 229 separate groups indexed 0 through 228. So I tried code like this:
with pm.Model() as model:
system_index = np.arange(data_229_systems.shape[0])
sigma = pm.Exponential('sigma', 10)
K_d = pm.Exponential('K_d', 1)
d_max = pm.Normal('d_max', mu=data[:, -1], sigma=1, shape=data_229_systems.shape[0])
d = pm.Deterministic('d', d_max[system_index]*((pc+lc+K_d) - np.sqrt((pc+lc+K_d)**2 - 4*pc*lc))/(2*pc))
y = pm.Normal('y', mu=d[system_index], sigma=sigma, observed=data_229_systems)
trace = pm.sample(tune=1000, draws=1000, target_accept=0.99)
But aesara complains bitterly about this:
ERROR (aesara.graph.opt): Optimization failure due to: constant_folding
ERROR (aesara.graph.opt): node: Assert{msg=Could not broadcast dimensions}(ScalarConstant{229}, TensorConstant{False})
ERROR (aesara.graph.opt): TRACEBACK:
ERROR (aesara.graph.opt): Traceback (most recent call last):
At this point I am fairly lost. I don’t know how to use the shape parameter in pm.Normal of d_max to get 229 normals to be fitted for each of the 229 series of data, but a global K_d and sigma. And this problem seems to stem from having a series of data that is used to estimate d_max, but just a single value.
So I will end by apologising again, and hope this post is interesting and clear enough to get some help/guidance.
Cheers
Scott