Pooling, Unpooling and Partial Pooling where each data point is a series of data

Hi All,

I hope that topic title is not too vague. I’ll preface this by saying I am detailing a biochemical/biophysical problem below in some detail so a) I can solve it and b) understand partial-pooling/multilevel models better. Below is quite a ‘tome’ so the tl;dr is: I can’t write a model were I am pooling noise and an affinity measurement but unpooled individual experimental measurements which are a series of data points themselves.

I am trying to do a parameter estimation of binding affinity of a molecule (ligand) to a protein (in biochem jargon, a K_d). Without going into much detail, the method involves being able to take measurements for many (229) locations on the protein. Measurements are a shift from an initial ligand concentration up to some saturation level. We measure this movement as ‘d’ for each ligand concentration where ‘d_max’ (the maximum shift possible, at ligand saturation) is a parameter to be estimated. There are 14 measurements of ‘d’ as ligand concentration increases. Adding ligand also dilutes the protein concentration, so protein concentration also changes. I have a vector of 14 values for ligand concentration:

lc = np.asarray([0,0.014,0.028,0.042,0.06,0.083,0.109,0.149,0.192,0.243,0.307,0.389,0.594,0.811])

and protein concentration:

pc = np.asarray([0.046,0.046,0.046,0.046,0.045,0.045,0.045,0.045,0.044,0.044,0.043,0.042,0.041,0.039])

and an equation (model) that relates ‘d’ to ‘d_max’ and K_d given lc and pc:

d = d_max*((pc+lc+K_d) - np.sqrt((pc+lc+K_d)**2 - 4*pc*lc))/(2*pc)

Now for each of the 229 sites on the protein, I have a series of 14 measurements of d that correspond to the values in lc and pc. It is easy to fit this data individually, either with a standard least squares approach or even a straight forward Bayesian pymc way:

    with pm.Model() as model:
        d_max = pm.Normal('d_max', mu=data.iloc[-1], sigma=1)
        K_d = pm.Exponential('K_d', 1)
        sigma = pm.Exponential('sigma', 1)
        d = d_max*((pc+lc+K_d) - np.sqrt((pc+lc+K_d)**2 - 4*pc*lc))/(2*pc)
        y = pm.Normal('y', mu=d, sigma=sigma, observed=data)
    
        trace = pm.sample(tune=1000, draws=1000, target_accept=0.99)

In words, the 14 data points (a series) is called ‘data’. d_max is modeled as a normal centered on the last value in the series (most likely to be closest to the fully saturated point, d_max). A variance of 1 is very wide here but works. K_d can only be positive so is modeled as an exponential. Likely values given my data are around 0.05. sigma is an error of measurement for the d values and appears in the likelihood equation ‘y’. I also like to estimate this value.

This samples just great, and I can iterate over all the 229 sites and get K_d values.

What I want to do is write a model with global fitting for K_d and sigma, but d_max should be fitted individually for each of the 229 series of data. My approach was:

N.B. ‘data_229_systems’ is a 229x14 numpy matrix.

with pm.Model() as model:
    sigma = pm.Uniform('sigma', lower=0, upper=1)
    K_d = pm.Exponential('K_d', 1) 
    
    for i, data in enumerate(data_229_systems): # this pulls out the 14 points of data for each system into 'data'
        d_max_n = pm.Normal(f'd_max_{i}', mu=data[-1], sigma=1)     
        d_n = d_max*((pc+lc+K_d) - np.sqrt((pc+lc+K_d)**2 - 4*pc*lc))/(2*pc)
        y_n = pm.Normal(f'y_{i}', mu=d_n, sigma=sigma, observed=data)

    trace = pm.sample(tune=1000, draws=1000, target_accept=0.99)

This does seem to work but I have seen other approaches using grouping and this seems more pythonic and better for writing more complex multilevel models (see: https://towardsdatascience.com/bayesian-hierarchical-modeling-in-pymc3-d113c97f5149 for the example I’m following). In my case, I have 229 separate groups indexed 0 through 228. So I tried code like this:

with pm.Model() as model:
    system_index = np.arange(data_229_systems.shape[0])
    sigma = pm.Exponential('sigma', 10)
    K_d = pm.Exponential('K_d', 1) 
    d_max = pm.Normal('d_max', mu=data[:, -1], sigma=1, shape=data_229_systems.shape[0])
    
    d = pm.Deterministic('d', d_max[system_index]*((pc+lc+K_d) - np.sqrt((pc+lc+K_d)**2 - 4*pc*lc))/(2*pc))
    
    y = pm.Normal('y', mu=d[system_index], sigma=sigma, observed=data_229_systems)    
    
    trace = pm.sample(tune=1000, draws=1000, target_accept=0.99)

But aesara complains bitterly about this:

ERROR (aesara.graph.opt): Optimization failure due to: constant_folding
ERROR (aesara.graph.opt): node: Assert{msg=Could not broadcast dimensions}(ScalarConstant{229}, TensorConstant{False})
ERROR (aesara.graph.opt): TRACEBACK:
ERROR (aesara.graph.opt): Traceback (most recent call last):

At this point I am fairly lost. I don’t know how to use the shape parameter in pm.Normal of d_max to get 229 normals to be fitted for each of the 229 series of data, but a global K_d and sigma. And this problem seems to stem from having a series of data that is used to estimate d_max, but just a single value.

So I will end by apologising again, and hope this post is interesting and clear enough to get some help/guidance.

Cheers

Scott

Seems you are doing too many indexing there, since d in the last model is already the same shape as the observed, you can do:

with pm.Model() as model:
    sigma = pm.Exponential('sigma', 10)
    K_d = pm.Exponential('K_d', 1) 
    d_max = pm.Normal('d_max', mu=data[:, -1], sigma=1, shape=(data_229_systems.shape[0], 1))
    model_eq = (pc+lc+K_d) - np.sqrt((pc+lc+K_d)**2 - 4*pc*lc) / (2*pc)
    # d_max (shape = [229, 1]) * model_eq (shape= [14]) -> d (shape = [229, 14])
    d = pm.Deterministic('d', d_max * model_eq)
    
    y = pm.Normal('y', mu=d, sigma=sigma, observed=data_229_systems)

Tip: instead of treating the observation as a shape (229, 14) matrix, one solution is to flatten the response into a vector shape=(229x14, ), it also makes indexing much easier.

1 Like

Thanks for the feedback! I’ll try your code and suggestions. Those and just writing out my problem has been showing up my ignorance to myself and thats just the first step to learning.

Cheers

1 Like

The following code worked:

with pm.Model() as model:
    sigma = pm.Exponential('sigma', 100)
    K_d = pm.Exponential('K_d', 10) 
    d_max = pm.Normal('d_max', 
                      mu=data_229_systems[:, -1].reshape(229,1), 
                      sigma=0.1, 
                      shape=(data_229_systems.shape[0], 1))
    model_eq = (pc+lc+K_d) - np.sqrt((pc+lc+K_d)**2 - 4*pc*lc) / (2*pc)
    # d_max (shape = [229, 1]) * model_eq (shape= [14]) -> d (shape = [229, 14])
    d = pm.Deterministic('d', d_max * model_eq)
    y = pm.Normal('y', mu=d, sigma=sigma, observed=data_229_systems)
    trace = pm.sample(tune=1000, draws=1000, target_accept=0.99, progressbar=True) 

So I had to reshape the mu values in d_max = pm.Normal. Once I did that everything sampled fine.

Surprisingly the value I get for K_d and sigma using this method are quite different than when I iteratively add models using the code I posted above… and here again for clarity:

with pm.Model() as model:
    
    K_d = pm.Exponential('K_d', 1) # units ultimately depend on input values
    sigma_n = pm.Exponential('sigma', 10)
    
    for i, data in enumerate(data_229_systems):
        
        d_max_n = pm.Normal(f'd_max_{i}', mu=data[-1], sigma=1) # units in ppm usually     
        d_n = d_max_n*((pc+lc+K_d) - np.sqrt((pc+lc+K_d)**2 - 4*pc*lc))/(2*pc)
        y_n = pm.Normal(f'y_{i}', mu=d_n, sigma=sigma_n, observed=data)  
    
    trace = pm.sample(tune=1000, draws=1000, target_accept=0.99, progressbar=True) 

This samples much slower as well. Functionally, for me I don’t see a difference between the two approaches apart from using a multivariate normal for all the d_maxs instead of just having multi-models with the same K_d and sigma but different d_maxs. I’m sure there is a good explanation and I will have a good think about it.

Anyway once again thanks for the response. I guess it turns out I was just not putting all the variables together consistently and it would have taken me forever to figure that out on my own.

Cheers! I owe you a drink.

Scott

1 Like

Happy to help!

I see a some differences between the 2 model that might explain the different inference result:

  • The prior for sigma and K_d is different
  • The prior between d_max and d_max_n is different (sigma)

Also, some tips:

  • Instead of data_229_systems[:, -1].reshape(229,1), doing data_229_systems[:, -1:] gives you the same thing without the additional reshape
  • target_accept=0.99 is usually too high and actually make your sampler much less efficient. I suggest you to remove it and try with the default.
1 Like