Hello,

I am attempting to fit a simple exponential-decay model to measured data, e.g. drug elimination following an administered dose, using a hierarchical model to pool reported data across studies. However, each study will report measured concentrations as either the individual level (each time/concentration pair represents one animal) or at the summary-level (at every time, there is a reported mean and standard deviation averaged over N animals for each observed concentration). Here’s an example of what the data could look like for a single dose (individual animals: top, summary data: bottom).

To account for the differences in how these data are reported, but to also allow for study-level error in the likelihood, I’ve attempted to include two likelihoods in the model fitting. For the individual-level data, I simply use a typical lognormal distribution (`pm.Lognormal`

).

For the summary-level data, I calculate a custom lognormally distributed log-likelihood (LL) where m(t_i) is the exponential-decay model with an initial dose (D) and add this calculated LL using `pm.Potential`

. Here is the custom log-normally distributed log-likelihood I’m using for data that are reported at a summary level.

m(t_i) = \frac{D}{V}\frac{k_a}{k_a-k_e}(e^{-k_et_i} - e^{k_at_i})

LL = -\frac{N}{2}ln(2\pi) - \sum_{i=1}^{G} [N_i\bar{z}_{Li} + \frac{N_i}{2}ln(\sigma_{Lij}^2) +\frac{(N_i - 1)s_{Li}^2}{2\sigma_{Lij}^2} + \frac{N_i(\bar{z}_{Lij} - ln(m(t_i)))^2}{2\sigma_{Lij}^2}]

where

\bar{z}_{Li} = log-scale summary mean at time t_i (observed)

s_{Li}^2 = log-scale variance at time t_i (observed)

\sigma_{Lij}= log-scale likelihood standard deviation for time i within study j

N_i= Number of animals comprising the reported mean at time (t_i)

N = \sum N_i across all summary-level studies

G = Total # of time points reported across all summary-level studies

To include the individual- and summary-level likelihood, I use a typical `pm.Lognormal`

for the individual-level data and then a `pm.Potential`

to ‘layer’ on summary-level log-likelihood:

```
# Likelihood for individually reported data
pm.Lognormal("indiv_like", mu=indiv_conc, sigma=sigma[sidx_indiv], observed=y_indiv)
# Likelihood for summary level data
pm.Potential("summary_like", LL)
```

Putting it all together, here is the current workflow, starting with the dataframe (`df`

) that contains all the data:

time_cor | conc_mean | conc_sd | lnconc_mean | lnconc_sd | D | study_index | indiv_index | |
---|---|---|---|---|---|---|---|---|

0 | 1 | 6.47 | 0 | 1.86718 | 0 | 0.286 | 0 | True |

1 | 1 | 6.36 | 0 | 1.85003 | 0 | 0.284 | 0 | True |

2 | 1 | 4.84 | 0 | 1.57691 | 0 | 0.283 | 0 | True |

3 | 2 | 5.01 | 0 | 1.61144 | 0 | 0.286 | 0 | True |

4 | 2 | 5.29 | 0 | 1.66582 | 0 | 0.284 | 0 | True |

… | … | … | … | … | … | … | … | … |

95 | 8.06922 | 60.6459 | 8.27174 | 4.09584 | 0.135766 | 17.6518 | 1 | False |

96 | 10.027 | 57.7451 | 6.83516 | 4.04908 | 0.117956 | 17.6518 | 1 | False |

97 | 12.2033 | 55.8628 | 7.62303 | 4.01367 | 0.135831 | 17.6518 | 1 | False |

98 | 18.075 | 46.7062 | 7.23221 | 3.83203 | 0.153929 | 17.6518 | 1 | False |

99 | 56.0399 | 20.5761 | 6.42714 | 2.97758 | 0.305118 | 17.6518 | 1 | False |

`indiv_index`

is a mask on whether or not the reported concentration is an individual animal (`True`

) or summary-level concentration (`False`

).

Using `df`

, the hierarchical model is constructed:

```
indiv_index = df.indiv_index.values # Mask on whether or not concentration is individual or summary
N = np.sum(~indiv_index)*4 # 4 animals per summary-level concentration (Total N)
Ni = np.array([4]*np.sum(~indiv_index)) # 4 animals per summary time-point
n_studies = len(df.study_index.unique())
with pm.Model() as model:
t = pm.Data("t", df.time_cor.values, mutable=True)
D = pm.Data("D", df.D.values, mutable=True)
sidx = pm.Data("sidx", df.study_index.values, mutable=True)
# Parse the observed data from df
#--------------------------------------------
# Individual data (Use reported concentrations)
y_indiv = pm.Data("y_indiv", df.conc_mean.values[indiv_index], mutable=True)
sidx_indiv = sidx[indiv_index]
# Summary data (Use log-transformed concentrations and standard deviations)
y_summary = pm.Data("y_summary", df.lnconc_mean.values[~indiv_index], mutable=True)
sidx_summary = sidx[~indiv_index]
sd_summary = pm.Data("sd_summary", df.lnconc_sd[~indiv_index], mutable=True)
#--------------------------------------------
# Population mean
mu_lnke = pm.Uniform('mu_lnke', -8, 3)
mu_lnV = pm.Uniform('mu_lnV', -8, 3, testval=0)
mu_lnka = pm.Uniform('mu_lnka', -8, 3, testval=1)
# Population Variance
sigma_lnka = pm.HalfCauchy('sigma_lnka', 1.)
sigma_lnV = pm.HalfCauchy('sigma_lnV', 1.)
sigma_lnke = pm.HalfCauchy('sigma_lnke', 1.)
# Reparamaterization
lnka_offset = pm.Normal('lnka_offset', mu=0, sigma=1, shape=(n_studies,))
lnka = pm.Deterministic('lnka', mu_lnka + sigma_lnka*lnka_offset)
lnV_offset = pm.Normal('lnV_offset', mu=0, sigma=1, shape=(n_studies,))
lnV = pm.Deterministic('lnV', mu_lnV + sigma_lnV*lnV_offset)
lnke_offset = pm.Normal('lnke_offset', mu=0, sigma=1, shape=(n_studies,))
lnke = pm.Deterministic('lnke', mu_lnke + sigma_lnke*lnke_offset)
# Exponentiate parameters
ka = pm.Deterministic('ka', pm.math.exp(lnka))
V = pm.Deterministic('V', pm.math.exp(lnV))
ke = pm.Deterministic('ke', pm.math.exp(lnke))
# Predict concentrations at every reported time
model_conc = pm.Deterministic('model_conc', (D/V[sidx])*(ka[sidx]/(ka[sidx]-ke[sidx]))*(pm.math.exp(-ke[sidx]*t) - pm.math.exp(-ka[sidx]*t)))
indiv_conc = pm.Deterministic('indiv_conc', model_conc[indiv_index]) # Use mask to get predicted individual-level
summary_conc = pm.Deterministic('summary_conc', model_conc[~indiv_index]) # Use mask to get predicted summary-level
# Likelihood error for log-normal distribution (different for each study)
sigma = pm.HalfCauchy('sigma', 1, shape=(n_studies,))
# Likelihood for individual data
pm.Lognormal('indiv_like', mu=indiv_conc, sigma=sigma[sidx_indiv], observed=y_indiv)
# Likelihood for summary data
sigma_summary = sigma[sidx_summary]
term1 = (Ni/2.)*pm.math.log(sigma_summary**2)
term2 = ((Ni -1)*sd_summary**2)/(2*sigma_summary**2)
term3 = (Ni*(y_summary - summary_conc)**2)/(2*sigma_summary**2)
LL = (N/2)*pm.math.log(2*np.pi) - pm.math.sum((term1 + term2 + term3))
pm.Potential("summary_like", LL)
idata = pm.sample(draws=3000, tune=1000, target_accept=0.95)
```

This appears to work and I’m able to fit `ka`

, `V`

, and `ke`

and get reasonable fits to the original data.

My questions are:

- Is this the appropriate way to use
`pm.Potential`

to add a custom log-likelihood or am I sacrificing functionality by not using something like`pm.DensityDist`

to build my own custom distribution for the summary-level likelihood? - Is this a reasonable way to mix individual- and summary-level data for fitting or is there a more “pymc-type” way of handling these different types of observed data?

Thanks as always to the PyMC community’s help with these questions answered. I’ve been piecing together the use of `pm.Potential`

from several discourse threads, but apologies if this type of likelihood mixing has been discussed somewhere and I missed it.