Hello,
I am attempting to fit a simple exponential-decay model to measured data, e.g. drug elimination following an administered dose, using a hierarchical model to pool reported data across studies. However, each study will report measured concentrations as either the individual level (each time/concentration pair represents one animal) or at the summary-level (at every time, there is a reported mean and standard deviation averaged over N animals for each observed concentration). Here’s an example of what the data could look like for a single dose (individual animals: top, summary data: bottom).
To account for the differences in how these data are reported, but to also allow for study-level error in the likelihood, I’ve attempted to include two likelihoods in the model fitting. For the individual-level data, I simply use a typical lognormal distribution (pm.Lognormal
).
For the summary-level data, I calculate a custom lognormally distributed log-likelihood (LL) where m(t_i) is the exponential-decay model with an initial dose (D) and add this calculated LL using pm.Potential
. Here is the custom log-normally distributed log-likelihood I’m using for data that are reported at a summary level.
m(t_i) = \frac{D}{V}\frac{k_a}{k_a-k_e}(e^{-k_et_i} - e^{k_at_i})
LL = -\frac{N}{2}ln(2\pi) - \sum_{i=1}^{G} [N_i\bar{z}_{Li} + \frac{N_i}{2}ln(\sigma_{Lij}^2) +\frac{(N_i - 1)s_{Li}^2}{2\sigma_{Lij}^2} + \frac{N_i(\bar{z}_{Lij} - ln(m(t_i)))^2}{2\sigma_{Lij}^2}]
where
\bar{z}_{Li} = log-scale summary mean at time t_i (observed)
s_{Li}^2 = log-scale variance at time t_i (observed)
\sigma_{Lij}= log-scale likelihood standard deviation for time i within study j
N_i= Number of animals comprising the reported mean at time (t_i)
N = \sum N_i across all summary-level studies
G = Total # of time points reported across all summary-level studies
To include the individual- and summary-level likelihood, I use a typical pm.Lognormal
for the individual-level data and then a pm.Potential
to ‘layer’ on summary-level log-likelihood:
# Likelihood for individually reported data
pm.Lognormal("indiv_like", mu=indiv_conc, sigma=sigma[sidx_indiv], observed=y_indiv)
# Likelihood for summary level data
pm.Potential("summary_like", LL)
Putting it all together, here is the current workflow, starting with the dataframe (df
) that contains all the data:
time_cor | conc_mean | conc_sd | lnconc_mean | lnconc_sd | D | study_index | indiv_index | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 6.47 | 0 | 1.86718 | 0 | 0.286 | 0 | True |
1 | 1 | 6.36 | 0 | 1.85003 | 0 | 0.284 | 0 | True |
2 | 1 | 4.84 | 0 | 1.57691 | 0 | 0.283 | 0 | True |
3 | 2 | 5.01 | 0 | 1.61144 | 0 | 0.286 | 0 | True |
4 | 2 | 5.29 | 0 | 1.66582 | 0 | 0.284 | 0 | True |
… | … | … | … | … | … | … | … | … |
95 | 8.06922 | 60.6459 | 8.27174 | 4.09584 | 0.135766 | 17.6518 | 1 | False |
96 | 10.027 | 57.7451 | 6.83516 | 4.04908 | 0.117956 | 17.6518 | 1 | False |
97 | 12.2033 | 55.8628 | 7.62303 | 4.01367 | 0.135831 | 17.6518 | 1 | False |
98 | 18.075 | 46.7062 | 7.23221 | 3.83203 | 0.153929 | 17.6518 | 1 | False |
99 | 56.0399 | 20.5761 | 6.42714 | 2.97758 | 0.305118 | 17.6518 | 1 | False |
indiv_index
is a mask on whether or not the reported concentration is an individual animal (True
) or summary-level concentration (False
).
Using df
, the hierarchical model is constructed:
indiv_index = df.indiv_index.values # Mask on whether or not concentration is individual or summary
N = np.sum(~indiv_index)*4 # 4 animals per summary-level concentration (Total N)
Ni = np.array([4]*np.sum(~indiv_index)) # 4 animals per summary time-point
n_studies = len(df.study_index.unique())
with pm.Model() as model:
t = pm.Data("t", df.time_cor.values, mutable=True)
D = pm.Data("D", df.D.values, mutable=True)
sidx = pm.Data("sidx", df.study_index.values, mutable=True)
# Parse the observed data from df
#--------------------------------------------
# Individual data (Use reported concentrations)
y_indiv = pm.Data("y_indiv", df.conc_mean.values[indiv_index], mutable=True)
sidx_indiv = sidx[indiv_index]
# Summary data (Use log-transformed concentrations and standard deviations)
y_summary = pm.Data("y_summary", df.lnconc_mean.values[~indiv_index], mutable=True)
sidx_summary = sidx[~indiv_index]
sd_summary = pm.Data("sd_summary", df.lnconc_sd[~indiv_index], mutable=True)
#--------------------------------------------
# Population mean
mu_lnke = pm.Uniform('mu_lnke', -8, 3)
mu_lnV = pm.Uniform('mu_lnV', -8, 3, testval=0)
mu_lnka = pm.Uniform('mu_lnka', -8, 3, testval=1)
# Population Variance
sigma_lnka = pm.HalfCauchy('sigma_lnka', 1.)
sigma_lnV = pm.HalfCauchy('sigma_lnV', 1.)
sigma_lnke = pm.HalfCauchy('sigma_lnke', 1.)
# Reparamaterization
lnka_offset = pm.Normal('lnka_offset', mu=0, sigma=1, shape=(n_studies,))
lnka = pm.Deterministic('lnka', mu_lnka + sigma_lnka*lnka_offset)
lnV_offset = pm.Normal('lnV_offset', mu=0, sigma=1, shape=(n_studies,))
lnV = pm.Deterministic('lnV', mu_lnV + sigma_lnV*lnV_offset)
lnke_offset = pm.Normal('lnke_offset', mu=0, sigma=1, shape=(n_studies,))
lnke = pm.Deterministic('lnke', mu_lnke + sigma_lnke*lnke_offset)
# Exponentiate parameters
ka = pm.Deterministic('ka', pm.math.exp(lnka))
V = pm.Deterministic('V', pm.math.exp(lnV))
ke = pm.Deterministic('ke', pm.math.exp(lnke))
# Predict concentrations at every reported time
model_conc = pm.Deterministic('model_conc', (D/V[sidx])*(ka[sidx]/(ka[sidx]-ke[sidx]))*(pm.math.exp(-ke[sidx]*t) - pm.math.exp(-ka[sidx]*t)))
indiv_conc = pm.Deterministic('indiv_conc', model_conc[indiv_index]) # Use mask to get predicted individual-level
summary_conc = pm.Deterministic('summary_conc', model_conc[~indiv_index]) # Use mask to get predicted summary-level
# Likelihood error for log-normal distribution (different for each study)
sigma = pm.HalfCauchy('sigma', 1, shape=(n_studies,))
# Likelihood for individual data
pm.Lognormal('indiv_like', mu=indiv_conc, sigma=sigma[sidx_indiv], observed=y_indiv)
# Likelihood for summary data
sigma_summary = sigma[sidx_summary]
term1 = (Ni/2.)*pm.math.log(sigma_summary**2)
term2 = ((Ni -1)*sd_summary**2)/(2*sigma_summary**2)
term3 = (Ni*(y_summary - summary_conc)**2)/(2*sigma_summary**2)
LL = (N/2)*pm.math.log(2*np.pi) - pm.math.sum((term1 + term2 + term3))
pm.Potential("summary_like", LL)
idata = pm.sample(draws=3000, tune=1000, target_accept=0.95)
This appears to work and I’m able to fit ka
, V
, and ke
and get reasonable fits to the original data.
My questions are:
- Is this the appropriate way to use
pm.Potential
to add a custom log-likelihood or am I sacrificing functionality by not using something likepm.DensityDist
to build my own custom distribution for the summary-level likelihood? - Is this a reasonable way to mix individual- and summary-level data for fitting or is there a more “pymc-type” way of handling these different types of observed data?
Thanks as always to the PyMC community’s help with these questions answered. I’ve been piecing together the use of pm.Potential
from several discourse threads, but apologies if this type of likelihood mixing has been discussed somewhere and I missed it.