Combined likelihoods for fitting observed individual- and summary-level data


I am attempting to fit a simple exponential-decay model to measured data, e.g. drug elimination following an administered dose, using a hierarchical model to pool reported data across studies. However, each study will report measured concentrations as either the individual level (each time/concentration pair represents one animal) or at the summary-level (at every time, there is a reported mean and standard deviation averaged over N animals for each observed concentration). Here’s an example of what the data could look like for a single dose (individual animals: top, summary data: bottom).


To account for the differences in how these data are reported, but to also allow for study-level error in the likelihood, I’ve attempted to include two likelihoods in the model fitting. For the individual-level data, I simply use a typical lognormal distribution (pm.Lognormal).

For the summary-level data, I calculate a custom lognormally distributed log-likelihood (LL) where m(t_i) is the exponential-decay model with an initial dose (D) and add this calculated LL using pm.Potential. Here is the custom log-normally distributed log-likelihood I’m using for data that are reported at a summary level.

m(t_i) = \frac{D}{V}\frac{k_a}{k_a-k_e}(e^{-k_et_i} - e^{k_at_i})
LL = -\frac{N}{2}ln(2\pi) - \sum_{i=1}^{G} [N_i\bar{z}_{Li} + \frac{N_i}{2}ln(\sigma_{Lij}^2) +\frac{(N_i - 1)s_{Li}^2}{2\sigma_{Lij}^2} + \frac{N_i(\bar{z}_{Lij} - ln(m(t_i)))^2}{2\sigma_{Lij}^2}]

\bar{z}_{Li} = log-scale summary mean at time t_i (observed)
s_{Li}^2 = log-scale variance at time t_i (observed)
\sigma_{Lij}= log-scale likelihood standard deviation for time i within study j
N_i= Number of animals comprising the reported mean at time (t_i)
N = \sum N_i across all summary-level studies
G = Total # of time points reported across all summary-level studies

To include the individual- and summary-level likelihood, I use a typical pm.Lognormal for the individual-level data and then a pm.Potential to ‘layer’ on summary-level log-likelihood:

# Likelihood for individually reported data
pm.Lognormal("indiv_like", mu=indiv_conc, sigma=sigma[sidx_indiv], observed=y_indiv)

# Likelihood for summary level data
pm.Potential("summary_like", LL)

Putting it all together, here is the current workflow, starting with the dataframe (df) that contains all the data:

time_cor conc_mean conc_sd lnconc_mean lnconc_sd D study_index indiv_index
0 1 6.47 0 1.86718 0 0.286 0 True
1 1 6.36 0 1.85003 0 0.284 0 True
2 1 4.84 0 1.57691 0 0.283 0 True
3 2 5.01 0 1.61144 0 0.286 0 True
4 2 5.29 0 1.66582 0 0.284 0 True
95 8.06922 60.6459 8.27174 4.09584 0.135766 17.6518 1 False
96 10.027 57.7451 6.83516 4.04908 0.117956 17.6518 1 False
97 12.2033 55.8628 7.62303 4.01367 0.135831 17.6518 1 False
98 18.075 46.7062 7.23221 3.83203 0.153929 17.6518 1 False
99 56.0399 20.5761 6.42714 2.97758 0.305118 17.6518 1 False

indiv_index is a mask on whether or not the reported concentration is an individual animal (True) or summary-level concentration (False).

Using df, the hierarchical model is constructed:

indiv_index = df.indiv_index.values # Mask on whether or not concentration is individual or summary
N = np.sum(~indiv_index)*4 # 4 animals per summary-level concentration (Total N)
Ni = np.array([4]*np.sum(~indiv_index)) # 4 animals per summary time-point

n_studies = len(df.study_index.unique())
with pm.Model() as model:
    t = pm.Data("t", df.time_cor.values, mutable=True)
    D = pm.Data("D", df.D.values, mutable=True)
    sidx = pm.Data("sidx", df.study_index.values, mutable=True)
    # Parse the observed data from df
    # Individual data (Use reported concentrations)
    y_indiv = pm.Data("y_indiv", df.conc_mean.values[indiv_index], mutable=True)
    sidx_indiv = sidx[indiv_index]
    # Summary data (Use log-transformed concentrations and standard deviations)
    y_summary = pm.Data("y_summary", df.lnconc_mean.values[~indiv_index], mutable=True)
    sidx_summary = sidx[~indiv_index]
    sd_summary = pm.Data("sd_summary", df.lnconc_sd[~indiv_index], mutable=True)
    # Population mean
    mu_lnke = pm.Uniform('mu_lnke', -8, 3)
    mu_lnV = pm.Uniform('mu_lnV', -8, 3, testval=0)
    mu_lnka = pm.Uniform('mu_lnka', -8, 3, testval=1)

    # Population Variance
    sigma_lnka = pm.HalfCauchy('sigma_lnka', 1.)
    sigma_lnV = pm.HalfCauchy('sigma_lnV', 1.)
    sigma_lnke = pm.HalfCauchy('sigma_lnke', 1.)

    # Reparamaterization
    lnka_offset = pm.Normal('lnka_offset', mu=0, sigma=1, shape=(n_studies,))
    lnka = pm.Deterministic('lnka', mu_lnka + sigma_lnka*lnka_offset)
    lnV_offset = pm.Normal('lnV_offset', mu=0, sigma=1, shape=(n_studies,))
    lnV = pm.Deterministic('lnV', mu_lnV + sigma_lnV*lnV_offset)
    lnke_offset = pm.Normal('lnke_offset', mu=0, sigma=1, shape=(n_studies,))
    lnke = pm.Deterministic('lnke', mu_lnke + sigma_lnke*lnke_offset)

    # Exponentiate parameters
    ka = pm.Deterministic('ka', pm.math.exp(lnka))
    V = pm.Deterministic('V', pm.math.exp(lnV))
    ke = pm.Deterministic('ke', pm.math.exp(lnke))

    # Predict concentrations at every reported time
    model_conc = pm.Deterministic('model_conc', (D/V[sidx])*(ka[sidx]/(ka[sidx]-ke[sidx]))*(pm.math.exp(-ke[sidx]*t) - pm.math.exp(-ka[sidx]*t)))

    indiv_conc = pm.Deterministic('indiv_conc', model_conc[indiv_index]) # Use mask to get predicted individual-level
    summary_conc = pm.Deterministic('summary_conc', model_conc[~indiv_index]) # Use mask to get predicted summary-level

    # Likelihood error for log-normal distribution (different for each study)
    sigma = pm.HalfCauchy('sigma', 1, shape=(n_studies,))

    # Likelihood for individual data
    pm.Lognormal('indiv_like', mu=indiv_conc, sigma=sigma[sidx_indiv], observed=y_indiv)

    # Likelihood for summary data
    sigma_summary = sigma[sidx_summary]

    term1 = (Ni/2.)*pm.math.log(sigma_summary**2)
    term2 = ((Ni -1)*sd_summary**2)/(2*sigma_summary**2)
    term3 = (Ni*(y_summary - summary_conc)**2)/(2*sigma_summary**2)
    LL = (N/2)*pm.math.log(2*np.pi) - pm.math.sum((term1 + term2 + term3))
    pm.Potential("summary_like", LL)
    idata = pm.sample(draws=3000, tune=1000, target_accept=0.95)

This appears to work and I’m able to fit ka, V, and ke and get reasonable fits to the original data.

My questions are:

  1. Is this the appropriate way to use pm.Potential to add a custom log-likelihood or am I sacrificing functionality by not using something like pm.DensityDist to build my own custom distribution for the summary-level likelihood?
  2. Is this a reasonable way to mix individual- and summary-level data for fitting or is there a more “pymc-type” way of handling these different types of observed data?

Thanks as always to the PyMC community’s help with these questions answered. I’ve been piecing together the use of pm.Potential from several discourse threads, but apologies if this type of likelihood mixing has been discussed somewhere and I missed it.

1 Like

Late reply, but this looks like a really nice way to attack the problem! I see nothing wrong with the way you’ve laid it out and using Potential to make sure you get the weighting from the N counts in the summary data is about as clean a solution as I can think of, personally.

Thank you so much for the reply @ckrapu. I was having some difficulty distilling down the actual inputs for pm.Potential, but it does appear it’s a useful way for tacking on a log-likelihood from a custom likelihood function. That’s good to know going forward if I ever need a custom likelihood function. Thanks again for the confirmation.