Outputting loglikelihood of each parameter set

mmurrow95 · November 18, 2021, 7:29pm

I am currently using a custom model implemented with DensityDist. However, I am running into an issue at the end when it saves data to an arviz dataframe. What PyMC seems to be doing is re-calculating the loglikelihood of every single parameter set as is saves to the arviz dataframe, taking a significant amount of time. My custom likelihood function takes on the order of 1 ms to run, so having to recalculate the loglikelihood of my data takes a prohibitive amount of time.

Does anyone know a workaround for this that will cause PyMC to save the loglikeihood in the data frame as its sampling rather than recalculating at the end?

junpenglao · November 26, 2021, 4:58pm

You can try passing log_likelihood=False in pm.sample(..., log_likelihood=False) to suppress the computation.

mmurrow95 · March 22, 2022, 12:21am

The problem is that I need the loglikelihood saved as it is calculated for further data analysis. Ideally, I would like it done while the computation is occurring since it takes multiple minutes to do afterwards.

ricardoV94 · March 22, 2022, 5:55am

PyMC computes the element wise log likelihood (per observation) at the end because this is generally needed for model comparison.

This quantity is not computed at all during NUTS sampling, where a combined scalar log likelihood + log prior (and gradients) is computed for all the parameters at once

mmurrow95 · March 22, 2022, 6:04am

Ah, I see. I’m using DEMetropolisZ where the logp is calculated for each sample anyway. Do you know if there is a way around this? Maybe using pm.Deterministic where I can load the value calculated into a Potential or DensityDist?

ricardoV94 · March 22, 2022, 6:12am

I am not sure it can be done. You would need to wrap it in a Deterministic because you don’t want it to affect sampling (Potential or DensityDist would do that). Are you in V3 or V4?

You could try to add a

pm.Deterministic("loglike”, obs.logp_elemwiset)

But even if that worked I am not sure it would save you time. I think the computation would be duplicated, not reused. Your best chance may be to hack the sampler to make it save the loglikelihood as a sampling stat.

mmurrow95 · March 22, 2022, 6:17am

Yeah, that’s the problem I’ve been running into. Whenever I use deterministic, it just re-calculates the logp, and I haven’t been able to figure out if there is a way to calculate the Deterministic, then use the value output by the Deterministic as an input to either Potential or DensityDist.

I’m still using V3.

ricardoV94 · March 22, 2022, 6:30am

Why do you want to do that?

mmurrow95 · March 22, 2022, 6:36am

For the problems I work on, I’m commonly fitting parameters to 50-100 individual data sets for multiple different models. To compare them, I might use something like a DIC calculation which requires the loglikelihood. Since my loglikelihood function takes ~1ms to run, I need to use a fast sampler like DeMetropolisZ, which requires ~50,000 samples to get good results back.

The end result of this is that it takes minutes to just compute my loglikelihood during parameter inference. Since PyMC isn’t saving the loglikelihood during parameter inference, I have to recalculate it later, which can take hours for all the data I’m working with. If I could save the loglikelihood values while it’s sampling (without having to calculate twice as Deterministic is doing), the problem goes away.

ricardoV94 · March 22, 2022, 6:47am

The calculating twice is not because of the deterministic. It’s because the sampler does not know about it. It has nothing to do with not being in a Potential or DensityDist.

As I was saying your best chance is to hack the sampler if you need it to save that result.

Depending on how the sampler is partitioning the graph the deterministic could actually reuse the computation but I can’t say for sure without looking at the Theano graph and what the sampler is doing.

OriolAbril · March 22, 2022, 12:27pm

To try and clarify the remarks made by Ricardo above.

All samplers compute the logp but as far as I know, none of them computes the pointwise log likelihood (called logp_elemwiset internally by pymc) which is what is needed for model comparison with information criteria like waic or loo cross-validation.

The logp is the sum of the total log likelihood and the log prior. The pointwise log likelihood is the log likelihood only evaluated at a single observation (computed at each sample and observation).

Topic		Replies	Views
PyMC outputs loglikelihood slowly using custom models Questions	0	362	November 18, 2021
Point-wise log-likelihood for black-box model in PyMC v4 v5	2	956	June 10, 2022
Likelihood Specification and DensityDist Questions doc	3	999	October 5, 2021
Building a hierarchical model using a Black Box loglikelihood function version agnostic	1	428	September 20, 2022
"log_likelihood" not found in InferenceData Questions	5	4238	June 21, 2023

Outputting loglikelihood of each parameter set

Related topics