Loglikelihood of posterior predictive samples

juststarted · February 21, 2022, 1:20pm

Hi everyone! I am trying to get the loglikelihood of the data generated by pm.sample_posterior_predictive under the corresponding posterior samples. E.g. I want the loglikelihood under the first trace sample of the dataset simulated under the first trace sample, the loglikelihood of the second simulated dataset under the second trace sample, etc. However, the function does not behave like I would expect it to.

For instance, take the following model:

dataset_np = np.random.sample(100)

with pm.Model() as testmodel:
    dataset = pm.Data('dataset', dataset_np)
    a = pm.Normal('a')
    pm.Normal('observed', mu=a, sigma=1, observed=dataset)
    test_trace = pm.sample(return_inferencedata=True)

What I am trying to do is finding the likelihood directly while using sample_posterior_predictive as follows:

with testmodel:
    if 'datalogpt' not in testmodel.named_vars:
        pm.Deterministic('datalogpt', testmodel.datalogpt)
    ppt_loglik = pm.sample_posterior_predictive(
        test_trace,
        var_names = ['datalogpt', 'observed'],
        random_seed=2
    )

Now, I can manually check the loglikelihood of the original data for the first posterior sample as follows:

testmodel.datalogpt.eval({a: test_trace.posterior.a[0,0]})

which prints out

array(-96.33916817)

And we can also check the loglikelihood of the simulated data for the first posterior sample as follows:

pm.set_data({'dataset': ppt_loglik['observed'][0]}, model=testmodel)
testmodel.datalogpt.eval({a: test_trace.posterior.a[0,0]})

which prints out

array(-146.20412808)

Finally, the loglikelihood of the first posterior sample recorded in ppt_loglik is:

ppt_loglik['datalogpt'][0]

which prints out -96.33916817158028. But this not the loglik of the simulated data for the first posterior sample, it is the loglikelihood of the actual data!

PS: It would be possible of course to get the relevant loglikelihood by iterating over the trace and running for each index i:

pm.set_data({'dataset': ppt_loglik['observed'][i]}, model=testmodel)
testmodel.datalogpt.eval({a: test_trace.posterior.a.flatten()[i]})

but this is very slow, in contrast to pm.sample_posterior_predictive which is really fast.

juststarted · February 22, 2022, 1:33pm

Just to give more context, this is for calculating the Bayesian posterior predictive p-value (here’s one of the first papers about it, but its use has been advocated by Gelman among others). The idea is to use it as a measure of how well a posterior fits the observed data, irrespective of the other models, rather than using it as a decision criterion. Basically, we choose some statistic of the data, get a trace, get simulated data for each trace sample, and then check for each trace sample whether the statistic of the simulated data is at least as extreme as that of the actual data (under that trace sample). That is why I would like to get for each posterior sample the loglikelihood of data simulated from the sample. However, the method above with pm.set_data is so slow!

cluhmann · February 22, 2022, 3:39pm

There very well may be a clever way to do this faster, but my intuition is that this operation is going to be inherently slow because of the need to calculate the logp of a single data set associated with a single set of model parameters. But finding an efficient solution would be useful for the reasons you mention. Maybe @ricardoV94 or @junpenglao have some suggestions for speeding things up?

OriolAbril · February 22, 2022, 4:11pm

If you are interested in posterior predictive checks and the sometimes called Bayesian p-value you should take a look at arviz.plot_bpv — ArviZ dev documentation and arviz.plot_loo_pit — ArviZ dev documentation.

cluhmann · February 22, 2022, 4:53pm

@OriolAbril Is it possible to access the raw data that goes into those plots?

OriolAbril · February 22, 2022, 6:47pm

for plot_loo_pit you can use loo_pit, for bpv but they can be calculated in 1-2 lines, either with xarray itself or with xarray-einstats (whose introductory blogpost I still have to write).

The raw data for the last example can be generated from something like:

obs_t = data.observed_data.quantile(.5)
pp_t = data.posterior_predictive.quantile(.5, dim=("y_dim_0"))
bpv = (pp_t >= obs_t).mean()

We also have arviz.apply_test_function — ArviZ dev documentation for functions that take a 1d array to help apply them with chain and draw as batch dimensions so it works on both observed_data and posterior_predictive at the same time.

ricardoV94 · February 23, 2022, 8:59am

Calculating logp is always going to be slower [citation needed] than taking random draws, which is what sample_posterior_predictive does. Iterating over the trace is not so bad, just make sure you cache the logp function. Everytime you call testmodel.datalogpt.eval() it is compiling the same function again.

You might want to try something like this (assuming you aren’t doing it already):

datalogp_fn = testmodel.fastfn(testmodel.datalogpt)
flattened_trace_numpy = ...
logp = [datalogp_fn({"a": trace_value}) for trace_value in flattened_trace_numpy]

Apologies for the broken pseudo-code.

juststarted · February 23, 2022, 9:37am

Ah thank you, it’s good to get a confirmation that I’m not missing something obvious. I have a related question about dealing with variables here, namely:

In the current version I am using testmodel.datalogpt.eval, and therefore I can pass a dictionary containing the untransformed variables as keys and their values at the point of interest as arguments (in the test model above there’s no transformed variable but there are some in the model I want to use this for). Therefore, I can

Get the trace with trace_ = pm.util.dataset_to_point_dict(testtrace.posterior)
Convert the keys from variable names to actual variables. For each point index i run: {testmodel.named_vars[key]: value for key, value in trace_[i].items()}
Feed each point to testmodel.datalogpt.eval.

However, if I use testmodel.fastfn(testmodel.datalogpt), the resulting function needs a dictionary with the transformed variable names as keys and the transformed values for the point of interest as arguments. I was wondering if there is a nice simple way / already implemented function to transform a dictionary with untransformed variables and their values to the corresponding transformed variables and values.

Thank you again for all the help!

ricardoV94 · February 23, 2022, 9:39am

Perhaps @junpenglao knows this? I am a bit out of touch with PyMC V3, so I don’t know it from the top of my head.

juststarted · February 23, 2022, 10:00am

Also a little additional problem: Changing the line where a is defined to a = pm.HalfNormal('a') to get a transformed variable, and then running datalogp_fn = testmodel.fastfn(testmodel.datalogpt), it raises an error saying TypeError: Unknown input or state: a. The function has 2 named inputs (a_log__, dataset). So it looks like now the resulting function also requires the observed data, in contrast to the function with the initial model that only required a.

Topic		Replies	Views
Estimate log likelihood of ground truth values using trace from sample_posterior_predictive v5	2	30	December 6, 2024
Evaluate logposterior at sample points Questions	9	4385	August 16, 2017
Pymc3.model likelihood evaluation for given prior sample Questions	2	536	June 22, 2018
How to retrieve (log) posterior/prior/likelihood from trace? Questions	4	4317	May 23, 2020
Computing lppd using trace, Help to understand Questions	4	1193	April 2, 2018

Loglikelihood of posterior predictive samples

Related topics