Loglikelihood of posterior predictive samples

Hi everyone! I am trying to get the loglikelihood of the data generated by pm.sample_posterior_predictive under the corresponding posterior samples. E.g. I want the loglikelihood under the first trace sample of the dataset simulated under the first trace sample, the loglikelihood of the second simulated dataset under the second trace sample, etc. However, the function does not behave like I would expect it to.

For instance, take the following model:

dataset_np = np.random.sample(100)

with pm.Model() as testmodel:
    dataset = pm.Data('dataset', dataset_np)
    a = pm.Normal('a')
    pm.Normal('observed', mu=a, sigma=1, observed=dataset)
    test_trace = pm.sample(return_inferencedata=True)

What I am trying to do is finding the likelihood directly while using sample_posterior_predictive as follows:

with testmodel:
    if 'datalogpt' not in testmodel.named_vars:
        pm.Deterministic('datalogpt', testmodel.datalogpt)
    ppt_loglik = pm.sample_posterior_predictive(
        test_trace,
        var_names = ['datalogpt', 'observed'],
        random_seed=2
    )

Now, I can manually check the loglikelihood of the original data for the first posterior sample as follows:

testmodel.datalogpt.eval({a: test_trace.posterior.a[0,0]})

which prints out

array(-96.33916817)

And we can also check the loglikelihood of the simulated data for the first posterior sample as follows:

pm.set_data({'dataset': ppt_loglik['observed'][0]}, model=testmodel)
testmodel.datalogpt.eval({a: test_trace.posterior.a[0,0]})

which prints out

array(-146.20412808)

Finally, the loglikelihood of the first posterior sample recorded in ppt_loglik is:

ppt_loglik['datalogpt'][0]

which prints out -96.33916817158028. But this not the loglik of the simulated data for the first posterior sample, it is the loglikelihood of the actual data!

PS: It would be possible of course to get the relevant loglikelihood by iterating over the trace and running for each index i:

pm.set_data({'dataset': ppt_loglik['observed'][i]}, model=testmodel)
testmodel.datalogpt.eval({a: test_trace.posterior.a.flatten()[i]})

but this is very slow, in contrast to pm.sample_posterior_predictive which is really fast.

Just to give more context, this is for calculating the Bayesian posterior predictive p-value (here’s one of the first papers about it, but its use has been advocated by Gelman among others). The idea is to use it as a measure of how well a posterior fits the observed data, irrespective of the other models, rather than using it as a decision criterion. Basically, we choose some statistic of the data, get a trace, get simulated data for each trace sample, and then check for each trace sample whether the statistic of the simulated data is at least as extreme as that of the actual data (under that trace sample). That is why I would like to get for each posterior sample the loglikelihood of data simulated from the sample. However, the method above with pm.set_data is so slow!

There very well may be a clever way to do this faster, but my intuition is that this operation is going to be inherently slow because of the need to calculate the logp of a single data set associated with a single set of model parameters. But finding an efficient solution would be useful for the reasons you mention. Maybe @ricardoV94 or @junpenglao have some suggestions for speeding things up?

If you are interested in posterior predictive checks and the sometimes called Bayesian p-value you should take a look at arviz.plot_bpv — ArviZ dev documentation and arviz.plot_loo_pit — ArviZ dev documentation.

@OriolAbril Is it possible to access the raw data that goes into those plots?

for plot_loo_pit you can use loo_pit, for bpv but they can be calculated in 1-2 lines, either with xarray itself or with xarray-einstats (whose introductory blogpost I still have to write).

The raw data for the last example can be generated from something like:

obs_t = data.observed_data.quantile(.5)
pp_t = data.posterior_predictive.quantile(.5, dim=("y_dim_0"))
bpv = (pp_t >= obs_t).mean()

We also have arviz.apply_test_function — ArviZ dev documentation for functions that take a 1d array to help apply them with chain and draw as batch dimensions so it works on both observed_data and posterior_predictive at the same time.

1 Like

Calculating logp is always going to be slower [citation needed] than taking random draws, which is what sample_posterior_predictive does. Iterating over the trace is not so bad, just make sure you cache the logp function. Everytime you call testmodel.datalogpt.eval() it is compiling the same function again.

You might want to try something like this (assuming you aren’t doing it already):

datalogp_fn = testmodel.fastfn(testmodel.datalogpt)
flattened_trace_numpy = ...
logp = [datalogp_fn({"a": trace_value}) for trace_value in flattened_trace_numpy]

Apologies for the broken pseudo-code.

Ah thank you, it’s good to get a confirmation that I’m not missing something obvious. I have a related question about dealing with variables here, namely:

In the current version I am using testmodel.datalogpt.eval, and therefore I can pass a dictionary containing the untransformed variables as keys and their values at the point of interest as arguments (in the test model above there’s no transformed variable but there are some in the model I want to use this for). Therefore, I can

  • Get the trace with trace_ = pm.util.dataset_to_point_dict(testtrace.posterior)
  • Convert the keys from variable names to actual variables. For each point index i run: {testmodel.named_vars[key]: value for key, value in trace_[i].items()}
  • Feed each point to testmodel.datalogpt.eval.

However, if I use testmodel.fastfn(testmodel.datalogpt), the resulting function needs a dictionary with the transformed variable names as keys and the transformed values for the point of interest as arguments. I was wondering if there is a nice simple way / already implemented function to transform a dictionary with untransformed variables and their values to the corresponding transformed variables and values.

Thank you again for all the help!

Perhaps @junpenglao knows this? I am a bit out of touch with PyMC V3, so I don’t know it from the top of my head.

Also a little additional problem: Changing the line where a is defined to a = pm.HalfNormal('a') to get a transformed variable, and then running datalogp_fn = testmodel.fastfn(testmodel.datalogpt), it raises an error saying TypeError: Unknown input or state: a. The function has 2 named inputs (a_log__, dataset). So it looks like now the resulting function also requires the observed data, in contrast to the function with the initial model that only required a.