Calculating model comparison using log pointwise predictive density

I know that PyMC3 provides criterias for model comparison using WAIC and LOO, yet I’m interested in using log pointwise predictive density (note that this is different from LPPD) for comparing PyMC3 models.

When I was trying to understand log predictive density, I noticed that the equation for log predictive density is same as of the equation used for posterior predictive checks (PPC), if we take the log.

Therefore, can we use the posterior predictive checks to determine the log predictive density of the model? If not how can we calculate the log predictive density of PyMC3 models?

Thanks

Which log pointwise predictive density are you refering to in the paper?

This
image

Here p_{post}(\theta) = p(\theta|y)
or

image

Note that both are same but different ways of looking at it (2nd equation is used with draws from a distribution).

Isnt this exactly the lppd? It was computed in PyMC3 during the computation of LOO (and WAIC):

It does not require new (simulated or real) data, as the y_i is the original observed data

1 Like

I thought “log pointwise predictive density” and “log pointwise posterior predictive density” are different.

But you are correct junpenglao. It seems they are the same. Thanks for the clarification.

Could you please answer for this questions as well.

Is there any connection between lppd and posterior predictive checks (PPC) ? Don’t we use the same process when sampling the posterior in PPC ? Can’t we take the mean of those samples and take the log of them to compute the lppd?

It does not require new (simulated or real) data, as the y_i is the original observed data

Does that mean that WAIC should be calculated with the same data that was used for sampling (train the model)?

They are computed differently - in PPC you generate samples conditioned on the posterior. However, it is an interesting idea to compute the lppd using the PPC samples. You can ask Aki Vehtari on the Stan discourse - he would have a better insight.

Yes.

1 Like

I’ll ask on Stan discourse.

Thanks junpenglao. This discussion was really helpful for me to understand certain aspects of the lppd.

The answers to these question was provided by @junpenglao. I’m just listing them.

“Log pointwise predictive density” and “Log pointwise posterior predictive density (LPPD)” are the same, and therefore we can use the LPPD computed during the computation of WAIC (or LOO).

PPC and LPPD are computed differently - they may have a connection. We need to discuss this with someone who has more insight, probability in Stan discourse.

1 Like

Posting the answer from Stan discourse here:

This is a good point, basically, PPC is generated using the lppd. So going back to @Nadheesh original question: you cannot use the PPC samples (generated from pm.sample_ppc) to evaluated lppd, as PPC samples are actually generated by the density defined as lppd.

@junpenglao thanks for pointing out.

I thought about it a bit, since we have the \bar{y} as the output if PPC can’t we just use the likelihood.logp(\bar{y})?

That would be the posterior expectation of the observed. The point here is that you want some distance measure of the observation against the fitted model, which is why you always need some actual data (the observed you used to fit the data, or the testing set, or the data collected in the future).

@junpenglao I took some time to understand what you said, still I don’t get the point.

When I look into the implementations of lppd and PPC,

  • In lppd we take the posterior samples \theta^S and then use them to compute the observedRV.logp(\theta^s) which is the log(p(y_i^k|\theta^s)), where y_i^k is the k^{th} sample drawn from i^{th} observed data point.
  • In PPC we randomly draw posterior samples from \theta^S and using them we call observedRV.random(\theta^s) to a draw sample y_i^k|\theta^s from y_i|\theta^s. Hoping that using those samples we can find an approximation of y_i|\theta^s. So if we take the logp we still get the log(p(y_i^k|\theta^s)).

Only difference I see is the random selection of posterior samples \theta^s when using PPC, instead of selecting all of them only once as in lppd calculation. However, if we select the extract sample set without randomly drawing samples from them, is not it exactly the same output when we take the logp of the PPC?

Please correct me if I’m wrong. Sorry about troubling you with this again and again :slight_smile:

1 Like

In PPC, the samples are not y_i \mid \theta^{'} - if you are taking y_i from the observation. Instead, it is the y_i^{'} \sim p(\theta^{'}) with \theta^{'} being the posterior.

Imagine you have a really bad model and it fits the data poorly. You will see that lppd will gives very bad output. But logp(PPC) will not distinguish from a good model to a bad model, as the output is always more or less the same.

1 Like

Now I get it. :smiley:

So when using observedRV.random(\theta^s) it exact point randomly selected from y_i|\theta^s, then y_i^′ is built by taking such random samples of multiple \theta^s. So we exactly don’t get samples from same distribution yet different distributions.

Thanks for the help @junpenglao

Yep, the key here is to realize that PPC somewhat equivalent to the log posterior, and lppd is a more formal way to quantify the visualization of plotting the observed y_i on top of PPC.

I did not see the relationship before but now I fully understand it.

When using PPC we extract a single sample from y_i|\theta^s while changing the \theta^s. So that is why it should be equivalent to log posterior.

Thanks a lot. :slight_smile: