"Unreasonable" model log likelihood in GMM

Hi!

I am using ADVI for GMM model in a 2-D dataset. My dataset comprises of 10,000 2-d points, and the ADVI inference gives a pretty good cluster-assignment result, as indicated by the picture below:

However, when I check the model likelihood, which is p(\theta, y), where \theta refers to latent variables and y refers to the observation data, it is giving me an extremely low value.

More specifically, I sampled point from the posterior, and evaluate approx.model.logp(point).
For 10 sampled points, I am getting the log joint likelihood as below:

array([-21182.10855688, -21284.10041797, -21996.51888791, -22184.84194186,
       -21503.81905424, -21611.31545723, -21688.10228115, -21476.98249261,
       -21166.65294749, -21232.66723859])

This seems really low even we have 10,000 data points. On average, each data point contributes to approximately 20,000/10,000=-2 log-likelihood, which is e^{-2}=0.13 likelihood. This is below my expectation.

So my question is, does this model likelihood seems normal?
You could find my notebook here, which includes the toy-dataset generating process.

I dont think there is much reason to alarm :slight_smile: The logp of pymc3 (and most PPL likely) is non-normalized, meaning that it might not exactly equal to when you computed by hand.

I see. What’s the constant normalizing factor of pymc3’s logp for it to be the true p(\theta_s, y)?

It depends on model, not completely sure in this case.