Pm.traceplot not working when the observations are incomplete

I am using a MatrixNormal distribution in my model:

obs=pm.MatrixNormal('obs',mu=mean,rowchol=R_chol,colchol=cross_chol,observed=df,shape=df.shape)

The dataframe df contains incomplete data, as a lot of its entries are None. As suggested by https://nbviewer.jupyter.org/github/fonnesbeck/scipy2014_tutorial/blob/master/3_Introduction-to-PyMC.ipynb, the model ignores the missing obseravtions and performs inference. However, when I use pm.traceplot(trace), I get the following error:

log_likelihood_vals_point
log_like_val = log_like_val[~var.observations.mask]
IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed

This happens even if I define the var_names of the traceplot and exclude obs_missing.

Any ideas on how to fix this and plot the trace?

Your error is actually happening when trying to convert the data to InferenceData. PyMC3 now delegates plotting, stats and diagnostics to ArviZ, it still offers the plotting functions as pm.function but these are aliases to ArviZ functions exposed also from the pymc3 namespace for convenience.

You should convert your samples to InferenceData first, this blogpost of mine should help you: PyMC3 with labeled coords and dims | Oriol unraveled, then call az.plot_trace or pm.plot_trace (pm.traceplot is deprecated and will stop working with next pymc3 release.

All arviz functions work with pymc3 multitraces too, because they do the conversion under the hood, which in some cases like yours can break if special care is needed when converting.

Last, I think we added some nice warnings and error messages to help when errors like these were trigered, which versions of PyMC3 and ArviZ are you using?

1 Like

Thanks a lot for your response.

I was using PyMC3 v3.8 at that time. I just updagred to v3.11.1 and everything works great. By the way, the version upgrade lead to an impressive speed increase. Was this to be expected?

2 Likes

There have been many improvements since v3.8 so it’s not unexpected you now see speed-ups somewhere. What I can’t know though (because I don’t remember all the changes since 3.8 nor I know your model) is where these speed-ups will have effect. Is it during sampling? Plotting?

I actually noticed the speed-up during sampling. In a simple test model, inference used to take ~60 seconds with v3.8, whereas with v3.11.1 it takes ~40 seconds. What’s even more important is that v3.11.1 can handle a large model with lots of incomplete observations, where v3.8 used to crash.

1 Like

If you have a large number of observations and don’t plan on doing model comparison with some of az.compare, az.loo or az.waic at all you can accelerate the conversion process by using idata_kwargs={"log_likelihood": False} in the pm.sample call if you are not doing so already. The main improvement will probably be on memory usage instead of time though, as this avoids storing the pointwise log likelihood values which is an array of shape nchains, ndraws, nobservations

Thanks for your advice, I will start using this statement. Indeed, there was no difference in speed, but I guess that the imporved memory usage will help, especially in larger models.

1 Like

Just for mentioning, my model still crashes when the number of missing obseravtions becomes too large. The message I get is: The derivative of RV obs_missing.ravel()[7576] is zero., for all my obs_missing variables. With smaller numbers of missing observations, v3.11.1 works well, as mentioned previously.

Does this make sense?

1 Like

I don’t really know much about the missing variable imputation internals. Maybe this talk helps better understand what is going on

1 Like

FWIW I recently included missing values in a model I’m building. I followed junpeng’s example that Oriol noted above and everything works as expected: I think the most important aspects are standardizing the dataset and using a hierarchical prior on the missing data.

I’ve an example here as part of a slightly different topic MRE_missing_values · GitHub

2 Likes