Xarray serialization error when marginal likelihood depends on random variables

I am unable to save traces to file using to_netcdf, when gp.marginal_likelihood is conditioned on random variables. To see this, consider the following toy example:

import numpy as np
import pymc3 as pm

x = np.linspace(-1, 1, 1000)
y = np.random.normal(loc=0.0, scale=0.001, size=len(x)) + 0.1 + np.sin(x)

with pm.Model() as model:
    mean = pm.Normal("mean", mu=0.1, sd=0.5)

    gp = pm.gp.Marginal(cov_func=pm.gp.cov.Linear(input_dim=1, c=1))

    y_pred = gp.marginal_likelihood('y_pred', X=x.reshape(-1, 1), y=y.flatten() - mean, noise=0.1)

    # Sample
    trace = pm.sample(return_inferencedata=True, tune=100, draws=300)
    trace.to_netcdf("test")

This raises the error

ValueError: unable to infer dtype on variable 'gp'; xarray cannot serialize arbitrary Python objects

Has anyone encountered this problem before, knows how it is meant to be solved in PyMC, or has a suggestion for a workaround?
Thank you all very much in advance.

1 Like

@OriolAbril

You should be able to fix this adding trace.posterior = trace.posterior.astype(float) between creation and saving.

Still, can you share the output of print(trace.posterior)? It is very strange that it gets assigned an object dtype instead of a numeric one, and that must come from earlier on, when converting to inferencedata arviz leaves the input arrays completely untouched and you can for example pass dask arrays instead of numpy without any problem (xarray supports both).

1 Like

print(trace.posterior) delivers

<xarray.Dataset>
Dimensions:  (chain: 2, draw: 300)
Coordinates:
  * chain    (chain) int64 0 1
  * draw     (draw) int64 0 1 2 3 4 5 6 7 8 ... 292 293 294 295 296 297 298 299
Data variables:
    mean     (chain, draw) float64 0.9928 0.997 0.999 ... 0.9992 1.02 0.9917
Attributes:
    created_at:                 2022-04-13T16:41:41.419915
    arviz_version:              0.12.0
    inference_library:          pymc3
    inference_library_version:  3.11.5
    sampling_time:              25.90023398399353
    tuning_steps:               100

The problem is, however, not with the posterior, but with trace.observed_data.
Calling trace.observed_data delivers

AttributeError: 'TensorVariable' object has no attribute 'item'

while the following works perfectly fine, but doesn’t save the observed data.

trace.to_netcdf(filename='test', groups=["posterior", "log_likelihood", "sample_stats"])

Something like trace.observed_data.astype(np.array) or trace.observed_data.astype(float) doesn’t work, raising

Cannot interpret '<built-in function array>' as a data type

Not sure what is going on here, maybe @michaelosthege could help? It looks like gp.marginal_likelihood registers an observed variable that is not retrieved properly.

My only guess is it might be because all other distributions do not allow FreeRVs as to be passed with observed kwarg and therefore the converter doesn’t support this case, not sure why it would be allowed here though.

Yes, your y parameter to the gp.marginal_likelihood is a TensorVariable.
Looks like PyMC can deal with that when it comes to the differentiation & sampling, but like Oriol correctly pointed out, the code that converts to InferenceData never expected that the observed may be symbolic.

When it comes to PyMC v3, this will be a wontfix bug, but there’s a chance the problem persists with v4 in which case something should be done about it.

However, there may be a valid reparametrization of the model as a pm.gp.Latent gaussian process.
Essentially something in the direction of

gp = pm.gp.Latent(mean, cov)
pred = gp.prior(X=x.reshape(-1, 1))
L = pm.Normal("likelihood", mu=pred + mean, sigma=0.1, observed=y)

:point_right::point_right: A non-zero mean_func=pm.gp.mean.Constant(mean) should also work.here. :point_left::point_left:

2 Likes