Did something change with the var_names keyword for sample_posterior_predictive from v3 to v5?

I’ve fitted my model and am trying to compute posterior predictive distributions for new data, however currently the sample_posterior_predictive routine returns an empty object when I run it with defaults, i.e. when I just give it the training trace. If, on the other hand, I specific var_names=[some_variables…], then I get a seemingly useful trace out. Is this the correct behaviour, or am I doing some wrong? What I mean is that previously, in v3 I believe was the last I used, the sample_posterior_predictive routine would just give predictive distributions for all the variables in the model by default. Is that no longer the case? I now have to explicitly list them all? Additionally my actual data prediction variable, the “likelihood” variable, doesn’t seem to be included in this list so I can’t seem to get a trace for the thing I want most.

I know I haven’t given you the model yet, so I’m just happy to be informed about the general expected behaviour of this function :). If what I am seeing sounds wrong then I’ll work harder to create a minimal example.

Edit: Ok I did more experimenting with simpler models and it seems that there everything works as I expected, i.e. no need to specify var_names for sample_posterior_predictive. So I guess it is something in my model (though it samples just fine so I am confused). Any idea why sample_posterior_predictive would return an empty dataset?

Edit 2: So sample_posterior_predict works fine using the training dataset again. So I guess something is wrong in the new input data and/or the model definition. It’s strange though because there is no error or anything, just empty output…

Ah ok I think I figured it out, it seems to be because I set the “y” values to NaN in my out-of-sample dataset. Somehow that triggers pymc to drop the whole dataset so nothing happens, or some such? But then it works ok if I specify var_names for vars that don’t get impacted by those NaN’s?

Anyway if I set the dummy “y” values to zero instead of NaN then something happens at least. I haven’t checked that the output is sensible yet, but it made sense dimensionally at least.

Is that the expected behaviour? NaNs should not be used to indicate absent dependant variable values? A warning message or something might be helpful here.

Setting values to nan triggers automatic imputation of the nan entries (so they are now just vanilla unobserved variablels)

But that wasn’t what happened. Instead nothing happened.

Some environments like VSCode and Colab now hide the warning logs emitted by PyMC. Anyway, you can check if imputation was done by model_graphviz or just checking what model.named_vars looks like.

To answer your original question, the behavior has changed. Any variables in var_names (as well as any variables that depend on those) will be resampled from the prior. If you can share your model and the exact problems you’re seeing it will be easier to help.

Ok I reproduced it in a very simple example:

import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)

# Based on https://www.pymc.io/projects/examples/en/latest/generalized_linear_models/GLM-robust.html

# Toy data
#---------
size = 100
true_intercept = 1
true_slope = 2

x = np.linspace(0, 1, size)
# y = a + b*x
true_regression_line = true_intercept + true_slope * x
# add noise
y = true_regression_line + rng.normal(scale=0.5, size=size)

# Add outliers
x_out = np.append(x, [0.1, 0.15, 0.2])
y_out = np.append(y, [8, 6, 9])

data = pd.DataFrame(dict(x=x_out, y=y_out))

# Use subset for training
x_train = x_out[0:-1:10]
y_train = y_out[0:-1:10]

# PyMC Model
#------------
def build_model(x, y):
    with pm.Model() as model:
        xdata = pm.Data("x", x, dims="obs_id", mutable=False)

        # define priors
        intercept = pm.Normal("intercept", mu=0, sigma=1)
        slope = pm.Normal("slope", mu=0, sigma=1)
        sigma = pm.HalfCauchy("sigma", beta=10)

        mu = intercept + slope * xdata

        # define likelihood
        likelihood = pm.Normal("y", mu=mu, sigma=sigma, observed=y, dims="obs_id")
    return model

refit = True
if refit:
    with build_model(x_train, y_train):
        # inference
        trace = pm.sample(tune=2000)
    
        print("trace:")
        print(trace)
        print("trace.posterior:")
        print(trace.posterior)
    
        # Save trace
        trace.to_netcdf("training_trace.nc") 

# Reload training trace for forward predictions
training_trace = az.InferenceData.from_netcdf("training_trace.nc") 
print("training_trace:")
print(training_trace)
nan_y = y_out * np.nan
zero_y = y_out * 0.
with build_model(x_out, nan_y):
    pp_trace = pm.sample_posterior_predictive(training_trace)
    print("pp_trace:")
    print(pp_trace)

Building the model with nan_y the output is:

training_trace:
Inference data with groups:
	> posterior
	> sample_stats
	> observed_data
	> constant_data
pp_trace:
Inference data with groups:
	> 

But doing it with zero_y the output is

training_trace:
Inference data with groups:
	> posterior
	> sample_stats
	> observed_data
	> constant_data
Sampling: [y]
pp_trace:█████████████████████████████████████████████████████| 100.00% [4000/4000 00:00<00:00]
Inference data with groups:
	> posterior_predictive
	> observed_data
	> constant_data

I.e. it just does absolutely nothing if the "y"s are all nan. Which I find confusing since they are just a placeholder, they shouldn’t have any impact on how the posterior predictions work, at least in my understanding. I guess the actual values don’t matter, but somehow nan is treated differently due to the imputation logic or something?