Does sample_posterior_predictive() ignore arrays in the trace when used as model factory?

@lucianopaz shows how to use a model factory to sample in a second model using values copied from the trace of a first model. (Thanks Luciano, for pioneering that valuable technique!)

When employing a model factory I am seeing a strange difference in the effect of copied arrays versus copied scalars. Considering this simple example:

import pymc3 as pm

def model_factory():
    """Create a model with a scalar normal and an array normal"""
    with pm.Model() as model:
        foo = pm.Normal('foo', 0, 1)
        foo_array = pm.Normal('foo_array', 0, 1, shape=10)
        pm.Deterministic('bar', foo)
        pm.Deterministic('bar_array', foo_array)
    return model

# Create a model and sample it
with model_factory() as antecedent:
    ant_trace = pm.sample(draws=600, chains=2, tune=500)

# Create a second model, copy the values for foo and foo_array
# to the model, and then sample posterior
with model_factory() as consequent:
    ant_trace_df = pm.trace_to_dataframe(ant_trace, varnames=['foo', 'foo_array'])
    conseq_trace = pm.sample_posterior_predictive(
        trace=ant_trace_df.to_dict('records'),
        var_names=['bar', 'bar_array'],
        samples=len(ant_trace_df))

This example creates two identical models, antecedent and consequent. antecedent is sampled, and the values for foo and foo_array are copied to consequent. Then consequent is sampled.

Take a look at one sample (e.g. 42) in the traces of antecedent and consequent. Note that we cannot look at the value for foo in consequent, as it does not exist in the trace. But we can look at bar, a deterministic that takes the value of foo. The values for bar are identical in antecedent and consequent, as expected:

foo_array is just like foo, except it is an array instead of a scalar. And bar_array is just like bar, except it is also an array instead of a scalar. But the values for bar_array are completely different between antecedent and consequent:

sample_posterior_predictive seems to be ignoring the values of foo_array copied in from antecedent, and instead resampling the normal distribution.

Is this expected behavior? Is this a bug in sample_posterior_predictive()? (fast_sample_posterior_predictive() exhibits the same behavior, albeit much faster.)

PyMC3 version 3.9.3

Reported as bug.

1 Like

Elsewhere @lucianopaz figured out the problem:

The problem is caused by pm.trace_to_dataframe(ant_trace, varnames=['foo', 'foo_array']). This creates a pandas dataframe where the entries for foo_array are split across columns named foo_array_0, foo_array_1, etc. These are then passed on to sample_posterior_predictive but there is no way for the sampler to understand that foo_array_0, foo_array_1, and the rest should be stacked together into a single foo_array value.

There are two ways to address this problem:

  1. Don’t convert to a DataFrame and drop the values from the MultiTrace directly:
with model_factory() as consequent:
    ant_trace.remove_values("bar")
    ant_trace.remove_values("bar_array")
    conseq_trace = pm.sample_posterior_predictive(
        trace=ant_trace,
        var_names=['bar', 'bar_array'],
    )
  1. Use arviz.InferenceData that stores the posterior samples in an xarray.Dataset, that can understand the multidimensional foo_array and bar_array without problems.
with model_factory() as antecedent:
    ant_trace = pm.sample(draws=600, chains=2, tune=500, return_inferencedata=True)

# Create a second model, copy the values for foo and foo_array
# from the first model to the second to the model. Then sample posterior
with model_factory() as consequent:
    ant_trace.posterior = ant_trace.posterior.drop_vars(["bar", "bar_array"])
    conseq_trace = pm.sample_posterior_predictive(
        trace=ant_trace,
        var_names=['bar', 'bar_array'],
    )