Why were the observed values in the out-of-sample prediction the true values of the training set, rather than the true values of the test set?

When I study the Out-of-Sample Predictions at Biking with BART case, I found a question about the posterior predictive check in train and test sets.
Why are the observed values ​​of the posterior predictions of the training set and the test set both the true values ​​of the training samples?
In other machine learning methods, we usually use the model built based on the training set to predict the test set. In the out-of-sample prediction of pymc, it seems that the test set is used to build the model and predict the training samples.

Looking forward to your reply. Thank you very much.

The code as follows:

# modeling and train sets prediction
with pm.Model() as model_oos_regression:
    X = pm.MutableData("X", X_train)
    Y = Y_train
    α = pm.Exponential("α", 1)
    μ = pmb.BART("μ", X, np.log(Y))
    y = pm.NegativeBinomial("y", mu=pm.math.exp(μ), alpha=α, observed=Y, shape=μ.shape)
    idata_oos_regression = pm.sample(random_seed=RANDOM_SEED)
    posterior_predictive_oos_regression_train = pm.sample_posterior_predictive(
        trace=idata_oos_regression, random_seed=RANDOM_SEED
    )
# test sets prediction
with model_oos_regression:
    X.set_value(X_test)
    posterior_predictive_oos_regression_test = pm.sample_posterior_predictive(
        trace=idata_oos_regression, random_seed=RANDOM_SEED
    )

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(8, 7), sharex=True, sharey=True, layout="constrained"
)
# compare the posterior predictive distribution with the observed data
az.plot_ppc(
    data=posterior_predictive_oos_regression_train, kind="cumulative", observed_rug=True, ax=ax[0]
)
ax[0].set(title="Posterior Predictive Check (train)", xlim=(0, 1_000))

az.plot_ppc(
    data=posterior_predictive_oos_regression_test, kind="cumulative", observed_rug=True, ax=ax[1]
)
ax[1].set(title="Posterior Predictive Check (test)", xlim=(0, 1_000));

@aloctavodia

If I changed the code as follows, it dose not work. Maybe the pymc-bart is not support the pm.MuatableData for Y_data.
Looking forward to your reply. Thank you very much.

with pm.Model() as model:
    X = pm.MutableData("X", X_train)
    Y = pm.MutableData("Y", Y_train)
    α = pm.Exponential("α", 1)
    μ = pmb.BART("μ", X, np.log(Y))
    y = pm.NegativeBinomial("y", mu=pm.math.exp(μ), alpha=α, observed=Y)
    trace = pm.sample(random_seed=RANDOM_SEED)
    posterior_predictive_train = pm.sample_posterior_predictive(trace, random_seed=RANDOM_SEED)

    pm.set_data({"X": X_test, "Y": Y_test})
    
    posterior_predictive_test = pm.sample_posterior_predictive(trace, random_seed=RANDOM_SEED)

errors:

TypeError                                 Traceback (most recent call last)
File ~\.conda\envs\pymc_env\Lib\site-packages\pytensor\link\vm.py:407, in Loop.__call__(self)
    404 for thunk, node, old_storage in zip_longest(
    405     self.thunks, self.nodes, self.post_thunk_clear, fillvalue=()
    406 ):
--> 407     thunk()
    408     for old_s in old_storage:

File ~\.conda\envs\pymc_env\Lib\site-packages\pytensor\graph\op.py:515, in Op.make_py_thunk.<locals>.rval(p, i, o, n)
    513 @is_thunk_type
    514 def rval(p=p, i=node_input_storage, o=node_output_storage, n=node):
--> 515     r = p(n, [x[0] for x in i], o)
    516     for o in node.outputs:

File ~\.conda\envs\pymc_env\Lib\site-packages\pytensor\tensor\random\op.py:330, in RandomVariable.perform(self, node, inputs, outputs)
    328 rng_var_out[0] = rng
--> 330 smpl_val = self.rng_fn(rng, *([*args, size]))
    332 if (
    333     not isinstance(smpl_val, np.ndarray)
    334     or str(smpl_val.dtype) != out_var.type.dtype
    335 ):

File ~\.conda\envs\pymc_env\Lib\site-packages\pymc_bart\bart.py:57, in BARTRV.rng_fn(cls, rng, X, Y, m, alpha, beta, split_prior, size)
     56     else:
---> 57         return np.full(cls.Y.shape[0], cls.Y.mean())
     58 else:

File ~\.conda\envs\pymc_env\Lib\site-packages\numpy\core\numeric.py:344, in full(shape, fill_value, dtype, order, like)
    343     dtype = fill_value.dtype
--> 344 a = empty(shape, dtype, order)
    345 multiarray.copyto(a, fill_value, casting='unsafe')

TypeError: expected a sequence of integers or a single integer, got 'Subtensor{i}.0'

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
Cell In[4], line 10
      5 μ = pmb.BART("μ", X, np.log(Y))
      7 y = pm.NegativeBinomial("y", mu=pm.math.exp(μ), alpha=α, observed=Y)
---> 10 trace = pm.sample(random_seed=RANDOM_SEED)
     12 posterior_predictive_train = pm.sample_posterior_predictive(trace, random_seed=RANDOM_SEED)
     14 pm.set_data({"X": X_test, "Y": Y_test})

File ~\.conda\envs\pymc_env\Lib\site-packages\pymc\sampling\mcmc.py:681, in sample(draws, tune, chains, cores, random_seed, progressbar, progressbar_theme, step, var_names, nuts_sampler, initvals, init, jitter_max_retries, n_init, trace, discard_tuned_samples, compute_convergence_checks, keep_warning_stat, return_inferencedata, idata_kwargs, nuts_sampler_kwargs, callback, mp_ctx, model, **kwargs)
    678         auto_nuts_init = False
    680 initial_points = None
--> 681 step = assign_step_methods(model, step, methods=pm.STEP_METHODS, step_kwargs=kwargs)
    683 if nuts_sampler != "pymc":
    684     if not isinstance(step, NUTS):

File ~\.conda\envs\pymc_env\Lib\site-packages\pymc\sampling\mcmc.py:234, in assign_step_methods(model, step, methods, step_kwargs)
    226         selected = max(
    227             methods_list,
    228             key=lambda method, var=rv_var, has_gradient=has_gradient: method._competence(  # type: ignore
    229                 var, has_gradient
    230             ),
    231         )
    232         selected_steps.setdefault(selected, []).append(var)
--> 234 return instantiate_steppers(model, steps, selected_steps, step_kwargs)

File ~\.conda\envs\pymc_env\Lib\site-packages\pymc\sampling\mcmc.py:135, in instantiate_steppers(model, steps, selected_steps, step_kwargs)
    133         args = step_kwargs.get(name, {})
    134         used_keys.add(name)
--> 135         step = step_class(vars=vars, model=model, **args)
    136         steps.append(step)
    138 unused_args = set(step_kwargs).difference(used_keys)

File ~\.conda\envs\pymc_env\Lib\site-packages\pymc_bart\pgbart.py:127, in PGBART.__init__(self, vars, num_particles, batch, model)
    119 def __init__(  # noqa: PLR0915
    120     self,
    121     vars=None,  # pylint: disable=redefined-builtin
   (...)
    124     model: Optional[Model] = None,
    125 ):
    126     model = modelcontext(model)
--> 127     initial_values = model.initial_point()
    128     if vars is None:
    129         vars = model.value_vars

File ~\.conda\envs\pymc_env\Lib\site-packages\pymc\model\core.py:1084, in Model.initial_point(self, random_seed)
   1071 """Computes the initial point of the model.
   1072 
   1073 Parameters
   (...)
   1081     Maps names of transformed variables to numeric initial values in the transformed space.
   1082 """
   1083 fn = make_initial_point_fn(model=self, return_transformed=True)
-> 1084 return Point(fn(random_seed), model=self)

File ~\.conda\envs\pymc_env\Lib\site-packages\pymc\initial_point.py:169, in make_initial_point_fn.<locals>.make_seeded_function.<locals>.inner(seed, *args, **kwargs)
    166 @functools.wraps(func)
    167 def inner(seed, *args, **kwargs):
    168     reseed_rngs(rngs, seed)
--> 169     values = func(*args, **kwargs)
    170     return dict(zip(varnames, values))

File ~\.conda\envs\pymc_env\Lib\site-packages\pytensor\compile\function\types.py:970, in Function.__call__(self, *args, **kwargs)
    967 t0_fn = time.perf_counter()
    968 try:
    969     outputs = (
--> 970         self.vm()
    971         if output_subset is None
    972         else self.vm(output_subset=output_subset)
    973     )
    974 except Exception:
    975     restore_defaults()

File ~\.conda\envs\pymc_env\Lib\site-packages\pytensor\link\vm.py:411, in Loop.__call__(self)
    409                 old_s[0] = None
    410     except Exception:
--> 411         raise_with_op(self.fgraph, node, thunk)
    413 return self.perform_updates()

File ~\.conda\envs\pymc_env\Lib\site-packages\pytensor\link\utils.py:523, in raise_with_op(fgraph, node, thunk, exc_info, storage_map)
    518     warnings.warn(
    519         f"{exc_type} error does not allow us to add an extra error message"
    520     )
    521     # Some exception need extra parameter in inputs. So forget the
    522     # extra long error message in that case.
--> 523 raise exc_value.with_traceback(exc_trace)

File ~\.conda\envs\pymc_env\Lib\site-packages\pytensor\link\vm.py:407, in Loop.__call__(self)
    403 try:
    404     for thunk, node, old_storage in zip_longest(
    405         self.thunks, self.nodes, self.post_thunk_clear, fillvalue=()
    406     ):
--> 407         thunk()
    408         for old_s in old_storage:
    409             old_s[0] = None

File ~\.conda\envs\pymc_env\Lib\site-packages\pytensor\graph\op.py:515, in Op.make_py_thunk.<locals>.rval(p, i, o, n)
    513 @is_thunk_type
    514 def rval(p=p, i=node_input_storage, o=node_output_storage, n=node):
--> 515     r = p(n, [x[0] for x in i], o)
    516     for o in node.outputs:
    517         compute_map[o][0] = True

File ~\.conda\envs\pymc_env\Lib\site-packages\pytensor\tensor\random\op.py:330, in RandomVariable.perform(self, node, inputs, outputs)
    326     rng = copy(rng)
    328 rng_var_out[0] = rng
--> 330 smpl_val = self.rng_fn(rng, *([*args, size]))
    332 if (
    333     not isinstance(smpl_val, np.ndarray)
    334     or str(smpl_val.dtype) != out_var.type.dtype
    335 ):
    336     smpl_val = _asarray(smpl_val, dtype=out_var.type.dtype)

File ~\.conda\envs\pymc_env\Lib\site-packages\pymc_bart\bart.py:57, in BARTRV.rng_fn(cls, rng, X, Y, m, alpha, beta, split_prior, size)
     55         return np.full((size[0], cls.Y.shape[0]), cls.Y.mean())
     56     else:
---> 57         return np.full(cls.Y.shape[0], cls.Y.mean())
     58 else:
     59     if size is not None:

File ~\.conda\envs\pymc_env\Lib\site-packages\numpy\core\numeric.py:344, in full(shape, fill_value, dtype, order, like)
    342     fill_value = asarray(fill_value)
    343     dtype = fill_value.dtype
--> 344 a = empty(shape, dtype, order)
    345 multiarray.copyto(a, fill_value, casting='unsafe')
    346 return a

TypeError: expected a sequence of integers or a single integer, got 'Subtensor{i}.0'
Apply node that caused the error: BART_rv{1, (2, 1, 0, 0, 0, 1), floatX, True}(RandomGeneratorSharedVariable(<Generator(PCG64) at 0x1EA465D6180>), [], 11, X, Log.0, 50, 0.95, 2.0, [])
Toposort index: 7
Inputs types: [RandomGeneratorType, TensorType(int64, shape=(0,)), TensorType(int64, shape=()), TensorType(float64, shape=(None, None)), TensorType(float64, shape=(None,)), TensorType(int8, shape=()), TensorType(float64, shape=()), TensorType(float32, shape=()), TensorType(float64, shape=(0,))]
Inputs shapes: ['No shapes', (0,), (), (100, 5), (100,), (), (), (), (0,)]
Inputs strides: ['No strides', (0,), (), (40, 8), (8,), (), (), (), (0,)]
Inputs values: [Generator(PCG64) at 0x1EA465D6180, array([], dtype=int64), array(11, dtype=int64), 'not shown', 'not shown', array(50, dtype=int8), array(0.95), array(2., dtype=float32), array([], dtype=float64)]
Outputs clients: [['output'], [Second(μ, True_div.0)]]

HINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag 'optimizer=fast_compile'. If that does not work, PyTensor optimizations can be disabled with 'optimizer=None'.
HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.

Hi @JJ_Py

The value of “Y” that you pass to pmb.BART, is only used to compute an initialization value for the leaf_nodes (then PyMC-BART will try to find better ones during tuning). It does not need to be Y, but it happens that passing Y or a transformation like log(Y) is a good starting point. So you can do:

with pm.Model() as model:
    X = pm.MutableData("X", X_train)
    Y = pm.MutableData("Y", Y_train)
    α = pm.Exponential("α", 1)
    μ = pmb.BART("μ", X, np.log(Y_train))

Regarding the comparison probably that was just a matter of convenience when writing the example. @juanitorduz?

Notice if you can also do this to get the the comparison you want.

fig, ax = plt.subplots(
    nrows=2, ncols=1, figsize=(8, 7), sharex=True, sharey=True, layout="constrained"
)

az.plot_ppc(
    data=posterior_predictive_oos_regression_train, kind="cumulative", observed=False,  num_pp_samples=100, ax=ax[0]
)
ax[0].ecdf(Y_train, label="observed", color="k")
ax[0].set(title="Posterior Predictive Check (train)", xlim=(0, 1_000))

az.plot_ppc(
    data=posterior_predictive_oos_regression_test, kind="cumulative", observed=False,  num_pp_samples=100, ax=ax[1]
)
ax[1].ecdf(Y_test, label="observed", color="k")
ax[1].set(title="Posterior Predictive Check (test)", xlim=(0, 1_000));
2 Likes

Thank you very much. And I found that: in the training set, the empirical cumulative distribution plots obtained by ax.ecdf() and az.plot_ppc() are almost the same; in the test set, there are small differences.

@JJ_Py Sorry for the late reply. As mentioned by @aloctavodia the Y_train is used for the variance. Note that the observed (likelihood) term just depends on the shape of mu and, therefore, the shape of X_test. To avoid confusion I like to work with. coordinates to make this even clearer. See, for example, this model Cohort Retention Analysis with BART - Dr. Juan Camilo Orduz where I do out-of-sample predictions of a BART model using coordinates.