Out of sample prediction with new category

Hi,

New to PYMC here.

I’ve been trying to follow A Primer on Bayesian Methods for Multilevel Modeling — PyMC3 3.11.5 documentation

I see that to predict on test data, I need to do:

prediction_coords = {"obs_id": ["ST LOUIS", "KANABEC"]}
with contextual_effect:
    pm.set_data({"county_idx": np.array([69, 31]), "floor_idx": np.array([1, 1])})
    stl_pred = pm.fast_sample_posterior_predictive(
        contextual_effect_idata.posterior, random_seed=RANDOM_SEED
    )
    az.from_pymc3_predictions(
        stl_pred, idata_orig=contextual_effect_idata, inplace=True, coords=prediction_coords
    )

However, how do I do a prediction if my county_idx is new and never seen before in the training data?

I tried to a pass a new county_idx and get the following error:

IndexError: index 10 is out of bounds for axis 0 with size 10
Apply node that caused the error: AdvancedSubtensor1(intercept_county, county_id_index)
Toposort index: 0
Inputs types: [TensorType(float64, (None,)), TensorType(int32, (None,))]
Inputs shapes: [(10,), (10,)]
Inputs strides: [(8,), (4,)]
Inputs values: ['not shown', 'not shown']
Outputs clients: [[normal_rv{0, (0, 0), floatX, True}(RandomGeneratorSharedVariable(<Generator(PCG64) at 0x15C2F2F20>), TensorConstant{[]}, TensorConstant{11}, AdvancedSubtensor1.0, sigma)]]

HINT: Use the Aesara flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.

I saw How do we predict on new unseen groups in a hierarchical model in PyMC3? - #2 by lucianopaz but not sure how to apply that to this simple model:

with pm.Model(coords=coords) as varying_intercept:
    floor_idx = pm.Data("floor_idx", floor, dims="obs_id")
    county_idx = pm.Data("county_idx", county, dims="obs_id")
    # Hyperpriors:
    a = pm.Normal("a", mu=0.0, sigma=10.0)
    sigma_a = pm.Exponential("sigma_a", 1.0)

    # Varying intercepts:
    a_county = pm.Normal("a_county", mu=a, sigma=sigma_a, dims="County")
    # Common slope:
    b = pm.Normal("b", mu=0.0, sigma=10.0)

    # Expected value per county:
    theta = a_county[county_idx] + b * floor_idx
    # Model error:
    sigma = pm.Exponential("sigma", 1.0)

    y = pm.Normal("y", theta, sigma=sigma, observed=log_radon, dims="obs_id")
pm.model_to_graphviz(varying_intercept)

One idea I had is to train with an additional fake index (which won’t show up in observations) and assign any new counties to that index. That way it should still get the benefit of the partial pooling and have a prediction (albeit highly uncertain).

Any help is appreciated!

Your last idea is basically correct. You need to add an “other” category to your training data, and use that to make predictions for any categories not in the training data. You don’t want it to just be an empty “fake index”, though, otherwise doing inference with it is pointless.

Another option is to set all the country offsets to zero, which amounts to using the hyper-mean (a in the example code) to say something about an unknown county.

2 Likes

Thanks for the confirmation.

What do you mean by “You don’t want it to just be an empty “fake index”, though, otherwise doing inference with it is pointless.”.

For using the hypermean:
Do you know of any code to do so?

I mean that there needs to be data assigned to the “other” category when you estimate the model.

For setting all the offsets to zero, what I called the “hyper-mean”, or the mean of the hierarchical prior distribution is a in the model from the OP. ‘b’ has no hierarchical structure, so we can use it directly. If you had more covariates, you would just use the hierarchical prior mean for all of them.

(To see why this is equivalent to seeing all the offsets to zero, consider the non-centered parameterization, a_country = a + sigma_a * offset_a)

So to make predictions using only this mean value, use the posterior samples from your trace:

posterior = az.extract_dataset(trace)
theta_mean = posterior['a'] + posterior['b'] * oos_floor_idx[:, None]
predictions = np.random.normal(loc=theta_hyper, scale=posterior['sigma'])

Where oos_floor_idx are the floor_idx values for the data you want to estimate with shape (n_to_estimate, ). posterior['a'] and posterior['b'] should both have shape (n_samples, ), so I added a dimension to oos_floor_idx so it would broadcast to (n_samples, n_to_estimate)

2 Likes

Another way you can try is adding new nodes to existing model that you already done the inference, and run posterior predictive sample using the new variables, which in this case:

with varying_intercept:
    a_county_new = pm.Normal("a_county_new", mu=a, sigma=sigma_a)
    theta_new = a_county_new+ b * floor_idx_new

    y_new = pm.Normal("y_new", theta_new, sigma=sigma)
    ppc = pm.sample_posterior_predictive(trace, var_names=['y_new'])

ppc.posterior_predictive['y_new']
4 Likes

Thanks, I successfully used the hypermean as suggested. @junpenglao’s suggestion is also interesting and may save me a bunch of code. I might try that next. Thanks all!