How does PyMC internally treat index variables?

Dekermanjian · February 18, 2025, 11:32pm

I am a little bit confused when it comes to index variables. Specifically, when trying to produce OOS predictions. For example, it seems like PyMC will generate OOS predictions even if an index is not an index from the In-Sample fit. Below is a runnable example that might clarify what I am talking about.

from typing import Tuple

import pymc as pm
import polars as pl
import numpy as np

def factorize(series: pl.Series) -> Tuple[np.ndarray, np.ndarray]:
        """
        Factorize polars series
        """
        name = series.name
        df = series.to_frame()
        df = df.fill_null("<NA>")
        df_ranked = df.unique().sort(name).with_row_index(name=f"{name}_index")
        uniques = df_ranked[name].to_numpy()
        codes = df.join(df_ranked, how="left", on=name)[f"{name}_index"].to_numpy()
        return codes, uniques

# Simulate simple data
X = np.array(["a"] * 20 + ["b"] * 20 + ["c"] * 20)
y_prob = {"a": 0.2, "b": 0.8, "c": 0.5}
y = [np.random.binomial(n=1, p=y_prob[p]) for p in X]

# Split into train + test making sure group c is only in test
X_train = pl.DataFrame(X[:40], schema=["group"])
y_train = y[:40]
X_test = pl.DataFrame(X[40:], schema=["group"])
y_test = X[40:]

# factorize categorical variable
train_group_idx, train_group = factorize(X_train["group"])
train_n_obs = np.arange(X_train.shape[0])

# define coords
coords = {
    "n_obs": train_n_obs,
    "group": train_group
}

# define simple model
with pm.Model(coords=coords) as model:
    group_data = pm.Data("group_data", train_group_idx, dims="n_obs")

    beta_group = pm.Normal("beta_group", 0, 1, dims="group")

    logit_p = pm.Deterministic("logit_p", beta_group[group_data], dims="n_obs")

    pm.Bernoulli("likelihood", logit_p=logit_p, observed=y_train, dims="n_obs")

# sample
with model:
    idata = pm.sample()

# Factorize test data group variable
test_group_idx, test_group = factorize(X_test["group"])
test_n_obs = np.arange(X_test.shape[0])

# HERE IS THE ISSUE. Group "c" was not in the training data
# Compute predictions
with model:
    pm.set_data(
        new_data={"group_data": test_group_idx},
        coords={"n_obs": test_n_obs, "group": test_group}
    )
    predictions = pm.sample_posterior_predictive(idata, predictions=True)

Shouldn’t this example raise an error when computing OOS predictions? Am I missing something here? I tried looking for some documentation with regards to how to handle index variables in OOS predictions but I couldn’t find anything.

jessegrabowski · February 19, 2025, 3:36am

You are calling your factorize function on the out-of-sample data, which will assign new, inconsistent indices to the groups. You can use a stateful transformer (for example sklearn’s OrdinalEncoder), or even a simple dictionary-based approach, to help guard against errors like this.

If you assign a new index value to group “c” by hand, you will get an error as expected:

n_test = X_test.shape[0]
k_train_groups = train_group_idx.max()


with model:
    pm.set_data(
        new_data={"group_data": np.full((n_test, ), k_train_groups  + 1, dtype=int)},
        coords={"n_obs": np.arange(n_test), "group": ['c']}
    )
    predictions = pm.sample_posterior_predictive(idata, predictions=True)

Dekermanjian · February 19, 2025, 4:16am

Thank you, Jesse! That makes a lot of sense. I wasn’t sure how PyMC was handling the indexing internally and wasn’t sure if I would end up with an index out of range error if I used and ordinal encoder for cases where I’d have just a handful of test samples lets say 10 but an encoded value of 100 and I would be trying to index a 10 sample array at index 100.

jessegrabowski · February 19, 2025, 4:21am

Yes, you will get an error in that case (and should). Here’s an article about sampling groups you’ve never seen before, if that’s something you’re worried about. Otherwise you can just do stratified sampling in your train-test split to ensure all groups are seen at training time (if the set of groups is closed, for example)

Edit:

To answer you question “how does PyMC handle indexing” super directly:

It doesn’t. Pytensor handles all computation in a PyMC model
Pytensor follows exactly the same rules that numpy does. So if it works in numpy it works, if it errors in numpy it errors.

Dekermanjian · February 19, 2025, 12:00pm

Thank you, that resource is very helpful!

I think I understand now. I keep forgetting that I am not actually indexing the data. It is the parameters that I am indexing. So even if I have 100 groups and 10 test samples I should still have 100 parameters for each group level.

jessegrabowski · February 19, 2025, 12:05pm

Right, the model will have 100 parameters (one for each group) but you will only actually use a subset of them when you predict on new data

Topic		Replies	Views
How indexing works in pymc v5	8	1418	September 13, 2022
Indexes of RVs in modeling? Development theano	0	487	November 20, 2020
API question: What are the semantics of array-style indexing on a Distribution object, with another PyMC object? version agnostic doc	2	353	May 9, 2022
Out of Sample Predictions on heirarchical regression model (PyMC - v5.0.1) v5	5	583	March 2, 2023
Managing indices/data shape in a multi-level linear regression for incremental model updating Questions	0	373	September 10, 2020

How does PyMC internally treat index variables?

Related topics