How does PyMC internally treat index variables?

I am a little bit confused when it comes to index variables. Specifically, when trying to produce OOS predictions. For example, it seems like PyMC will generate OOS predictions even if an index is not an index from the In-Sample fit. Below is a runnable example that might clarify what I am talking about.

from typing import Tuple

import pymc as pm
import polars as pl
import numpy as np

def factorize(series: pl.Series) -> Tuple[np.ndarray, np.ndarray]:
        """
        Factorize polars series
        """
        name = series.name
        df = series.to_frame()
        df = df.fill_null("<NA>")
        df_ranked = df.unique().sort(name).with_row_index(name=f"{name}_index")
        uniques = df_ranked[name].to_numpy()
        codes = df.join(df_ranked, how="left", on=name)[f"{name}_index"].to_numpy()
        return codes, uniques

# Simulate simple data
X = np.array(["a"] * 20 + ["b"] * 20 + ["c"] * 20)
y_prob = {"a": 0.2, "b": 0.8, "c": 0.5}
y = [np.random.binomial(n=1, p=y_prob[p]) for p in X]

# Split into train + test making sure group c is only in test
X_train = pl.DataFrame(X[:40], schema=["group"])
y_train = y[:40]
X_test = pl.DataFrame(X[40:], schema=["group"])
y_test = X[40:]

# factorize categorical variable
train_group_idx, train_group = factorize(X_train["group"])
train_n_obs = np.arange(X_train.shape[0])

# define coords
coords = {
    "n_obs": train_n_obs,
    "group": train_group
}

# define simple model
with pm.Model(coords=coords) as model:
    group_data = pm.Data("group_data", train_group_idx, dims="n_obs")

    beta_group = pm.Normal("beta_group", 0, 1, dims="group")

    logit_p = pm.Deterministic("logit_p", beta_group[group_data], dims="n_obs")

    pm.Bernoulli("likelihood", logit_p=logit_p, observed=y_train, dims="n_obs")

# sample
with model:
    idata = pm.sample()

# Factorize test data group variable
test_group_idx, test_group = factorize(X_test["group"])
test_n_obs = np.arange(X_test.shape[0])

# HERE IS THE ISSUE. Group "c" was not in the training data
# Compute predictions
with model:
    pm.set_data(
        new_data={"group_data": test_group_idx},
        coords={"n_obs": test_n_obs, "group": test_group}
    )
    predictions = pm.sample_posterior_predictive(idata, predictions=True)

Shouldn’t this example raise an error when computing OOS predictions? Am I missing something here? I tried looking for some documentation with regards to how to handle index variables in OOS predictions but I couldn’t find anything.

You are calling your factorize function on the out-of-sample data, which will assign new, inconsistent indices to the groups. You can use a stateful transformer (for example sklearn’s OrdinalEncoder), or even a simple dictionary-based approach, to help guard against errors like this.

If you assign a new index value to group “c” by hand, you will get an error as expected:

n_test = X_test.shape[0]
k_train_groups = train_group_idx.max()


with model:
    pm.set_data(
        new_data={"group_data": np.full((n_test, ), k_train_groups  + 1, dtype=int)},
        coords={"n_obs": np.arange(n_test), "group": ['c']}
    )
    predictions = pm.sample_posterior_predictive(idata, predictions=True)

Thank you, Jesse! That makes a lot of sense. I wasn’t sure how PyMC was handling the indexing internally and wasn’t sure if I would end up with an index out of range error if I used and ordinal encoder for cases where I’d have just a handful of test samples lets say 10 but an encoded value of 100 and I would be trying to index a 10 sample array at index 100.

Yes, you will get an error in that case (and should). Here’s an article about sampling groups you’ve never seen before, if that’s something you’re worried about. Otherwise you can just do stratified sampling in your train-test split to ensure all groups are seen at training time (if the set of groups is closed, for example)

Edit:

To answer you question “how does PyMC handle indexing” super directly:

  1. It doesn’t. Pytensor handles all computation in a PyMC model
  2. Pytensor follows exactly the same rules that numpy does. So if it works in numpy it works, if it errors in numpy it errors.

Thank you, that resource is very helpful!

I think I understand now. I keep forgetting that I am not actually indexing the data. It is the parameters that I am indexing. So even if I have 100 groups and 10 test samples I should still have 100 parameters for each group level.

Right, the model will have 100 parameters (one for each group) but you will only actually use a subset of them when you predict on new data

1 Like