Out of sample predict issue

the following code is from the bart-bikling example,I changed it to a classifier.but it reports the shape mismatch issue,seems that the set_data(X_test) not work.

from pathlib import Path
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
import pymc_bart as pmb
from sklearn.model_selection import train_test_split
bikes = pd.read_csv(pm.get_data("bikes.csv"))
features = ["hour", "temperature", "humidity", "workingday"]
X = bikes[features]
Y = bikes["count"]
Y2 = Y.apply(lambda x:1 if x>180 else 0)
RANDOM_SEED=100
X_train, X_test, Y_train, Y_test = train_test_split(X, Y2, test_size=0.2, random_state=RANDOM_SEED)
with pm.Model() as model_oos_regression:
    X1 = pm.MutableData("X", X_train.values)
    Y1 = Y_train.values.flatten()
    #α = pm.Exponential("α", 1)
    μ = pmb.BART("μ", X1, Y1)
    #y = pm.NegativeBinomial("y", mu=pm.math.exp(μ), alpha=α, observed=Y, shape=μ.shape)
    #y = pm.Deterministic("y", pm.invlogit(μ))
    pm.Bernoulli("y",observed=Y1,p=pm.Deterministic("p1", pm.invlogit(μ)))
    idata = pm.sample(random_seed=RANDOM_SEED)
    #idata_oos_regression = pm.fit(method=pm.ADVI()).sample()

    #predict out sample
    pm.set_data({"X":X_test.values})
    # posterior_predictive_oos_regression_test = pm.sample_posterior_predictive(
    #     trace=idata_oos_regression, random_seed=RANDOM_SEED,
    #     var_names=['y'],
    #     return_inferencedata=True,
    #     predictions=True
    # )
    idata.extend(pm.sample_posterior_predictive(idata))
    #pred = posterior_predictive_oos_regression_test.predictions
    yHat = idata.posterior_predictive['y'].mean(("chain", "draw")).to_numpy()
    print(f"yHat-len={len(yHat)},X_test-len={len(X_test)}")
    assert len(yHat)==len(X_test)

You have to specify how the shape of y depends on its parameters. It’s illustrated in the examples here: pymc.set_data — PyMC 5.5.0 documentation

Otherwise you need to provide dummy values for y with the correct shape

thank u Richard, not clear still.
pm.Bernoulli(“y”,observed=Y1,p=pm.Deterministic(“p1”, pm.invlogit(μ)),shape=Y1.shape)
that is what I changed,failed still.
in the document of set_data,the x and y has the same shape,but it is not my case.

The shape should depend on μ somehow? Otherwise shape=Y1.shape is the default anyway. If you have no other source of shape information other than the observations, you will need to use dummy variables when doing posterior predictive, to force the right shape

yes,μ.shape works!! thank u ricardo.
Another question is
pm.Bernoulli(“y”,observed=Y1,p=pm.Deterministic(“p1”, pm.invlogit(μ)),shape=μ.shape )
for prediction, which variable shall I predict? “y” or “p1”?
my understanding is p1 is invlogit(μ), that is the sigmoid value,which should be a classifier’s predict_proba, but what’s the ‘y’?

y is a bernoulli draw from p1. You can predict whichever is more useful for you (or both)

1 Like

Linking to the new entry in the FAQ for future readers: Frequently Asked Questions - #18 by ricardoV94