Pm.set_data in 4.2.0

In the set data module, https://www.pymc.io/projects/docs/en/stable/api/generated/pymc.set_data.html

it seems that in version 4.2.0, we can only assign new data x to the same size as the old data x in the model, but in 4.1.7 it’s ok to have a different size. I think it’s more reasonable to set a new data to a different size, because if we are doing a training testing split, a common portion is 80% and 20%, so x_train and x_test would have different sizes in most cases.

Can you provide an minimum example to reproduce the problem?

This is what I get with 4.2:

import pymc as pm

with pm.Model() as model:
    y = pm.MutableData('y', [1., 2., 3.])
    beta = pm.Normal('beta', 0, 1)
    obs = pm.Normal('obs', beta, 1, observed=y)
    idata = pm.sample(1000, tune=1000)
    
with model:
    pm.set_data({'y': [1,2,3,4]})
    y_test = pm.sample_posterior_predictive(idata)

y_test.posterior_predictive['obs'].mean(('chain', 'draw'))

#<xarray.DataArray 'obs' (obs_dim_0: 4)>
#array([1.5213813 , 1.50321493, 1.51028904, 1.52245995])
#Coordinates:
#  * obs_dim_0  (obs_dim_0) int64 0 1 2 3
y = df['indicator']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)


    
with pm.Model() as logistic_model_pred:
    beta_0=pm.Uniform('beta_0', -100, 100)
    beta_1=pm.Normal('beta_1', -0.5, 1)
    beta_2=pm.Normal('beta_2', 2, 1)
    first_feature = pm.Data("first_feature", value = X_train['first_feature'], mutable = True)
    second_feature = pm.Data("second_feature", value = X_train['second_feature'], mutable = True) 
    observed = pm.Bernoulli("indicator", pm.math.sigmoid(beta_0 + beta_1 * first_feature + beta_2 * second_feature), observed = y_train)
    step = pm.Metropolis()
    pred_trace = pm.sample(random_seed = [1, 10, 100, 1000], step = step, init = 'auto')
    
with logistic_model_pred:
    pm.set_data({'first_feature': X_test['first_feature']})
    pm.set_data({'second_feature': X_test['second_feature']})
    ppc = pm.sample_posterior_predictive(trace = pred_trace)

y_score = ppc['posterior_predictive']['indicator'].mean(('chain', 'draw'))
print(y_score)

To provide more details, I have two features and one binary target variable, and there are totally 200 observations in the df. I did a 80% training set and 20% testing set. But I got an error like this:

ValueError: size does not match the broadcast shape of the parameters. (160,), (160,), (40,)

My guess was the error was due to the size difference between training and testing set, because in this case my training set has 160 and my testing set has 40 observations.

(I’m not working on my local computer and this is a result of an online environment. Not sure if this info would help.)

If you make the observed data mutable data:

    obs_data = pm.Data("obs_data", value = y_train, mutable = True) 
    observed = pm.Bernoulli("indicator",
                            pm.math.sigmoid(beta_0 +
                                            beta_1 * first_feature +
                                            beta_2 * second_feature
                            ),
                            observed = obs_data
    )

and then swap the observed data:

with logistic_model_pred:
    pm.set_data({'first_feature': X_test['first_feature']})
    pm.set_data({'second_feature': X_test['second_feature']})
    pm.set_data({'second_feature': X_test['second_feature']})
    pm.set_data({'obs_data': y_test})
    ppc = pm.sample_posterior_predictive(trace = pred_trace)

I think it should work?