y = df['indicator']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)
with pm.Model() as logistic_model_pred:
beta_0=pm.Uniform('beta_0', -100, 100)
beta_1=pm.Normal('beta_1', -0.5, 1)
beta_2=pm.Normal('beta_2', 2, 1)
first_feature = pm.Data("first_feature", value = X_train['first_feature'], mutable = True)
second_feature = pm.Data("second_feature", value = X_train['second_feature'], mutable = True)
observed = pm.Bernoulli("indicator", pm.math.sigmoid(beta_0 + beta_1 * first_feature + beta_2 * second_feature), observed = y_train)
step = pm.Metropolis()
pred_trace = pm.sample(random_seed = [1, 10, 100, 1000], step = step, init = 'auto')
with logistic_model_pred:
pm.set_data({'first_feature': X_test['first_feature']})
pm.set_data({'second_feature': X_test['second_feature']})
ppc = pm.sample_posterior_predictive(trace = pred_trace)
y_score = ppc['posterior_predictive']['indicator'].mean(('chain', 'draw'))
print(y_score)
To provide more details, I have two features and one binary target variable, and there are totally 200 observations in the df
. I did a 80% training set and 20% testing set. But I got an error like this:
ValueError: size does not match the broadcast shape of the parameters. (160,), (160,), (40,)
My guess was the error was due to the size difference between training and testing set, because in this case my training set has 160 and my testing set has 40 observations.
(I’m not working on my local computer and this is a result of an online environment. Not sure if this info would help.)