Thanks a lot in advance for taking the time to respond to this long and possibly daft question.
Suppose I have a hierarchical model that predicts price of some items in different countries. How can I get prediction of the price of a hold-out item for different countries?
Here are some details. I first create the right indexing for the two stages of the hierarchy as follows.
country_index = df_copy.groupby('Country').all().reset_index().reset_index()[['index', 'Country']]
country_category_index = df_copy.groupby(['Country', 'Product Category']).all().reset_index().reset_index()[['index', 'Country', 'Product Category']]
country_category_indexes_df = pd.merge(country_index, country_category_index, how='inner', on='Country', suffixes=('_Co', '_CoCa'))
indexed_df = pd.merge(df_copy, country_category_indexes_df, how='inner', on=['Country', 'Product Category'])
And prepare the indices in the following way:
country_indexes = country_index['index'].values
country_count = len(country_indexes)
country_category_indexes = country_category_indexes_df['index_Co'].values
country_category_count = len(country_category_indexes)
Then make the shared variables
x_1_shared = theano.shared(indexed_df['spec_1'].values)
x_2_shared = theano.shared(indexed_df['spec_2'].values)
y_shared = theano.shared(indexed_df['label'].values)
My model would look like this:
with pm.Model() as model:
global_m = pm.Normal('global_m', mu=0, sd=100)
global_m_sd = pm.HalfNormal('global_m_sd', 5.)
global_b = pm.Normal('global_b', mu=0, sd=100)
global_b_sd = pm.HalfNormal('global_b_sd', 5.)
global_c = pm.Normal('global_c', mu=0, sd=100)
global_c_sd = pm.HalfNormal('global_c_sd', 5.)
country_m = pm.Normal('country_m', mu=global_m, sd=global_m_sd, shape=country_count)
country_m_sd = pm.HalfNormal('country_m_sd', 5.,shape=country_count)
country_b = pm.Normal('country_b', mu=global_b, sd=global_b_sd, shape=country_count)
country_b_sd = pm.HalfNormal('country_b_sd', 5.,shape=country_count)
country_c = pm.Normal('country_c', mu=global_c, sd=global_c_sd, shape=country_count)
country_c_sd = pm.HalfNormal('country_c_sd', 5.,shape=country_count)
country_category_m = pm.Normal('country_category_m', mu=country_m[country_category_indexes], sd=country_m_sd[country_category_indexes], shape=country_category_count)
country_category_b = pm.Normal('country_category_b', mu=country_b[country_category_indexes], sd=country_b_sd[country_category_indexes], shape=country_category_count)
country_category_c = pm.Normal('country_category_c', mu=country_c[country_category_indexes], sd=country_c_sd[country_category_indexes], shape=country_category_count)
error = pm.HalfCauchy('error', 5.)
y_prediction = country_category_m[indexed_df['index_CoCa'].values] * x_1_shared + country_category_c[indexed_df['index_CoCa'].values] * x_2_shared + country_category_b[indexed_df['index_CoCa'].values]
y_like = pm.Normal('y_like', mu=y_prediction,
sigma=error, observed=y_shared)
Then I sample, with some pre-set parameters, SAMPLE_KWARGS_2:
with model:
model_trace_country_category = pm.sample(**SAMPLE_KWARGS_2)
I could then get predictions on some new data to generate the price curve, relative to spec_1, for a fixed value of spec_2=135, in the following way
X_1 = np.linspace(np.max(indexed_df['spec_1'].values), np.min(indexed_df['spec_1'].values), len(indexed_df)).reshape(-1,1)[:,0]
X_2 = np.array([135]*len(indexed_df)).reshape(-1,1)[:,0]
Y = np.array([0]*len(indexed_df)).reshape(-1,1)[:,0]
x_1_shared.set_value(X_1)
x_2_shared.set_value(X_2)
y_shared.set_value(Y)
with model:
post_preds = pm.sample_posterior_predictive(model_trace_country_category, model= model,random_seed=123)['y_like']
predictions = post_preds.mean(0)
I can see that there is something about it I don’t understand. How can I now make the sample_posterior_predictive predict prices for different countries separately?