Data containers, heirarchical models and minibatches

Hi everyone,

I’m having an issue with getting a heirarchical model to work with pm.Data containers. I’m using data containers because in the model documentation (https://docs.pymc.io/api/model.html#pymc3.model.set_data) if you want to predict on new data you need to use pm.set_data() to switch the shared variable in the model from the data used to train the model to some new data. I have seen examples of this for normal non-heirarchical models, but not heirarchical models. The use of pm.set_data() works for pickled models too, so you can save a model using pickle, load it up and use the model to predict on new data.

The issue I am having is that line 22 of the following code logit_p_formula = constant[group_identifier_shared], throws a TypeError: index must be integers. This is because the variable type of group_identifier_shared has been cast to float, relating to this issue (https://github.com/pymc-devs/pymc3/pull/3611). To note, the model has worked fine if x_train_shared, y_train_shared and group_identifier_shared are set as shared variables like this x_train_shared = shared(x_train.values) but then you cant save the model, load it up and make predictions on new data.

## define the number of groups
num_groups = group_identifier.nunique()

with pm.Model() as model:
    ## turn x/y_train and group identifier into a data container, for model definition and predictions 
    x_train_shared = pm.Data('x', x_train.values)
    y_train_shared = pm.Data('y', y_train.values)
    group_identifier_shared = pm.Data('group_identifier', group_identifier.values)
    
    ## model hyperparameters
    constant_mu         = pm.Normal('constant_mu',              mu=0,   sigma=2.5 ) 
    constant_sigma      = pm.HalfCauchy('constant_sigma',               beta=2.5 )   
   
    coefficients_mu     = pm.Normal('coefficients_mu',          mu=0,   sigma=2.5,  shape=x_train.shape[1] )   
    coefficients_sigma  = pm.HalfCauchy('coefficients_sigma',           beta=5,    shape=x_train.shape[1] )  
    

    coefficients        = pm.Normal('coefficients',         mu=coefficients_mu, sigma=coefficients_sigma,   shape= (num_groups, x_train.shape[1]) )
    constant            = pm.Normal('constant',             mu=constant_mu,     sigma=constant_sigma,       shape = num_groups)     

    ## define logit_p_formula
    logit_p_formula = constant[group_identifier_shared]

    for i, col in enumerate(x_train.columns):
        logit_p_formula = logit_p_formula + (coefficients[group_identifier_shared, i] * x_train_shared[:, i])

    outcome     = pm.Bernoulli('outcome',
                                logit_p     = logit_p_formula, 
                                observed    = y_train_shared)

I have a few questions:

  1. If the issue with pm.Data containers being cast to float by default was solved. i.e. the data container kept the type of the data given to it. Would the heirarchical model work?
  2. Am I doing something wrong?
  3. Do data containers and minibatches work together like they do with plain shared theano variables?

thanks for any help you can provide.

Hi,
This is indeed the object of an ongoing PR. In the meantime, you can do pm.intX(pm.Data('group_identifier', group_identifier.values)). This will cast the shared variables as ints again, allowing you to use them as index variables.
At first glance, I don’t see any obvious issue with the model (maybe your priors will be too wide), so the best would be to try it out.
Hope this helps :vulcan_salute:

Hi Alex, I’ve just tested that and it works thanks. I managed to train the model, save it, reload and predict on new data with it. Thanks alot.

1 Like