Data containers, heirarchical models and minibatches

willblenkhornjisc · April 16, 2020, 3:33pm

Hi everyone,

I’m having an issue with getting a heirarchical model to work with pm.Data containers. I’m using data containers because in the model documentation (https://docs.pymc.io/api/model.html#pymc3.model.set_data) if you want to predict on new data you need to use pm.set_data() to switch the shared variable in the model from the data used to train the model to some new data. I have seen examples of this for normal non-heirarchical models, but not heirarchical models. The use of pm.set_data() works for pickled models too, so you can save a model using pickle, load it up and use the model to predict on new data.

The issue I am having is that line 22 of the following code logit_p_formula = constant[group_identifier_shared], throws a TypeError: index must be integers. This is because the variable type of group_identifier_shared has been cast to float, relating to this issue (https://github.com/pymc-devs/pymc3/pull/3611). To note, the model has worked fine if x_train_shared, y_train_shared and group_identifier_shared are set as shared variables like this x_train_shared = shared(x_train.values) but then you cant save the model, load it up and make predictions on new data.

## define the number of groups
num_groups = group_identifier.nunique()

with pm.Model() as model:
    ## turn x/y_train and group identifier into a data container, for model definition and predictions 
    x_train_shared = pm.Data('x', x_train.values)
    y_train_shared = pm.Data('y', y_train.values)
    group_identifier_shared = pm.Data('group_identifier', group_identifier.values)
    
    ## model hyperparameters
    constant_mu         = pm.Normal('constant_mu',              mu=0,   sigma=2.5 ) 
    constant_sigma      = pm.HalfCauchy('constant_sigma',               beta=2.5 )   
   
    coefficients_mu     = pm.Normal('coefficients_mu',          mu=0,   sigma=2.5,  shape=x_train.shape[1] )   
    coefficients_sigma  = pm.HalfCauchy('coefficients_sigma',           beta=5,    shape=x_train.shape[1] )  
    

    coefficients        = pm.Normal('coefficients',         mu=coefficients_mu, sigma=coefficients_sigma,   shape= (num_groups, x_train.shape[1]) )
    constant            = pm.Normal('constant',             mu=constant_mu,     sigma=constant_sigma,       shape = num_groups)     

    ## define logit_p_formula
    logit_p_formula = constant[group_identifier_shared]

    for i, col in enumerate(x_train.columns):
        logit_p_formula = logit_p_formula + (coefficients[group_identifier_shared, i] * x_train_shared[:, i])

    outcome     = pm.Bernoulli('outcome',
                                logit_p     = logit_p_formula, 
                                observed    = y_train_shared)

I have a few questions:

If the issue with pm.Data containers being cast to float by default was solved. i.e. the data container kept the type of the data given to it. Would the heirarchical model work?
Am I doing something wrong?
Do data containers and minibatches work together like they do with plain shared theano variables?

thanks for any help you can provide.

AlexAndorra · April 17, 2020, 8:59am

Hi,
This is indeed the object of an ongoing PR. In the meantime, you can do pm.intX(pm.Data('group_identifier', group_identifier.values)). This will cast the shared variables as ints again, allowing you to use them as index variables.
At first glance, I don’t see any obvious issue with the model (maybe your priors will be too wide), so the best would be to try it out.
Hope this helps

willblenkhornjisc · April 17, 2020, 9:33am

Hi Alex, I’ve just tested that and it works thanks. I managed to train the model, save it, reload and predict on new data with it. Thanks alot.

Topic		Replies	Views
Problems moving from shared variable to data container Questions	3	536	April 13, 2023
Data container question Questions	1	421	February 26, 2020
Issue with pm.set_data() and Shape Mismatch in Hierarchical PyMC Model During Posterior Prediction modeling , hierarchical	4	66	February 27, 2025
Saving a model object with minimal size for later sampling Questions	0	453	November 7, 2019
Rerun model with different data Questions	1	277	November 1, 2021

Data containers, heirarchical models and minibatches

Related topics