Running a model on multiple datasets error

dilsher_dhillon · March 5, 2021, 8:36pm

Hi Folks -
I need to run the same model on ‘jittered’ versions on the same dataset. This notebook pretty much goes through how it’s done and I’m able to run this specific example, however, running into problems when expanding the example to include other predictors as well.

I’ve tried to create a reproducible example that hopefully you can replicate on your machine. Added the error I’m getting at the bottom.

import numpy as np
import pandas as pd
import pymc3 as pm

print("Running on PyMC3 v{}".format(pm.__version__))

Running on PyMC3 v3.10.0

Create Fake data to work with

n = 40
fake_df = {'id' : np.repeat([0, 1, 2, 3], repeats = [n/4, n/4, n/4, n/4]), 
          'y' : np.random.normal(0, 10, size = n), 
          'x' : np.random.normal(0, 5, size = n)} 

fake_df = pd.DataFrame(fake_df)
fake_df['id'] = pd.Categorical(fake_df['id'])

Create simulated datasets

these are the datasets on which the same model will be run.

def add_bias(df):
    df['x'] = df['x'] + np.random.uniform(-3, 3) 
    df['y'] = df['y'] + np.random.uniform(-10, 10)
    return df 

def create_biased_data(data):
    tmp = data.groupby('id').apply(add_bias)
    tmp['id'] = pd.Categorical(tmp['id'])
    return tmp 

tmp_dfs = [create_biased_data(fake_df) for _ in range(10)]

Model Statement

with pm.Model() as m_fake_multiple: 
    
    tmp_data = pm.Data('tmp_data', tmp_dfs[0]) 
    
    a = pm.Normal('a', mu = 0, sigma = 10, shape = 4) 
    b = pm.Normal('b', mu = 0, sigma = 10, shape = 1) 
    
    error = pm.HalfCauchy('error', beta = 1, shape = 4)
    
    lp = a[tmp_data['id']] + b*tmp_data['x'] 
    
    _ = pm.Normal('y', mu = lp, sigma = error[fake_df['id']], observed = tmp_data['y'])

The difference between the example on the documentation page and this is that the data and model contain additional predictors, apart from only estimating the mean.

What I have tried

At first I thought since the example had numpy arrays instead of dataframes, maybe that is the problem, so I tweaked the example to create dataframes with single columns instead of arrays and the example still worked. So I don’t think it is because I’m creating dataframes instead of array.

ERROR

RecursionError                            Traceback (most recent call last)
<ipython-input-37-984402ed8f87> in <module>
      8    error = pm.HalfCauchy('error', beta = 1, shape = 4)
      9 
---> 10    lp = a[tmp_data['id']] + b*tmp_data['x']
     11 
     12    _ = pm.Normal('y', mu = lp, sigma = error[fake_df['id']], observed = tmp_data['y'])

~/anaconda3/envs/pymc3_2021/lib/python3.9/site-packages/theano/tensor/var.py in __getitem__(self, args)
    480                 # indexing, it is safe to throw an error if we encounter
    481                 # any of these difficult cases.
--> 482                 if includes_bool(arg):
    483                     raise TypeError(
    484                         "TensorType does not support Python bools "

~/anaconda3/envs/pymc3_2021/lib/python3.9/site-packages/theano/tensor/var.py in includes_bool(args_el)
    448             ):
    449                 for el in args_el:
--> 450                     if includes_bool(el):
    451                         return True
    452             return False

... last 1 frames repeated, from the frame below ...

~/anaconda3/envs/pymc3_2021/lib/python3.9/site-packages/theano/tensor/var.py in includes_bool(args_el)
    448             ):
    449                 for el in args_el:
--> 450                     if includes_bool(el):
    451                         return True
    452             return False

RecursionError: maximum recursion depth exceeded in comparison

cluhmann · March 5, 2021, 10:25pm

It appears to be the fact that you are trying to indexing into a tensor variable that is causing the problems. This example is sufficient to trigger the error.

with pm.Model() as model:
    pm_data = pm.Data('pm_data', [0,1,2,3])
    temp = pm_data['id']

I am not entirely sure what the intended behavior is when you pack a multidimensional structure into a pm.Data() container (e.g., it’s not obvious that the containerized data should provide an interface that matches the original, native data structure). But one potential solution is to break out the individual variables into separate pm.Data containers:

id = pm.Data('id', tmp_df['id'])
x = pm.Data('x', tmp_df['x'])
y = pm.Data('y', tmp_df['y'])
# etc.

dilsher_dhillon · March 6, 2021, 4:13pm

@cluhmann Thanks! That’s spot on. When I break it out into individual containers, it works but only with predictors that are continuous. For instance this works

with pm.Model() as m_fake_multiple:   
      
    x_pred = pm.Data('x_pred', tmp_dfs[0]['x'])   
    y_obs = pm.Data('y_obs', tmp_dfs[0]['y'])   
    
    b = pm.Normal('b', mu = 0, sigma = 10, shape = 1)  
    
    ## linear predictor continuous var
    lp = b*x_pred 
    
    _ = pm.Normal('y', mu = lp, sigma = 1, observed = y_obs)

traces = []
for data_vals in tmp_dfs:
    with m_fake_multiple:
        # Switch out the observed dataset
        pm.set_data({"x_pred": data_vals['x']})
        pm.set_data({"y_obs": data_vals['y']})
        traces.append(pm.sample(return_inferencedata=False))

Having an index container doesn’t work

with pm.Model() as m_fake_multiple:   
      
    x_pred = pm.Data('x_pred', tmp_dfs[0]['x'])   
    y_obs = pm.Data('y_obs', tmp_dfs[0]['y'])   

    ### This is the categorical index with 4 levels 
    run_id = pm.Data('run_id', tmp_dfs[0]['id'])  
    
    a = pm.Normal('a', mu = 0, sigma = 10, shape = 4)  
    b = pm.Normal('b', mu = 0, sigma = 10, shape = 1)  
    
    ### linear predictor a function of both categorical index and continuous 
    lp = a[run_id] + b*x_pred  
    
    _ = pm.Normal('y', mu = lp, sigma = 1, observed = y_obs)

Error.
index must be integers.

How do I convert the index to integer within pm.Data? I tried using astype(int) but that didn’t work either.

dilsher_dhillon · March 6, 2021, 4:20pm

Hmmm. This hack works but requiring re creating the index

with pm.Model() as m_fake_multiple:   
      
    x_pred = pm.Data('x_pred', tmp_dfs[0]['x'])   
    y_obs = pm.Data('y_obs', tmp_dfs[0]['y'])   
    
    #run_id = pm.Data('run_id', tmp_dfs[0]['id']) 
    
    ## Instead create the index again
    run_id = pm.Data('run_id', np.repeat([0, 1, 2, 3], repeats = [N/4, N/4, N/4, N/4]))
    
    a = pm.Normal('a', mu = 0, sigma = 10, shape = 4)  
    b = pm.Normal('b', mu = 0, sigma = 10, shape = 1)  
    
    
    lp = a[run_id] + b*x_pred  
    
    _ = pm.Normal('y', mu = lp, sigma = 1, observed = y_obs)

cluhmann · March 6, 2021, 7:10pm

Yup, creating the id column so they are as integers from the very beginning (as you did here) would certainly work. If you need to convert a categorical column to integers, this should work:

df['id'] = pd.to_numeric(df['id'])

OriolAbril · March 7, 2021, 7:30pm

There are also some discussions on how to better support multidimensional data containers in New Feature: Add a TidyData Class · Issue #3953 · pymc-devs/pymc3 · GitHub

Topic		Replies	Views
Data containers, heirarchical models and minibatches Questions	2	613	April 17, 2020
PyMC3 posterior prediction example; how to incorporate a matrix of betas? Questions	24	3087	April 23, 2021
Understanding coords, indexation, Data, ..., for multilevel models v5 modeling	1	4205	April 29, 2022
Out of Sample Predictions on heirarchical regression model (PyMC - v5.0.1) v5	5	633	March 2, 2023
Getting the same prediction when using the PyMC3 data container to generate Bayesian regression prediction using new data Questions theano , modeling	3	545	December 10, 2022