Running a model on multiple datasets error

Hi Folks -
I need to run the same model on ‘jittered’ versions on the same dataset. This notebook pretty much goes through how it’s done and I’m able to run this specific example, however, running into problems when expanding the example to include other predictors as well.

I’ve tried to create a reproducible example that hopefully you can replicate on your machine. Added the error I’m getting at the bottom.

import numpy as np
import pandas as pd
import pymc3 as pm

print("Running on PyMC3 v{}".format(pm.__version__))

Running on PyMC3 v3.10.0

Create Fake data to work with

n = 40
fake_df = {'id' : np.repeat([0, 1, 2, 3], repeats = [n/4, n/4, n/4, n/4]), 
          'y' : np.random.normal(0, 10, size = n), 
          'x' : np.random.normal(0, 5, size = n)} 

fake_df = pd.DataFrame(fake_df)
fake_df['id'] = pd.Categorical(fake_df['id'])

Create simulated datasets

these are the datasets on which the same model will be run.

def add_bias(df):
    df['x'] = df['x'] + np.random.uniform(-3, 3) 
    df['y'] = df['y'] + np.random.uniform(-10, 10)
    return df 

def create_biased_data(data):
    tmp = data.groupby('id').apply(add_bias)
    tmp['id'] = pd.Categorical(tmp['id'])
    return tmp 

tmp_dfs = [create_biased_data(fake_df) for _ in range(10)]

Model Statement

with pm.Model() as m_fake_multiple: 
    
    tmp_data = pm.Data('tmp_data', tmp_dfs[0]) 
    
    a = pm.Normal('a', mu = 0, sigma = 10, shape = 4) 
    b = pm.Normal('b', mu = 0, sigma = 10, shape = 1) 
    
    error = pm.HalfCauchy('error', beta = 1, shape = 4)
    
    lp = a[tmp_data['id']] + b*tmp_data['x'] 
    
    _ = pm.Normal('y', mu = lp, sigma = error[fake_df['id']], observed = tmp_data['y'])

The difference between the example on the documentation page and this is that the data and model contain additional predictors, apart from only estimating the mean.

What I have tried

At first I thought since the example had numpy arrays instead of dataframes, maybe that is the problem, so I tweaked the example to create dataframes with single columns instead of arrays and the example still worked. So I don’t think it is because I’m creating dataframes instead of array.

ERROR

RecursionError                            Traceback (most recent call last)
<ipython-input-37-984402ed8f87> in <module>
      8    error = pm.HalfCauchy('error', beta = 1, shape = 4)
      9 
---> 10    lp = a[tmp_data['id']] + b*tmp_data['x']
     11 
     12    _ = pm.Normal('y', mu = lp, sigma = error[fake_df['id']], observed = tmp_data['y'])

~/anaconda3/envs/pymc3_2021/lib/python3.9/site-packages/theano/tensor/var.py in __getitem__(self, args)
    480                 # indexing, it is safe to throw an error if we encounter
    481                 # any of these difficult cases.
--> 482                 if includes_bool(arg):
    483                     raise TypeError(
    484                         "TensorType does not support Python bools "

~/anaconda3/envs/pymc3_2021/lib/python3.9/site-packages/theano/tensor/var.py in includes_bool(args_el)
    448             ):
    449                 for el in args_el:
--> 450                     if includes_bool(el):
    451                         return True
    452             return False

... last 1 frames repeated, from the frame below ...

~/anaconda3/envs/pymc3_2021/lib/python3.9/site-packages/theano/tensor/var.py in includes_bool(args_el)
    448             ):
    449                 for el in args_el:
--> 450                     if includes_bool(el):
    451                         return True
    452             return False

RecursionError: maximum recursion depth exceeded in comparison
1 Like

It appears to be the fact that you are trying to indexing into a tensor variable that is causing the problems. This example is sufficient to trigger the error.

with pm.Model() as model:
    pm_data = pm.Data('pm_data', [0,1,2,3])
    temp = pm_data['id']

I am not entirely sure what the intended behavior is when you pack a multidimensional structure into a pm.Data() container (e.g., it’s not obvious that the containerized data should provide an interface that matches the original, native data structure). But one potential solution is to break out the individual variables into separate pm.Data containers:

id = pm.Data('id', tmp_df['id'])
x = pm.Data('x', tmp_df['x'])
y = pm.Data('y', tmp_df['y'])
# etc.
2 Likes

@cluhmann Thanks! That’s spot on. When I break it out into individual containers, it works but only with predictors that are continuous. For instance this works

with pm.Model() as m_fake_multiple:   
      
    x_pred = pm.Data('x_pred', tmp_dfs[0]['x'])   
    y_obs = pm.Data('y_obs', tmp_dfs[0]['y'])   
    
    b = pm.Normal('b', mu = 0, sigma = 10, shape = 1)  
    
    ## linear predictor continuous var
    lp = b*x_pred 
    
    _ = pm.Normal('y', mu = lp, sigma = 1, observed = y_obs)

traces = []
for data_vals in tmp_dfs:
    with m_fake_multiple:
        # Switch out the observed dataset
        pm.set_data({"x_pred": data_vals['x']})
        pm.set_data({"y_obs": data_vals['y']})
        traces.append(pm.sample(return_inferencedata=False))

Having an index container doesn’t work

with pm.Model() as m_fake_multiple:   
      
    x_pred = pm.Data('x_pred', tmp_dfs[0]['x'])   
    y_obs = pm.Data('y_obs', tmp_dfs[0]['y'])   

    ### This is the categorical index with 4 levels 
    run_id = pm.Data('run_id', tmp_dfs[0]['id'])  
    
    a = pm.Normal('a', mu = 0, sigma = 10, shape = 4)  
    b = pm.Normal('b', mu = 0, sigma = 10, shape = 1)  
    
    ### linear predictor a function of both categorical index and continuous 
    lp = a[run_id] + b*x_pred  
    
    _ = pm.Normal('y', mu = lp, sigma = 1, observed = y_obs) 

Error.
index must be integers.

How do I convert the index to integer within pm.Data? I tried using astype(int) but that didn’t work either.

1 Like

Hmmm. This hack works but requiring re creating the index

with pm.Model() as m_fake_multiple:   
      
    x_pred = pm.Data('x_pred', tmp_dfs[0]['x'])   
    y_obs = pm.Data('y_obs', tmp_dfs[0]['y'])   
    
    #run_id = pm.Data('run_id', tmp_dfs[0]['id']) 
    
    ## Instead create the index again
    run_id = pm.Data('run_id', np.repeat([0, 1, 2, 3], repeats = [N/4, N/4, N/4, N/4]))
    
    a = pm.Normal('a', mu = 0, sigma = 10, shape = 4)  
    b = pm.Normal('b', mu = 0, sigma = 10, shape = 1)  
    
    
    lp = a[run_id] + b*x_pred  
    
    _ = pm.Normal('y', mu = lp, sigma = 1, observed = y_obs)
1 Like

Yup, creating the id column so they are as integers from the very beginning (as you did here) would certainly work. If you need to convert a categorical column to integers, this should work:

df['id'] = pd.to_numeric(df['id'])
1 Like

There are also some discussions on how to better support multidimensional data containers in New Feature: Add a TidyData Class · Issue #3953 · pymc-devs/pymc3 · GitHub

1 Like