Using Pandas within Pymc3 context?

Thanks Chartl. But it Still didn’t work. Error message:

KeyError Traceback (most recent call last)
/anaconda3/lib/python3.7/site-packages/theano/tensor/type.py in dtype_specs(self)
268 ‘complex64’: (complex, ‘theano_complex64’, ‘NPY_COMPLEX64’)
–> 269 }[self.dtype]
270 except KeyError:

KeyError: ‘object’
During handling of the above exception, another exception occurred:
TypeError: Unsupported dtype for TensorType: object

Let’s just focus on one line of the code:
xxdat['ad'] = ((xxdat['pde_norm'].values) * (lam[xxdat['chnl_cd'].values-1]**(xxdat['gap_nrx_promo'].values))).astype(T.config.floatX)

I add a print function after that, and the got this table result:

<class ‘pandas.core.frame.DataFrame’>
Int64Index: 1000 entries, 1029454 to 607395
Data columns (total 8 columns):
spcl_st_cd 1000 non-null int64
nrx_wk 1000 non-null int64
nrx_norm 1000 non-null float64
chnl_cd 1000 non-null int64
promo_wk 1000 non-null int64
pde_norm 1000 non-null float64
gap_nrx_promo 1000 non-null int64
ad 1000 non-null object
dtypes: float64(2), int64(5), object(1)
memory usage: 70.3+ KB
None

So even we change the data type, the generated column is still object type. Do you know the reason for this?

You can always try to unwrap into an array

xxdat_ad = np.array([float(u) for u in xxdat['ad']])

and see where the error occurs.

By further checking the column it generated, is looks like this: That is the reason it is the object type

You’re mixing theano and numpy. ad_sum_r is a random variable, and so will all downstream computations that involve it. These computations should be performed in theano to register them on the DAG, rather than through numpy/pandas which won’t know what to do with a theano variable.

I think you are right. I never used theano before so it is a bit confusing here for me.

In a nutshell, what I want to do here is to generate a column based on the random variable lam, and use this generated column to further do some aggregations, the final column generated which is bx will be directly used for calculating the likelihood.

In that case, how can i separate pandas/numpy calculations from theano?

In this case it would mean manually grouping and aggregating, so that you can use theano.tensor.sum instead of np.sum. and theano.tensor.mean instead of np.mean. I don’t think that the theano functions will play nicely with pandas groupby/agg; but I could be wrong.

Hi Chartl,

Thanks man. But for grouping it is not that easy to do without pandas

Hi Chris,

I have revised my code based on your suggestion, here is my main code now:


with pm.Model() as model:
    
    ### prior for unknown parameters
    # prior for lambda
    BoundedNormal_lam = pm.Bound(pm.Normal, lower=lam_lbd, upper=lam_ubd)
    lam = BoundedNormal_lam('lam', mu=lam_mean, sd=lam_sd, shape = 8)
    # prior for gamma
    BoundedNormal_gamma = pm.Bound(pm.Normal, lower=0.2, upper=0.8)
    gamma = BoundedNormal_gamma('gamma', mu=0.5, sd=0.1)
    # prior for beta
    BoundedNormal_beta = pm.Bound(pm.Normal, lower=b_lbd, upper=b_ubd)
    beta = BoundedNormal_beta('beta', mu=b_mean, sd=b_sd, shape = 8)
    # prior for v
    v = pm.Uniform('v', lower = 1e-2, upper =1e2)
    
    # decayed promotion: pde * lam_j^{t-l}
    xxdat['ad'] = (xxdat['pde_norm'].values) * (lam[xxdat['chnl_cd'].values-1]**(xxdat['gap_nrx_promo'].values))
    # Sum up of decayed promotion for each NRx + each channel: \sum_t pde * lam_j^{t-l}
    yydat = xxdat.groupby(['spcl_st_cd','nrx_wk', 'chnl_cd', 'nrx_norm'], as_index= False).agg({'ad':test_sum})
    yydat.rename(columns = {'ad': 'ad_sum'}, inplace = True)
    yydat['ad_sum_r'] = (yydat['ad_sum'] ** gamma) # Add adstock diminishing factor (\sum_t pde * lam_j^{t-l})^r
    
    # normalize each channel's adstock
    yydat['ad_norm'] = (yydat['ad_sum_r'].values)/(yydat.groupby('chnl_cd', as_index=False).agg({'ad_sum_r':test_mean})['ad_sum_r'][yydat['chnl_cd'].values-1]).values
    # sum up adstock multiplying impact factor over channels
    yydat['bx'] = (beta[yydat['chnl_cd'].values -1]*(yydat['ad_norm']))
    zzdat = yydat.groupby(['spcl_st_cd', 'nrx_wk', 'nrx_norm'], as_index = False).agg({'bx':test_sum})
    # Get the final mu
    mu = test_mean(zzdat['bx'])
    
    ### likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd = v, observed = zzdat['nrx_norm'])

And here are these two test functions:


def test_sum(x):
    return tt.sum(x.to_list())

def test_mean(x):
    return tt.mean(x.to_list())

It can run successfully with input data having row number < 256. But once it is larger than it, i got this error:

fatal error: bracket nesting level exceeded maximum of 256. 

I have tried changed the maximum number to some larger number but it give some other errors. So i think still i need to revise my code. Do you have any more idea about my case? Thank you!

The type conversion is likely quite costly even if the pandas aggregation function works, I think you should considered doing the conversion and data manipulation outside of pm.Model() and inside the model only operates on numpy arrays

Hi Junpeng,

Thanks for your suggestion. But the thing here is that all the aggregations are related to the prior parameters, so it cannot be moved out of the context. for example the first row:


xxdat['ad'] = (xxdat['pde_norm'].values) * (lam[xxdat['chnl_cd'].values-1]**(xxdat['gap_nrx_promo'].values))

it generate the new column ad by using parameter lam, and then this column will be used into the groupby function in the next row.

I am also feel like too many aggregations within the pm.model context can definitely lower the speed but could not find a better way to solve this :frowning_face:

My suggestion would be not to group using pandas; but instead do it yourself. My guess is that the aggregation is implemented like a reduce, so theano is getting the equivalent of

tt.sum(a[0], tt.sum(a[1], tt.sum(a[2], ...

My suggestion is to convert the initial aggregated data frame (yydat – which initially contains only data so can use np.sum) into a dictionary keyed by your grouping variables, and to operate on the dictionary instead of using pandas functions.

Hi Chris,

Thanks for your suggestion! I have decided to change all pandas operations and calculation to numpy but still i faced some problems. Here are the old code:


xxdat['ad'] = (xxdat['pde_norm'].values) * (lam[xxdat['chnl_cd'].values-1]**(xxdat['gap_nrx_promo'].values))
yydat = xxdat.groupby(['spcl_st_cd','nrx_wk', 'chnl_cd', 'nrx_norm'], as_index= False).agg({'ad':test_sum})

and for the first row, the problem is that lam is a symbolic tensor, so even though we have for example 100 rows, after the multiplication it will still result in only one value: Elemwise{mul,no_inplace}.0. But behind the scene it has the shape (100,1). So I decided to do this calculation row by row with below code and also create groupby function using unique:


basic_model = pm.Model()
with basic_model:
    
    ### prior for unknown parameters
    # prior for lambda
    BoundedNormal_lam = pm.Bound(pm.Normal, lower=lam_lbd, upper=lam_ubd)
    lam = BoundedNormal_lam('lam', mu=lam_mean, sd=lam_sd, shape = 8)
    # prior for gamma
    BoundedNormal_gamma = pm.Bound(pm.Normal, lower=0.2, upper=0.8)
    gamma = BoundedNormal_gamma('gamma', mu=0.5, sd=0.1)
    # prior for beta
    BoundedNormal_beta = pm.Bound(pm.Normal, lower=b_lbd, upper=b_ubd)
    beta = BoundedNormal_beta('beta', mu=b_mean, sd=b_sd, shape = 8)
    # prior for v
    v = pm.Uniform('v', lower = 1e-2, upper =1e2)
    

    # decayed promotion: pde * lam_j^{t-l}
    ad = np.apply_along_axis(caclculate_ad, 1, xxdat_matrix)
    xxdat_matrix = np.append(xxdat_matrix,ad.reshape(-1,1),axis = 1)
    
    yydat_matrix = np.empty((len(GroupId1), 5), dtype = 'object')
    
    for index, group in enumerate(GroupId1):
        sub_group = xxdat_matrix[GroupId1 == group]
        sum_ad = tt.sum(sub_group[:,-1].tolist())
        sum_ad_gammma = sum_ad**gamma
        yydat_matrix[index,:] = np.append(sub_group[:,[0,1,2,3]][0].reshape(1,-1), np.array([sum_ad_gammma]).reshape(1,-1), axis = 1)
    
    bx =  np.apply_along_axis(calculate_bx, 1, yydat_matrix)
    mu = tt.mean(bx.tolist())
    
    ### likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd = v, observed = yydat_matrix[:,2])

where some pre-calculated staff and functions are as below:


### Transfer pandas data frame into numpy matrix
xxdat_matrix = xxdat.values
### Get the group ID 
GroupId1 = np.unique(xxdat_matrix[:,[0,1,2,3]], axis = 0, return_inverse=True)[1]

def caclculate_ad(x):
    '''x is a numpy array contains all the basic information'''
    pde_norm = x[5]
    channel = int(x[3])-1
    return pde_norm*((lam[channel])**(x[6]))
def calculate_bx(x):
    '''x is a numpy array contains ad_sum_r information'''
    channel =  int(x[3])-1
    ad_sum_r = x[4]
    return beta[channel]*ad_sum_r

Now within pymc3.basic_model context it is working fine. But the issues is, when i try to sample from it, lets say using find_MAP, i got this error:


Exception: ('The following error happened while compiling the node', Elemwise{Composite{(Switch(Cast{int8}((GE(i0, i1) * LE(i0, i2))), (i3 * i4 * (i5 - i4) * (i6 + i7)), i8) + (i9 * (Switch(EQ(i10, i8), i8, (i11 * log(i10) * i12)) + Switch(EQ(i13, i8), i8, (i14 * log(i13) * i15)) 

It is a very long error and not intuitive at all to me… any idea on this?

Thanks so much for your time!

Best,
Fan

I’m actually quite surprised this is working at all because of mixing numpy and theano. Try using

yydat_matrix = tt.zeros((len(GroupId1), 5), dtype=theano.config.floatX)

and tt.set_subtensor(yydat_matrix, ...) in place of yydat_matrix[index,:] =

And you’ll probably need to hand-implement apply_along_axis (see https://stackoverflow.com/questions/49581702/what-is-the-apply-along-axis-tensorflow-equivalent)

1 Like

Hi Chris,

Thank you so much! I finally got some results based on your suggestions. Really learned a lot from you these days!

If there are some further issues i may bother you again :smiley:

Best,
Fan

Hi Chris,

I just faced another issue that maybe only related to theano: One of my groupby task was implemented by the following code:


zzdat_matrix = tt.zeros((len(GroupId3), 4), dtype=T.config.floatX)

for index,group in enumerate(GroupId3):
    sub_group = yydat_matrix[tt.all(tt.eq(yydat_matrix[:,[0,1,2]].astype('float32'), group), axis = 1)]
    bx = tt.sum(beta[(sub_group[:,3].astype('int32'))-1]*sub_group[:,4])
    tt.set_subtensor(zzdat_matrix[index,[0,1,2]], group)  
    tt.set_subtensor(zzdat_matrix[index,3], bx)

But it will run forever without any error. Then I found out that maybe there is something wrong with the sub_group because if i tried sub_group[0], it gave me the error that index is out of bound. But i dont see anything wrong with my boolean indexing here for yydat_matrix.

To have a quick recap, yydat_matrix is a symbolic matrix with 5 columns, GroupId3 is a numpy array with shape 3xn, since tt.eq works element-wise, i need to add a tt.all after that.

Do you have any idea what’s going wrong here? Thanks!

Best,
Fan

Take a look at the documentation for set_subtensor. It does not alter the input variable, but rather returns a copy with the changes made

I have changed it like this:


for index, group in enumerate(GroupId1):
    sub_group = xxdat_matrix[GroupId1 == group]
    sum_ad = tt.sum(sub_group[:,-1].tolist())
    sum_ad_gammma = sum_ad**gamma
    yydat_matrix = tt.set_subtensor(yydat_matrix[index,[0,1,2,3]], group)  
    yydat_matrix = tt.set_subtensor(yydat_matrix[index,4], sum_ad_gammma)

But now it gives me this error:


RecursionError: maximum recursion depth exceeded

I increased the recursion limit number but still cannot get the correct results.

Almost there. How many groups do you have? If it’s quite a few then the recursion limit probably comes from these self-referential set_subtensor calls. You can do some thing like

g_idx, g_vals, gam_vals = list(), list(), list()
for index, group in enumerate(GroupId1):
    sub_group = xxdat_matrix[GroupId1 == group]
    sum_ad = tt.sum(sub_group[:,-1].tolist())
    sum_ad_gammma = sum_ad**gamma
    g_idx.extend(index)
    g_vals.extend([group]*len(index)]
    gam_vals.extend(sum_ad_gamma)

yydat_matrix = tt.set_subtensor(yydat_matrix[g_idx,[0,1,2,3]], g_vals)  
yydat_matrix = tt.set_subtensor(yydat_matrix[g_idx,4], gam-vals)

Now list() likely won’t work. My guess is you’ll need a correctly-shaped matrix or array; but you get the idea. This way you’re only adding two set nodes to the computation graph.

1 Like

Hi Chris,

Thanks! I got your idea and tried two ways:

  1. First used the method you suggested to create a array:

ls_ad_gamma = np.zeros((len(GroupId1), 1), dtype=object)
for index, group in enumerate(GroupId1):
    sub_group = xxdat_matrix[np.all(xxdat_matrix[:,[0,1,2,3]] == group, axis = 1)]
    sum_ad = tt.sum(sub_group[:,-1].tolist())
    ls_ad_gamma[index] = sum_ad**gamma
yydat_matrix = tt.set_subtensor(yydat_matrix[:,4], ls_ad_gamma)    
yydat_matrix = tt.set_subtensor(yydat_matrix[:,[0,1,2,3]], GroupId1)

Problem of this is that, the numpy array ls_ad_gamma cannot be added to the yydat_matrix tensor. It looks like this:


array([[Elemwise{pow,no_inplace}.0],
       [Elemwise{pow,no_inplace}.0],
       [Elemwise{pow,no_inplace}.0],
       ...,
       [Elemwise{pow,no_inplace}.0]], dtype=object)

and got this error:


('Cannot convert [[Elemwise{pow,no_inplace}.0]\n [Elemwise{pow,no_inplace}.0]\n [Elemwise{pow,no_inplace}.0]\n ...\n [Elemwise{pow,no_inplace}.0]\n [Elemwise{pow,no_inplace}.0]\n [Elemwise{pow,no_inplace}.0]] to TensorType', )

change ls_ad_gamma to ls_ad_gamma.tolist() still cannot resolve it. And also if we create a tensor type ls_ad_gamma, then we will do this recursion item assignment again.

  1. Then I tried to simplify my looping:

    yydat_matrix = tt.zeros((len(GroupId1), 5), dtype=T.config.floatX)
    
    for index, group in enumerate(GroupId1):
        sub_group = xxdat_matrix[np.all(xxdat_matrix[:,[0,1,2,3]] == group, axis = 1)]
        sum_ad = tt.sum(sub_group[:,-1].tolist())
        sum_ad_gamma = sum_ad**gamma
        yydat_matrix = tt.set_subtensor(yydat_matrix[index,4], sum_ad_gamma)

    yydat_matrix = tt.set_subtensor(yydat_matrix[:,[0,1,2,3]], GroupId1)

but it still got the recursion limit error. Any idea about this?

Thanks!
Fan

Hi Chirs, just wondering do you have time to think about it this week? Thank you!