Using Pandas within Pymc3 context?

Hi All,

I am a newbie for using pymc3 and now in the early stage of transforming previous R MCMC code to python.

My problems are:

  1. Is it possible to use pandas functions inside of pymc context? Cause I have seen several post saying that it is not recommended to use.
  2. Running this code will result in dead kernel and it will use lots of memory. I don’t know why this will happen

Here are the main part of my code:

with model:
### prior for unknown parameters
# prior for lambda
BoundedNormal_lam = pm.Bound(pm.Normal, lower=lam_lbd, upper=lam_ubd)
lam = BoundedNormal_lam(‘lam’, mu=lam_mean, sd=lam_sd, shape = 8)
# prior for gamma
BoundedNormal_gamma = pm.Bound(pm.Normal, lower=0.2, upper=0.8)
gamma = BoundedNormal_gamma(‘gamma’, mu=0.5, sd=0.1)
# prior for beta
BoundedNormal_beta = pm.Bound(pm.Normal, lower=b_lbd, upper=b_ubd)
beta = BoundedNormal_beta(‘beta’, mu=b_mean, sd=b_sd, shape = 8)
# prior for v
v = pm.Uniform(‘v’, lower = 1e-2, upper =1e2)

### Get expected value of outcome
xxdat = tpdat
# decayed promotion: pde * lam_j^{t-l}
xxdat['ad'] = (xxdat['pde_norm'].values) * (lam[xxdat['chnl_cd'].values-1]**(xxdat['gap_nrx_promo'].values))  
# Sum up of decayed promotion for each NRx + each channel: \sum_t pde * lam_j^{t-l}
yydat = xxdat.groupby(['spcl_st_cd','nrx_wk', 'chnl_cd', 'nrx_norm'], as_index= False).agg({'ad':np.sum})
yydat.rename(columns = {'ad': 'ad_sum'}, inplace = True)
yydat['ad_sum_r'] = yydat['ad_sum'].values ** gamma # Add adstock diminishing factor (\sum_t pde * lam_j^{t-l})^r
# normalize each channel's adstock
yydat['ad_norm'] = np.divide(yydat['ad_sum_r'],yydat.groupby('chnl_cd', as_index=False).agg({'ad_sum_r':np.mean})['ad_sum_r'][yydat['chnl_cd'].values-1])
# sum up adstock multiplying impact factor over channels
yydat['bx'] = beta[yydat['chnl_cd'].values -1]*(yydat['ad_norm'])
zzdat = yydat.groupby(['spcl_st_cd', 'nrx_wk', 'nrx_norm'], as_index = False).agg({'bx':np.sum})
# Get the final mu
mu = np.mean(zzdat['bx'])

### likelihood (sampling distribution) of observations
Y_obs = pm.Normal('Y_obs', mu=mu, sd = v, observed = zzdat['nrx_norm'])

Any suggestions are welcomed! Thanks in advance.

Best,
Fan

Use zzdat['nrx_norm'].values to access the underlying numpy array. Whenever you pass a dataframe or series off to pymc3, use .values, and it will work (if the dtype isn’t Object).

Hi Chartl,

Thanks for your help. After using ‘.values’ I still received the dead kernel error. The size of the data table is (320894, 7), do you think it is because the data size is too large?

Thanks

Possibly. Try running with a subset .values[:10000,:] just to see if it works. If it does, the issue may be due to the data size.

I tried with 1000 rows and here is the error i got:

KeyError Traceback (most recent call last)
/anaconda3/lib/python3.7/site-packages/theano/tensor/type.py in dtype_specs(self)
268 ‘complex64’: (complex, ‘theano_complex64’, ‘NPY_COMPLEX64’)
–> 269 }[self.dtype]
270 except KeyError:

KeyError: ‘object’

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)
/anaconda3/lib/python3.7/site-packages/theano/tensor/basic.py in constant(x, name, ndim, dtype)
245 try:
–> 246 ttype = TensorType(dtype=x_.dtype, broadcastable=bcastable)
247 if not constant.enable:

/anaconda3/lib/python3.7/site-packages/theano/tensor/type.py in init(self, dtype, broadcastable, name, sparse_grad)
50 self.broadcastable = tuple(bool(b) for b in broadcastable)
—> 51 self.dtype_specs() # error checking is done there
52 self.name = name

/anaconda3/lib/python3.7/site-packages/theano/tensor/type.py in dtype_specs(self)
271 raise TypeError(“Unsupported dtype for %s: %s”
–> 272 % (self.class.name, self.dtype))
273

TypeError: Unsupported dtype for TensorType: object

During handling of the above exception, another exception occurred:

TypeError Traceback (most recent call last)
/anaconda3/lib/python3.7/site-packages/theano/tensor/basic.py in as_tensor_variable(x, name, ndim)
193 try:
–> 194 return constant(x, name=name, ndim=ndim)
195 except TypeError:

/anaconda3/lib/python3.7/site-packages/theano/tensor/basic.py in constant(x, name, ndim, dtype)
265 except Exception:
–> 266 raise TypeError(“Could not convert %s to TensorType” % x, type(x))
267

TypeError: ('Could not convert [Elemwise{mul,no_inplace}.0 Elemwise{mul,no_inplace}.0\n Elemwise{mul,no_inplace}.0\n Elemwise{mul,no_inplace}.0 Elemwise{mul,no_inplace}.0\n Elemwise{mul,no_inplace}.0 Elemwise{mul,no_inplace}.0\n Elemwise{mul,no_inplace}.0

Seems like it is something wrong with the ‘object’ type for the new column added to the table, so then tried to change the data type to np.float64, then got this error:

KeyError Traceback (most recent call last)
in
15
16 # decayed promotion: pde * lam_j^{t-l}
—> 17 xxdat[‘ad’] = ((xxdat[‘pde_norm’].values) * (lam[xxdat[‘chnl_cd’].values-1]**(xxdat[‘gap_nrx_promo’].values))).astype(np.float64)
18 # Sum up of decayed promotion for each NRx + each channel: \sum_t pde * lam_j^{t-l}
19 yydat = xxdat.groupby([‘spcl_st_cd’,‘nrx_wk’, ‘chnl_cd’, ‘nrx_norm’], as_index= False).agg({‘ad’:np.sum})

/anaconda3/lib/python3.7/site-packages/theano/tensor/var.py in astype(self, dtype)
456 # CASTING
457 def astype(self, dtype):
–> 458 return theano.tensor.cast(self, dtype)
459
460 # SLICING/INDEXING

/anaconda3/lib/python3.7/site-packages/theano/tensor/basic.py in cast(x, dtype)
1264 'Casting from complex to real is ambiguous: consider real(), ’
1265 ‘imag(), angle() or abs()’))
-> 1266 return _cast_mappingdtype
1267
1268 ##########################

KeyError: <class ‘numpy.float64’>

Sorry that my reply looks a little bit messy…

Try using the type theano.config.floatX ?

Thanks Chartl. But it Still didn’t work. Error message:

KeyError Traceback (most recent call last)
/anaconda3/lib/python3.7/site-packages/theano/tensor/type.py in dtype_specs(self)
268 ‘complex64’: (complex, ‘theano_complex64’, ‘NPY_COMPLEX64’)
–> 269 }[self.dtype]
270 except KeyError:

KeyError: ‘object’
During handling of the above exception, another exception occurred:
TypeError: Unsupported dtype for TensorType: object

Let’s just focus on one line of the code:
xxdat['ad'] = ((xxdat['pde_norm'].values) * (lam[xxdat['chnl_cd'].values-1]**(xxdat['gap_nrx_promo'].values))).astype(T.config.floatX)

I add a print function after that, and the got this table result:

<class ‘pandas.core.frame.DataFrame’>
Int64Index: 1000 entries, 1029454 to 607395
Data columns (total 8 columns):
spcl_st_cd 1000 non-null int64
nrx_wk 1000 non-null int64
nrx_norm 1000 non-null float64
chnl_cd 1000 non-null int64
promo_wk 1000 non-null int64
pde_norm 1000 non-null float64
gap_nrx_promo 1000 non-null int64
ad 1000 non-null object
dtypes: float64(2), int64(5), object(1)
memory usage: 70.3+ KB
None

So even we change the data type, the generated column is still object type. Do you know the reason for this?

You can always try to unwrap into an array

xxdat_ad = np.array([float(u) for u in xxdat['ad']])

and see where the error occurs.

By further checking the column it generated, is looks like this: That is the reason it is the object type

You’re mixing theano and numpy. ad_sum_r is a random variable, and so will all downstream computations that involve it. These computations should be performed in theano to register them on the DAG, rather than through numpy/pandas which won’t know what to do with a theano variable.

I think you are right. I never used theano before so it is a bit confusing here for me.

In a nutshell, what I want to do here is to generate a column based on the random variable lam, and use this generated column to further do some aggregations, the final column generated which is bx will be directly used for calculating the likelihood.

In that case, how can i separate pandas/numpy calculations from theano?

In this case it would mean manually grouping and aggregating, so that you can use theano.tensor.sum instead of np.sum. and theano.tensor.mean instead of np.mean. I don’t think that the theano functions will play nicely with pandas groupby/agg; but I could be wrong.

Hi Chartl,

Thanks man. But for grouping it is not that easy to do without pandas

Hi Chris,

I have revised my code based on your suggestion, here is my main code now:


with pm.Model() as model:
    
    ### prior for unknown parameters
    # prior for lambda
    BoundedNormal_lam = pm.Bound(pm.Normal, lower=lam_lbd, upper=lam_ubd)
    lam = BoundedNormal_lam('lam', mu=lam_mean, sd=lam_sd, shape = 8)
    # prior for gamma
    BoundedNormal_gamma = pm.Bound(pm.Normal, lower=0.2, upper=0.8)
    gamma = BoundedNormal_gamma('gamma', mu=0.5, sd=0.1)
    # prior for beta
    BoundedNormal_beta = pm.Bound(pm.Normal, lower=b_lbd, upper=b_ubd)
    beta = BoundedNormal_beta('beta', mu=b_mean, sd=b_sd, shape = 8)
    # prior for v
    v = pm.Uniform('v', lower = 1e-2, upper =1e2)
    
    # decayed promotion: pde * lam_j^{t-l}
    xxdat['ad'] = (xxdat['pde_norm'].values) * (lam[xxdat['chnl_cd'].values-1]**(xxdat['gap_nrx_promo'].values))
    # Sum up of decayed promotion for each NRx + each channel: \sum_t pde * lam_j^{t-l}
    yydat = xxdat.groupby(['spcl_st_cd','nrx_wk', 'chnl_cd', 'nrx_norm'], as_index= False).agg({'ad':test_sum})
    yydat.rename(columns = {'ad': 'ad_sum'}, inplace = True)
    yydat['ad_sum_r'] = (yydat['ad_sum'] ** gamma) # Add adstock diminishing factor (\sum_t pde * lam_j^{t-l})^r
    
    # normalize each channel's adstock
    yydat['ad_norm'] = (yydat['ad_sum_r'].values)/(yydat.groupby('chnl_cd', as_index=False).agg({'ad_sum_r':test_mean})['ad_sum_r'][yydat['chnl_cd'].values-1]).values
    # sum up adstock multiplying impact factor over channels
    yydat['bx'] = (beta[yydat['chnl_cd'].values -1]*(yydat['ad_norm']))
    zzdat = yydat.groupby(['spcl_st_cd', 'nrx_wk', 'nrx_norm'], as_index = False).agg({'bx':test_sum})
    # Get the final mu
    mu = test_mean(zzdat['bx'])
    
    ### likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd = v, observed = zzdat['nrx_norm'])

And here are these two test functions:


def test_sum(x):
    return tt.sum(x.to_list())

def test_mean(x):
    return tt.mean(x.to_list())

It can run successfully with input data having row number < 256. But once it is larger than it, i got this error:

fatal error: bracket nesting level exceeded maximum of 256. 

I have tried changed the maximum number to some larger number but it give some other errors. So i think still i need to revise my code. Do you have any more idea about my case? Thank you!

The type conversion is likely quite costly even if the pandas aggregation function works, I think you should considered doing the conversion and data manipulation outside of pm.Model() and inside the model only operates on numpy arrays

Hi Junpeng,

Thanks for your suggestion. But the thing here is that all the aggregations are related to the prior parameters, so it cannot be moved out of the context. for example the first row:


xxdat['ad'] = (xxdat['pde_norm'].values) * (lam[xxdat['chnl_cd'].values-1]**(xxdat['gap_nrx_promo'].values))

it generate the new column ad by using parameter lam, and then this column will be used into the groupby function in the next row.

I am also feel like too many aggregations within the pm.model context can definitely lower the speed but could not find a better way to solve this :frowning_face:

My suggestion would be not to group using pandas; but instead do it yourself. My guess is that the aggregation is implemented like a reduce, so theano is getting the equivalent of

tt.sum(a[0], tt.sum(a[1], tt.sum(a[2], ...

My suggestion is to convert the initial aggregated data frame (yydat – which initially contains only data so can use np.sum) into a dictionary keyed by your grouping variables, and to operate on the dictionary instead of using pandas functions.

Hi Chris,

Thanks for your suggestion! I have decided to change all pandas operations and calculation to numpy but still i faced some problems. Here are the old code:


xxdat['ad'] = (xxdat['pde_norm'].values) * (lam[xxdat['chnl_cd'].values-1]**(xxdat['gap_nrx_promo'].values))
yydat = xxdat.groupby(['spcl_st_cd','nrx_wk', 'chnl_cd', 'nrx_norm'], as_index= False).agg({'ad':test_sum})

and for the first row, the problem is that lam is a symbolic tensor, so even though we have for example 100 rows, after the multiplication it will still result in only one value: Elemwise{mul,no_inplace}.0. But behind the scene it has the shape (100,1). So I decided to do this calculation row by row with below code and also create groupby function using unique:


basic_model = pm.Model()
with basic_model:
    
    ### prior for unknown parameters
    # prior for lambda
    BoundedNormal_lam = pm.Bound(pm.Normal, lower=lam_lbd, upper=lam_ubd)
    lam = BoundedNormal_lam('lam', mu=lam_mean, sd=lam_sd, shape = 8)
    # prior for gamma
    BoundedNormal_gamma = pm.Bound(pm.Normal, lower=0.2, upper=0.8)
    gamma = BoundedNormal_gamma('gamma', mu=0.5, sd=0.1)
    # prior for beta
    BoundedNormal_beta = pm.Bound(pm.Normal, lower=b_lbd, upper=b_ubd)
    beta = BoundedNormal_beta('beta', mu=b_mean, sd=b_sd, shape = 8)
    # prior for v
    v = pm.Uniform('v', lower = 1e-2, upper =1e2)
    

    # decayed promotion: pde * lam_j^{t-l}
    ad = np.apply_along_axis(caclculate_ad, 1, xxdat_matrix)
    xxdat_matrix = np.append(xxdat_matrix,ad.reshape(-1,1),axis = 1)
    
    yydat_matrix = np.empty((len(GroupId1), 5), dtype = 'object')
    
    for index, group in enumerate(GroupId1):
        sub_group = xxdat_matrix[GroupId1 == group]
        sum_ad = tt.sum(sub_group[:,-1].tolist())
        sum_ad_gammma = sum_ad**gamma
        yydat_matrix[index,:] = np.append(sub_group[:,[0,1,2,3]][0].reshape(1,-1), np.array([sum_ad_gammma]).reshape(1,-1), axis = 1)
    
    bx =  np.apply_along_axis(calculate_bx, 1, yydat_matrix)
    mu = tt.mean(bx.tolist())
    
    ### likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd = v, observed = yydat_matrix[:,2])

where some pre-calculated staff and functions are as below:


### Transfer pandas data frame into numpy matrix
xxdat_matrix = xxdat.values
### Get the group ID 
GroupId1 = np.unique(xxdat_matrix[:,[0,1,2,3]], axis = 0, return_inverse=True)[1]

def caclculate_ad(x):
    '''x is a numpy array contains all the basic information'''
    pde_norm = x[5]
    channel = int(x[3])-1
    return pde_norm*((lam[channel])**(x[6]))
def calculate_bx(x):
    '''x is a numpy array contains ad_sum_r information'''
    channel =  int(x[3])-1
    ad_sum_r = x[4]
    return beta[channel]*ad_sum_r

Now within pymc3.basic_model context it is working fine. But the issues is, when i try to sample from it, lets say using find_MAP, i got this error:


Exception: ('The following error happened while compiling the node', Elemwise{Composite{(Switch(Cast{int8}((GE(i0, i1) * LE(i0, i2))), (i3 * i4 * (i5 - i4) * (i6 + i7)), i8) + (i9 * (Switch(EQ(i10, i8), i8, (i11 * log(i10) * i12)) + Switch(EQ(i13, i8), i8, (i14 * log(i13) * i15)) 

It is a very long error and not intuitive at all to me… any idea on this?

Thanks so much for your time!

Best,
Fan

I’m actually quite surprised this is working at all because of mixing numpy and theano. Try using

yydat_matrix = tt.zeros((len(GroupId1), 5), dtype=theano.config.floatX)

and tt.set_subtensor(yydat_matrix, ...) in place of yydat_matrix[index,:] =

And you’ll probably need to hand-implement apply_along_axis (see https://stackoverflow.com/questions/49581702/what-is-the-apply-along-axis-tensorflow-equivalent)

1 Like