Is it possible to convert column from pandas to theano.shared?


#1

Good time of the day.

I’m trying to predict values, using pymc3.sample_ppc, for which I need shared values, according to notebook.

But if I try to run something like
DISTANCE = theano.shared(BR['mil_km'])
, where BR is a pandas dataframe, I get the following error during the model specification:

Traceback (most recent call last):
  File "rail_0.py", line 25, in <module>
    theta = (A + BD * DISTANCE + BY * YEAR)
TypeError: unsupported operand type(s) for *: 'FreeRV' and 'SharedVariable'

If I print(DISTANCE), the output is simply <Generic> instead of a column of data, so I assume that is the problem. What is a correct way to convert a column from pandas to theano.shared? Most of the examples I’ve seen simply generate initial data to use as shared predictors.

Here’s the problem I’m trying to solve if it might be of help:
I have a dataset with 2 periods. It contains year, distance and result. I want to train the model on the years and distance from the first period and then predict values for the second period.

Here’s the way I’m trying to implement that:

import pymc3 as pm
import pandas as pd
import numpy as np
import theano

RAILS = pd.read_csv('./brit_rail_acc.csv')
BR = RAILS[21:47]  # Data prior to privatisation, but after the steam engine
ABR = RAILS[48:]  # Data after privatisation

DISTANCE = theano.shared(BR['mil_km'])
YEAR = theano.shared(BR['year'])

with pm.Model() as MODEL_BR_0:
    A = pm.Normal('alpha', mu=0, sd=100)
    BD = pm.Normal('distance', mu=0, sd=10)
    BY = pm.Normal('year', mu=0, sd=10)

    theta = (A + BD * DISTANCE + BY * YEAR)
    y = pm.Poisson('accidents', mu=np.exp(theta),
            observed=BR['cdo_acc'].values)

    trace = pm.sample(5000, n_init=5000)

DISTANCE.set_values(ABR['mil_km'])
YEAR.set_values(ABR['year'])

ppc = pm.sample_ppc(trace, model=MODEL_BR_0)

On a semi-related note, should I worry if in about 50% of cases I get the following error, while trying to run the basic model without the shared elements?

WARNING (theano.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
  0%|                                                                                                            | 0/5500 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "rail_0_nt.py", line 24, in <module>
    trace = pm.sample(5000, n_init=5000)
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/sampling.py", line 285, in sample
    return sample_func(**sample_args)[discard:]
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/sampling.py", line 332, in _sample
    for it, strace in enumerate(sampling):
  File "/home/eichhorn/.local/lib/python3.6/site-packages/tqdm/_tqdm.py", line 955, in __iter__
    for obj in iterable:
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/sampling.py", line 430, in _iter_sample
    point, states = step.step(point)
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 175, in step
    apoint, stats = self.astep(array)
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/step_methods/hmc/nuts.py", line 182, in astep
    'might be misspecified.' % start.energy)
ValueError: Bad initial energy: nan. The model might be misspecified.

#2

You can get a numpy array of a column by using .values

This should work

DISTANCE = theano.shared(BR[‘mil_km’].values)
YEAR = theano.shared(BR[‘year’].values)


#3

This worked perfectly, thank you!

Unfortunately, now I get the misspecification error a lot more often, but that’s not really a big problem, as long as I can run it eventually.


#4

I’ve found that you can often protect against the bad initial energy error by initializing NUTS with advi + adapt diag

So basically this uses advi to find the typical set, and NUTS to explore it.


#5

Thank you once more! I’ll read about it.

EDIT: I get the following error, trying to run it with advi_adapt_diag

Auto-assigning NUTS sampler...
Initializing NUTS using advi+adapt_diag...
Average Loss = inf:   0%|                                                                                        | 0/5000 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "rail_0.py", line 30, in <module>
    trace = pm.sample(5000, n_init=5000, init="advi+adapt_diag")
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/sampling.py", line 252, in sample
    progressbar=progressbar, **args)
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/sampling.py", line 815, in init_nuts
    obj_optimizer=pm.adagrad_window,
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/variational/inference.py", line 756, in fit
    return inference.fit(n, **kwargs)
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/variational/inference.py", line 135, in fit
    state = self._iterate_with_loss(0, n, step_func, progress, callbacks)
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/variational/inference.py", line 181, in _iterate_with_loss
    raise FloatingPointError('NaN occurred in optimization.')
FloatingPointError: NaN occurred in optimization.

#6

Could you try

import theano.tensor as tt
    theta = A + BD * DISTANCE + BY * YEAR
    y = pm.Poisson('accidents', mu=tt.exp(theta),
            observed=BR['cdo_acc'].values)
    trace = pm.sample(1000, tune=1000)

I would suggest sticking with the default jitter+adapt_diag initialization, if you are still having misspecification error, that is an indication of problems in your model. One of the problem could be that the prior sd is too large (did you centered YEAR?), resulting exp(theta) goes inf.


#7

I tried, but the problem persists. It’s the same even with sd=1.

I did not center YEAR, and I’m not sure it is reasonable for the model (if I understand you correctly). But I’ll double check.

May I ask you another question about pymc3.sample_ppc? Help states that I can specify the amount of random draws from the distribution, by setting the size variable.
Do I understand correctly that it is the amount of rows in the array that I get as a result?
I’ve tried to get 10, but instead I get the following error:

Traceback (most recent call last):
  File "rail_0.py", line 34, in <module>
    ppc = pm.sample_ppc(trace, model=MODEL_BR_0, size=10)
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/sampling.py", line 572, in sample_ppc
    size=size))
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/distributions/discrete.py", line 293, in random
    size=size)
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/distributions/distribution.py", line 415, in generate_samples
    return reshape_sampled(samples, size, dist_shape)
  File "/home/eichhorn/.local/lib/python3.6/site-packages/pymc3/distributions/distribution.py", line 331, in reshape_sampled
return np.reshape(sampled, repeat_shape + dist_shape)
  File "/home/eichhorn/.local/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 232, in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)
  File "/home/eichhorn/.local/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 57, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 100 into shape (10,)

The error is “40 into the shape (4, )” if I set size=4


#8

It might be easier if you could share the full code with data :wink:


#9

Sorry, there it is.

code:

import pymc3 as pm
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import theano

RAILS = pd.read_csv('./brit_rail_acc.csv')
BR = RAILS[21:47]  # Data prior to privatisation, but after the steam engine
ABR = RAILS[48:]  # Data after privatisation
GBR = RAILS[21:]  # Data for the comparison plot

DISTANCE = theano.shared(BR['mil_km'].values)
YEAR = theano.shared(BR['year'].values)

with pm.Model() as MODEL_BR_0:
    A = pm.Normal('alpha', mu=0, sd=100)
    BD = pm.Normal('distance', mu=0, sd=10)
    BY = pm.Normal('year', mu=0, sd=10)
# There is no reason to assume that year affects the outcome, but as is stated,
# British Rail was running new safety measures and they are assumed to continue,
# therefore there must be at least some weak effect present

    theta = (A + BD * DISTANCE + BY * YEAR)
    y = pm.Poisson('accidents', mu=np.exp(theta),
            observed=BR['cdo_acc'])

    trace = pm.sample(5000, n_init=5000)

DISTANCE.set_value(ABR['mil_km'].values)
YEAR.set_value(ABR['year'].values)

ppc = pm.sample_ppc(trace, model=MODEL_BR_0)

#    sns.regplot(x="year", y="cdo_acc", data=BR)
#    plt.show()

data: brit_rail_acc.csv
Originally from here, but I converted it into .CSV.

I was trying to run trace = pm.sample(100, size=4) for a quick overview, before running the CPU intensive variant.

And there is another version, but the only difference is that the second one uses theano.tensor.exp instead of the numpy one, but result is the same.


#10

Preprocessing makes a big difference: because in the case of predictors contains large values, it usually makes the coefficient very small. You will see all kinds of numerical error (e.g., matrix conditioning problem).

The code below should work (tested from pymc3 master):

from scipy.stats import zscore
RAILS['mil_km_z'] = zscore(RAILS['mil_km'])
RAILS['year_z'] = zscore(RAILS['year'])

BR = RAILS[21:47]  # Data prior to privatisation, but after the steam engine
ABR = RAILS[48:]  # Data after privatisation
GBR = RAILS[21:]  # Data for the comparison plot

DISTANCE = theano.shared(BR['mil_km_z'].values)
YEAR = theano.shared(BR['year_z'].values)

with pm.Model() as MODEL_BR_0:
    A = pm.Normal('alpha', mu=0, sd=1)
    BD = pm.Normal('distance', mu=0, sd=1)
    BY = pm.Normal('year', mu=0, sd=1)
# There is no reason to assume that year affects the outcome, but as is stated,
# British Rail was running new safety measures and they are assumed to continue,
# therefore there must be at least some weak effect present

    theta = A + BD * DISTANCE + BY * YEAR
    y = pm.Poisson('accidents', mu=tt.exp(theta),
            observed=BR['cdo_acc'].values)

    trace = pm.sample(1000, tune=1000)

DISTANCE.set_value(ABR['mil_km_z'].values)
YEAR.set_value(ABR['year_z'].values)

ppc = pm.sample_ppc(trace, model=MODEL_BR_0)

#11

Thank you. it works without errors every time now!

I’m mostly used to eViews and a bit of R, but have never encountered pre-processing before. Probably, because we were given good data to begin with during studies.

I’ll be sure to check all my other models.


#12

No problem.
Also be careful is that, more often than not statistical package perform some preprocessing or preconditioning of your matrix internally, it is a subtle point that easily be over looked :slight_smile: