Using categorical data in pm.Minibatch

What is the proper way to include categorical data within the signature call to pm.Minibatch()?

I have a dataframe that contains predictors of the type pandas.core.categorical.Categorical. I currently have a model set up to accept theano.shared variables containing these predictor vectors and can then sample and draw from the posterior.

I am creating the shared tensor as follows:

x_1 = shared(train.genre_id.values)

and accessing the number of predictors and codes associated to the predictors to use in the model specification as follows:

x_1.eval().dtype.categories.size
x_1.eval().dtype.codes

However, when I create the vector x_1 with pm.Minibatch(), it returns an AttributeError stating that

AttributeError: 'SharedVariable' object has no attribute 'shape'

I looked at the code in pymc3.data and it looks like it’s creating the shared tensor with the passed data so I’m wondering if there is any way I can have the same behavior with Minibatch that I am getting from using theano.shared.

You can not create a Minibatch out of a theano.shared variable, because Minibatch it is a theano.shared.
Instead, for you use case you can try the following:

x_1 = Minibatch(train.genre_id.values)
...
# setting up model and do fitting
...
# replace value for prediction
x_1.set_value(test.genre_id.values)

Right, but that exact call produces the AttributeError described.

What is the output of train.genre_id.values? It works fine eg below:

train = pd.DataFrame(data=dict(genre_id=np.random.choice(10, size=(100)),
                               x=np.random.randn(100)))
x_1 = pm.Minibatch(train.genre_id.values)

The output is a pandas.core.categorical.Categorical:

image

I see, try coding it as a numpy array then.

I can create the Minibatch object using the coded values in a numpy array as follows:

x_2 = pm.Minibatch(train.genre_id.cat.codes.values, batch_size=200)

However, doing so does not allow me to access the underlying category dtype anymore, e.g. the following throws an error:

x_2.eval().dtype.categories.size

I believe this is because the values are implicitly stored as dtype int8 instead of the pandas categorical.

image

I’m just confused as to why I can create a shared tensor this way and access the underlying pandas categorical datatype but can’t with Minibatch since as you said Minibatch creates a shared tensor under-the-hood.

I am actually surprised that theano.shared work with pandas dataframe. I think the work around for now is use numpy array and save the mapping of category label. But I think we can add a condition in the code to make sure it also work with pandas. PR welcome!

I would be more than happy to create a PR, but am unsure where to start. I’ll check out the theano.shared object and see if I can replicate what’s done there in the creation of Minibatch.

I suspect that instead of accessing shape directly with the self.shared object I can use the .eval() method to access the internal representation of the vector, but again I’m not sure if that’s the correct thing to do.

Further, is there a wiki page describing how testing for this project is done so that I can ensure I don’t break something already functioning?

1 Like

Sorry we don’t have a page explaining the test. But don’t worry the devs will help you once you get the PR going. :wink:

1 Like