What is the proper way to include categorical data within the signature call to pm.Minibatch()?
I have a dataframe that contains predictors of the type pandas.core.categorical.Categorical. I currently have a model set up to accept theano.shared variables containing these predictor vectors and can then sample and draw from the posterior.
I am creating the shared tensor as follows:
x_1 = shared(train.genre_id.values)
and accessing the number of predictors and codes associated to the predictors to use in the model specification as follows:
x_1.eval().dtype.categories.size
x_1.eval().dtype.codes
However, when I create the vector x_1 with pm.Minibatch(), it returns an AttributeError stating that
AttributeError: 'SharedVariable' object has no attribute 'shape'
I looked at the code in pymc3.data and it looks like it’s creating the shared tensor with the passed data so I’m wondering if there is any way I can have the same behavior with Minibatch that I am getting from using theano.shared.
You can not create a Minibatch out of a theano.shared
variable, because Minibatch it is a theano.shared
.
Instead, for you use case you can try the following:
x_1 = Minibatch(train.genre_id.values)
...
# setting up model and do fitting
...
# replace value for prediction
x_1.set_value(test.genre_id.values)
Right, but that exact call produces the AttributeError described.
What is the output of train.genre_id.values
? It works fine eg below:
train = pd.DataFrame(data=dict(genre_id=np.random.choice(10, size=(100)),
x=np.random.randn(100)))
x_1 = pm.Minibatch(train.genre_id.values)
The output is a pandas.core.categorical.Categorical:
I see, try coding it as a numpy array then.
I can create the Minibatch object using the coded values in a numpy array as follows:
x_2 = pm.Minibatch(train.genre_id.cat.codes.values, batch_size=200)
However, doing so does not allow me to access the underlying category dtype anymore, e.g. the following throws an error:
x_2.eval().dtype.categories.size
I believe this is because the values are implicitly stored as dtype int8 instead of the pandas categorical.
I’m just confused as to why I can create a shared tensor this way and access the underlying pandas categorical datatype but can’t with Minibatch since as you said Minibatch creates a shared tensor under-the-hood.
I am actually surprised that theano.shared work with pandas dataframe. I think the work around for now is use numpy array and save the mapping of category label. But I think we can add a condition in the code to make sure it also work with pandas. PR welcome!
I would be more than happy to create a PR, but am unsure where to start. I’ll check out the theano.shared object and see if I can replicate what’s done there in the creation of Minibatch.
I suspect that instead of accessing shape directly with the self.shared object I can use the .eval() method to access the internal representation of the vector, but again I’m not sure if that’s the correct thing to do.
Further, is there a wiki page describing how testing for this project is done so that I can ensure I don’t break something already functioning?
1 Like
Sorry we don’t have a page explaining the test. But don’t worry the devs will help you once you get the PR going.
1 Like