Hello,
What is the proper way to specify a theano shared variable that represents a nominal predictor in a given model?
I wish to be able to perform the model inference using the shared variable from the train set and then perform the sampling from the ppc in the test set.
I anticipate that the shape of the nominal predictor vector to be different in the train data than in the test data. Note that I am currently using Pandas categories series for the nominal predictor vector.
See below for the model that I wish to specify:
with pm.Model() as model:
a_0 = pm.Normal(
'a_0', mu=0.0, tau=1 / (2 ** 2), shape=1
)
a_1_sigma = pm.Gamma('a_1_sigma', 1.64, 0.32)
a_1 = pm.Normal(
'a_1', mu=0.0, tau=1 / a_1_sigma ** 2,
shape=train.content_id.cat.categories.size
)
# Parameters for categories
link_argument = pm.Deterministic(
'link_argument',
a_0 +
a_1[train.content_id.cat.codes.values]
)
omega = pm.Deterministic('omega', pm.invlogit(link_argument))
kappa = pm.Gamma('kappa', 0.01, 0.01)
alpha = pm.Deterministic('alpha', omega * kappa + 1)
beta = pm.Deterministic('beta', (1 - omega) * kappa + 1)
# Parameter for individuals
mu = pm.Beta(
'mu', alpha, beta, shape=train.content_id.cat.codes.size
)
y = pm.Binomial(
'y', p=mu, n=train.n.values, observed=train.y.values
)
b_0 = pm.Deterministic('b_0', tt.mean(link_argument))
b_1 = pm.Deterministic('b_1', link_argument[train.content_id.cat.codes.values] - b_0)
If by âdifferent shapeâ you mean the length of the vector is different, then it is a typical use case and you can follow the similar example in the doc.
If you meant you have new previously unobserved categories (e.g., new individuals that are not in the training set), then in you model block instead of using train.content_id.cat.codes.size
as the shape you should use the total number of individuals. Also in this case hyper prior would help (which you are already doing for mu
the parameter for individuals).
Hi @junpenglao,
Iâm not sure I follow. I am definitely interested in the last case, but am unsure of what you mean.
Suppose that train.content_id = [0, 0, 0, 1, 1]
. There are 2 categories and the length of the vector is 5. Are you saying that instead of specifying the variable a_1
as
a_1 = pm.Normal(
'a_1', mu=0.0, tau=1 / a_1_sigma ** 2,
shape=train.content_id.cat.categories.size
)
specify the shape as len(train.content_id)
? How would that work?
Further, do you have any literature covering examples of shared variables with nominal predictors? I could only find examples involving metric predictors here: http://docs.pymc.io/notebooks/posterior_predictive.html.
Thanks!
Nope, what I meant is: specifying a_1
as
a_1 = pm.Normal('a_1', mu=0.0, tau=1 / a_1_sigma ** 2, shape=2)
However, if in your testing set there is a 3rd category, then it will break the code as indexing a_1[2]
will return an error. In this case, you need to instead specify a_1
as
a_1 = pm.Normal('a_1', mu=0.0, tau=1 / a_1_sigma ** 2, shape=3)
As you can see, if your training data does not contain the new category, it will sample from the prior. Thus, one way to improve is to have more informative prior, or hyperpriors:
hypermu = ...
hypertau = ...
a_1 = pm.Normal('a_1', mu=hypermu, tau=hypertau, shape=3)
It shouldnât make a big difference with nominal predictors, if there are corner cases in terms indexing a tensor, try transforming the nominal predictor into a matrix and to matrix multiplication.
@junpenglao, I see so, something like
from theano import shared
x_1 = shared(train.content_id)
and then x_1.set_value(test.content_id)
to draw from the posterior and everything will work as expected with the new category.
How do I access the size of the vector in the underlying tensor though?
x_1 = shared(train.content_id).values
because casting a pd.Series
to theano.shared
might not work.
x1.eval().shape
would be the easiest way to check for theano.shared
Got it. As another question, how do I maintain the mapping of indexes between the test and train data set for a nominal predictor?
You should do the mapping first, then split the array into training and testing set. Alternatively, save the mapping somewhere if the testing data is not available yet.
Ok I see. So I encode the categorical values across the union and use that mapping for the train and test.
As an example, if I have as my dataset [âblueâ, âgreenâ, âgreenâ, âredâ, âgreenâ ], encode this as [0, 1, 1, 2, 1] and then split into train = [0, 1, 1] and test = [2, 1] and use the unique categories across the union for the shape values?
Exactly, although you always need to be careful extrapolating to things that are not originally in the model (i.e., test = [2] in this case).
Well thatâs precisely a case we want to handle though. Shouldnât it draw from the prior or do something sensible for that test = [2]?
It should, but there is no guarantee that case [2] will follow the pattern of the other categories. For example, you can always artificially make the case [2] wildly different from the common pattern in the training data as a counter case. Just something to keep in mind thatâs all 
But, in the face of no data for test=[2], shouldnât it just draw from N(0, 1 / a_1_sigma ** 2)?
Yes, so if the mu of test=[2] is far away from your prior (N(0, 1 / a_1_sigma ** 2)
) you will not have a good prediction.
But if we have no data for test=[2], then that prior should be our best indicator of what the prediction should be, right? If not the model itself is the problem, correct?
Ok, thank you for all of your help!
1 Like