Nominal predictor to be used in regression

researcher · January 16, 2018, 1:20am

Hello,

What is the proper way to specify a theano shared variable that represents a nominal predictor in a given model?

I wish to be able to perform the model inference using the shared variable from the train set and then perform the sampling from the ppc in the test set.

I anticipate that the shape of the nominal predictor vector to be different in the train data than in the test data. Note that I am currently using Pandas categories series for the nominal predictor vector.

See below for the model that I wish to specify:

with pm.Model() as model:
    a_0 = pm.Normal(
        'a_0', mu=0.0, tau=1 / (2 ** 2), shape=1
    )
    a_1_sigma = pm.Gamma('a_1_sigma', 1.64, 0.32)
    a_1 = pm.Normal(
        'a_1', mu=0.0, tau=1 / a_1_sigma ** 2, 
        shape=train.content_id.cat.categories.size
    )  

    # Parameters for categories
    link_argument = pm.Deterministic(
        'link_argument', 
        a_0 + 
        a_1[train.content_id.cat.codes.values] 
    )
    omega = pm.Deterministic('omega', pm.invlogit(link_argument))
    kappa = pm.Gamma('kappa', 0.01, 0.01)

    alpha = pm.Deterministic('alpha', omega * kappa + 1)
    beta = pm.Deterministic('beta', (1 - omega) * kappa + 1)

    # Parameter for individuals
    mu = pm.Beta(
        'mu', alpha, beta, shape=train.content_id.cat.codes.size
    )

    y = pm.Binomial(
        'y', p=mu, n=train.n.values, observed=train.y.values
    )

    b_0 = pm.Deterministic('b_0', tt.mean(link_argument))
    b_1 = pm.Deterministic('b_1', link_argument[train.content_id.cat.codes.values] - b_0)

junpenglao · January 16, 2018, 5:46am

If by “different shape” you mean the length of the vector is different, then it is a typical use case and you can follow the similar example in the doc.

If you meant you have new previously unobserved categories (e.g., new individuals that are not in the training set), then in you model block instead of using train.content_id.cat.codes.size as the shape you should use the total number of individuals. Also in this case hyper prior would help (which you are already doing for mu the parameter for individuals).

researcher · January 16, 2018, 10:58am

Hi @junpenglao,

I’m not sure I follow. I am definitely interested in the last case, but am unsure of what you mean.

Suppose that train.content_id = [0, 0, 0, 1, 1]. There are 2 categories and the length of the vector is 5. Are you saying that instead of specifying the variable a_1 as

a_1 = pm.Normal(
    'a_1', mu=0.0, tau=1 / a_1_sigma ** 2, 
    shape=train.content_id.cat.categories.size
)

specify the shape as len(train.content_id)? How would that work?

Further, do you have any literature covering examples of shared variables with nominal predictors? I could only find examples involving metric predictors here: http://docs.pymc.io/notebooks/posterior_predictive.html.

Thanks!

junpenglao · January 16, 2018, 11:25am

Nope, what I meant is: specifying a_1 as
a_1 = pm.Normal('a_1', mu=0.0, tau=1 / a_1_sigma ** 2, shape=2)
However, if in your testing set there is a 3rd category, then it will break the code as indexing a_1[2] will return an error. In this case, you need to instead specify a_1 as
a_1 = pm.Normal('a_1', mu=0.0, tau=1 / a_1_sigma ** 2, shape=3)
As you can see, if your training data does not contain the new category, it will sample from the prior. Thus, one way to improve is to have more informative prior, or hyperpriors:

hypermu = ...
hypertau = ...
a_1 = pm.Normal('a_1', mu=hypermu, tau=hypertau, shape=3)

It shouldn’t make a big difference with nominal predictors, if there are corner cases in terms indexing a tensor, try transforming the nominal predictor into a matrix and to matrix multiplication.

researcher · January 16, 2018, 11:54am

@junpenglao, I see so, something like

from theano import shared
x_1 = shared(train.content_id)

and then x_1.set_value(test.content_id) to draw from the posterior and everything will work as expected with the new category.

How do I access the size of the vector in the underlying tensor though?

junpenglao · January 16, 2018, 11:56am

x_1 = shared(train.content_id).values because casting a pd.Series to theano.shared might not work.

x1.eval().shape would be the easiest way to check for theano.shared

researcher · January 16, 2018, 12:34pm

Got it. As another question, how do I maintain the mapping of indexes between the test and train data set for a nominal predictor?

junpenglao · January 16, 2018, 12:48pm

You should do the mapping first, then split the array into training and testing set. Alternatively, save the mapping somewhere if the testing data is not available yet.

researcher · January 16, 2018, 1:06pm

Ok I see. So I encode the categorical values across the union and use that mapping for the train and test.

As an example, if I have as my dataset [‘blue’, ‘green’, ‘green’, ‘red’, ‘green’ ], encode this as [0, 1, 1, 2, 1] and then split into train = [0, 1, 1] and test = [2, 1] and use the unique categories across the union for the shape values?

junpenglao · January 16, 2018, 1:15pm

Exactly, although you always need to be careful extrapolating to things that are not originally in the model (i.e., test = [2] in this case).

researcher · January 16, 2018, 1:27pm

Well that’s precisely a case we want to handle though. Shouldn’t it draw from the prior or do something sensible for that test = [2]?

junpenglao · January 16, 2018, 1:35pm

It should, but there is no guarantee that case [2] will follow the pattern of the other categories. For example, you can always artificially make the case [2] wildly different from the common pattern in the training data as a counter case. Just something to keep in mind that’s all

researcher · January 16, 2018, 1:47pm

But, in the face of no data for test=[2], shouldn’t it just draw from N(0, 1 / a_1_sigma ** 2)?

junpenglao · January 16, 2018, 1:52pm

Yes, so if the mu of test=[2] is far away from your prior (N(0, 1 / a_1_sigma ** 2)) you will not have a good prediction.

researcher · January 16, 2018, 2:08pm

But if we have no data for test=[2], then that prior should be our best indicator of what the prediction should be, right? If not the model itself is the problem, correct?

junpenglao · January 16, 2018, 2:15pm

Yep agree.

researcher · January 16, 2018, 2:20pm

Ok, thank you for all of your help!

Topic		Replies	Views
GLM - theano shared variables for predictors Questions	1	450	July 29, 2020
Posterior predictive sampling with shared matrix Questions	2	584	August 31, 2018
Theano shared and prediction not working as expected Questions	2	656	July 13, 2019
Shape mismatch using theano shared variable Questions	20	1573	February 1, 2019
"Out of sample" predictions with the GLM sub-module Questions	9	2784	February 4, 2021

Nominal predictor to be used in regression

Related topics