"Out of sample" predictions with the GLM sub-module

gsmafra · January 23, 2018, 12:27pm

I don’t know if I’m using the right vocabulary here but I want to use a model I’m fitting with GLM to give me the posterior predictive distribution of a variable that was not observed - equivalent to a train/test split. Is there a way to do this and continue using the compact syntax of the GLM sub-module?

junpenglao · January 23, 2018, 2:53pm

Yes, you can feed a theano.shared X and y for fitting/sampling, and then replace the test value for prediction. For more information see:
http://docs.pymc.io/notebooks/posterior_predictive.html#Prediction
http://docs.pymc.io/notebooks/api_quickstart.html#4.1-Predicting-on-hold-out-data

benyi-mikara · July 13, 2018, 6:47am

Hi @junpenglao is there an example of this using the GLM module?

I tried creating a shared variable for a model using GLM, received this error:

PatsyError: Error evaluating factor: TypeError: The generic ‘SharedVariable’ object is not subscriptable. This shared variable contains an object of type: <class ‘pandas.core.frame.DataFrame’>. Did you forget to cast it into a Numpy array before calling theano.shared()?

The variable is a dataframe (and the model runs fine without the shared variable).

thanks

junpenglao · July 13, 2018, 10:02am

I dont think that is possible for pandas data frame, you can try passing the shared array to x and y specifically, but I am not sure it would work as:

jean-phi66 · October 6, 2018, 4:59pm

Hello,

Just to be sure about the outcome of this discussion as I am currently facing the same issue : can we use any shared theano variables to make OOS prediction with GLM?
I have tried several configurations without success…
Thank you in advance.

DoctorRad · May 9, 2019, 3:08pm

This issue does not appear to have been resolved as yet. The following snippet…

pca_embed = pca.fit_transform(X_train)
pca_data = dict(x=pca_embed.astype(np.float32), y=shared(y_train.reshape(-1)))

with Model() as model:
    glm.GLM.from_formula('y ~ x', pca_data)

… gives me the following error:

ValueError: length not known: <TensorType(float64, vector)> [id A]

Since y is not a numpy array, I assume this is causing the error. It would be useful if this could be fixed to allow the simplicity of using GLM with the ability to easily perform out-of-sample prediction.

Is there a workaround? Should I raise an issue on Github?

Nicky · December 2, 2020, 4:32am

This was so frustrating for me too! I tried the theano shared and had issues and was almost going to give up when I think I found a workaround just by using the pm.Data but you have to give it the labels when you do pm.Glm…

with pm.Model() as logistic_model:
pred = pm.Data(“pred”, X_train)
pm.glm.GLM(pred, y_train, family = pm.glm.families.Binomial(), labels=X_train.columns.tolist())
trace = pm.sample(1000, tune = 2000, init = ‘adapt_diag’,cores=1)

If you just do pred = pm.Data(“pred”, X_train) you’ll get an error I think you have to add labels=X_train.columns.tolist()) to pm.glm.GLM.

This should allow you you to do sample from the posterior predictive distribution:
pm.set_data({“pred”: X_test}, model=logistic_model)
ppc = pm.sample_ppc(trace, model=logistic_model, samples=500)
predValues = np.rint(ppc[‘y’].mean(axis=0)).astype(‘int’)

juanitorduz · January 3, 2021, 10:52am

Hey all! Thank you for your input! I was having the same question, in particular to use the pasty formulas to generate out-of-sample predictions. I was not able to do it directly and I had to generate the features outside the model…I wrote a little post explaining the procedure (I took your suggestions, and referenced them accordingly): https://juanitorduz.github.io/glm_pymc3/

cstoafer · February 3, 2021, 9:26pm

@juanitorduz Your post was very helpful! You should consider adding the notebook procedure to the pymc3 docs.

juanitorduz · February 4, 2021, 8:48am

Thanks! I am happy you found it useful! I created an issue about it to scope the content of the example.

Topic		Replies	Views
Posterior predictive sampling with shared matrix Questions	2	588	August 31, 2018
GLM - theano shared variables for predictors Questions	1	450	July 29, 2020
Prediction/setting data fails with multivariate observed Questions theano , bug	2	908	September 21, 2021
Theano shared and prediction not working as expected Questions	2	657	July 13, 2019
Shared theano in multiple regression Questions	12	1086	February 12, 2019

"Out of sample" predictions with the GLM sub-module

Related topics