"Out of sample" predictions with the GLM sub-module

I don’t know if I’m using the right vocabulary here but I want to use a model I’m fitting with GLM to give me the posterior predictive distribution of a variable that was not observed - equivalent to a train/test split. Is there a way to do this and continue using the compact syntax of the GLM sub-module?

Yes, you can feed a theano.shared X and y for fitting/sampling, and then replace the test value for prediction. For more information see:

Hi @junpenglao is there an example of this using the GLM module?

I tried creating a shared variable for a model using GLM, received this error:

PatsyError: Error evaluating factor: TypeError: The generic ‘SharedVariable’ object is not subscriptable. This shared variable contains an object of type: <class ‘pandas.core.frame.DataFrame’>. Did you forget to cast it into a Numpy array before calling theano.shared()?

The variable is a dataframe (and the model runs fine without the shared variable).


I dont think that is possible for pandas data frame, you can try passing the shared array to x and y specifically, but I am not sure it would work as:


Just to be sure about the outcome of this discussion as I am currently facing the same issue : can we use any shared theano variables to make OOS prediction with GLM?
I have tried several configurations without success…
Thank you in advance.

1 Like

This issue does not appear to have been resolved as yet. The following snippet…

pca_embed = pca.fit_transform(X_train)
pca_data = dict(x=pca_embed.astype(np.float32), y=shared(y_train.reshape(-1)))

with Model() as model:
    glm.GLM.from_formula('y ~ x', pca_data)

… gives me the following error:

ValueError: length not known: <TensorType(float64, vector)> [id A]

Since y is not a numpy array, I assume this is causing the error. It would be useful if this could be fixed to allow the simplicity of using GLM with the ability to easily perform out-of-sample prediction.

Is there a workaround? Should I raise an issue on Github?

This was so frustrating for me too! I tried the theano shared and had issues and was almost going to give up when I think I found a workaround just by using the pm.Data but you have to give it the labels when you do pm.Glm…

with pm.Model() as logistic_model:
pred = pm.Data(“pred”, X_train)
pm.glm.GLM(pred, y_train, family = pm.glm.families.Binomial(), labels=X_train.columns.tolist())
trace = pm.sample(1000, tune = 2000, init = ‘adapt_diag’,cores=1)

If you just do pred = pm.Data(“pred”, X_train) you’ll get an error I think you have to add labels=X_train.columns.tolist()) to pm.glm.GLM.

This should allow you you to do sample from the posterior predictive distribution:
pm.set_data({“pred”: X_test}, model=logistic_model)
ppc = pm.sample_ppc(trace, model=logistic_model, samples=500)
predValues = np.rint(ppc[‘y’].mean(axis=0)).astype(‘int’)

Hey all! Thank you for your input! I was having the same question, in particular to use the pasty formulas to generate out-of-sample predictions. I was not able to do it directly and I had to generate the features outside the model…I wrote a little post explaining the procedure (I took your suggestions, and referenced them accordingly): https://juanitorduz.github.io/glm_pymc3/

@juanitorduz Your post was very helpful! You should consider adding the notebook procedure to the pymc3 docs.

Thanks! I am happy you found it useful! I created an issue about it to scope the content of the example.