Testing hiearchical model on unseen data

Theano shared variables with ppc seems to be the preferred way to test your model with unseen data. But there’s only very basic examples on how to use them.

For example this notebook last section shows how to make prediction by specifying a specific index for the county of a new home.

St Louis county prediction

yhat_stl = Normal(‘yhat_stl’, mu=a[69] + b, tau=tau_y)

But this seems very inefficient to test a lot of unseen data.

Taking the example from the radon by county data, how could we fit a hierarchical model with shared variables and then sample a ppc for a dataset of unseen data? Each data point (a home) in the unseen data being associated to a particular county?


I think the best approach is first working out how to express the model in linear functions: y_hat = X*beta + Z*b. Then the prediction is done by setting new values to tt.shared variables and do sample_ppc.

So instead of doing:
y_hat = a[county]
transfer county into a design matrix Xcounty (you can use patsy from patsy import dmatrices, see some example here), and do y_hat = tt.dot(Xcounty, a)

I will investigate the design matrices this is something I was not aware of. Thank you!

In your code example this would be the relevant line right?

_, L = dmatrices('tempresp ~ -1+subj', data=tbltest, return_type='matrix')

Where you don`t care about the _ variable and L is the design matrix for ‘subj’ which is your categorical variable?

Yep exactly :wink:
You can also use one-hot-encoding from scikit-learn for similar purpose.