How to predict new values on hold-out data

I apologize – this was a false positive. My colleague just told me that the problem was an error in post processing, not in the sampling.
I will check to verify that and let you know.

1 Like

This’s a nice workaround. How can I modify this resolution in a situation when working with pm.GLM.from_formula.

Here is my model building block that I prefer not to change:

with pm.Model() as model:
	priors = {
			"Intercept"      : pm.Normal.dist(mu=start_mu['Intercept']     , sigma=start_sig['Intercept']),
			"a" : pm.Normal.dist(mu=start_mu['a'], sigma=start_sig['a']),
			"b" : pm.Normal.dist(mu=start_mu['b'], sigma=start_sig['b']),
			"c" : pm.Normal.dist(mu=start_mu['c'], sigma=start_sig['c']),
			"d"  : pm.Normal.dist(mu=start_mu['d']    , sigma=start_sig['d'])
		}
	family = pm.glm.families.StudentT()
    pm.GLM.from_formula(formula = 'y ~ a + b + c + d',
                        data    = train_df,
                        priors  = priors,
                        family  = family
                       )
	trace = pm.sample(draws=11000,
                      tune=700,
                      init='advi',
                      start=None,
                      cores=ncores,
                      chains=ncores,
                      random_seed=[123 for _ in range(ncores)],
                      discard_tuned_samples=True,
                      compute_convergence_checks=True)

Thanks!

You just have to try to move your model’s definition into a function that receives either the training or test data:

def model_factory(data):
    with pm.Model() as model:
        priors = {
			"Intercept"      : pm.Normal.dist(mu=start_mu['Intercept']     , sigma=start_sig['Intercept']),
			"a" : pm.Normal.dist(mu=start_mu['a'], sigma=start_sig['a']),
			"b" : pm.Normal.dist(mu=start_mu['b'], sigma=start_sig['b']),
			"c" : pm.Normal.dist(mu=start_mu['c'], sigma=start_sig['c']),
			"d"  : pm.Normal.dist(mu=start_mu['d']    , sigma=start_sig['d'])
		}
        family = pm.glm.families.StudentT()
        pm.GLM.from_formula(formula = 'y ~ a + b + c + d',
                        data    = data,
                        priors  = priors,
                        family  = family
                       )
    return model

with model_factory(train_data):
    trace = pm.sample(draws=11000,
                      tune=700,
                      init='advi',
                      start=None,
                      cores=ncores,
                      chains=ncores,
                      random_seed=[123 for _ in range(ncores)],
                      discard_tuned_samples=True,
                      compute_convergence_checks=True)

with model_factory(test_data):
    ppc = pm.sample_posterior_predictive(trace) #or whatever
1 Like

Thank you for this! it was super helpful.

  1. Would it be possible to show an example of a hierarchical linear model as well? I’m finding it hard to plot the prediction vs train and test.
  2. Also, if you can point to any article that helps build and predict hierarchical data that’s not necessarily linear.
    Thank you for your time!

Hi @Sam_Anand!
I think the PyMC3 port of Rethinking_2 chapters 13 and 14 (hierarchical models) could help you with that :wink:
PyMCheers :vulcan_salute: