How do we predict on new unseen groups in a hierarchical model in PyMC3?

I posted this previously on stackoverflow, but realized it might be more appropriate here.

If we have a hierarchical model with data from different sites as different groups in the model, how do we predict on new groups (new sites that we haven’t seen before)?
e.g. using the following logistic regression model:

from pymc3 import Model, sample, Normal, HalfCauchy,Bernoulli
import theano.tensor as tt
with Model() as varying_slope:

    mu_beta = Normal('mu_beta', mu=0., sd=1e5)
    sigma_beta = HalfCauchy('sigma_beta', 5)
    a = Normal('a', mu=0., sd=1e5)
    betas = Normal('b',mu=mu_beta,sd=sigma_beta,shape=(n_features,n_site))
    y_hat = a + tt.dot(X_shared,betas[:,site_shared])
    y_like = Bernoulli('y_like', logit_p=y_hat, observed=train_y)

After we fit this model, we can predict on new data (i.e. sample from the posterior predictive) from a specific site using:

site_to_predict = 1
samples = 500

x = tt.matrix('X',dtype='float64')
new_site = tt.vector('new_site',dtype='int32')
n_samples = tt.iscalar('n_samples')
x.tag.test_value = np.empty(shape=(1,X.shape[1]))
new_site.tag.test_value = np.empty(shape=(1,1))

_sample_proba =  approx.sample_node(varying_slope.y_like.distribution.p,
                               size=n_samples,
                               more_replacements={X_shared: x,site_shared:new_site})

sample_proba = theano.function([x,new_site,n_samples], _sample_proba)

pred_test = sample_proba(test_X.reshape(1,-1),np.array(site_to_predict).reshape(-1),samples)

but what is the correct way to sample from the posterior predictive distribution if we have a new unseen site ? Should we just randomly sample sites from a uniform distribution ?

First of all, beware of the centered hierarchical parametrization you are using, it may lead to divergences and difficulties while fitting.

That being said, your model looks more or less like a GLM with shared prior random variates mu_beta and sigma_beta across features and sites. Once you get a posterior distribution over those two, your predictions should look something like

y_hat = a + dot(X_shared, Normal(mu=mu_beta, sigma=sigma_beta))
y_like = Bernoulli('y_like', logit_p=y_hat)

So, we will aim to get that.

The way in which we always recommend out of sample posterior predictive checks is to use theano.shared's. I’ll use a different approach, inspired in the functional API that is being the core design idea for pymc4. The are many differences I wont go into between pymc3 and the skeleton of pymc4, but one thing that I started to use more were factory functions to get the Model instances. Instead of trying to define things inside the model with theano.shared's, I just create a new model with the new data and draw posterior predictive samples from it. I just recently posted about this here.

The idea is to create the model with the training data and sample from it to get a trace. Then you use have to extract from the trace the hierarchical part which is shared with the unseen site: mu_beta, sigma_beta and a. Finally, you create a new model using the new data of the test site, and sample from the posterior predictive using a list of dictionaries that hold the mu_beta, sigma_beta and a part of the training trace. Here’s a self-contained example

import numpy as np
import pymc3 as pm
from theano import tensor as tt
from matplotlib import pyplot as plt


def model_factory(X, y, site_shared, n_site, n_features=None):
    if n_features is None:
        n_features = X.shape[-1]
    with pm.Model() as model:
        mu_beta = pm.Normal('mu_beta', mu=0., sd=1)
        sigma_beta = pm.HalfCauchy('sigma_beta', 5)
        a = pm.Normal('a', mu=0., sd=1)
        b = pm.Normal('b', mu=0, sd=1, shape=(n_features, n_site))
        betas = mu_beta + sigma_beta * b
        y_hat = a + tt.dot(X, betas[:, site_shared])
        pm.Bernoulli('y_like', logit_p=y_hat, observed=y)
    return model


# First I generate some training X data
n_features = 10
ntrain_site = 5
ntrain_obs = 100
ntest_site = 1
ntest_obs = 1
train_X = np.random.randn(ntrain_obs, n_features)
train_site_shared = np.random.randint(ntrain_site, size=ntrain_obs)
new_site_X = np.random.randn(ntest_obs, n_features)
test_site_shared = np.zeros(ntest_obs, dtype=np.int32)
# Now I generate the training and test y data with a sample from the prior
with model_factory(X=train_X,
                   y=np.empty(ntrain_obs, dtype=np.int32),
                   site_shared=train_site_shared,
                   n_site=ntrain_site) as train_y_generator:
    train_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]
with model_factory(X=new_site_X,
                   y=np.empty(ntest_obs, dtype=np.int32),
                   site_shared=test_site_shared,
                   n_site=ntest_site) as test_y_generator:
    new_site_Y = pm.sample_prior_predictive(1, vars=['y_like'])['y_like'][0]

# The previous part is just to get some toy data to fit
# Now comes the important parts. First training
with model_factory(X=train_X,
                   y=train_Y,
                   site_shared=train_site_shared,
                   n_site=ntrain_site) as train_model:
    train_trace = pm.sample()

# Second comes the hold out data posterior predictive
with model_factory(X=new_site_X,
                   y=new_site_Y,
                   site_shared=test_site_shared,
                   n_site=ntrain_site) as test_model:
    # We first have to extract the learnt global effect from the train_trace
    df = pm.trace_to_dataframe(train_trace,
                               varnames=['mu_beta', 'sigma_beta', 'a'],
                               include_transformed=True)
    # We have to supply the samples kwarg because it cannot be inferred if the
    # input trace is not a MultiTrace instance
    ppc = pm.sample_posterior_predictive(trace=df.to_dict('records'),
                                         samples=len(df))
    plt.figure()
    plt.hist(ppc['y_like'], 30)
    plt.axvline(new_site_Y, linestyle='--', color='r')

The posterior predictive I get looks like this:
Figure_1

Of course, I don’t know what kind of data to concretely put as your X_shared, site_shared or train_y, so I just made up some nonsense toy data at the beginning of the code, you should replace that with your actual data.

10 Likes

Thanks so much ! , this is very very clear. Is it the case then that the corresponding non-centered model using different priors for different features is then just given by:


def model_factory(X, y, site_shared, n_site, n_features=None):
    if n_features is None:
        n_features = X.shape[-1]
    with pm.Model() as model:
        mu_beta = pm.Normal('mu_beta', mu=0., sd=1,shape=(n_features,1))
        sigma_beta = pm.HalfCauchy('sigma_beta', 5,shape=(n_features,1))
        a = pm.Normal('a', mu=0., sd=1)
        b = pm.Normal('b', mu=0, sd=1, shape=(n_features, n_site))
        betas = mu_beta + sigma_beta * b
        y_hat = a + tt.dot(X, betas[:, site_shared])
        pm.Bernoulli('y_like', logit_p=y_hat, observed=y)
    return model

Yes, your non centered parametrization is correct

1 Like

FYI, the issue regarding deterministics of shared values was resolved by this PR. Now, all the deterministics stored in the trace are ignored in sample posterior predictive, so you don’t have to remove them by hand from the trace.

1 Like

This was a really helpful post for beginning to understand how to sample the posterior distributions of a hierarchical model. However, I did have two questions regarding the determination of y_hat for multiple observations and how to sample different levels of a hierarchical model.

  1. With the current construction, the dot product in y_hat produces a n x n matrix for n observations. From my understanding, shouldn’t y_hat be the same shape as y, i.e. n x 1 ?

I’ve modified this non-centered model, using the HierarchicalLogisticRegression module from Nicole Carlson’s “parsing-science” project as a guide, to arrive at a modified hierarchical model:

def model_factory(X, y, site_shared, n_site, n_features=None):
    if n_features is None:
        n_features = X.shape[-1]
    with pm.Model() as model:
        mu_beta = pm.Normal('mu_beta', mu=0., sd=1,shape=(n_features,))
        sigma_beta = pm.HalfCauchy('sigma_beta', 5,shape=(n_features,))
        a = pm.Normal('a', mu=0., sd=1)
        b = pm.Normal('b', mu=0, sd=1, shape=(n_site, n_features))
        betas = mu_beta + sigma_beta * b
        y_hat = a + tt.sum(betas[site_shared]*X, 1)
        pm.Bernoulli('y_like', logit_p=y_hat, observed=y)
    return model

Is this an appropriate modification to the original model to create a non-centered hierarchical model using different priors for each feature across multiple observations?

  1. I’m still a little confused about how to sample the posterior distribution for the population level of the hierarchy.

Say you want to make a prediction for unseen data on different levels of the hierarchy. In the first case, you don’t know which site your data belongs to and you simply want to sample the population parameters for a prediction. Is this when you sample using only [‘mu_beta’, ‘sigma_beta’, ‘a’] and run the sample_posterior_predictive on the model using those traces?

# Population based prediction, site_shared doesn't matter
with model_factory(X = new_site_X,
                   y = np.empty(ntrain_obs, dtype=np.int32),
                   site_shared = np.random.randint(ntrain_site, size=ntrain_obs), # Placeholder because the actual site isn't known
                   n_site = ntrain_site) as pop_model:
  
  df = pm.trace_to_dataframe(train_trace,
                             varnames=['mu_beta', 'sigma_beta', 'a'],
                             include_transformed=True)
  ppc_pop = pm.sample_posterior_predictive(trace=df.to_dict('records'),
                                           samples=1000)

Conversely, say you know the site (site=0) and want to predict using the site-specific posterior distributions. Would you sample like this?

# site_shared = 0 prediction, set site_shared to 0 for each observation and sample from original trace
with model_factory(X = new_site_X,
                   y = np.empty(ntrain_obs, dtype=np.int32),
                   site_shared = np.array([0]*ntest_obs), # Only use betas for site==0
                   n_site = ntrain_site) as lvl0_model:

  ppc_level0 = pm.sample_posterior_predictive(trace=train_trace, # Use all levels of the trace
                                           samples=1000)
1 Like

@zult,
For #1 and #2, you are correct. The dot product doesnt work as beta is now shape=(n_features, n_observations)

As i understand it, by removing the group offset parameter ('a' in @lucianopaz’s example), you end up effectively sampling from the prior of the group offset (a pm.Normal(0,1)), which will generally be around 0, which drops out the sigma_b “group spread” term, leaving the beta as mu_beta. Very nice :slight_smile:

Assuming you are using theano.shared's, is there any problem with just reusing the model after the sample step, instead of re-instantiating a new model via the model factory? Like so:

with model_factory(X=train_X,
                   y=train_Y,
                   site_shared=train_site_shared,
                   n_site=ntrain_site) as train_model:
    train_trace = pm.sample()
    df = pm.trace_to_dataframe(train_trace,
                               varnames=['mu_beta', 'sigma_beta', 'a'],
                               include_transformed=True)
    #SET YOUR SHARED INPUTS HERE, then...
    ppc = pm.sample_posterior_predictive(trace=df.to_dict('records'),
                                         samples=len(df))

I’m assuming all data from sample is contained inside the trace object, and a fresh model from the model factory is the same as a model after pm.sample()?

I am trying to replicate the toy example to suit my own dataset. It seems that sampling time increases exponentially with the increase in the number of sites and observations.
@lucianopaz 's toy setup indeed samples quite fast (08:52 4.44draws/s). However, my dataset has ~100 observations per group. I tried to run @lucianopaz code with the following setup:

n_features = 6
ntrain_site = 5    # Although I have more groups. But I am testing with just five
ntrain_obs = 500   # 100 observations per group totaling 500 observations 
ntest_site = 1
ntest_obs = 1

Above setup took 2hrs (1.3s/draws) to complete the sampling.

Further increasing the group/observation makes sampling more time-consuming. I was wondering what is the source of this slow-down? How this could be improved if I have many groups and observations in my training data?

This example is helpful; thanks, @lucianopaz .

Now that pm.trace_to_dataframe has been removed in pymc, what’s the new way to do the following? Is there anything subtle needed for the include_transformed?

    df = pm.trace_to_dataframe(train_trace,
                               varnames=['mu_beta', 'sigma_beta', 'a'],
                               include_transformed=True)

The prediction section of the tutorial on multilevel models says that one can make predictions on existing and new groups. But it only gives an example of predicting on an existing group. It would be neat to add an example of predicting on a new group to that tutorial.

1 Like

Completely agree, do you want to open an issue here asking for that? GitHub - pymc-devs/pymc-examples: Examples of PyMC models, including a library of Jupyter notebooks.

Good idea, done! Here’s the new issue