Prediction using sample_ppc in Hierarchical model

Github issue #2352 by nikzadb:

I have the attached data and following Hierarchical model (as a toy example of another model) and trying to draw posterior samples from it (of course to predict new values).
new_data.txt (1.5 KB)

Hierarchical model:

import numpy as np
import pandas as pd
import pymc3 as pm3
from theano import shared
import theano.tensor as tt
import matplotlib.pylab as plt

%matplotlib inline

new_data = pd.read_csv('new_data.csv', sep=',')

train = new_data.iloc[indexes[:-15], :]
test = new_data.iloc[indexes[-15:], :]

masked_x1 = shared(train.x1)
masked_x2 = shared(train.x2)
idx = train.idx.values.astype(int)
num_items = len(idx)
with pm3.Model() as model:
    intercept_prior = pm3.Normal('intercept_prior',
                                 mu=20,
                                 sd=100,
                                 shape=num_items)

    mu_x1 = pm3.Normal('mu_x1', mu=0, sd=20)
    std_x1 = pm3.Uniform('std_x1', lower=0, upper=100)
    x1_prior = pm3.Normal('x1_prior',
                                    mu=mu_x1,
                                    sd=std_x1,
                                    shape=num_items)

    std_x2 = pm3.Uniform('std_x2', lower=0, upper=100)
    x2_prior = pm3.HalfNormal('x2_prior',
                                    sd=std_x2,
                                    shape=num_items)

    y_est = (
        intercept_prior[idx] +
        x1_prior[idx] * masked_x1.get_value() +
        x2_prior[idx] * masked_x2.get_value()
    )

    y_est = pm3.Deterministic('y_est', y_est)

    pm3.Normal('y_pred',
               mu=y_est,
               sd=100,
               observed=train.y)    
    start, sds, elbo = pm3.variational.advi(n=20000)
    step = pm3.Metropolis()
    trace = pm3.sample(10000, start=start, step=step)

varnames = ['mu_x1', 'std_x1', 'std_x2']
pm3.traceplot(trace[-1000:], varnames)
pm3.plot_posterior(trace[-1000:], varnames)

in which the size of trace[‘y_pred’] is (10000, 43) since there are 10000 traces and 43 data in train set.

Prediction

masked_x1.set_value(test.x1)
masked_x2.set_value(test.x2) 
ppc = pm3.sample_ppc(trace, model=model, samples=500)

What I expect here is the shape of ppc[‘y_pred’] be (500, 15), where 15 is the size of test dataset, but instead I get (500, 43) which 43 is the size of train dataset.
(I followed the suggestions at https://github.com/pymc-devs/pymc3/issues/1001 and ran the hierarchical model for rugby prediction example. For this example, everything is fine but can not figure out why the above model still goes wrong.)

In your model the item specific effect is indexing through [idx], so you want the number of coefficients equal to the number of unique items:
num_items = len(idx) --> num_items = len(np.unique(idx))
and the indexing should be correspondent to the input data in

y_est = (
        intercept_prior[idx] +
        x1_prior[idx] * masked_x1.get_value() +
        x2_prior[idx] * masked_x2.get_value()
    )

the [idx] above should be correctly correspondent to masked_x1 and masked_x2

Try the code below, it will give the correct format of output now:

new_data = pd.read_csv('new_data.txt', sep='\t')

idx = np.asarray(new_data.idx)
num_items = len(np.unique(idx))

train = new_data[:-15]
test = new_data[-15:]

masked_x1 = shared(np.asarray(train.x1))
masked_x2 = shared(np.asarray(train.x2))
masked_idx = shared(np.asarray(train.idx))

with pm3.Model() as model:
    intercept_prior = pm3.Normal('intercept_prior',
                                 mu=20,
                                 sd=100,
                                 shape=num_items)

    mu_x1 = pm3.Normal('mu_x1', mu=0, sd=20)
    std_x1 = pm3.Uniform('std_x1', lower=0, upper=100)
    x1_prior = pm3.Normal('x1_prior',
                                    mu=mu_x1,
                                    sd=std_x1,
                                    shape=num_items)

    std_x2 = pm3.Uniform('std_x2', lower=0, upper=100)
    x2_prior = pm3.HalfNormal('x2_prior',
                                    sd=std_x2,
                                    shape=num_items)

    y_est = pm3.Deterministic('y_est',
        intercept_prior[masked_idx] +
        x1_prior[masked_idx] * masked_x1 +
        x2_prior[masked_idx] * masked_x2
    )

    pm3.Normal('y_pred',
               mu=y_est,
               sd=100,
               observed=train.y)    
    trace = pm3.sample(3000, njobs=2)

varnames = ['mu_x1', 'std_x1', 'std_x2']
pm3.traceplot(trace[2000:], varnames)
pm3.plot_posterior(trace[2000:], varnames)
#%%
masked_x1.set_value(np.asarray(test.x1))
masked_x2.set_value(np.asarray(test.x2))
masked_idx.set_value(np.asarray(test.idx))
ppc = pm3.sample_ppc(trace, model=model, samples=500)

Also, some hyper prior on intercept_prior would also improve the model.

1 Like

Thanks @junpenglao. It was great and useful.

Well I have another question about how the final trace in PyMC3 works (another application). I used hpd() function to find 95% interval but almost all true values are outside of this interval. What that means? how can I improve the results?

Before running the model, I standardized all columns and then reversed to the original sale (using scikit-learn MinMaxScaler)

bd = trace[-1000:][‘y_est’]
predictions = pd.DataFrame()
predictions[‘pred’] = bd.mean(axis=0)
predictions = predictions[~predictions.pred.isnull()]
predictions[‘y’] = new_data.y
predictions[‘residual’] = predictions.pred - predictions.y

bd_hpd = pm3.hpd(bd)

predictions[‘prctile_95’] = bd_hpd[:, 1]
predictions[‘prctile_5’] = bd_hpd[:,0]

Hi @junpenglao,

It’s Charles again. I am hitting a problem with sample_ppc that hierarchical multinomial/categorical model we were talking about yesterday.

I followed your suggestions:

#set in shared variables for prediction and testing later
input_var = theano.shared(model_data.values)
target_var = theano.shared(all_data['idx'].values)
masked_idx = theano.shared(asarray(labels))

#Test the model
input_var.set_value(test_input.values)
masked_idx.set_value(test_input['label'].values)
n_test = test_input.values.shape[0]

#have to put target_var that has same length as test_input
#just fake values because this is what I want to predict
target_var.set_value(ones(n_test).astype('int32'))

ppc = pm.sample_ppc(trace, model=multinomial_model)

I get this error:

ppc = pm.sample_ppc(trace, model=multinomial_model)
  0%|          | 0/2000 [00:00<?, ?it/s]
Traceback (most recent call last):

  File "<ipython-input-77-a52e52683e3c>", line 1, in <module>
    ppc = pm.sample_ppc(trace, model=multinomial_model)

  File "/home/ccrawfor/anaconda3/lib/python3.6/site-packages/pymc3/sampling.py", line 572, in sample_ppc
    size=size))

  File "/home/ccrawfor/anaconda3/lib/python3.6/site-packages/pymc3/distributions/discrete.py", line 520, in random
    size=size)

  File "/home/ccrawfor/anaconda3/lib/python3.6/site-packages/pymc3/distributions/distribution.py", line 403, in generate_samples
    samples = generator(size=1, *args, **kwargs)

  File "/home/ccrawfor/anaconda3/lib/python3.6/site-packages/pymc3/distributions/discrete.py", line 510, in random_choice
    for p in kwargs['p']]

  File "/home/ccrawfor/anaconda3/lib/python3.6/site-packages/pymc3/distributions/discrete.py", line 510, in <listcomp>
    for p in kwargs['p']]

  File "mtrand.pyx", line 1137, in mtrand.RandomState.choice

ValueError: a and p must have same size 

I’ve looked through the pymc3 files to try and figure out what “a” is, but I can’t find it. I guess it is some thing in that mtrand.pyx file that is being held over? I don’t wanna jack anything up and delete the .pyx file, but is that the thing to do?

Thanks so much for your help.

-C

Unfortunately it might be an issue with Multinomial, see eg this issue: https://github.com/pymc-devs/pymc3/issues/2614

Hm, ok. That’s no fun. Thanks for the info tho.