Prediction using sample_ppc in Hierarchical model

Github issue #2352 by nikzadb:

I have the attached data and following Hierarchical model (as a toy example of another model) and trying to draw posterior samples from it (of course to predict new values).
new_data.txt (1.5 KB)

Hierarchical model:

import numpy as np
import pandas as pd
import pymc3 as pm3
from theano import shared
import theano.tensor as tt
import matplotlib.pylab as plt

%matplotlib inline

new_data = pd.read_csv('new_data.csv', sep=',')

train = new_data.iloc[indexes[:-15], :]
test = new_data.iloc[indexes[-15:], :]

masked_x1 = shared(train.x1)
masked_x2 = shared(train.x2)
idx = train.idx.values.astype(int)
num_items = len(idx)
with pm3.Model() as model:
    intercept_prior = pm3.Normal('intercept_prior',
                                 mu=20,
                                 sd=100,
                                 shape=num_items)

    mu_x1 = pm3.Normal('mu_x1', mu=0, sd=20)
    std_x1 = pm3.Uniform('std_x1', lower=0, upper=100)
    x1_prior = pm3.Normal('x1_prior',
                                    mu=mu_x1,
                                    sd=std_x1,
                                    shape=num_items)

    std_x2 = pm3.Uniform('std_x2', lower=0, upper=100)
    x2_prior = pm3.HalfNormal('x2_prior',
                                    sd=std_x2,
                                    shape=num_items)

    y_est = (
        intercept_prior[idx] +
        x1_prior[idx] * masked_x1.get_value() +
        x2_prior[idx] * masked_x2.get_value()
    )

    y_est = pm3.Deterministic('y_est', y_est)

    pm3.Normal('y_pred',
               mu=y_est,
               sd=100,
               observed=train.y)    
    start, sds, elbo = pm3.variational.advi(n=20000)
    step = pm3.Metropolis()
    trace = pm3.sample(10000, start=start, step=step)

varnames = ['mu_x1', 'std_x1', 'std_x2']
pm3.traceplot(trace[-1000:], varnames)
pm3.plot_posterior(trace[-1000:], varnames)

in which the size of trace[ā€˜y_predā€™] is (10000, 43) since there are 10000 traces and 43 data in train set.

Prediction

masked_x1.set_value(test.x1)
masked_x2.set_value(test.x2) 
ppc = pm3.sample_ppc(trace, model=model, samples=500)

What I expect here is the shape of ppc[ā€˜y_predā€™] be (500, 15), where 15 is the size of test dataset, but instead I get (500, 43) which 43 is the size of train dataset.
(I followed the suggestions at https://github.com/pymc-devs/pymc3/issues/1001 and ran the hierarchical model for rugby prediction example. For this example, everything is fine but can not figure out why the above model still goes wrong.)

1 Like

In your model the item specific effect is indexing through [idx], so you want the number of coefficients equal to the number of unique items:
num_items = len(idx) --> num_items = len(np.unique(idx))
and the indexing should be correspondent to the input data in

y_est = (
        intercept_prior[idx] +
        x1_prior[idx] * masked_x1.get_value() +
        x2_prior[idx] * masked_x2.get_value()
    )

the [idx] above should be correctly correspondent to masked_x1 and masked_x2

Try the code below, it will give the correct format of output now:

new_data = pd.read_csv('new_data.txt', sep='\t')

idx = np.asarray(new_data.idx)
num_items = len(np.unique(idx))

train = new_data[:-15]
test = new_data[-15:]

masked_x1 = shared(np.asarray(train.x1))
masked_x2 = shared(np.asarray(train.x2))
masked_idx = shared(np.asarray(train.idx))

with pm3.Model() as model:
    intercept_prior = pm3.Normal('intercept_prior',
                                 mu=20,
                                 sd=100,
                                 shape=num_items)

    mu_x1 = pm3.Normal('mu_x1', mu=0, sd=20)
    std_x1 = pm3.Uniform('std_x1', lower=0, upper=100)
    x1_prior = pm3.Normal('x1_prior',
                                    mu=mu_x1,
                                    sd=std_x1,
                                    shape=num_items)

    std_x2 = pm3.Uniform('std_x2', lower=0, upper=100)
    x2_prior = pm3.HalfNormal('x2_prior',
                                    sd=std_x2,
                                    shape=num_items)

    y_est = pm3.Deterministic('y_est',
        intercept_prior[masked_idx] +
        x1_prior[masked_idx] * masked_x1 +
        x2_prior[masked_idx] * masked_x2
    )

    pm3.Normal('y_pred',
               mu=y_est,
               sd=100,
               observed=train.y)    
    trace = pm3.sample(3000, njobs=2)

varnames = ['mu_x1', 'std_x1', 'std_x2']
pm3.traceplot(trace[2000:], varnames)
pm3.plot_posterior(trace[2000:], varnames)
#%%
masked_x1.set_value(np.asarray(test.x1))
masked_x2.set_value(np.asarray(test.x2))
masked_idx.set_value(np.asarray(test.idx))
ppc = pm3.sample_ppc(trace, model=model, samples=500)

Also, some hyper prior on intercept_prior would also improve the model.

1 Like

Thanks @junpenglao. It was great and useful.

Well I have another question about how the final trace in PyMC3 works (another application). I used hpd() function to find 95% interval but almost all true values are outside of this interval. What that means? how can I improve the results?

Before running the model, I standardized all columns and then reversed to the original sale (using scikit-learn MinMaxScaler)

bd = trace[-1000:][ā€˜y_estā€™]
predictions = pd.DataFrame()
predictions[ā€˜predā€™] = bd.mean(axis=0)
predictions = predictions[~predictions.pred.isnull()]
predictions[ā€˜yā€™] = new_data.y
predictions[ā€˜residualā€™] = predictions.pred - predictions.y

bd_hpd = pm3.hpd(bd)

predictions[ā€˜prctile_95ā€™] = bd_hpd[:, 1]
predictions[ā€˜prctile_5ā€™] = bd_hpd[:,0]

Hi @junpenglao,

Itā€™s Charles again. I am hitting a problem with sample_ppc that hierarchical multinomial/categorical model we were talking about yesterday.

I followed your suggestions:

#set in shared variables for prediction and testing later
input_var = theano.shared(model_data.values)
target_var = theano.shared(all_data['idx'].values)
masked_idx = theano.shared(asarray(labels))

#Test the model
input_var.set_value(test_input.values)
masked_idx.set_value(test_input['label'].values)
n_test = test_input.values.shape[0]

#have to put target_var that has same length as test_input
#just fake values because this is what I want to predict
target_var.set_value(ones(n_test).astype('int32'))

ppc = pm.sample_ppc(trace, model=multinomial_model)

I get this error:

ppc = pm.sample_ppc(trace, model=multinomial_model)
  0%|          | 0/2000 [00:00<?, ?it/s]
Traceback (most recent call last):

  File "<ipython-input-77-a52e52683e3c>", line 1, in <module>
    ppc = pm.sample_ppc(trace, model=multinomial_model)

  File "/home/ccrawfor/anaconda3/lib/python3.6/site-packages/pymc3/sampling.py", line 572, in sample_ppc
    size=size))

  File "/home/ccrawfor/anaconda3/lib/python3.6/site-packages/pymc3/distributions/discrete.py", line 520, in random
    size=size)

  File "/home/ccrawfor/anaconda3/lib/python3.6/site-packages/pymc3/distributions/distribution.py", line 403, in generate_samples
    samples = generator(size=1, *args, **kwargs)

  File "/home/ccrawfor/anaconda3/lib/python3.6/site-packages/pymc3/distributions/discrete.py", line 510, in random_choice
    for p in kwargs['p']]

  File "/home/ccrawfor/anaconda3/lib/python3.6/site-packages/pymc3/distributions/discrete.py", line 510, in <listcomp>
    for p in kwargs['p']]

  File "mtrand.pyx", line 1137, in mtrand.RandomState.choice

ValueError: a and p must have same size 

Iā€™ve looked through the pymc3 files to try and figure out what ā€œaā€ is, but I canā€™t find it. I guess it is some thing in that mtrand.pyx file that is being held over? I donā€™t wanna jack anything up and delete the .pyx file, but is that the thing to do?

Thanks so much for your help.

-C

Unfortunately it might be an issue with Multinomial, see eg this issue: https://github.com/pymc-devs/pymc3/issues/2614

Hm, ok. Thatā€™s no fun. Thanks for the info tho.