I have a simple m-by-n table of categoricals:
x y
0 a A
1 b B
2 c C
3 a C
I want to model that the y
column depends on the x
column.
I made an attempt, and I expected that posterior-predictive values would have each row showing which categorical y
value was predicted for the given x
. But the predicted rows actually have four columns, and don’t indicate which of the y
values is predicted.
I’m not sure if my model is wrong or if I’m using sample_posterior_predictive()
incorrectly. How can I get y
predictions for given values of x
?
Here’s my attempt.
import arviz
import pandas as pd
import pymc3
import numpy as np
import theano
import theano.tensor
data = pd.DataFrame(
{
'x': pd.Categorical(['a', 'b', 'c', 'a']),
'y': pd.Categorical(['A', 'B', 'C', 'C']),
}
)
def build(x, y, ncats_x):
with pymc3.Model() as mod:
p = pymc3.Dirichlet("p", a=np.array([1] * ncats_x))
out = pymc3.Categorical("out", p=p[x], observed=y)
return mod
np.set_printoptions(threshold=np.inf)
cats_x = len(set(data["x"]))
x_shared = theano.shared(data["x"].cat.codes.values)
model = build(x=x_shared, y=data["y"].cat.codes.values, ncats_x=cats_x)
trace = pymc3.sample(model=model)
inference = arviz.from_pymc3(trace)
print(arviz.summary(inference))
out_of_sample_x = np.random.choice(data["x"].cat.codes.values, size=2)
x_shared.set_value(out_of_sample_x)
pred = pymc3.sample_posterior_predictive(trace=trace, model=model,)
print(pred)
# mean sd hpd_3% hpd_97% ... ess_sd ess_bulk ess_tail r_hat
# p[0] 0.219 0.143 0.009 0.484 ... 1632.0 1620.0 1184.0 1.0
# p[1] 0.311 0.168 0.030 0.608 ... 1953.0 1901.0 1033.0 1.0
# p[2] 0.470 0.180 0.156 0.803 ... 1624.0 1683.0 1168.0 1.0
# [3 rows x 11 columns]
# {'out': array([[0, 0, 0, 0],
# [1, 0, 0, 1],
# [0, 0, 1, 1],
# [1, 0, 0, 1],
# [0, 1, 0, 1], ...