I have a series of events in different categories. Each event is associated with a binary outcome. We would like to estimate outcome probability, given the category. This is the code that creates a dataset
category_str = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c']
category_numerical = [0, 0, 0, 1, 1, 1, 2, 2]
outcome = [0, 0, 0, 1, 0, 1, 1, 1]
Here, we have three categories. Therefore, I would like to create a vector of three probabilities. Here’s what I tried to do
n_categories = len(set(category_str))
with pm.Model() as model:
probs = []
for cat in ('a', 'b', 'c'):
_p = pm.Beta(name=f'cat_{cat}', alpha=1, beta=1)
probs.append(_p)
p = tt.stack(probs)
p_conditional = pm.Deterministic('p_conditional', p[category_numerical])
label = pm.Bernoulli('label', p=p_conditional, observed=outcome)
trace = pm.sample(1000, tune=1000, chains=1)
At this point, I assumed that trace['p_conditional']
will contain 1,000 three-element values (one for each category), but the shape of this trace is completely different. It’s (1000, 8)
(where 8 corresponds to the number of observations.
What is the right way to get my goal?