# Proper way to estimate conditional probabilities

I have a series of events in different categories. Each event is associated with a binary outcome. We would like to estimate outcome probability, given the category. This is the code that creates a dataset

``````category_str = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c']
category_numerical = [0, 0, 0, 1, 1, 1, 2, 2]
outcome = [0, 0, 0, 1, 0, 1, 1, 1]
``````

Here, we have three categories. Therefore, I would like to create a vector of three probabilities. Here’s what I tried to do

``````n_categories = len(set(category_str))
with pm.Model() as model:
probs = []
for cat in ('a', 'b', 'c'):
_p = pm.Beta(name=f'cat_{cat}', alpha=1, beta=1)
probs.append(_p)
p = tt.stack(probs)
p_conditional = pm.Deterministic('p_conditional', p[category_numerical])
label = pm.Bernoulli('label', p=p_conditional, observed=outcome)
trace = pm.sample(1000, tune=1000, chains=1)
``````

At this point, I assumed that `trace['p_conditional']` will contain 1,000 three-element values (one for each category), but the shape of this trace is completely different. It’s `(1000, 8)` (where 8 corresponds to the number of observations.

What is the right way to get my goal?

Hi Boris!
Yeah, that’s the expected behavior: you need to give an 8 long vector of probabilities (your `p_conditional` vector) to the likelihood, as you have 8 observations – this matches each data point with the corresponding category probability.

But you’re actually interested in the underlying probability of each category (your `p` vector). So I think you’re almost there, you just need to tell PyMC to keep it in the trace (and actually maybe you don’t need `p_conditional` in the trace):

``````with pm.Model() as model:
p = pm.Beta('p_cat', alpha=1, beta=1, shape=n_categories)
label = pm.Bernoulli('label', p=p[category_numerical], observed=outcome)
``````

This should give you what you’re looking for.
Just as an aside, note that the three probabilities won’t sum to 1 here – I know it’s not your question, just pointing that out
Hope that helps

I came to a slightly different solution which, although worked, is much slower in my toy example. However, when applied to real data (100,000 observations with 15 categories), “my” approach is much faster than yours. Do you have any idea why?

``````category_str = np.array(category_str)
outcome = np.array(outcome)
with pm.Model() as model:
probs = []
for cat in ('a', 'b', 'c'):
p_ = pm.Beta(f'p_{cat}', alpha=1, beta=1)
cat_outcomes = pm.Bernoulli(f'outcome_{cat}', p=p_, observed=outcome[category_str == cat])
probs.append(cat_outcomes)
trace = pm.sample(1000, tune=1000)
``````
1 Like

Mmh not really
Maybe because you just give a probability vector of shape 1 to the likelihood instead of a vector of length `len(outcome)` as before? This maybe takes advantage of vectorized operations under the hood?
Really not sure but glad you found a solution