Proper way to estimate conditional probabilities

I have a series of events in different categories. Each event is associated with a binary outcome. We would like to estimate outcome probability, given the category. This is the code that creates a dataset

category_str = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c']
category_numerical = [0, 0, 0, 1, 1, 1, 2, 2]
outcome = [0, 0, 0, 1, 0, 1, 1, 1]

Here, we have three categories. Therefore, I would like to create a vector of three probabilities. Here’s what I tried to do

n_categories = len(set(category_str))
with pm.Model() as model:
    probs = []
    for cat in ('a', 'b', 'c'):
        _p = pm.Beta(name=f'cat_{cat}', alpha=1, beta=1)
    p = tt.stack(probs)
    p_conditional = pm.Deterministic('p_conditional', p[category_numerical])
    label = pm.Bernoulli('label', p=p_conditional, observed=outcome)
    trace = pm.sample(1000, tune=1000, chains=1)

At this point, I assumed that trace['p_conditional'] will contain 1,000 three-element values (one for each category), but the shape of this trace is completely different. It’s (1000, 8) (where 8 corresponds to the number of observations.

What is the right way to get my goal?

Hi Boris!
Yeah, that’s the expected behavior: you need to give an 8 long vector of probabilities (your p_conditional vector) to the likelihood, as you have 8 observations – this matches each data point with the corresponding category probability.

But you’re actually interested in the underlying probability of each category (your p vector). So I think you’re almost there, you just need to tell PyMC to keep it in the trace (and actually maybe you don’t need p_conditional in the trace):

with pm.Model() as model:
    p = pm.Beta('p_cat', alpha=1, beta=1, shape=n_categories)
    label = pm.Bernoulli('label', p=p[category_numerical], observed=outcome)

This should give you what you’re looking for.
Just as an aside, note that the three probabilities won’t sum to 1 here – I know it’s not your question, just pointing that out :wink:
Hope that helps :vulcan_salute:

I came to a slightly different solution which, although worked, is much slower in my toy example. However, when applied to real data (100,000 observations with 15 categories), “my” approach is much faster than yours. Do you have any idea why?

category_str = np.array(category_str)
outcome = np.array(outcome)
with pm.Model() as model:
    probs = []
    for cat in ('a', 'b', 'c'):
        p_ = pm.Beta(f'p_{cat}', alpha=1, beta=1)
        cat_outcomes = pm.Bernoulli(f'outcome_{cat}', p=p_, observed=outcome[category_str == cat])
    trace = pm.sample(1000, tune=1000)
1 Like

Mmh not really :thinking:
Maybe because you just give a probability vector of shape 1 to the likelihood instead of a vector of length len(outcome) as before? This maybe takes advantage of vectorized operations under the hood?
Really not sure but glad you found a solution :tada: