Proper way to estimate conditional probabilities

Boris_Gorelik · December 4, 2019, 10:21am

I have a series of events in different categories. Each event is associated with a binary outcome. We would like to estimate outcome probability, given the category. This is the code that creates a dataset

category_str = ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c']
category_numerical = [0, 0, 0, 1, 1, 1, 2, 2]
outcome = [0, 0, 0, 1, 0, 1, 1, 1]

Here, we have three categories. Therefore, I would like to create a vector of three probabilities. Here’s what I tried to do

n_categories = len(set(category_str))
with pm.Model() as model:
    probs = []
    for cat in ('a', 'b', 'c'):
        _p = pm.Beta(name=f'cat_{cat}', alpha=1, beta=1)
        probs.append(_p)
    p = tt.stack(probs)
    p_conditional = pm.Deterministic('p_conditional', p[category_numerical])
    label = pm.Bernoulli('label', p=p_conditional, observed=outcome)
    trace = pm.sample(1000, tune=1000, chains=1)

At this point, I assumed that trace['p_conditional'] will contain 1,000 three-element values (one for each category), but the shape of this trace is completely different. It’s (1000, 8) (where 8 corresponds to the number of observations.

What is the right way to get my goal?

AlexAndorra · December 4, 2019, 11:58am

Hi Boris!
Yeah, that’s the expected behavior: you need to give an 8 long vector of probabilities (your p_conditional vector) to the likelihood, as you have 8 observations – this matches each data point with the corresponding category probability.

But you’re actually interested in the underlying probability of each category (your p vector). So I think you’re almost there, you just need to tell PyMC to keep it in the trace (and actually maybe you don’t need p_conditional in the trace):

with pm.Model() as model:
    p = pm.Beta('p_cat', alpha=1, beta=1, shape=n_categories)
    label = pm.Bernoulli('label', p=p[category_numerical], observed=outcome)

This should give you what you’re looking for.
Just as an aside, note that the three probabilities won’t sum to 1 here – I know it’s not your question, just pointing that out
Hope that helps

Boris_Gorelik · December 5, 2019, 8:48am

I came to a slightly different solution which, although worked, is much slower in my toy example. However, when applied to real data (100,000 observations with 15 categories), “my” approach is much faster than yours. Do you have any idea why?

category_str = np.array(category_str)
outcome = np.array(outcome)
with pm.Model() as model:
    probs = []
    for cat in ('a', 'b', 'c'):
        p_ = pm.Beta(f'p_{cat}', alpha=1, beta=1)
        cat_outcomes = pm.Bernoulli(f'outcome_{cat}', p=p_, observed=outcome[category_str == cat])
        probs.append(cat_outcomes)
    trace = pm.sample(1000, tune=1000)

AlexAndorra · December 5, 2019, 10:11am

Mmh not really
Maybe because you just give a probability vector of shape 1 to the likelihood instead of a vector of length len(outcome) as before? This maybe takes advantage of vectorized operations under the hood?
Really not sure but glad you found a solution

Topic		Replies	Views
Categories in another column, how to use dynamically? Questions	5	449	June 26, 2020
Observed data in Bayesian networks Questions	13	2727	July 6, 2021
Mixture of probabilities Questions	3	472	October 30, 2018
How to get the probabilities of this PyMC3 model? Questions	3	834	March 9, 2020
Problem with pm.Categorical Questions	4	4250	December 8, 2017

Proper way to estimate conditional probabilities

Related topics