PyMC3 Conditioning Random Variable on Multiple Discrete Parents

I recently started doing probabilistic programming using PyMC3. In my scenario, I have 3 random variables: On, Triangle, and X, such that X depends on Triangle and On. Triangle and On both follow Bernoulli distributions, and depending on which value they take, the value of X, which follows a Normal, changes.

I wrote up some mock code to test this concept but it doesn’t work. I just started working in this framework, and I know there are potentially many things wrong with this code, but I’m posting this here so you can see what I’ve done.

with pymc3.Model() as model:
        on = pymc3.distributions.discrete.Bernoulli('on', p=.7)
        x_on = pymc3.distributions.continuous.Normal('x_on', 10, .4)
        
        pTriangle_given_on = 1
        pTriangle_given_not_on = .7
        pTriangle = pymc3.math.switch(on, pTriangle_given_on, pTriangle_given_not_on)
        triangle = pymc3.distributions.discrete.Bernoulli('triangle', p=pTriangle)
        
        name = None
        x_triangle = None
    
        if triangle:
            name = pymc3.distributions.discrete.Categorical('name', p=[.3, .2, .1, .1, .2, .1])
        else:
            name = pymc3.distributions.discrete.Categorical('name', p=[.1, .5, .4])
        
        if on:
            x_triangle = pymc3.Deterministic('x_triangle', x_on)
        elif triangle:
            x_triangle = pymc3.Normal('x_triangle', 5, 1)
        else:
            x_triangle = numpy.nan
    
        trace = pymc3.sample()
        pymc3.plot_posterior(trace)

I’m not sure how to specify the conditional dependence of X on Triangle and On. Any thoughts from you all would be much appreciated.

As a caveat: I’m not sure what the “proper” way to handle NA-valued outputs is, but below is one way.


Hello and welcome. The core reason the above code doesn’t work is that what you want the if statements to mean is "If the state of on is 1"; but what they’re really asking is "If the variable on is not None".

Ultimately, the value of a random variable is not available until the sampler has started running, so you can’t directly use if on them. The switch statement you have is the appropriate one to use; and you could re-write all of your code to use switch and it would work just fine.

However, when dealing with Bernoulli distributions, it is often more efficient to exploit the fact that multiplication works effectively as a switch. For instance, the (incorrect) code

if on:
  return 5
else:
  return 2

could be implemented correctly as

2 + on * 3

With that in mind, the joint distribution could be defined with:

import pymc3 as pm

p_on = 0.3
p_tri_giv_on = 1.
p_tri_giv_not_on = 0.7
tri_delta_on = p_tri_giv_on - p_tri_giv_not_on

mean_on, sd_on = 10., 0.4
mean_tri, sd_tri = 5., 1.
NA_ENCODING = -10.
with pm.Model() as model:
    on = pm.Bernoulli('on', p_on)
    triangle = pm.Bernoulli('triangle', p_tri_giv_not_on + on * tri_delta_on)
    x_lat = pm.Normal('x_l', 0., 1.)  # will transform with `on` and `triangle`
    x = pm.Deterministic('x', NA_ENCODING + \
                              on * (-NA_ENCODING + x_lat * sd_on + mean_on) + \
                             (1 - on)*triangle*(-NA_ENCODING + x_lat * sd_tri + mean_tri))
    res = pm.sample_prior_predictive(1000)

import seaborn as sbn
import pandas as pd
from matplotlib import pyplot as plt
rdf = pd.DataFrame(res)
rdf['sample_num'] = rdf.index

sbn.pointplot(x='sample_num', y='x', data=rdf, join=False)

image

2 Likes

Thank you @chartl !

Quick question @chartl Why is the quoted value 10, rather than 20? Because I see -(-10) + 0 + 10 = 20.

The first term of x is (positive) NA_encoding, which is the default value (i.e. for on=0 and triangle=0). When on=1 then you get

NA_encoding + (-NA_ENCODING + x_lat*sd_on + mean_on) = x_lat*sd_on + mean_on

Ahh ok, yes I see now! Thanks!

I am trying to implement a similar case where I have a (continuous) random variable depend on multiple discrete parents. I am trying to implementing this using, what I believe to be essentially the same approach suggested by @chartl. However, when I try to sample, I get errors.

Here is my model:

with pm.Model() as model:
    data = pm.Data('data', train['k'].values)
    k = pm.Normal('k', mu=0, sigma=train['k'].std(), observed=data)
    
    a_w1 = pm.Normal('a_w1', mu=0, sigma=1)
    b_w1 = pm.Normal('b_w1', mu=0, sigma=1)
    w_1 = pm.Data('w_1', train['w1'].values)
    w1 = pm.Bernoulli('w1', p=pm.invlogit(a_w1 + b_w1 * k), observed=w_1)
    
    a_w2 = pm.Normal('a_w2', mu=0, sigma=1)
    b_w2 = pm.Normal('b_w2', mu=0, sigma=1)
    w_2 = pm.Data('w_2', train['w2'].values)
    w2 = pm.Bernoulli('w2', p=pm.invlogit(a_w2 + b_w2 * k), observed=w_2)
    
    a_e = pm.Normal('a_e', mu=0, sigma=1)
    b_e1 = pm.Normal('b_e1', mu=0, sigma=1)
    b_e2 = pm.Normal('b_e2', mu=0, sigma=1)
    
    
    e_ = pm.Data('education', train['e'].values)
    e = pm.Normal('e', mu=a_e + b_e1 * w1 + b_e2 * w2, sigma=train['e'].std(), observed=e)  
    
    trace = pm.sample(1000, tune=2000, cores=1, target_accept=0.9)

    az.plot_trace(trace)
    plt.tight_layout()
    plt.show()

Here I want to the mean of the e variable to be conditioned on the outcomes of the w1 and w2 variables. Here is part of the error trace that I get when attempting to sample:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/opt/anaconda3/envs/habitus/lib/python3.9/site-packages/theano/compile/function/types.py in __call__(self, *args, **kwargs)
    973             outputs = (
--> 974                 self.fn()
    975                 if output_subset is None

TypeError: expected type_num 12 (NPY_FLOAT64) got 7

I noticed the error is similar to issues 994 and 3751 on github, but both of those are closed and none of the suggested fixes seem to be working for me.

I could be wrong but it seems that the issue in my case is that the Bernoulli RVs return integer types, while Normal RVs return float types, and Theano cannot handle doing operations on tensors of mixed types?

Is there a way to achieve what I am trying to do, without running into this issue?

What does the data in w_1 and w_2 look like?

@michaelosthege You chimed in on at least one of those referenced issues. Any idea?

Maybe try to comment out variables (starting at the bottom) until the error goes away.
Probably a mix of dtypes somewhere.
Don’t forget to check train['e'].std() too.

They are series of 1’s and 0’s. They are integers, but the issue persists even when I try to coerce them to floats.

The issue is the e variable. I also think that the issue is a mix of dtypes. I am not sure how this works exactly on the background but I think somewhere along the way in the terms b_e1 * w1 and b_e2 * w2 there is a mix of integers as floats. Is it not possible to mix them like that in this context? And if not, is there a workaround? Thanks!

You mean the train['e'].std() ?
Try to do .astype("float64") before passing it to PyMC/Theano, because the conversion to tensors will try to infer the type from the input.

Or you could cast things explicitly with theano.tensor.cast.

I think I figured out the issue. The Bernoulli variable expected observations as int64. For whatever reason pm.Data was casting the observed values of w1 and w2 from the dataframe into int32 (even if I explicitly cast the w1 and w2 columns as int64 beforehand). Casting the resulting tensors seemed to do the trick, like this:

w_1 = pm.Data('w_1', train['w1'].values)
w1 = pm.Bernoulli('w1', p=pm.invlogit(a_w1 + b_w1 * k), observed=tt.cast(w_1, 'int64'))
1 Like