Bayesian network with multiple observed variables

Beginner in Pymc3 here. I am trying to implement the bayesian network described in this paper.

fair network

The bayesian network described in the paper contains the sensitive attributes S, the latent fair labels D_f, the observed labels D and the features X. I want to do inference on P(D_f|X, S) to predict the fair label from observed data using a trained model.

I am using the adult dataset where S = gender, X = workclass, D = income. I write the model step by step but when i try to model P(D|D_f, S) i face some issues

with pm.Model() as model:
    # Data
    S_shared = pm.Data("S_obsered", sensitive['gender'][:100])
    D_shared = pm.Data("D_observed", y[:100])
    W_shared = pm.Data("W_observed", X['workclass'][:100])

    # Sensitive Attribute
    beta_s = pm.Beta("beta_s", alpha=1, beta=1, shape=2)
    S = pm.Categorical("S", p=beta_s, observed=S_shared)

    # Fair Labels
    beta_f = pm.Beta("beta_f", alpha=1, beta=1, shape=2)
    Df = pm.Categorical("Df", p=beta_f)

    # Data Labels P(D | Df, S)
    intercept_D = pm.Normal("intercept_D", mu=0, sd=1)
    beta1_D = pm.Normal("beta1_D", mu=0, sd=1)
    beta2_D = pm.Normal("beta2_D", mu=0, sd=1)
    mu_D = pm.math.invlogit(intercept_D + beta1_D*S + beta2_D*Df)
    D = pm.Bernoulli("D", p=mu_D, observed=D_shared)

    # Sample (Only 50 for testing purposes)
    trace = pm.sample(50)

When i change the line

    mu_D = pm.math.invlogit(intercept_D + beta1_D*S + beta2_D*Df)

to

    mu_D = pm.math.invlogit(intercept_D + beta2_D*Df)

it works. Otherwise i get the error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
C:\tools\Anaconda3\envs\forseti\lib\site-packages\theano\compile\function\types.py in __call__(self, *args, **kwargs)
    973             outputs = (
--> 974                 self.fn()
    975                 if output_subset is None

TypeError: expected type_num 1 (NPY_INT8) got 9

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_18340/784928753.py in <module>
     26 
     27     # Sample
---> 28     trace = pm.sample(50)

So somehow i am not understanding how to treat observed variables and how to model bayesian networks using pymc3. I model the relationship between two categorical variables to a binary variable as a logistic regression. For the categorical attributes in X i plan to model it as a logistic regression.

Is there someone more experienced in pymc3 who could give me some tips on how to implement the bayesian network described and how to treat observed variables?

1 Like

I tried to reproduce your error with the code snippet below but couldn’t get it to occur with version 3.11.4. You might check all the dtypes for the variables involved and cast them to what you think they should be (using .astype('float64') or .astype('uint8') or similar command). You also probably want to be careful about your usage of categorical predictors here. Once they have more than 2 levels, you should consider making dummy variables out of them so that you aren’t using values of 0, 1, 2,... as predictors unless your workclass is actually ordered in some way.

import pymc3 as pm
import numpy as np

n = 10

with pm.Model() as model:
    # Data
    S_shared = pm.Data("S_obsered", np.random.randint(low=0,high=2,size=n))
    D_shared = pm.Data("D_observed", np.random.randint(low=0,high=2,size=n))

    # Sensitive Attribute
    beta_s = pm.Beta("beta_s", alpha=1, beta=1, shape=2)
    S = pm.Categorical("S", p=beta_s, observed=S_shared)

    # Fair Labels
    beta_f = pm.Beta("beta_f", alpha=1, beta=1, shape=2)
    Df = pm.Categorical("Df", p=beta_f)

    # Data Labels P(D | Df, S)
    intercept_D = pm.Normal("intercept_D", mu=0, sd=1)
    beta1_D = pm.Normal("beta1_D", mu=0, sd=1)
    beta2_D = pm.Normal("beta2_D", mu=0, sd=1)
    mu_D = pm.math.invlogit(intercept_D + beta1_D*S + beta2_D*Df)
    D = pm.Bernoulli("D", p=mu_D, observed=D_shared)

    # Sample (Only 50 for testing purposes)
    trace = pm.sample(50)
1 Like

Thank you! The problem was indeed the data types. The datatypes of the sensitive attribute, label and workclass was a combination of int8 and int64. Converting these to int64 resolved the issue. Using int8 the problem persists.

Regarding your last comment on categorical predictors. I am trying to implement P(W| S, Df). D_f and S are binary in this case. I thought i could model this relationship as a multinomial logistic regression since workclass has 9 categories. Should i rather try to use dummy variables for the 9 categories?

When trying to implement this by adding the following lines to the code above:

  W_shared_T = pm.Data("W_observed:T", X['workclass'][:100].values.reshape(-1, 1))
  W_shared = pm.Data("W_observed", X['workclass'][:100])

  # Data info
  nparam = 1
  nclass = len(X['workclass'].unique())

  # Data Labels P(W | Df, S)
  alpha = pm.Normal('alpha', mu=0, sd=1, shape=(nclass))
  beta = pm.Normal('beta', mu=0, sd=1, shape=(nparam, nclass))
  mu = tt.dot(W_shared_T, beta)
  W = pm.Categorical('W', p=mu, observed=W_shared)

I get this error

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'beta_s_stickbreaking__': array([0.]), 'beta_f_stickbreaking__': array([0.]), 'Df': array(0), 'intercept_D': array(0.), 'beta1_D': array(0.), 'beta2_D': array(0.), 'alpha': array([0., 0., 0., 0., 0., 0., 0., 0., 0.]), 'beta': array([[0., 0., 0., 0., 0., 0., 0., 0., 0.]])}

Initial evaluation results:
beta_s_stickbreaking__    -0.69
beta_f_stickbreaking__    -0.69
Df                        -0.69
intercept_D               -0.92
beta1_D                   -0.92
beta2_D                   -0.92
alpha                     -8.27
beta                      -8.27
S                        -69.31
D                        -69.31
W                          -inf
Name: Log-probability of test_point, dtype: float64

I thought i could model this relationship as a multinomial logistic regression since workclass has 9 categories.

I think this makes sense if they are clearly ordered in a linear way. For example, if the jobs were software engineer I, II, III and principal engineer I, II II, and so on, then maybe. This probably isn’t the case and I would recommend using one-hot coding. You can get the one-hot in Theano by using from theano.tensor.extra_ops.to_one_hot to get an (n, nclass) matrix of 0-1 entries and then using a parameter vector with shape (n_class,) to get the probabilities for the categorical variable.

Also, you aren’t using normalized probabilities - some of them are falling outside of 0-1. You want to use a softmax or similar transformation on mu here:

mu = tt.dot(W_shared_T, beta)
 W = pm.Categorical('W', p=mu, observed=W_shared)
1 Like