Observed data in Bayesian networks

I’m struggling to understand how observed data works in pymc3. From the information I’ve found so far, these two examples have been the most helpful for getting me as far as I have, but I can’t get my model to work.

As an example of what I’m trying to do, say I have records from customers at a restaurant, recording the temperature of the day on a categorical rating scale out of 5, and whether or not they ordered a main meal, a side, or a beverage. I’ve set up some mock data like so:

import numpy as np
import pymc3 as pm
from theano import shared

no_of_root_categories = 5
no_of_samples = 1000
hot_day = np.random.randint(no_of_root_categories, size=no_of_samples)
option_labels = ["Main", "Side", "Beverage"]
meal_options = np.random.randint(2, size=(len(option_labels), no_of_samples))

I want to model it as a simple Bayesian network, like so:

bn

where the shaded nodes are observed.

This is what I’ve got:

with pm.Model() as crashing_model:
    shape = (no_of_samples, no_of_root_categories)
    alpha = (1 / no_of_root_categories) * np.ones(shape)
    root_prior = pm.Dirichlet("Temperature Rating Prior", alpha, shape=shape)
    root = pm.Categorical('Temperature Rating', p=root_prior, shape=no_of_samples, observed=hot_day)

    for item, label in enumerate(option_labels):
        node_data = meal_options[item, :]
        node_prior = pm.Beta(f"{label} Prior",
                             mu=root,
                             sigma=root,
                             shape=no_of_samples,
                             testval=np.random.randint(1, size=no_of_samples))
        pm.Binomial(label, p=node_prior, n=no_of_samples, observed=node_data)

which works, but when I try

with crashing_model:
    trace = pm.sample(1000, random_seed=0)

Python exits with a ‘Bad initial energy’ error.

I can create a model which seems to work without the latent variables

with pm.Model() as working_model: # seems to work
    root_values = [np.where(hot_day == i)[0].tolist() for i in range(no_of_root_categories)]
    root_p = [len(i) / 1000 for i in root_values]
    root = pm.Categorical('Temperature Rating', p=root_p)

    shared_proportions = shared(np.array([len(hot_day[i]) for i in root_values]))
    for item, label in enumerate(option_labels):
        node_probs = [sum([meal_options[item, idx] for idx in category]) / len(category) for category in root_values]
        theano_probs = shared(np.array(node_probs))
        pm.Binomial(label, p=theano_probs[root], n=shared_proportions[root])

but I’m not sure how to translate what I’ve done there to work with the latent variables. Any help would be appreciated.

I posted this question on SO too before realising that this is a better place for it. Thinking further about it since then, I realised there’s more that I haven’t wrapped my head around. The documentation has been somewhat helpful, but there’s still a lot I’m unsure of, such as how the output from each node works - i.e. why it can be used in the probability of the next node, but also as an index for the theano shared object. I’m new to both pymc3 and theano, so if anyone has some good resources they can share that’d be great too.

Hi Elenchus

I think you are having problems in relation to the shape of the variables. I will assist on just the temperature part and see if that helps your understanding.

You have 1000 days of temperature data, where you encode the data using ints 0-4. You want to have a posterior for which of the 5 states is more or less likely. So your prior for temperate needs to have total size of 5, not 1000x5=5000. Together with this prior you need to feed into the model the output from 1000 days using the observed keyword.

with pm.Model() as model:
    alpha = np.ones(no_of_root_categories)
    root_prior = pm.Dirichlet("Temperature Rating Prior", alpha)
    root = pm.Categorical('Temperature Rating', p=root_prior, observed=hot_day)
        
    trace = pm.sample(1000, random_seed=0)
az.summary(trace)

and the summary shows

A few general guidelines:

  • I personally start with the most simple aspect of my model, get it to work using something simple and then either add some complexity or add another section of the model. Here I started with just the temperature.
  • I generally don’t use the shape parameter unless something errors and I always start off without it. I sample and then look at the output and the shape and dimensions of the output. Here the size of the temperature rating is 5 with just 1 dimension. And ask yourself if that makes sense.
  • Of course sometimes I do need to add shape.
  • Looking at the second piece of your code quickly, try not to use python loops inside PyMC3. Usually you can vectorise which will be much faster.
1 Like

Thanks @nkaimcaudle, that was just what I needed. I’d somehow got the idea that I needed to make sure the shapes were the same for all the nodes, and going round in circles with that was stopping me seeing other problems. I’ll look into vectorising next - I couldn’t find a way to iterate making nodes other than with the loops, so if you’ve got some pointers for that I’d appreciate it, but at least now I have a model I can sample from. Working code below for anyone else who stumbles on this.

with pm.Model() as model:
    shape = (no_of_samples, no_of_root_categories)
    alpha = np.ones(no_of_root_categories)
    root_prior = pm.Dirichlet("Temperature Rating Prior", alpha)
    root = pm.Categorical('Temperature Rating', p=root_p, observed=hot_day)

    for item, label in enumerate(option_labels):
        node_data = meal_options[item, :]
        theano_alpha_probs = shared(np.ones(no_of_root_categories))
        theano_beta_probs = shared(np.ones(no_of_root_categories))
        node_prior = pm.Beta(f"{label} Prior",
                             alpha=theano_alpha_probs[root],
                             beta=theano_beta_probs[root],
                             testval=0.5)
        pm.Bernoulli(label, p=node_prior, observed=node_data)

Ok, here is the full code that I think is what you want.

no_of_root_categories = 5
no_of_samples = 1000
hot_day = np.random.choice(no_of_root_categories, size=no_of_samples, p=[0.1, 0.2, 0.3, 0.2, 0.2])
option_labels = ["Main", "Side", "Beverage"]
#meal_options = np.random.choice(2, size=(no_of_samples, len(option_labels)), p=[0.3, 0.7])
A = np.array([[0.9, 0.8, 0.9],
              [0.8, 0.6, 0.7],
              [0.6, 0.7, 0.7],
              [0.4, 0.8, 0.8],
              [0.2, 0.9, 0.9]])
meal_options = np.zeros((no_of_samples, len(option_labels)), dtype=np.int)
for i in range(no_of_samples):
    meal_options[i, :] = np.random.binomial(1, p=A[hot_day[i], :])
    
with pm.Model() as model_new:
    alpha = np.ones(no_of_root_categories)
    temperature_rating = pm.Dirichlet('temperature_rating', alpha)
    obs_temperature = pm.Categorical('obs_temperature', p=temperature_rating, observed=hot_day)
    
    food_order = pm.Beta('food_order', 1, 1, shape=(5, 3))
    obs = pm.Bernoulli('obs', p=food_order[hot_day], observed=meal_options)
    
    idata = pm.sample(random_seed=0, return_inferencedata=True)
az.summary(idata)

I assume you want the probability of an item being ordered depending on the temperature rating. So I made food_order with shape=(5, 3). So each row will correspond to a temperature rating and then the three numbers will be the probability of the order, these 3 don’t need to sum to one.

A few more general points:

  • With testing data it’s best to introduce some variability, otherwise you don’t know if the model is achieving what you want. Using your code everything was coming up 0.5 and as a modeler I don’t know if that’s expected or not.
  • I find it’s best to always have the first dimension of my test data as the number of samples.
  • Similar to my points before, I don’t set testval on my first attempt. If the model is struggling to converge (seen by lots of divergences, very low ess, high rhat) only then I might set testval.
4 Likes

Fantastic.

Given the way I’d created the mock data 0.5 was what I was expecting, but I only set testval because it asked me to - possibly that was indicating that there’s something wrong in the setup I had.

This is really neat, and makes sense of the shape for me. I didn’t realised the distributions could work that way. Thank you very much.

I spent yesterday implementing this for my actual use case, and it’s raised a couple of (hopefully quick) questions.

Firstly, it seems that the only connection between the temperature part of the model and the meal part of the model is through the indices of the observed data. Is there any way to indicate that any sample drawn from the second half of the model should be preceded by a sample drawn from the first half? At the least, the figure from graph = pm.model_to_graphviz(model) doesn’t match the Bayesian network figure from my original question, but also I’m curious as to whether (and if so, how) the model will know that the two parts are connected other than when sampling from the observed data.

Secondly, I thought that my example was analogous to my use case, but there was one key difference, and that is that my observed categories, while integers, are not sequential and don’t start at 0. I had assumed that pm.Categorical was working out the categories from the data, but it seems not. I fixed the issue by passing in pandas.Categorical(data).codes - is that the right approach?

1/ I dont have graphviz installed, can you show how it looks?

2/ You can use pd.factorize(data, sort=True), this will return a tuple with the first item being the relevant indicies and the second being the labels in order.

Oh, of course. If I understand what’s happening correctly, then at least for pm.sample it should be operating as intended, but it’d be nice to be nice to know that the model would always link them
model

where procedure is analogous to temperature rating and item is analogous to meal option

Ok, I get your point.

Though I am not sure if it’s a real problem or not? Anyone else with any ideas? Each obs_procedure realisation will prohibit the obs_items matrix to just one row. I’m not sure if this is sufficient.

Btw I was thinking, in the example where you have a restaurant and orders then what we’ve modelled isn’t strictly correct. As the fact you have an observation implies that the customer ordered at least one item. Whereas in our model all the ordered items can be zero at the same time. I’m not sure how to model this though.

1 Like

Yeah, I’m not sure either. I had a thought though which makes me a fair bit more comfortable - I swapped obs = pm.Bernoulli('obs', p=food_order[hot_day], observed=meal_options) for obs = pm.Bernoulli('obs', p=food_order[obs_temperature], observed=meal_options) which I believe is equivalent. Still not quite the ideal diagram, but there’s a link between the parts of the model now.

Testing it raised another question though. Should I get the same trace summary with pm.sample(random_seed=0) if I re-run the whole script ? I’m getting slight variations most times at the moment, but I thought the purpose of random_seed generally was to allow for repeatable results.

Interesting other thought, too. Shouldn’t be a problem for my use case (all 0s is an unlikely but valid observation) but definitely something to keep in mind for the future.

EDIT:
I’ve fixed the figure by adding a deterministic node between food_order and obs which takes var=food_order[obs_temperature], so now the implied conditional independence is correct. Success. Thanks again for all your help.

1 Like