Modelling Categorial Variable

Update:
Using multinomial instead of Categorial works, i.e.

pm_choice = pm.Multinomial('pm_choice', 
                           n=1,
                           p=pm_data_prob,
                           shape=11,
                           observed=data_simulated) 

I also need to change init when sampling or I will get bad energy:

trace = pm.sample(10000, init='adapt_diag')

The resulting posterior looks as expected. Moreover I get the following warning

C:\Users\...\Anaconda3\lib\site-packages\theano\tensor\subtensor.py:2197: FutureWarning: 
Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` 
instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`,
which will result either in an error or a different result.
rval = inputs[0].__getitem__(inputs[1:])

Any ideas on the warning and why I need to change init when sampling?

Original Post:
I am new to PYMC3 and trying to model latent factors behind a the realisation of a categorial variable.

import numpy as np
import pandas as pd
import pymc3 as pm
from theano import tensor as T
from scipy.stats import bernoulli, logistic

Here is some simulated data. The data is simulated in a way corresponding to the model specification below. You may think of the data generating process like this:

  • A customer e.g. on the Amazon checkout page has 11 choices
    – Choice 1-9: Order and get delivery at one out of 9 pre-defined time-windows
    – Choice 10: Order, but do not choose pre-defined time-window
    – Choice 11: Leave checkout and do not order
  • One of the above choices is always made
  • The data below shows the choices made per customer, each row is a customer, each column a choice-dummy (e.g. customer 1 choose to order at time window 9)
data_simulated = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]]) 

However, not all choices are always available

  • Choice 10 and 11 are always available
  • Choice 1-9 may or may not be available
  • The data below shows for each customer the available choices (only 10 choices here, as the 11th “do not order”-choice is excluded).
slots_available = np.array([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
       [0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
       [0, 1, 0, 1, 1, 1, 1, 1, 1, 1],
       [1, 0, 0, 1, 0, 0, 1, 0, 1, 1],
       [0, 0, 1, 0, 0, 1, 1, 1, 1, 1],
       [0, 1, 0, 1, 0, 0, 1, 1, 1, 1],
       [0, 0, 0, 1, 1, 0, 1, 1, 0, 1],
       [0, 0, 1, 0, 0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1, 0, 1],
       [0, 0, 0, 1, 0, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 1, 0, 1, 1, 1, 1],
       [1, 0, 0, 0, 1, 1, 1, 0, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 1, 0, 1, 1, 1, 1],
       [0, 0, 0, 1, 1, 0, 0, 1, 1, 1],
       [1, 0, 0, 1, 1, 0, 1, 1, 0, 1],
       [0, 0, 0, 1, 1, 0, 1, 1, 1, 1],
       [0, 1, 0, 0, 1, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
       [0, 0, 1, 0, 1, 1, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
       [1, 1, 0, 1, 0, 0, 1, 0, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1, 0, 1],
       [0, 0, 0, 1, 1, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
       [0, 0, 1, 0, 1, 0, 0, 1, 1, 1],
       [0, 0, 0, 1, 0, 0, 1, 1, 1, 1],
       [0, 0, 0, 0, 1, 1, 1, 0, 0, 1],
       [0, 0, 0, 1, 1, 1, 0, 1, 1, 1]])

More samples may be generated running the following code:

# Number of customers
N = 3000

# Latent preferences for choices 1-10: If a choice is not available, 
# the probability of the other choice increase proportionally to preference_lat 
# preference_lat  shall later be estimated from the data and does not work yet in 
# my pymc3 implementation
preference_lat = np.array([0.25, 0.125, 0.125, 0.10, 0.10, 0.10, 0.10, 0.05, 0.025, 0.025])
# Probability that choices 1-10 are available. Probability of Choice 10 is set to 1, so
# it is always available. The estimation of this in my pymc3 implementation works 
p_slots_avail = np.array([0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 0.8, 0.8, 0.8, 1])
# Draw from choice space
slots_available = bernoulli.rvs(p_slots_avail, size=(N, 10))
# Number of available choices per customer
no_slots = slots_available.sum(axis=1)

# Probabilty of choice 12: When the choice space is larger, the probability #of 
# actually ordering increases (the probability of not-ordering decreases). The 
# estimation of a and b in my pymc3 implementation through logistic regression
# works well
a = -2
b = 0.8
p_conv = logistic.cdf(a + b*no_slots)

# Keep only preferences for available choices
slot_prob = slots_available * preference_lat
# Normalize preferences to probability of choice 1 to 10 given that customer orders
slot_prob_normalized = slot_prob / (slots_available*preference_lat).sum(axis=1)[..., np.newaxis]
# Add the probability of not ordering (logistic output) and renormalize such that 
# all probabilities sum to 1
slot_prob_norm2 = p_conv[..., np.newaxis] * slot_prob_normalized
data_prob = np.concatenate((slot_prob_norm2, 1 - p_conv[..., np.newaxis]), axis=1)

# Draw from choice space. Each customer (row) faces has a different 
# choice distribution as the available choices different between each customer
# (according to p_slots_avail)
def mnomial(slc):
    return np.random.multinomial(1, slc)
data_simulated = np.apply_along_axis(mnomial, 1, data_prob)

When I try to run the following code, I get this error message due to the last step (pm_choice):

ValueError: Input dimension mis-match. (input[0].shape[1] = 11, input[1].shape[1] = 30)

with pm.Model() as model:
    # This part works as intended, the posteriors are around the true values of the 
    # choice availability distributions
    pm_p_slots_avail = pm.Uniform('pm_p_slots_avail', shape=10, lower=0, upper=1)
    pm_slots_available = pm.Bernoulli('pm_slots_available', p=pm_p_slots_avail, shape=10, 
                                       observed=slots_available)
    
    # This logistic regression part works as intended, the posteriors of pm_a and pm_b are 
    # around the true values of a and b
    pm_no_slots = pm.Deterministic('pm_no_slots', pm_slots_available.sum(axis=1))
    pm_a = pm.Uniform('pm_a', lower=-10, upper=10)
    pm_b = pm.Uniform('pm_b', lower=-10, upper=10)
    pm_p_conv = pm.Deterministic('pm_p_conv', pm.math.sigmoid(pm_a + pm_b*pm_no_slots))
    pm_order = pm.Bernoulli('pm_order', 
                            p=pm_p_conv,
                            observed=1-data_simulated[:, 10])
    
    # This part does not work / sampling for pm_choice is extremely slow
    pm_preference_lat = pm.Dirichlet('pm_preference_lat', a=np.ones(10), shape=10)
    pm_slot_prob = pm_slots_available * pm_preference_lat
    pm_slot_prob_normalized = pm_slot_prob / (pm_slots_available*pm_preference_lat).sum(axis=1)[..., np.newaxis] 
    pm_slot_prob_norm2 = pm_p_conv[..., np.newaxis] * pm_slot_prob_normalized
    pm_data_prob = T.concatenate((pm_slot_prob_norm2, 1 - pm_p_conv[..., np.newaxis]), axis=1)
    pm_choice = pm.Categorical('pm_choice', 
                               p=pm_data_prob,
                               observed=data_simulated)

If I ‘un-one-hot-encode’ the choice columns and replace the observed data in pm_choice like this …

data_simulated_df = pd.DataFrame(data_simulated)

def get_choice(row):
    for c in data_simulated_df.columns:
        if row[c]==1:
            return c

data_simulated_rev = np.asarray(data_simulated_df.apply(get_choice, axis=1))

with model:
    # This part does not work / sampling for pm_choice is extremely slow
    pm_preference_lat = pm.Dirichlet('pm_preference_lat', a=np.ones(10), shape=10)
    pm_slot_prob = pm_slots_available * pm_preference_lat
    pm_slot_prob_normalized = pm_slot_prob / (pm_slots_available*pm_preference_lat).sum(axis=1)[..., np.newaxis] 
    pm_slot_prob_norm2 = pm_p_conv[..., np.newaxis] * pm_slot_prob_normalized
    pm_data_prob = T.concatenate((pm_slot_prob_norm2, 1 - pm_p_conv[..., np.newaxis]), axis=1)
    pm_choice = pm.Categorical('pm_choice', 
                               p=pm_data_prob,
                               observed=data_simulated_rev

… I can get the model to run with Metropolis-Sampling, however extremely slow. NUTS returns the bad energy exception, i.e. there must be some miss-specification? NUTS works (extremely slow) if I pass pm_data_prob + 0.01 instead of pm_data_prob.

Does anyone have any idea how to fix this problem? In the end I need the posterior of pm_preference_lat. The posteriors of pm_p_slots_avail, pm_a and pm_b correspond to the values I used in simulating the data.

Some background on the model above:

The variable of interest can take 11 different realisations, i.e. think of an individual who can choose between 11 alternatives.
The last choice (number 11) is always available, however the first 10 choices are not necessarily available for each draw / individual.

The availability for each choice was simulated with the parameter set

p_slots_avail = np.array([0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 0.8, 0.8, 0.8, 1])

I.e. choice 1 is available with P = 10%, choice 11 with P = 100%. The posterior of pm_p_slots_avail lies around these values when I simulate a big enough dataset.

The probability of choice 11 is determined by the number of remaining available choices. The data was simulated with

a = -2
b = 0.8

(actually the probability of not choosing choice 11 was modelled, i.e. 1-choice11)
The posterior of pm_a and pm_b lies around these values when I simulate a big enough dataset.

The probability of choices 1-10 is determined by the latent factors

preference_lat = np.array([0.25, 0.125, 0.125, 0.10, 0.10, 0.10, 0.10, 0.05, 0.025, 0.025])

and the probability of a choice, given it is available, is proportional to this vector. I.e. if choice 1 is not available and choice 11 not taken, choice 2 will be chosen with P = 16.6% (=0.125/np.array([0.125, 0.125, 0.10, 0.10, 0.10, 0.10, 0.05, 0.025, 0.025]).sum()).

Hi! Here are some thoughts, but I’m not sure I understood your model specification, so take them with a grain of salt:

  • ValueError: you seem to have a shape problem somewhere. You can check the shape of any PyMC variable with print("p: ", p.tag.test_value.shape) for instance. This will print the shape during compilation of the model and I usually find that useful!

  • I think your priors are too wide, which could explain NUTS having a hard time sampling when your data are in multinomial shape. In multinomial regressions, priors on the parameter scale are completely distorted on the probability space (mainly because of floor/ceiling effects). I’d try much more regularizing priors, combined to prior predictive checks

  • I don’t know if it’s on purpose, but I only see 10 slots here, with p_1 = 0.1 and p_{10} = 1. Maybe that is causing your shape problem?

  • I’m actually working on a case that looks a little like yours. I’m still fighting with how to implement the model in PyMC, but maybe you’ll find something useful - maybe we’ll even help each other out!

  • If I can give a personal opinion, your model specification can be hard to read because of the pm_ suffixes, as they look a lot like the PyMC calls. If you could trim them - and maybe even comment the important steps - I think that would make it easier to read and understand :slight_smile:

Hope all of that helps you!

Hi Alex,

thanks for you reply! I added some comments above, maybe that makes the DGP and the model a bit clearer.

My main problem is that I do not understand why pm.Categorial can’t deal with 0 probabilites in pm_data_prob. And even if I add 0.01 to each element, sampling is extremely slow.

On your points:

  • ValueError: Any problems must come from the last part (pm_preference_lat). If I simulate 3000 observations, pm_data_prob has the shape (3000,11) and data_simulated_rev (3000,). Sampling like this work (if I guarantee strictly positive probabilites for each choice), but extremely slow.

  • Priors: The uniform priors with range (-10,10) actually work fine, I get posteriors quite close around the true values for pm_a, pm_b and the 10 elements of pm_p_slots_avail. For the Dirichlet-Prior of pm_preference_lat I may only vary “a” or use other starting values?

  • Yep, the pm_-prefix is probably not such a good idea, I somehow needed to distinguish the simulation variables from the pymc3-variables. I’ll change that.