Update:
Using multinomial instead of Categorial works, i.e.
pm_choice = pm.Multinomial('pm_choice',
n=1,
p=pm_data_prob,
shape=11,
observed=data_simulated)
I also need to change init when sampling or I will get bad energy:
trace = pm.sample(10000, init='adapt_diag')
The resulting posterior looks as expected. Moreover I get the following warning
C:\Users\...\Anaconda3\lib\site-packages\theano\tensor\subtensor.py:2197: FutureWarning:
Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]`
instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`,
which will result either in an error or a different result.
rval = inputs[0].__getitem__(inputs[1:])
Any ideas on the warning and why I need to change init when sampling?
Original Post:
I am new to PYMC3 and trying to model latent factors behind a the realisation of a categorial variable.
import numpy as np
import pandas as pd
import pymc3 as pm
from theano import tensor as T
from scipy.stats import bernoulli, logistic
Here is some simulated data. The data is simulated in a way corresponding to the model specification below. You may think of the data generating process like this:
- A customer e.g. on the Amazon checkout page has 11 choices
– Choice 1-9: Order and get delivery at one out of 9 pre-defined time-windows
– Choice 10: Order, but do not choose pre-defined time-window
– Choice 11: Leave checkout and do not order - One of the above choices is always made
- The data below shows the choices made per customer, each row is a customer, each column a choice-dummy (e.g. customer 1 choose to order at time window 9)
data_simulated = np.array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]])
However, not all choices are always available
- Choice 10 and 11 are always available
- Choice 1-9 may or may not be available
- The data below shows for each customer the available choices (only 10 choices here, as the 11th “do not order”-choice is excluded).
slots_available = np.array([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 1, 1, 1, 1, 1],
[0, 1, 0, 1, 1, 1, 1, 1, 1, 1],
[1, 0, 0, 1, 0, 0, 1, 0, 1, 1],
[0, 0, 1, 0, 0, 1, 1, 1, 1, 1],
[0, 1, 0, 1, 0, 0, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 0, 1, 1, 0, 1],
[0, 0, 1, 0, 0, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 0, 1],
[0, 0, 0, 1, 0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 1, 0, 1, 1, 1, 1],
[1, 0, 0, 0, 1, 1, 1, 0, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 1, 0, 1, 1, 1, 1],
[0, 0, 0, 1, 1, 0, 0, 1, 1, 1],
[1, 0, 0, 1, 1, 0, 1, 1, 0, 1],
[0, 0, 0, 1, 1, 0, 1, 1, 1, 1],
[0, 1, 0, 0, 1, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 1, 0, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 1, 1, 1, 1],
[1, 1, 0, 1, 0, 0, 1, 0, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 0, 1],
[0, 0, 0, 1, 1, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
[0, 0, 1, 0, 1, 0, 0, 1, 1, 1],
[0, 0, 0, 1, 0, 0, 1, 1, 1, 1],
[0, 0, 0, 0, 1, 1, 1, 0, 0, 1],
[0, 0, 0, 1, 1, 1, 0, 1, 1, 1]])
More samples may be generated running the following code:
# Number of customers
N = 3000
# Latent preferences for choices 1-10: If a choice is not available,
# the probability of the other choice increase proportionally to preference_lat
# preference_lat shall later be estimated from the data and does not work yet in
# my pymc3 implementation
preference_lat = np.array([0.25, 0.125, 0.125, 0.10, 0.10, 0.10, 0.10, 0.05, 0.025, 0.025])
# Probability that choices 1-10 are available. Probability of Choice 10 is set to 1, so
# it is always available. The estimation of this in my pymc3 implementation works
p_slots_avail = np.array([0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 0.8, 0.8, 0.8, 1])
# Draw from choice space
slots_available = bernoulli.rvs(p_slots_avail, size=(N, 10))
# Number of available choices per customer
no_slots = slots_available.sum(axis=1)
# Probabilty of choice 12: When the choice space is larger, the probability #of
# actually ordering increases (the probability of not-ordering decreases). The
# estimation of a and b in my pymc3 implementation through logistic regression
# works well
a = -2
b = 0.8
p_conv = logistic.cdf(a + b*no_slots)
# Keep only preferences for available choices
slot_prob = slots_available * preference_lat
# Normalize preferences to probability of choice 1 to 10 given that customer orders
slot_prob_normalized = slot_prob / (slots_available*preference_lat).sum(axis=1)[..., np.newaxis]
# Add the probability of not ordering (logistic output) and renormalize such that
# all probabilities sum to 1
slot_prob_norm2 = p_conv[..., np.newaxis] * slot_prob_normalized
data_prob = np.concatenate((slot_prob_norm2, 1 - p_conv[..., np.newaxis]), axis=1)
# Draw from choice space. Each customer (row) faces has a different
# choice distribution as the available choices different between each customer
# (according to p_slots_avail)
def mnomial(slc):
return np.random.multinomial(1, slc)
data_simulated = np.apply_along_axis(mnomial, 1, data_prob)
When I try to run the following code, I get this error message due to the last step (pm_choice):
ValueError: Input dimension mis-match. (input[0].shape[1] = 11, input[1].shape[1] = 30)
with pm.Model() as model:
# This part works as intended, the posteriors are around the true values of the
# choice availability distributions
pm_p_slots_avail = pm.Uniform('pm_p_slots_avail', shape=10, lower=0, upper=1)
pm_slots_available = pm.Bernoulli('pm_slots_available', p=pm_p_slots_avail, shape=10,
observed=slots_available)
# This logistic regression part works as intended, the posteriors of pm_a and pm_b are
# around the true values of a and b
pm_no_slots = pm.Deterministic('pm_no_slots', pm_slots_available.sum(axis=1))
pm_a = pm.Uniform('pm_a', lower=-10, upper=10)
pm_b = pm.Uniform('pm_b', lower=-10, upper=10)
pm_p_conv = pm.Deterministic('pm_p_conv', pm.math.sigmoid(pm_a + pm_b*pm_no_slots))
pm_order = pm.Bernoulli('pm_order',
p=pm_p_conv,
observed=1-data_simulated[:, 10])
# This part does not work / sampling for pm_choice is extremely slow
pm_preference_lat = pm.Dirichlet('pm_preference_lat', a=np.ones(10), shape=10)
pm_slot_prob = pm_slots_available * pm_preference_lat
pm_slot_prob_normalized = pm_slot_prob / (pm_slots_available*pm_preference_lat).sum(axis=1)[..., np.newaxis]
pm_slot_prob_norm2 = pm_p_conv[..., np.newaxis] * pm_slot_prob_normalized
pm_data_prob = T.concatenate((pm_slot_prob_norm2, 1 - pm_p_conv[..., np.newaxis]), axis=1)
pm_choice = pm.Categorical('pm_choice',
p=pm_data_prob,
observed=data_simulated)
If I ‘un-one-hot-encode’ the choice columns and replace the observed data in pm_choice like this …
data_simulated_df = pd.DataFrame(data_simulated)
def get_choice(row):
for c in data_simulated_df.columns:
if row[c]==1:
return c
data_simulated_rev = np.asarray(data_simulated_df.apply(get_choice, axis=1))
with model:
# This part does not work / sampling for pm_choice is extremely slow
pm_preference_lat = pm.Dirichlet('pm_preference_lat', a=np.ones(10), shape=10)
pm_slot_prob = pm_slots_available * pm_preference_lat
pm_slot_prob_normalized = pm_slot_prob / (pm_slots_available*pm_preference_lat).sum(axis=1)[..., np.newaxis]
pm_slot_prob_norm2 = pm_p_conv[..., np.newaxis] * pm_slot_prob_normalized
pm_data_prob = T.concatenate((pm_slot_prob_norm2, 1 - pm_p_conv[..., np.newaxis]), axis=1)
pm_choice = pm.Categorical('pm_choice',
p=pm_data_prob,
observed=data_simulated_rev
… I can get the model to run with Metropolis-Sampling, however extremely slow. NUTS returns the bad energy exception, i.e. there must be some miss-specification? NUTS works (extremely slow) if I pass pm_data_prob + 0.01 instead of pm_data_prob.
Does anyone have any idea how to fix this problem? In the end I need the posterior of pm_preference_lat. The posteriors of pm_p_slots_avail, pm_a and pm_b correspond to the values I used in simulating the data.
Some background on the model above:
The variable of interest can take 11 different realisations, i.e. think of an individual who can choose between 11 alternatives.
The last choice (number 11) is always available, however the first 10 choices are not necessarily available for each draw / individual.
The availability for each choice was simulated with the parameter set
p_slots_avail = np.array([0.1, 0.1, 0.1, 0.5, 0.5, 0.5, 0.8, 0.8, 0.8, 1])
I.e. choice 1 is available with P = 10%, choice 11 with P = 100%. The posterior of pm_p_slots_avail lies around these values when I simulate a big enough dataset.
The probability of choice 11 is determined by the number of remaining available choices. The data was simulated with
a = -2
b = 0.8
(actually the probability of not choosing choice 11 was modelled, i.e. 1-choice11)
The posterior of pm_a and pm_b lies around these values when I simulate a big enough dataset.
The probability of choices 1-10 is determined by the latent factors
preference_lat = np.array([0.25, 0.125, 0.125, 0.10, 0.10, 0.10, 0.10, 0.05, 0.025, 0.025])
and the probability of a choice, given it is available, is proportional to this vector. I.e. if choice 1 is not available and choice 11 not taken, choice 2 will be chosen with P = 16.6% (=0.125/np.array([0.125, 0.125, 0.10, 0.10, 0.10, 0.10, 0.05, 0.025, 0.025]).sum()).