I face a problem of interaction between two categorical variables. To better understand my question, let’s say a travel agency models the amount of upgrades purchased
as a function of the flight destination and of whether or not the flight has a connection. Obviously, some regions will have more non-stop flights, while others will have fewer of them.
Here’s a toy dataset
data = pd.DataFrame({
'destination_region': [0, 0, 1, 1, 1, 2, 2, 2, 2],
'has_a_connection_flight': [1, 1, 0, 0, 0, 1, 0, 1, 0],
'sum_of_upgrades' : [1, 10, 0, 10, 2, 2, 0, 100, 4]
})
n_regions = data.destination_region.nunique()
Now, modelling the sum of upgrades as a function of each of the independent variable is trivial
with pm.Model() as destination_model:
mu = pm.Uniform('mu', lower=0.1, upper=10, shape=n_regions)
sigma = pm.Uniform('sd', lower=0.1, upper=10, shape=n_regions)
upgrades = pm.Lognormal(
'upgrades', mu=mu[data.destination_region], sigma=sigma[data.destination_region],
observed=data.sum_of_upgrades+1
)
trace_destination = pm.sample(200, tune=50)
with pm.Model() as connection_model:
mu = pm.Uniform('mu', lower=0.1, upper=10, shape=2)
sigma = pm.Uniform('sd', lower=0.1, upper=10, shape=2)
upgrades = pm.Lognormal(
'upgrades', mu=mu[data.destination_region], sigma=sigma[data.destination_region],
observed=data.sum_of_upgrades+1
)
trace_upgrades = pm.sample(200, tune=50)
However, what approach should I take in order to model the interplay between the variables?