Modelling interactions between two categorical variables

I face a problem of interaction between two categorical variables. To better understand my question, let’s say a travel agency models the amount of upgrades purchased
as a function of the flight destination and of whether or not the flight has a connection. Obviously, some regions will have more non-stop flights, while others will have fewer of them.

Here’s a toy dataset

data = pd.DataFrame({
  'destination_region':      [0, 0, 1, 1, 1, 2, 2, 2, 2],
  'has_a_connection_flight': [1, 1, 0, 0, 0, 1, 0, 1, 0],
  'sum_of_upgrades'        : [1, 10, 0, 10, 2, 2, 0, 100, 4]
n_regions = data.destination_region.nunique()

Now, modelling the sum of upgrades as a function of each of the independent variable is trivial

with pm.Model() as destination_model:
  mu = pm.Uniform('mu', lower=0.1, upper=10, shape=n_regions)
  sigma = pm.Uniform('sd', lower=0.1, upper=10, shape=n_regions)
  upgrades = pm.Lognormal(
      'upgrades', mu=mu[data.destination_region], sigma=sigma[data.destination_region], 
  trace_destination = pm.sample(200, tune=50)
with pm.Model() as connection_model:
  mu = pm.Uniform('mu', lower=0.1, upper=10, shape=2)
  sigma = pm.Uniform('sd', lower=0.1, upper=10, shape=2)
  upgrades = pm.Lognormal(
      'upgrades', mu=mu[data.destination_region], sigma=sigma[data.destination_region], 
  trace_upgrades = pm.sample(200, tune=50)    

However, what approach should I take in order to model the interplay between the variables?

Hi Ivan

You might want to consider the proportion of customers who upgrade rather than the absolute number who do.

Here is one model that treats the connection as a covariate

with pm.Model() as combined_model:
    mu_destination = pm.Uniform('mu_destination', lower=0.1, upper=10, shape=n_regions)
    beta_connection = pm.Uniform('beta_connection', lower=0.1, upper=10)
    sigma = pm.Gamma('sd', 2, 1)
    mu = mu_destination[data.destination_region] + beta_connection*data.has_a_connection_flight
    upgrades = pm.Lognormal(
      'upgrades', mu=mu, sigma=sigma, 
    trace_combined = pm.sample()

Which such few data it is hard to make much inferences and the flight with 100 upgrades seems like an outlier compared to the others. Perhaps using the proportion of people who upgrade will help here. I used a single variable sigma rather than have a sigma for each destination.

In your code you had only 200 draws and 50 tuning steps. This is far too low for MCMC. Unless you have good reasons to deviate you should just stick with the defaults in PyMC3.

Thanks. Obviously, this is a toy data thus the small data size and the outliers.