Modelling interactions between two categorical variables

IvanSanych · March 12, 2020, 8:57am

I face a problem of interaction between two categorical variables. To better understand my question, let’s say a travel agency models the amount of upgrades purchased
as a function of the flight destination and of whether or not the flight has a connection. Obviously, some regions will have more non-stop flights, while others will have fewer of them.

Here’s a toy dataset

data = pd.DataFrame({
  'destination_region':      [0, 0, 1, 1, 1, 2, 2, 2, 2],
  'has_a_connection_flight': [1, 1, 0, 0, 0, 1, 0, 1, 0],
  'sum_of_upgrades'        : [1, 10, 0, 10, 2, 2, 0, 100, 4]
  })
n_regions = data.destination_region.nunique()

Now, modelling the sum of upgrades as a function of each of the independent variable is trivial

with pm.Model() as destination_model:
  mu = pm.Uniform('mu', lower=0.1, upper=10, shape=n_regions)
  sigma = pm.Uniform('sd', lower=0.1, upper=10, shape=n_regions)
  upgrades = pm.Lognormal(
      'upgrades', mu=mu[data.destination_region], sigma=sigma[data.destination_region], 
      observed=data.sum_of_upgrades+1
  )
  trace_destination = pm.sample(200, tune=50)
  
with pm.Model() as connection_model:
  mu = pm.Uniform('mu', lower=0.1, upper=10, shape=2)
  sigma = pm.Uniform('sd', lower=0.1, upper=10, shape=2)
  upgrades = pm.Lognormal(
      'upgrades', mu=mu[data.destination_region], sigma=sigma[data.destination_region], 
      observed=data.sum_of_upgrades+1
  )
  trace_upgrades = pm.sample(200, tune=50)

However, what approach should I take in order to model the interplay between the variables?

nkaimcaudle · March 14, 2020, 12:37am

Hi Ivan

You might want to consider the proportion of customers who upgrade rather than the absolute number who do.

Here is one model that treats the connection as a covariate

with pm.Model() as combined_model:
    mu_destination = pm.Uniform('mu_destination', lower=0.1, upper=10, shape=n_regions)
    beta_connection = pm.Uniform('beta_connection', lower=0.1, upper=10)
    sigma = pm.Gamma('sd', 2, 1)
    
    mu = mu_destination[data.destination_region] + beta_connection*data.has_a_connection_flight
    upgrades = pm.Lognormal(
      'upgrades', mu=mu, sigma=sigma, 
      observed=data.sum_of_upgrades+1
    )
    
    trace_combined = pm.sample()

Which such few data it is hard to make much inferences and the flight with 100 upgrades seems like an outlier compared to the others. Perhaps using the proportion of people who upgrade will help here. I used a single variable sigma rather than have a sigma for each destination.

In your code you had only 200 draws and 50 tuning steps. This is far too low for MCMC. Unless you have good reasons to deviate you should just stick with the defaults in PyMC3.

IvanSanych · March 15, 2020, 9:47am

Thanks. Obviously, this is a toy data thus the small data size and the outliers.

Topic		Replies	Views
Modeling interactions between categorical variables using coords and dims? v5 modeling	2	170	June 17, 2024
Modeling a DAG with Discrete and Categorical Variables modeling	0	58	March 8, 2025
Why my multinomial model with categorical predictors and response differs so much from results in R? version agnostic modeling	5	97	January 26, 2025
Hierarchical model, categorical outcome Questions	0	537	September 15, 2020
Multivariate Bernoulli modelling question v5 modeling	2	355	July 24, 2023

Modelling interactions between two categorical variables

Related topics