Multi-label categorical predictor?

I’d like to build a model using a multi-label categorical predictor (with N unique values) for a scalar target variable. In other words, if the categorical feature is one hot encoded, then there can be up to N True values in an observation. The target variable will be the sum of the latent distributions multiplied by the one hot encoded coefficients. I tried looking for a PyMC example with this sort of predictor, but I wasn’t able to find one. Could someone provide a basic template for how to define this sort of model with PyMC? Is PyMC the best package to use, or would Bambi be better?

The main problem I’m not sure how to tackle is in defining N priors that will sum to produce the target variable. I want to avoid hard coding each prior. Instead, I want to duplicate a weak prior N times. Thanks for any help!

I’m not sure I understand what you mean with “multi-label categorical predictor”. Does it mean there’s a categorical predictor with multiple levels, and each observation has one level at a time, or that there is a categorical predictor with multiple levels, but each observation can have one or more of those levels at a time? I’m thinking it could be one of those cases where someone is presented a question, and they have to choose all options that apply? If that’s the case, what you have is a set of N binary variables.

Anyway, I’m not sure I understood. If you could provide an example, it would be great.

Yes, it is essentially a case where there are N binary features. But is there a way to define the priors for these N binaries as a list comprehension, or does PyMC require that I hardcode each prior uniquely?

As an example, let’s try to predict the total of a grocery bill. In this example we are able to categorize all of the items on every person’s receipt. There are N categories. But to make the example interesting, we do not have data about how much each item costs. All we know is the total of the grocery bill and which item categories were purchased for each receipt.

In this case, there could be 30 or more categories. So hard coding a prior for each category would be very tedious. I’d like to make a model where every item category is given the same weak prior in some kind of a for loop. I took a shot at prototyping this example below.

# bool_array is M receipts by N categories
# price_totals is M vector
cat_names = ["dairy", "frozen food", "candy", "beverage", "cleaning", "toys", "hygiene", "clothing", ...]
with pm.Model() as model:
	data = pm.Data("data", bool_array)
	y_components = [pm.Normal(
		f"y_{cat}",
		pm.Normal(f"μ_{cat}", 20, 5),
		pm.HalfNormal(f"σ_{cat}", 10)
	) for cat in cat_names]

	# I'm unsure how to define the likelihood as a dot product of data & y_components with the observed outcomes = price_totals
	y = pm.Deterministic("y", pm.math.dot(y_components, data.T))

Have a look at Distribution Dimensionality — PyMC 5.19.1 documentation

You don’t need to write such a for loop. Have a look at the following example, which includes a model fit with simulated data.

import arviz as az
import numpy as np
import pymc as pm

# Set of categories
categories = [
    "dairy",
    "frozen food",
    "candy",
    "beverage",
    "cleaning",
    "toys",
    "hygiene",
    "clothing"
]

# Number of categories
p = len(categories)

# Number of observations
n = 200

# Initialize a random number generator
rng = np.random.default_rng(1234)

# Simulate values for the "contribution" of each category
beta_categories_theoretical = pm.draw(pm.HalfNormal.dist(), draws=p, random_seed=rng) * 10

# Determine the probability each category is present in a purchase
categories_p = pm.draw(pm.Beta.dist(3, 3), draws=p, random_seed=rng)

# Simulate the categories present in the 'n' purchases
X = pm.draw(pm.Bernoulli.dist(categories_p), draws=n) # shape: (n, p)

# Compute the mean for each purchase
mu_theoretical = X @ beta_categories_theoretical

# Add some random noise, by sampling from a normal centered around mu with some sigma
y_observed = pm.draw(pm.Normal.dist(mu=mu_theoretical, sigma=2.5), random_seed=rng)


# Initialize coordinates dictionary
coords = {
	"__obs__": np.arange(n), # a simple observation index from 0 to n-1
	"categories": categories
}

with pm.Model(coords=coords) as model:
	# This will be of length equal to the number of categories 'p'
	beta_categories = pm.HalfNormal("beta_categories", dims="categories")

	# This is of shape (n, )
	mu = pm.Deterministic("mu", pm.math.dot(X, beta_categories), dims="__obs__")

	# A prior for the irreducible error
	sigma = pm.HalfNormal("sigma")

	# Observational model
	pm.Normal("y", mu=mu, sigma=sigma, observed=y_observed, dims="__obs__")

# See the model graph representation
display(model.to_graphviz())

# Obtain draws from the posterior
with model:
    idata = pm.sample(random_seed=1234)

# Visualize marginal posteriors
az.plot_posterior(idata, var_names=["beta_categories"], ref_val=beta_categories_theoretical.tolist());



The key is to use dims in PyMC to get a vector of normal priors instead of just a single normal prior.

A few caveats:

  • The model does not contain an intercept (I’m not sure if this applies to your case)
  • All parameters are constrained to be positive (since presence can only add up to the total, and it may help to remove some non-identifiability).
  • In your real case, be aware of category aliasing (e.g. categories that always appear together), that will cause parameter non-identifiability.
2 Likes

After thinking about using the dims argument some more, I’ve realized it does eliminate the need for a for loop in my use case. But I’m still unsure how to the implement the dot product into the likelihood. Here’s an updated code example:

import numpy as np
import pymc as pm

# make synthetic data
cat_names = ["dairy", "frozen food", "candy", "beverage", "cleaning", "toys", "hygiene", "clothing"]
n_labels = len(cat_names)
n_obs = 500
sparsity = 0.2
bool_array = np.array([np.random.choice([0, 1], n_labels, p=[1 - sparsity, sparsity]) for _ in range(n_obs)])
price_totals = np.random.lognormal(5, 1, n_obs)

coords = {"categories": cat_names}
with pm.Model() as model:
    data = pm.Data("data", bool_array)
    components = pm.Normal(
        "components",
        pm.Normal("μ", 20, 5),
        pm.HalfNormal("σ", 10),
        dims="categories"
    )

    # I'm unsure how to define the likelihood as a dot product of data & y_components with the observed outcomes = price_totals
    y = pm.Deterministic("y", pm.math.dot(components, data.T))

I’m not sure I understand your question about the likelihood. Have you seen the example I shared? That shows how to do it.

Thanks so much @tcapretto! Somehow I hadn’t seen your reply even though I was checking this page every day. Your model approach makes a lot more sense than my approach! Somehow I had gotten locked into the idea that the Normal distribution for each category needed to be fully defined before applying the dot product.

Regarding your point “parameters are constrained to be positive,” I’m also planning to change the likelihood to a Log Normal to ensure that the outcomes for each category are always positive. I suspect my real data is closer to Log Normal anyway. I was just sticking with Normal in the example to keep it simple.

Thanks so much for your thorough example code with thorough comments and caveats!!!

1 Like