Can someone help me model this data...?

Hi all!
This is my first time using PyMC and my first question on this site. As a bit of background, I completed my masters in stats about 5 years ago and haven’t really touched it since (so bare with me as I’m rusty!), untill now…

A problem has arisen where I’ve been tasked to create some predictions on a very limited set of data. However, we have people in the team who have lots of domain experience in the field so I’ve seen an opportunity to dust off my bayesian statistics and use some priors!

Here is the problem (reworded in a fun way):

A fruit seller has a selection of produce; including Oranges, apples and pears. She wants to model the proportion of fruit that she sells to women in the different cities her stores are located in. We have records of the following data;

  • City
  • Age of store (in days)
  • Number of employees working
  • Volume of rain of that day
  • Total amount spent on item that day
  • Total amount spent on item that day by women

For each record, we have the data for each of the items. The priors we would then have would be based on the city and item - for example, if we knew women in Liverpool bought the majority of the bananas at the store we would choose a Beta prior where the mean is centered at a high value (say 0.9) and the alpha and beta values chosen to best represent our own confidence in this prior. (I understand this will lead to a high number of priors if we increase the number of cities and items).

Reminder:
We want to model the proportion of the amount spent on each food item by women in each city.

I have thought about using a Binomial or Bernoulli distribution with a beta prior and I have also thought about using some sort of logistic regression. I was wondering if anyone could answer the following:

  1. Whats the best method or model to allow us to place priors on each item for each city?
  2. Will the cities with no data just return a draw from the prior for that city / how about an unknown city?
  3. How do we predict on a new observation (if someone told us the city, item and days of rain etc) could we predict the proportion of sales by women for that item?
    (im sure ill have more so will update when i think of them!)

Ive put together an example of the data so if anyone wants to jump in and see if they can help you’re more than welcome! (hopefully it gives you an idea of what im trying to acheive)

from pandas import DataFrame, concat
import random
import numpy as np

# Store / city level
cities = ["Liverpool", "London", "Manchester", "Birmingham", "Leeds", "Glasgow", "Sheffield", "Edinburgh", "Bristol", "Cardiff"]
female_prop = [0.4, 0.55, 0.6, 0.5, 0.45, 0.43, 0.7, 0.6, 0.4, 0.5]
age_of_store = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Averages to simulate from
rain_averages = [1000, 900, 800, 700, 600, 500, 400, 300, 200, 100]
employee_averages = [2,4,3,5,3,4,5,6,3,6]

# Values to simulate sales from (for index 0 - apples would be the most popular purchase and pears the least)
fruit_ratios = [(7, 4, 2, 5), (8, 5, 3, 4), (4, 6, 4, 5), (5, 7, 5, 7), (6, 8, 4, 9), (7, 9, 7, 4), (3, 7, 8, 6), (6, 4, 8, 5), (3, 6, 5, 4), (7, 7, 2, 5)]
fruit_names = ["Apples", "Oranges", "Pears", "Bananas"]

df = DataFrame()
row = 0
for i in range(500):
    city = random.choice(cities)
    city_index = cities.index(city)
    age = age_of_store[city_index]
    fp = female_prop[city_index]

    rain = max([0, np.random.normal(rain_averages[city_index], 40)])
    employees = int(max([1, np.random.normal(employee_averages[city_index] + 3, 3)]))

    fruit_batch = fruit_ratios[city_index]
    for j, fruit in enumerate(fruit_names):
        total_sales = round(max([0, np.random.normal(fruit_batch[j] * employees, 5)]), 2)
        female_sales = round(np.clip(np.random.normal(fp, 0.03), 0, 1) * total_sales, 2)
        
        tmp = {
            "city": city,
            "age": age,
            "rain": rain,
            "employees": employees,
            "item": fruit,
            "total_sales": total_sales,
            "female_sales": female_sales
        }

        df = concat([df, DataFrame(tmp, index=[row])])
        row += 1

df

This data vaguely mimics the data that we’re actually working with.

The final aim would be to pass in some new data (of the format; city, number of employees, amount of rain…) and be able to get posterior distributions/predictions for each of the fruits.

Hopefully this has made sense and theres someone out there thats able to help!

Thanks

Welcome!

Your modeling problem seems like a good candidate for a hierarchical model. You can see an example of how they are implemented here (this one is extremely thorough) and here. I strongly suggest having a look at these, because the go over everything you need to estimate a model like the one you describe. This is the answer to your first question, and using a hierarchical model suggests also suggests answer to your second question. When you don’t have a data for a city or fruit, your prediction becomes the mean of the distribution from which all the city or fruit effects come.

As for choosing priors and a likelihood, you can reach for either a beta or a binomial, both of these are appropriate for modeling rates. Note that a regression with a binomial likelihood function is logistic regression!

One thing I don’t like about these two likelihood functions is that they imply a relationship between the mean and the variance of the likelihood, which may or may not be appropriate. I have had success modeling rates with the Logistic Normal distribution, which has distinct location and scale parameters.

Whatever likelihood function you choose, I would not suggest trying to model the mean using priors that are constrained between 0 and 1. Instead, use whatever priors you like, then pass the result through a link function to constrain the result between 0 and 1. This seems like it’s the same thing, but it’s not: it will be much easier to run your model if the priors are first allowed to be anything. The link function you want is the inverse logistic. You can read about this type of modeling under the name “GLM”, for Generalized Linear Models.

There’s a very powerful package called Bambi built on top of PyMC that can handle this class models for you. Examples of hierarchical models in Bambi, along with Beta and Logistic regression, here.

Prediction with GLMs is extremely easy, especially if you use Bambi. There’s a .predict method built in to Bambi models that lets you show the model new data and get estimates for sales. If you are feeling adventurous and you do it yourself, it’s just a single matrix multiplication: \hat y = f(X \beta), where f is the link function, \beta are all your estimated parameters, and X is a matrix of data you want to predict from.

It will be easier to answer more specific questions once you’ve made some modeling choices and you have some model code to discuss. I’d also think carefully about your data generating process. The code you provided doesn’t produce sales values that depend on rain or age, for example, although this might have been your intention. A simple linear model, good old fashioned OLS, performs pretty much perfectly on your test data:


import statsmodels.api as sm

X = pd.get_dummies(df, columns=['city', 'item'], drop_first=True).drop(columns=['female_sales'])
y = df['female_sales']

mod = sm.OLS(y, X.assign(constant = 1))
res = mod.fit(cov_type='HC0')

predict_df = pd.DataFrame(index=pd.MultiIndex.from_product([cities, fruit_names]), columns=['predicted', 'true'])

for idx in predict_df.index:
    city, item = idx
    
    city_mask = df.city == city
    fruit_mask = df.item == item

    city_df = X[city_mask & fruit_mask].copy()
    y_hat = res.predict(city_df.assign(constant=1).mean())
    y_true = y[city_mask & fruit_mask].mean()
    
    predict_df.loc[idx, :] = np.array([y_hat[0], y_true])

Predictions vs true proportions for each fruit-city pair:

If your real data are like this, I would consider not worrying about the fancy likelihood function and just start with a hierarchical model with a normal distribution, then get more complex from there. If your data are quite good, a normally distributed likelihood function will fit well enough that the boundary constraints aren’t binding, and you can ignore them. This is the case in the generated data.

2 Likes

Hi Jesse, thanks for the prompt reply! I’m getting into Bambi examples at the moment and it seems very interesting - so thankyou for the suggested articles.

I was going to share some of my code but was also hesitant due to the fact it’s most probably completely wrong. Ive put something together using the linked articles and will share it here:

formula_hier = "p(female_sales, total_sales) ~ 1 + (1|item) + (item|city)"

priors = {
    "Intercept": bmb.Prior("Normal", mu=0, sigma=1),
    "1|item": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("HalfNormal", sigma=1)),
    "item|city": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("HalfNormal", sigma=1))
}

model_bmb_hier = bmb.Model(formula_hier, df, priors=priors, family="binomial")
model_bmb_hier

I’m not quite sure on how best to incorporate the other variables or even the priors I’ve selected.

After fitting the model i get quite a nice plot by running:

az.plot_forest(fitted_hier, var_names='item|city', r_hat=True, combined=True, textsize=8);

To then create predictions I can then pass a table of unseen data into the model like so…

pred_df = DataFrame(
    [
        {
            "city": "Liverpool",
            "age": 10,
            "rain": 1000,
            "employees": 2,
            "item": "Apples"
        },
        {
            "city": "Liverpool",
            "age": 10,
            "rain": 1000,
            "employees": 2,
            "item": "Pears"
        },
        {
            "city": "Liverpool",
            "age": 10,
            "rain": 1000,
            "employees": 2,
            "item": "Bananas"
        },
        {
            "city": "Liverpool",
            "age": 10,
            "rain": 1000,
            "employees": 2,
            "item": "Oranges"
        },
    ]
)
pred_hier = model_bmb_hier.predict(fitted_hier, data=pred_df, kind='mean', inplace=False)
az.plot_forest(pred_hier, var_names='p(female_sales, total_sales)_mean', r_hat=True, combined=True, textsize=8);

On the point of the data generation, my function was sloppy and done at the latter end of the working day so apologies for that! An updated version of the function is as follows (hopefully its more appropriate!):

from pandas import DataFrame, concat
import random
import numpy as np

# Store / city level
cities = ["Liverpool", "London", "Manchester", "Birmingham", "Leeds", "Glasgow", "Sheffield", "Edinburgh", "Bristol", "Cardiff"]
female_prop = [0.4, 0.55, 0.6, 0.5, 0.45, 0.43, 0.7, 0.6, 0.4, 0.5]
age_of_store = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Averages to simulate from
rain_averages = [1000, 900, 800, 700, 600, 500, 400, 300, 200, 100]
employee_averages = [2,4,3,5,3,4,5,6,3,6]

# Values to simulate sales from (for index 0 - apples would be the most popular purchase and pears the least)
fruit_ratios = [(7, 4, 2, 5), (8, 5, 3, 4), (4, 6, 4, 5), (5, 7, 5, 7), (6, 8, 4, 9), (7, 9, 7, 4), (3, 7, 8, 6), (6, 4, 8, 5), (3, 6, 5, 4), (7, 7, 2, 5)]
fruit_names = ["Apples", "Oranges", "Pears", "Bananas"]

df = DataFrame()
row = 0
for i in range(500):
    city = random.choice(cities)
    city_index = cities.index(city)
    age = age_of_store[city_index]
    fp = female_prop[city_index]

    rain = max([0, np.random.normal(rain_averages[city_index], 40)])
    employees = int(max([1, np.random.normal(employee_averages[city_index] + 3, 3)]))

    fruit_batch = fruit_ratios[city_index]
    for j, fruit in enumerate(fruit_names):
        total_sales = round(max([0, np.random.normal(fruit_batch[j] * employees, 5) + (np.random.normal(age, 5)*np.sqrt(rain))]), 2)
        female_sales = round(np.clip(np.random.normal(fp, 0.03), 0, 1) * total_sales, 2)
        
        tmp = {
            "city": city,
            "age": age,
            "rain": rain,
            "employees": employees,
            "item": fruit,
            "total_sales": total_sales,
            "female_sales": female_sales
        }

        df = concat([df, DataFrame(tmp, index=[row])])
        row += 1

  1. Suggesting I know that usually in Liverpool that the proportion of oranges bought by women is around 0.6, how could I incorporate this into the model?
  2. If we don’t have training data for a specific city that we later want a prediction for, I get the error “The levels xx in ‘city’ are not present in the original data set.” which is understandable but I was hoping there was a way to generalize?

I hope this is a step in the right direction (albeit a baby one) and i look forward to your reply. Thanks

1 Like

All my code is a mess too, don’t be shy about sharing. It looks like you have a good start!

  1. For setting the priors, in PyMC, rather than passing a scalar value as the mean of an effect, you can pass an array. Let’s say you have 5 cities, the 3rd city is Liverpool, and you know you want to center the prior for Liverpool at 1.0, then you can write pm.Normal(mu=np.array([0, 0, 1, 0, 0]), sigma=sigma, dims='city'). I am not 100% sure on the Bambi syntax for something like this, someone with more experience would need to chime in there.

If your prior is over conditional information, as in your example that oranges in Liverpool sell better, you would tweak the mean of the item-given-city effect.

  1. If you want to make a prediction for a city you don’t have, and which isn’t in your model, you’re asking for too much. You have two options:

First, you could gather all the smaller cities in your dataset into a single “other cities” label, then use this label to generate predictions for unseen cities. The idea here is that smaller markets are all similar to each other and all different from the population mean.

Second, you could set all the city effects to zero. This would have the effect of drawing from the mean of the city effects distribution. You would be saying that “our best guess of the effect of an unknown city is the average city effect”.

Again, I apologize because I don’t know the Bambi-specific syntax here. Looking at the predict method I linked, it seems like it would involve setting include_group_specific = False, but I’m not sure if you could turn off city effects but leave on item effects, for example.