Can someone help me model this data...?

Hi Jesse, thanks for the prompt reply! I’m getting into Bambi examples at the moment and it seems very interesting - so thankyou for the suggested articles.

I was going to share some of my code but was also hesitant due to the fact it’s most probably completely wrong. Ive put something together using the linked articles and will share it here:

formula_hier = "p(female_sales, total_sales) ~ 1 + (1|item) + (item|city)"

priors = {
    "Intercept": bmb.Prior("Normal", mu=0, sigma=1),
    "1|item": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("HalfNormal", sigma=1)),
    "item|city": bmb.Prior("Normal", mu=0, sigma=bmb.Prior("HalfNormal", sigma=1))
}

model_bmb_hier = bmb.Model(formula_hier, df, priors=priors, family="binomial")
model_bmb_hier

I’m not quite sure on how best to incorporate the other variables or even the priors I’ve selected.

After fitting the model i get quite a nice plot by running:

az.plot_forest(fitted_hier, var_names='item|city', r_hat=True, combined=True, textsize=8);

To then create predictions I can then pass a table of unseen data into the model like so…

pred_df = DataFrame(
    [
        {
            "city": "Liverpool",
            "age": 10,
            "rain": 1000,
            "employees": 2,
            "item": "Apples"
        },
        {
            "city": "Liverpool",
            "age": 10,
            "rain": 1000,
            "employees": 2,
            "item": "Pears"
        },
        {
            "city": "Liverpool",
            "age": 10,
            "rain": 1000,
            "employees": 2,
            "item": "Bananas"
        },
        {
            "city": "Liverpool",
            "age": 10,
            "rain": 1000,
            "employees": 2,
            "item": "Oranges"
        },
    ]
)
pred_hier = model_bmb_hier.predict(fitted_hier, data=pred_df, kind='mean', inplace=False)
az.plot_forest(pred_hier, var_names='p(female_sales, total_sales)_mean', r_hat=True, combined=True, textsize=8);

On the point of the data generation, my function was sloppy and done at the latter end of the working day so apologies for that! An updated version of the function is as follows (hopefully its more appropriate!):

from pandas import DataFrame, concat
import random
import numpy as np

# Store / city level
cities = ["Liverpool", "London", "Manchester", "Birmingham", "Leeds", "Glasgow", "Sheffield", "Edinburgh", "Bristol", "Cardiff"]
female_prop = [0.4, 0.55, 0.6, 0.5, 0.45, 0.43, 0.7, 0.6, 0.4, 0.5]
age_of_store = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

# Averages to simulate from
rain_averages = [1000, 900, 800, 700, 600, 500, 400, 300, 200, 100]
employee_averages = [2,4,3,5,3,4,5,6,3,6]

# Values to simulate sales from (for index 0 - apples would be the most popular purchase and pears the least)
fruit_ratios = [(7, 4, 2, 5), (8, 5, 3, 4), (4, 6, 4, 5), (5, 7, 5, 7), (6, 8, 4, 9), (7, 9, 7, 4), (3, 7, 8, 6), (6, 4, 8, 5), (3, 6, 5, 4), (7, 7, 2, 5)]
fruit_names = ["Apples", "Oranges", "Pears", "Bananas"]

df = DataFrame()
row = 0
for i in range(500):
    city = random.choice(cities)
    city_index = cities.index(city)
    age = age_of_store[city_index]
    fp = female_prop[city_index]

    rain = max([0, np.random.normal(rain_averages[city_index], 40)])
    employees = int(max([1, np.random.normal(employee_averages[city_index] + 3, 3)]))

    fruit_batch = fruit_ratios[city_index]
    for j, fruit in enumerate(fruit_names):
        total_sales = round(max([0, np.random.normal(fruit_batch[j] * employees, 5) + (np.random.normal(age, 5)*np.sqrt(rain))]), 2)
        female_sales = round(np.clip(np.random.normal(fp, 0.03), 0, 1) * total_sales, 2)
        
        tmp = {
            "city": city,
            "age": age,
            "rain": rain,
            "employees": employees,
            "item": fruit,
            "total_sales": total_sales,
            "female_sales": female_sales
        }

        df = concat([df, DataFrame(tmp, index=[row])])
        row += 1

  1. Suggesting I know that usually in Liverpool that the proportion of oranges bought by women is around 0.6, how could I incorporate this into the model?
  2. If we don’t have training data for a specific city that we later want a prediction for, I get the error “The levels xx in ‘city’ are not present in the original data set.” which is understandable but I was hoping there was a way to generalize?

I hope this is a step in the right direction (albeit a baby one) and i look forward to your reply. Thanks

1 Like