Hi all!
This is my first time using PyMC and my first question on this site. As a bit of background, I completed my masters in stats about 5 years ago and haven’t really touched it since (so bare with me as I’m rusty!), untill now…
A problem has arisen where I’ve been tasked to create some predictions on a very limited set of data. However, we have people in the team who have lots of domain experience in the field so I’ve seen an opportunity to dust off my bayesian statistics and use some priors!
Here is the problem (reworded in a fun way):
A fruit seller has a selection of produce; including Oranges, apples and pears. She wants to model the proportion of fruit that she sells to women in the different cities her stores are located in. We have records of the following data;
- City
- Age of store (in days)
- Number of employees working
- Volume of rain of that day
- Total amount spent on item that day
- Total amount spent on item that day by women
For each record, we have the data for each of the items. The priors we would then have would be based on the city and item - for example, if we knew women in Liverpool bought the majority of the bananas at the store we would choose a Beta prior where the mean is centered at a high value (say 0.9) and the alpha and beta values chosen to best represent our own confidence in this prior. (I understand this will lead to a high number of priors if we increase the number of cities and items).
Reminder:
We want to model the proportion of the amount spent on each food item by women in each city.
I have thought about using a Binomial or Bernoulli distribution with a beta prior and I have also thought about using some sort of logistic regression. I was wondering if anyone could answer the following:
- Whats the best method or model to allow us to place priors on each item for each city?
- Will the cities with no data just return a draw from the prior for that city / how about an unknown city?
- How do we predict on a new observation (if someone told us the city, item and days of rain etc) could we predict the proportion of sales by women for that item?
(im sure ill have more so will update when i think of them!)
Ive put together an example of the data so if anyone wants to jump in and see if they can help you’re more than welcome! (hopefully it gives you an idea of what im trying to acheive)
from pandas import DataFrame, concat
import random
import numpy as np
# Store / city level
cities = ["Liverpool", "London", "Manchester", "Birmingham", "Leeds", "Glasgow", "Sheffield", "Edinburgh", "Bristol", "Cardiff"]
female_prop = [0.4, 0.55, 0.6, 0.5, 0.45, 0.43, 0.7, 0.6, 0.4, 0.5]
age_of_store = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
# Averages to simulate from
rain_averages = [1000, 900, 800, 700, 600, 500, 400, 300, 200, 100]
employee_averages = [2,4,3,5,3,4,5,6,3,6]
# Values to simulate sales from (for index 0 - apples would be the most popular purchase and pears the least)
fruit_ratios = [(7, 4, 2, 5), (8, 5, 3, 4), (4, 6, 4, 5), (5, 7, 5, 7), (6, 8, 4, 9), (7, 9, 7, 4), (3, 7, 8, 6), (6, 4, 8, 5), (3, 6, 5, 4), (7, 7, 2, 5)]
fruit_names = ["Apples", "Oranges", "Pears", "Bananas"]
df = DataFrame()
row = 0
for i in range(500):
city = random.choice(cities)
city_index = cities.index(city)
age = age_of_store[city_index]
fp = female_prop[city_index]
rain = max([0, np.random.normal(rain_averages[city_index], 40)])
employees = int(max([1, np.random.normal(employee_averages[city_index] + 3, 3)]))
fruit_batch = fruit_ratios[city_index]
for j, fruit in enumerate(fruit_names):
total_sales = round(max([0, np.random.normal(fruit_batch[j] * employees, 5)]), 2)
female_sales = round(np.clip(np.random.normal(fp, 0.03), 0, 1) * total_sales, 2)
tmp = {
"city": city,
"age": age,
"rain": rain,
"employees": employees,
"item": fruit,
"total_sales": total_sales,
"female_sales": female_sales
}
df = concat([df, DataFrame(tmp, index=[row])])
row += 1
df
This data vaguely mimics the data that we’re actually working with.
The final aim would be to pass in some new data (of the format; city, number of employees, amount of rain…) and be able to get posterior distributions/predictions for each of the fruits.
Hopefully this has made sense and theres someone out there thats able to help!
Thanks