Welcome!
Your modeling problem seems like a good candidate for a hierarchical model. You can see an example of how they are implemented here (this one is extremely thorough) and here. I strongly suggest having a look at these, because the go over everything you need to estimate a model like the one you describe. This is the answer to your first question, and using a hierarchical model suggests also suggests answer to your second question. When you don’t have a data for a city or fruit, your prediction becomes the mean of the distribution from which all the city or fruit effects come.
As for choosing priors and a likelihood, you can reach for either a beta or a binomial, both of these are appropriate for modeling rates. Note that a regression with a binomial likelihood function is logistic regression!
One thing I don’t like about these two likelihood functions is that they imply a relationship between the mean and the variance of the likelihood, which may or may not be appropriate. I have had success modeling rates with the Logistic Normal distribution, which has distinct location and scale parameters.
Whatever likelihood function you choose, I would not suggest trying to model the mean using priors that are constrained between 0 and 1. Instead, use whatever priors you like, then pass the result through a link function to constrain the result between 0 and 1. This seems like it’s the same thing, but it’s not: it will be much easier to run your model if the priors are first allowed to be anything. The link function you want is the inverse logistic. You can read about this type of modeling under the name “GLM”, for Generalized Linear Models.
There’s a very powerful package called Bambi built on top of PyMC that can handle this class models for you. Examples of hierarchical models in Bambi, along with Beta and Logistic regression, here.
Prediction with GLMs is extremely easy, especially if you use Bambi. There’s a .predict method built in to Bambi models that lets you show the model new data and get estimates for sales. If you are feeling adventurous and you do it yourself, it’s just a single matrix multiplication: \hat y = f(X \beta), where f is the link function, \beta are all your estimated parameters, and X is a matrix of data you want to predict from.
It will be easier to answer more specific questions once you’ve made some modeling choices and you have some model code to discuss. I’d also think carefully about your data generating process. The code you provided doesn’t produce sales values that depend on rain or age, for example, although this might have been your intention. A simple linear model, good old fashioned OLS, performs pretty much perfectly on your test data:
import statsmodels.api as sm
X = pd.get_dummies(df, columns=['city', 'item'], drop_first=True).drop(columns=['female_sales'])
y = df['female_sales']
mod = sm.OLS(y, X.assign(constant = 1))
res = mod.fit(cov_type='HC0')
predict_df = pd.DataFrame(index=pd.MultiIndex.from_product([cities, fruit_names]), columns=['predicted', 'true'])
for idx in predict_df.index:
city, item = idx
city_mask = df.city == city
fruit_mask = df.item == item
city_df = X[city_mask & fruit_mask].copy()
y_hat = res.predict(city_df.assign(constant=1).mean())
y_true = y[city_mask & fruit_mask].mean()
predict_df.loc[idx, :] = np.array([y_hat[0], y_true])
Predictions vs true proportions for each fruit-city pair:
If your real data are like this, I would consider not worrying about the fancy likelihood function and just start with a hierarchical model with a normal distribution, then get more complex from there. If your data are quite good, a normally distributed likelihood function will fit well enough that the boundary constraints aren’t binding, and you can ignore them. This is the case in the generated data.
