People, hey!
I wrote model to describe difference in ARPU (average revenue per user) of two countries.
I used data stored in pandas dataframe called data
where every row = 1 user and columns:
- player_id – unique user id
- country_code – US or JP
- revenue_7 – cumulative revenue to 7th day of user’s life (~95% of users are not payers, for them value = 0)
Model:
country = np.array(['US', 'JP'])
country_idx = pd.Categorical(data['country_code'], categories=country).codes
coords = {'country': country, 'country_flat': country[idx]}
with pm.Model(coords=coords) as model:
psi = pm.Beta('psi', alpha=1, beta=1, dims='country')
mu = pm.HalfNormal('mu', sigma=10, dims='country')
sigma = pm.HalfNormal('sigma', sigma=15, dims='country')
y = pm.HurdleGamma('y', mu=mu[country_idx], sigma=sigma[country_idx], psi=psi[country_idx], observed=data['revenue_7'])
revenue = pm.Deterministic('revenue', psi * mu, dims='country')
diff = pm.Deterministic('diff', revenue[0] - revenue[1])
idata = pm.sample()
idata.extend(pm.sample_posterior_predictive(idata))
Can you help me please with following questions :
- How to plot posterior predictive checks for dimensions separately (2 countries) using
az.plot_ppc
? - How to ignore in vizualization zero values? it’s needed because psi is less then 5% and validation of revenue spread is impossible, there is only spike in x:0 definable and don’t see anything else. I wanna look at KDE for only payers, revenue tail (hope you understand). Or can you suggest better model setup that make inference data better structured for post-analysis