I have these variables
expense_chance = np.random.binomial(1, 0.1, size=1000)
expense_size = np.random.exponential(200, size=1000)
expenses = expense_size * expense_chance
df_data = pd.DataFrame({'expenses': expenses})
Then
and the final distribution of the data looks like this:
I have created the following model which should? reproduce the original distribution after sampling:
import pymc3 as pm
import theano.tensor as tt
with pm.Model() as testing_model2:
zero = pm.Deterministic('zero', tt.constant(0))
p_expense = pm.Beta('p_expense', alpha=2.0, beta=2.0)
# p_expense = pm.Uniform('p_expense')
expense_fact = pm.Bernoulli(
'expense_fact',
p=p_expense,
)
lam_expense = pm.Exponential('lam_expense', lam=1.0, )
expense = pm.Exponential(
'expense',
lam=lam_expense,
)
# switch = tt.stack([no_transaction, expense])
final_expense = pm.Normal(
'final_expense',
mu=pm.math.switch(expense_fact, expense, zero),
sigma=1,
observed=df_data['expenses'],
shape=(1,)
)
final_expense_sim = pm.Normal(
'final_expense_sim',
mu=pm.math.switch(expense_fact, expense, zero),
sigma=1,
shape=(1,)
)
steps = [
pm.Metropolis([
lam_expense,
final_expense_sim,
final_expense,
p_expense,
expense,
]),
]
trace = pm.sample(
5000,
step=steps,
# cores=1,
chains=4,
# init='adapt_diag',
# tune=2000,
)
The mean for p_expense doesn’t match but the mean for the expense itself does:
Also, the final distribution isn’t matching:
The question is 1), what I am doing wrong? I would like to get the final distribution to be close to the one I’ve generated initially?
I could try doing this:
expense_fact = pm.Bernoulli(
'expense_fact',
p=p_expense,
observed=(df_data['expenses'] > 0).astype(int),
shape=(1,)
)
The p_expense probability becomes correct.
But I do not think this should be necessary. IMO this should basically be a mixture of two distros: no expense and some expense.
2) If I do the above, final_expense_sim fails with:
TypeError: For compute_test_value, one input test value does not have the requested type.
The error when converting the test value to that variable type:
Non-unit value on shape on a broadcastable dimension.
(1000,)
(True,)
Why does this happen?
3) If do not specify sampler to be Metropolis, the entire Jupyter crashes, what could be the problem?
The notebook is here https://colab.research.google.com/drive/1JuD7Gd2stqovb5-NQh_IzuHdZUH9pkP0?usp=sharing