Sampling Bernoulli + Exponential cannot reproduce original distribution

tadas-subonis · December 9, 2020, 11:06am

I have these variables

expense_chance = np.random.binomial(1, 0.1, size=1000)
expense_size = np.random.exponential(200, size=1000)
expenses = expense_size * expense_chance
df_data = pd.DataFrame({'expenses': expenses})

Then

and the final distribution of the data looks like this:

I have created the following model which should? reproduce the original distribution after sampling:

import pymc3 as pm
import theano.tensor as tt

with pm.Model() as testing_model2:
    zero = pm.Deterministic('zero',  tt.constant(0))

    p_expense = pm.Beta('p_expense', alpha=2.0, beta=2.0)
    # p_expense = pm.Uniform('p_expense')
    expense_fact = pm.Bernoulli(
        'expense_fact',
        p=p_expense,
    )

    lam_expense = pm.Exponential('lam_expense', lam=1.0, )
    expense = pm.Exponential(
        'expense',
        lam=lam_expense,
    )

    # switch = tt.stack([no_transaction, expense])
    final_expense = pm.Normal(
        'final_expense',
        mu=pm.math.switch(expense_fact, expense, zero),
        sigma=1,
        observed=df_data['expenses'],
        shape=(1,)
    )

    final_expense_sim = pm.Normal(
        'final_expense_sim',
        mu=pm.math.switch(expense_fact, expense, zero),
        sigma=1,
        shape=(1,)
    )

    steps = [
        pm.Metropolis([
            lam_expense,
            final_expense_sim,
            final_expense,
            p_expense,
            expense,
        ]),
    ]
    trace = pm.sample(
        5000,
        step=steps,
        # cores=1,
        chains=4,
        # init='adapt_diag',
        # tune=2000,
    )

The mean for p_expense doesn’t match but the mean for the expense itself does:

Also, the final distribution isn’t matching:

The question is 1), what I am doing wrong? I would like to get the final distribution to be close to the one I’ve generated initially?

I could try doing this:

expense_fact = pm.Bernoulli(
    'expense_fact',
    p=p_expense,
    observed=(df_data['expenses'] > 0).astype(int),
    shape=(1,)
)

The p_expense probability becomes correct.

But I do not think this should be necessary. IMO this should basically be a mixture of two distros: no expense and some expense.

2) If I do the above, final_expense_sim fails with:
TypeError: For compute_test_value, one input test value does not have the requested type.

The error when converting the test value to that variable type:
Non-unit value on shape on a broadcastable dimension.
(1000,)
(True,)

Why does this happen?

3) If do not specify sampler to be Metropolis, the entire Jupyter crashes, what could be the problem?

The notebook is here https://colab.research.google.com/drive/1JuD7Gd2stqovb5-NQh_IzuHdZUH9pkP0?usp=sharing

cluhmann · December 10, 2020, 3:54pm

It’s not entirely clear to me why you have done what you have done, but I suspect that your inability to recover the parameter values you plugged into the data generating process have to do with the fact that your model does not resemble the data generating process. You are modeling the expenses as a mixture of two normally distributed variables when they are actually a mixture of a spike at 0 and an exponentially distributed set of non-zero expenses from (roughly) zero to 100. Approximating the spike at zero with a normal might be ok (though I wouldn’t recommend it), but the exponentially distributed non-zero expenses will be poorly captured by a normal distribution (particularly one in which sigma=1).

tadas-subonis · December 10, 2020, 4:06pm

I’ve used that Gaussian as a proxy to:

add some observational noise
with it I could specify observed=data values - not sure how to do that on the switch directly

was that a wrong approach?

cluhmann · December 10, 2020, 4:56pm

The exponential in this line

expense = pm.Exponential(
        'expense',
        lam=lam_expense,
    )

is noise. Otherwise, all expenses would be either zero or exactly 1/\lambda.

Right now, you are specifying that the non-zero expenses are normally distributed (which they are not). You are further specifying that you are uncertain where the mean of that normal distribution is, but specifying that your prior belief about the value of the mean of that normally-distrbuted variable is described as an exponential.

tadas-subonis · December 10, 2020, 6:16pm

OK. Thanks. That makes sense.

However, how do you suggest to combine 0 or exponential in a mixture that’s observed. Should I use pm.Mixture() here?

cluhmann · December 10, 2020, 9:40pm

I might suggest something like what I have below, in which you tackle the 2 components separately. First, you figure out what proportion of expenses are non-zero. Second, you figure out how to describe the non-zero expenses. I don’t know how much faith you have that this particular set of assumptions is correct, but it seems reasonably consistent with your earlier discussion.

The zero portion of the mixture is a slightly awkward because it seems like you are (absolutely) confident that expenses of zero are exactly zero. This precision/certainty isn’t entirely natural to deal with in a Bayesian approach (in which everything is assumed to be uncertain). But I think the model below captures what you were going for. If not, let me know.

with pm.Model() as model:
    p_expense = pm.Beta("p_expense", alpha=2.0, beta=2.0)

    lam_expense = pm.Exponential("lam_expense", lam=1.0)
    expense_magnitude = pm.Exponential(
        "expense_magnitude",
        lam=lam_expense,
        observed=df_data.loc[~(df_data["expenses"] == 0)]["expenses"],
    )

    expense_fact = pm.Bernoulli(
        "expense_fact", p=p_expense, observed=(df_data["expenses"] > 0).astype(int)
    )

    trace = pm.sample()

tadas-subonis · December 13, 2020, 4:21pm

Yeah. That will work.

I’ve also tried this and it seems to do the job quite well:

    zero = pm.Normal.dist(
        mu=0,
        sigma=1,
    )

    lam_expense = pm.Exponential('lam_expense', lam=1.0, )
    expense = pm.Exponential.dist(
        lam=lam_expense,
    )

    w = pm.Dirichlet('w', a=np.array([1, 1]))
    components = [zero, expense]

    final_expense = pm.Mixture('final_expense', w=w, comp_dists=components, observed=df_data['expenses'])
    final_expense_sim = pm.Mixture('final_expense_sim', w=w, comp_dists=components)

The biggest problem was that I had that mess up with Gaussian output where I had expense as mu instead of just adding Gaussian noise on top of it - thanks for pointing that out.

cluhmann · December 13, 2020, 8:07pm

Yes. Although you should note that your model implies that your zero expenses are only roughly zero and your prior suggests that the “non-zero” expenses can reasonably be expected to be negative (or positive with equal probability). But if that isn’t too far off of your actual application, it might be fine. But if your posterior admits negative “non-zero” expenses, things could get a bit weird in a predictive context. Just something to be aware of.

tadas-subonis · December 14, 2020, 1:48pm

Thanks. I’ll have that in mind. I could replace that with some positive-only distribution. Probably even Beta would work here.

Topic		Replies	Views
Mixture with custom distribution class Questions	1	596	October 25, 2018
Sampling many dimensions, with different PDF in each dimension version agnostic modeling	13	685	March 30, 2023
Multiplicative Factor Model (degradations) version agnostic	10	267	February 21, 2024
Bernoulli samples all the same? Questions	1	2440	May 29, 2018
Mixture of multivariate Bernoullis Questions	11	3270	June 13, 2022

Sampling Bernoulli + Exponential cannot reproduce original distribution

Related topics