Fitting mixture of binomials

Ok so that makes sense. Is your DGP1 bimodal as well? It should be I think.

Backing up 3 posts, where you said

There is a peak at approximately 0.25. But the model can’t place a peak there, because the model is a mixture of 0.0001 and 0.5. The data is a mixture of zeros (when simulated_data["sample_type"] == 0 ) and a mixture of 0.0001 and 0.5 (when simulated_data["sample_type"] == 1 ).

Its not 0s when sample type is 0, its observed_count = background_rate when sample_type is 0. So those samples should fully identify/resolve p0

When sample type is 1 and fixed fractions of 0.5, its observed_count = 0.5 * Bin(n,p1) + background count. Background count is fully resolved from the above samples. So, p1 is fully identifiable/resolvable.

While the peak is at 0.025, since this is coming from 0.5 * Bin, the p1 should be 0.05.

Is this clear/do you disagree?

It is not as seen in the plot I posted and if you’re not sure why, I would sit with that code until it’s clearer why. I feel like that’s a critical piece missing from this.

Its not because sample type isnt there, I see. I think that neither scenario you are presenting is the true scenario.

There should be DGP1 + samples drawn purely from background_Rate. see below where I added DGP3

# %%
import numpy as np
import pandas as pd
import scipy.stats as ss
import matplotlib.pyplot as plt

# %%
background_rate = 1e-4
signal_rate = 5e-2

num_obs = 500
simulated_data = pd.DataFrame(
    data=[ss.norm.rvs(loc=10_000, scale=1_500, size=num_obs)], index=["N"]
).T

# number of flips per observation
simulated_data["N"] = simulated_data["N"].astype(int)

# mixing fraction
simulated_data["fraction"] = 0.5

# counts of "background" successes
simulated_data["background_counts"] = ss.binom.rvs(
    n=simulated_data["N"], p=background_rate
)

# counts of "signal" successes
simulated_data["signal_counts"] = ss.binom.rvs(n=simulated_data["N"], p=signal_rate)

# %%
# DGP #1
# for each observation, mix the signal and background successes according to the mixing fraction
simulated_data["observed_counts_DGP1"] = (
    (simulated_data["signal_counts"] * simulated_data["fraction"])
    + (simulated_data["background_counts"] * (1 - simulated_data["fraction"]))
).astype(int)

# %%
# DGP #2
# for each observation, select "background" or "signal" successes according to the mixing fraction
switch = ss.binom.rvs(n=1, p=simulated_data["fraction"])
simulated_data["observed_counts_DGP2"] = np.where(
    switch, simulated_data["signal_counts"], simulated_data["background_counts"]
)

# %%
# DGP #3
# for each observation in class 1, mix the signal and background successes according to the mixing fraction
# for each observation in class 2, only use background successes
simulated_data["observed_counts_DGP3"] = (
    (simulated_data["signal_counts"] * simulated_data["fraction"])
    + (simulated_data["background_counts"] * (1 - simulated_data["fraction"]))
).astype(int)

simulated_data.loc[:250,"observed_counts_DGP3"] = simulated_data.loc[:250,"background_counts"]


# %%
plt.hist(simulated_data["observed_counts_DGP1"], color="r", alpha=0.5)
plt.hist(simulated_data["observed_counts_DGP2"], color="g", alpha=0.5)
plt.hist(simulated_data["observed_counts_DGP3"], color="b", alpha=0.5)
plt.xlabel("# of successes per observation")
plt.ylabel("# observations")
plt.legend(["DGP1", "DGP2", "DGP3"])
plt.show()

That’s fine, but your model still reflects DGP 2, so it’s still misspecified and thus not doing what you want it to.

Why is my model DGP2 and not DGP3?

DGP2 is DGP3 but with no mixing term of 0.5 in sample_type==1

Maybe the best formulation of my question is: Could you please help me code the right model for DGP3, if the issue is that I am not coding it right.

Snippet:

n = pm.ConstantData("n",value=scores["N"],dims="observation")
mix_frac = pm.ConstantData("mix_frac",value=scores["fraction"],dims="observation")

        p_type0 = pm.Uniform("p_type0")
        p_type1 = pm.Uniform("p_type1")
        observed = scores["observed_counts"]
        dims = "observation"

        component_t0 = pm.Binomial.dist(
                       n = n,
                       p = p_type0, 
                    )
        
        component_t1 = pm.Binomial.dist(
                       n = n,
                       p = p_type1, 
                    )
        
        observed_counts = pm.Mixture(
            "counts",
            w = pt.stack([1-mix_frac,mix_frac],axis=1),
            comp_dists = [
                component_t0,
                component_t1
            ],
            observed = observed,
        )

        idata = pm.sample()

I agree that the model is mis-specificied somehow. If I set p0 and just try to infer p1, it still cannot estimate it properly. I assume I am not mixing them properly in pm.Mixture?

With p0 set to 1e-4, and mixing_fraction fixed, the only free parameter is p1, and that should be fittable from the data. p1 is getting fit to 0.3

%%capture --no-display
scores = simulated_data.reset_index(drop=True)
coords = {"observation": scores.index.values}
for mode in ["fit","predict"]:
    model = pm.Model(coords=coords)
    with model:

        n = pm.ConstantData("n",value=scores["N"],dims="observation")
        mix_frac = pm.ConstantData("mix_frac",value=scores["fraction"],dims="observation")

        if mode == "fit":
            p_type0 = 1e-4
            p_type1 = pm.Uniform("p_type1")
            observed = scores["observed_counts"]
            dims = "observation"

        else:
            p_type0 = 1e-4
            p_type1 = params["p_type1"]
            observed = None
            dims = None

        component_t0 = pm.Binomial.dist(
                       n = n,
                       p = p_type0, 
                    )
        
        component_t1 = pm.Binomial.dist(
                       n = n,
                       p = p_type1, 
                    )
        
        observed_counts = pm.Mixture(
            "counts",
            w = pt.stack([1-mix_frac,mix_frac],axis=1),
            comp_dists = [
                component_t0,
                component_t1
            ],
            observed = observed,
        )

        idata = pm.sample()
        pm.sample_posterior_predictive(idata, extend_inferencedata=True)

        params = save_model_params(idata)
        display(params)

    if "fit" in mode:
        display(az.plot_ppc(idata))
    else:
        predicted_counts = list(idata.posterior.data_vars["counts"].mean(axis=(0,1)).astype(int).values+1)
        observed_counts = list(scores["observed_counts"])
        tf = list(scores["fraction"])

        df = scores.copy()
        df["observed"] = observed_counts
        df["predicted"] = predicted_counts
        df["sample_type"] = df["sample_type"].astype(str)
        df["obs-pred_sign"] = (((df["observed"] - df["predicted"]))>0).apply(lambda x: 1 if x else -1)
        df["obs-pred_value"] = ((df["observed"] - df["predicted"]).abs()+1).apply(np.log10)
        df["obs-pred"] = df["obs-pred_sign"] * df["obs-pred_value"]

        fig = px.scatter(
            data_frame = df,
            x = "observed",
            y = "predicted",
            color = "sample_type",
            width = 750,
            height = 750,
            log_x=True,
            log_y=True,
        )

        fig.add_shape(type="line",line=dict(dash="dash",color="black"),x0=1,y0=1,x1=10_000,y1=10_000)
        display(fig)


        fig = px.scatter(
            data_frame = df,
            x = "observed",
            y = "fraction",
            color = "sample_type",
            width = 750,
            height = 750,
            log_x=True,
            log_y=True,
        )

        display(fig)

        plt.figure(figsize=(7,5))
        display(sns.distplot(df["obs-pred"]))

        display(df["obs-pred"].abs().describe(percentiles=[i/100 for i in range(0,100,10)]))

One easier way to know your model is what you intended is to use it to generate synthetic data.

@tcapretto has a nice blogspot on this approach: Data simulation with PyMC – Tomi Capretto

2 Likes

Thanks I will look into this :slight_smile:

I assume it would be something like this:

with pm.Model() as model:
    n = pm.Data("n", value=simulated_data["N"])
    observed = simulated_data["observed_counts"]
    mix_frac = pm.Beta("mix_frac", alpha=1, beta=1)
    p_type0 = background_rate
    p_type1 = signal_rate

    # some observations reflect p_type0
    component_t0 = pm.Binomial.dist(
        n=n,
        p=p_type0,
    )
    # some observations reflect a mix of p_type0 and p_type1
    component_t1 = pm.Binomial.dist(
        n=n,
        p=np.mean([p_type0, p_type1])
    )

    observed_counts = pm.Mixture(
        "observed_counts",
        w=pt.stack([1 - mix_frac, mix_frac], axis=1),
        comp_dists=[component_t0, component_t1],
        observed=observed,
    )
    
    idata = pm.sample()

This assumes that that the ["sample_type"] == 1 mixture is 50-50 and the model works well when the data mirrors this assumption:

If you wish to infer the mixture for the ["sample_type"] == 1 observations, you can do something like this:

    other_mix_frac = pm.Beta("other_mix_frac", alpha=1, beta=1)
    component_t1 = pm.Binomial.dist(
        n=n,
        p=([other_mix_frac*p_type0 + (1-other_mix_frac)*p_type1])
    )

This assumes that there is a single other_mix_frac governing all observations and it works when when the data mirrors this assumption (here I have set simulated_data["fraction"] = 0.75):

However, in your original formulation, each observation had its own unique other_mix_frac (what you called simulated_data["fraction"]).

If you want to assume that other_mix_frac (i.e., simulated_data["fraction"]) varies across observations, you can do this:

    other_mix_frac = pm.Beta(
        "other_mix_frac", alpha=1, beta=1, shape=len(simulated_data["N"])
    )
    component_t1 = pm.Binomial.dist(
        n=n,
        p=(other_mix_frac * p_type0 + (1 - other_mix_frac) * p_type1),
        shape=len(simulated_data["N"]),
    )

This again works well when the data mirrors this assumption:

1 Like

Thanks cluhmann. I hadn’t thought to do the mixture in the p variable itself, though I have done it in other cases. This is very helpful.

My last question is – is there a way to do your second setup with a Mixture instead of within a binomial? The reason i ask is that a mixture is so generic you could change the example to any two types of distributions whereas this approach is only valid for 2 binomials. Thinking forward to the future, at some point these two distributions may expand in my model to something more complex.

Or, is the Mixture object not set up for this and it has to be done this way?

Thanks!

You can build hierarchical mixtures, but it’s not clear to me that it works here because you have continuous parameters generating discrete observations. You might be able to model a binomial mixtures of Bernoulli likelihoods, but that would require you to model every single observation (rather than N and p you would just have p and every observations would be a single “flip”). Maybe someone more clever than I would be able to work out a better solution.