Fitting mixture of binomials

Hello,

I am looking for some help fitting a mixture of binomials. My data is of the form
Counts = w*Bin(p1,n) + (1-w)*Bin(p2,n)

however, when I do inference on the mixture, W is always mis-estimated. Any help is appreciated, and I can give more code if that is useful. The main goal is to estimate the mixing weight of two known distributions.

When I dummy some data, with say a mixing w of 0.4, I always get back a w of 0.6
Snippet:

n = pm.ConstantData("n",value=row["N"])
        mix_frac = pm.Beta("mix_frac",alpha=1,beta=1)

        p_type0 = 1e-4
        p_type1 = 5e-2
        observed = row["observed_counts"]
        dims = "observation"

        component_t0 = pm.Binomial.dist(
                       n = n,
                       p = p_type0,
                    )
        
        component_t1 = pm.Binomial.dist(
                       n = n,
                       p = p_type1, 
                    )
        
        observed_counts = pm.Mixture(
            "observed_counts",
            w = pt.stack([1-mix_frac,mix_frac],axis=1),
            comp_dists = [
                component_t0,
                component_t1
            ],
            observed = observed,
        )

        idata = pm.sample(progressbar=False)

I have the same issue if i try to infer the p’s and specify the w apriori, though this is less important to me right now
Snippet:

n = pm.ConstantData("n",value=scores["N"],dims="observation")
mix_frac = pm.ConstantData("mix_frac",value=scores["fraction"],dims="observation")

component_t0 = pm.Binomial.dist(
                       n = n,
                       p = p_type0, 
                    )
        
component_t1 = pm.Binomial.dist(
                       n = n,
                       p = p_type1, 
                    )
        
observed_counts = pm.Deterministic(
            "counts",
            1-mix_frac*component_t0+mix_frac*component_t1,
            observed = observed,
            dims = dims,
        )

idata = pm.sample()

If you set the following, your model seems to do ok:

    n = pm.Data("n", value=[10,10])
    p_type0 = 0.00001
    p_type1 = 1-p_type0
    observed = [2,2,2]

So perhaps it has to do with the specifics of your data and p values (ha)?

If you also try to infer the value of the ps, I suspect that you will run into identifiability issues. But that’s just my intuition. Or maybe I am misunderstanding the setup.

Hey cluhmann,

Any insight into why this data has issues? I appreciate your help.

Focusing on inference first (I’ll ask more about fitting later but want to separate the questions for ease):
Dummy data:

N                    11696.000000
fraction                 0.219695
background_counts        2.000000
signal_counts          600.000000
observed_counts        134

(Note that this data is contrived by setting the p0 and p1 I specified below, setting fraction, and then manually drawing from 2 distributions and summing together, so the answer should have a truth).

inference:

model = pm.Model(coords=coords)
with model:

    n = pm.ConstantData("n",value=row["N"])
    mix_frac = pm.Beta("mix_frac",alpha=1,beta=1)

    p_type0 = 1e-4
    p_type1 = 5e-2
    observed = row["observed_counts"]
    dims = "observation"

    component_t0 = pm.Binomial.dist(
                   n = n,
                   p = p_type0,
                )
    
    component_t1 = pm.Binomial.dist(
                   n = n,
                   p = p_type1, 
                )
    
    observed_counts = pm.Mixture(
        "observed_counts",
        w = pt.stack([1-mix_frac,mix_frac],axis=1),
        comp_dists = [
            component_t0,
            component_t1
        ],
        observed = observed,
    )

    idata = pm.sample(progressbar=False)
    pm.sample_posterior_predictive(idata, extend_inferencedata=True,progressbar=False)

    params = {}
    for param_name in list(idata.posterior.data_vars.keys()):
        param_value = idata.posterior.data_vars[param_name].mean().values
        params[param_name]=param_value

and i get:
params
{'mix_frac': array(0.67010995)}

If you can supply the data generation code, it would probably be easier to debug.

for sure. thank you!

background_rate = 1e-4
signal_rate=5e-2

simulated_data = pd.DataFrame(
    data = [ss.norm.rvs(loc = 10_000, scale = 1_500, size = 500)],
    index = ["N"]
).T

simulated_data["N"] = simulated_data["N"].astype(int)
simulated_data["sample_type"] = [np.random.randint(2) for i in range(len(simulated_data))]

simulated_data["fraction"] = ss.uniform.rvs(size=500)

simulated_data["background_counts"] = ss.binom.rvs(n=simulated_data["N"],p=background_rate)

simulated_data["signal_counts"] = ss.binom.rvs(n=simulated_data["N"],p=signal_rate)
cnd = simulated_data["sample_type"] == 0
simulated_data.loc[cnd,"signal_counts"] = 0

simulated_data["observed_counts"] = simulated_data["signal_counts"]*simulated_data["fraction"] + simulated_data["background_counts"]*(1-simulated_data["fraction"])

simulated_data["observed_counts"] = simulated_data["observed_counts"].astype(int)

for col in simulated_data:
    plt.figure(figsize=(7,5))
    sns.histplot(simulated_data[col])

Right now I am just asking about inference on the sample_type == 1 samples. The other group is to make fitting the parameters easier, but I will ask about that later.

Also, while I am treating background counts as a static number (i.e. im not multiplying by 1-w) it is so small relative to signal counts it shouldn’t be affecting estimation of w by so much. I am retrying right now with 1-w added to confirm and will edit my post soon.

Edit: yeah it doesn’t matter if you add 1-w, but I updated the above code to include it just to be thorough

I think you are simply running into identifiability issues. You are mixing ps in every observation. So every observation is something like this (in expectation):

N * (w * background_rate) + ((1 - w) * signal_rate)

But then your model is asking the for value of w that led to this particular observation (which is, itself, a mixture). The model you built is mixing across observations. So some observations will reflect background_rate whereas others will reflect signal_rate (though you won’t know which is which a priori). You can see this by checking out the posterior predictive plot:

As you can see, the model expects there to be some observations with large values of observed_counts and other observations to with small values of observed_counts. But the data is unimodal (suggesting that the mixture you are modeling is unnecessary).

So I think you need to align you model and the data generating process you are envisioning.

I agree that sampling counts is harder, which is why I wanted to put that part down for now and just focus on inferring weights. The distribuitons are so different, I could understand if identifiability wise you can’t fully resolve w, but shouldnt it be almost correct given all your counts basically have to come from signal_rate? With the rates the way they are and these Ns, its roughly impossible to get more than 5 counts off background_rate, meaning it should be putting w at whatever signal_rate*w gives you the most likely set of counts (even if it cant then resolve the last 0.00X decimal of weight). Being off by 0.4 in w seems like too much even in this case?

Separately though, in your case, are you trying to use all samples or only sample_type==0 samples. In principle, with the simulation code above, sample_type 0 samples have fraction of 0 so should completely identify background_rate as they are completely drawn from here. Which means the remaining samples should be able to completely identify signal_rate as background_rate has been solved out. Or is this understanding wrong?

It seems like my understanding must be wrong, but I can’t figure out why.

I only simulated sample_type==1. Apologies if I got it backwards.

# %%
background_rate = 1e-4
signal_rate = 5e-2

num_obs = 500
simulated_data = pd.DataFrame(
    data=[ss.norm.rvs(loc=10_000, scale=1_500, size=num_obs)], index=["N"]
).T

simulated_data["N"] = simulated_data["N"].astype(int)
simulated_data["sample_type"] = [
    np.random.randint(2) for i in range(len(simulated_data))
]

simulated_data["fraction"] = ss.uniform.rvs(size=num_obs)
simulated_data["fraction"] = 0.5

simulated_data["background_counts"] = ss.binom.rvs(
    n=simulated_data["N"], p=background_rate
)

simulated_data["signal_counts"] = ss.binom.rvs(n=simulated_data["N"], p=signal_rate)
cnd = simulated_data["sample_type"] == 0
simulated_data.loc[cnd, "signal_counts"] = 0

simulated_data["observed_counts"] = (
    simulated_data["signal_counts"] * simulated_data["fraction"]
) + (simulated_data["background_counts"] * (1 - simulated_data["fraction"]))

simulated_data["observed_counts"] = simulated_data["observed_counts"].astype(int)

simulated_data = simulated_data[simulated_data["sample_type"] == 1]

Even in the case you had, shouldnt the mixing fraction be more accurate to 0.5? I am a bit confused how this is not identifiable given that background rate cannot possibly generate observed counts at the fractions predicted.

I guess also, if you look at estimated fractions, can you confirm they are still wrong?

Last question, can you comment on the logic of having both sample_types improving identifiablity? Does that seem right or wrong.

Maybe this helps (or helps show why I am misunderstanding). Below, I lay out two different data generating processes (DGP#1 and DGP#2).

Code
# %%
import numpy as np
import pandas as pd
import scipy.stats as ss
import matplotlib.pyplot as plt

# %%
background_rate = 1e-4
signal_rate = 5e-2

num_obs = 500
simulated_data = pd.DataFrame(
    data=[ss.norm.rvs(loc=10_000, scale=1_500, size=num_obs)], index=["N"]
).T

# number of flips per observation
simulated_data["N"] = simulated_data["N"].astype(int)

# mixing fraction
simulated_data["fraction"] = 0.5

# counts of "background" successes
simulated_data["background_counts"] = ss.binom.rvs(
    n=simulated_data["N"], p=background_rate
)

# counts of "signal" successes
simulated_data["signal_counts"] = ss.binom.rvs(n=simulated_data["N"], p=signal_rate)

# %%
# DGP #1
# for each observation, mix the signal and background successes according to the mixing fraction
simulated_data["observed_counts_DGP1"] = (
    (simulated_data["signal_counts"] * simulated_data["fraction"])
    + (simulated_data["background_counts"] * (1 - simulated_data["fraction"]))
).astype(int)

# %%
# DGP #2
# for each observation, select "background" or "signal" successes according to the mixing fraction
switch = ss.binom.rvs(n=1, p=simulated_data["fraction"])
simulated_data["observed_counts_DGP2"] = np.where(
    switch, simulated_data["signal_counts"], simulated_data["background_counts"]
)
# %%
plt.hist(simulated_data["observed_counts_DGP1"], color="r", alpha=0.5)
plt.hist(simulated_data["observed_counts_DGP2"], color="g", alpha=0.5)
plt.xlabel("# of successes per observation")
plt.ylabel("# observations")
plt.legend(["DGP1", "DGP2"])
plt.show()

DGP#1 is roughly what you wrote in your scipy/numpy code. DGP#2 is roughly what your model reflects. As the resulting plot shows, these processes are not the same and are not expected to produce related patterns of data. So what I may have mistaken as an identification problem seems to (also) be a misspecification problem. But I am not sure which DGP is the one you actually believe reflects your application. If it’s DGP#1, then your model is misspecified and I suspect you will have identification problems even if you specify your model correctly. If it’s DGP#2, then you are generating your synthetic data inappropriately.

1 Like

It should be DGP2. And that is what I see when I run my code:




And for thoroughness in case I typo’d earlier somehow? I split out non_fracitonal_signal_counts below to generate the above figure for debugging.

background_rate = 1e-4
signal_rate=5e-2

simulated_data = pd.DataFrame(
    data = [ss.norm.rvs(loc = 10_000, scale = 1_500, size = 500)],
    index = ["N"]
).T

simulated_data["N"] = simulated_data["N"].astype(int)
simulated_data["sample_type"] = [np.random.randint(2) for i in range(len(simulated_data))]

simulated_data["fraction"] = ss.uniform.rvs(size=500)

simulated_data["background_counts"] = ss.binom.rvs(n=simulated_data["N"],p=background_rate)

simulated_data["non_fractional_signal_counts"] = ss.binom.rvs(n=simulated_data["N"],p=signal_rate)
cnd = simulated_data["sample_type"] == 0
simulated_data.loc[cnd,"signal_counts"] = 0

simulated_data["signal_counts"] = simulated_data["non_fractional_signal_counts"]*simulated_data["fraction"]
simulated_data["observed_counts"] = simulated_data["signal_counts"] + simulated_data["background_counts"]*(1-simulated_data["fraction"])

simulated_data["observed_counts"] = simulated_data["observed_counts"].astype(int)
for col in simulated_data:
    plt.figure(figsize=(7,5))
    sns.histplot(simulated_data[col])

I simplified your code to fix the fraction parameter at 0.5 instead of letting it vary across observations. When you let it vary, what variation is causing what (or caused by what) becomes ambiguous. But the point about incompatibility between your DGP and the model remains.

Even in that case I get DGP2 not DGP1

And this should be solvable when you include both sample types since sample type 0 fully defines p0

I’m not sure what you mean by “get DGP2”. Where did that plot come from?

Sorry – you posted 2 figured and asked if it was DGP1 or DGP2. The scenario should be like DGP2 in principle.

The figure above is what I get from my simulations with a fixed fraction of 0.5 (as you said how you locked it).

See below

background_rate = 1e-4
signal_rate=5e-2

simulated_data = pd.DataFrame(
    data = [ss.norm.rvs(loc = 10_000, scale = 1_500, size = 500)],
    index = ["N"]
).T

simulated_data["N"] = simulated_data["N"].astype(int)
simulated_data["sample_type"] = [np.random.randint(2) for i in range(len(simulated_data))]

simulated_data["fraction"] = 0.5#ss.uniform.rvs(size=500)

simulated_data["background_counts"] = ss.binom.rvs(n=simulated_data["N"],p=background_rate)

simulated_data["non_fractional_signal_counts"] = ss.binom.rvs(n=simulated_data["N"],p=signal_rate)
cnd = simulated_data["sample_type"] == 0
simulated_data.loc[cnd,"non_fractional_signal_counts"] = 0

simulated_data["signal_counts"] = simulated_data["non_fractional_signal_counts"]*simulated_data["fraction"]
simulated_data["observed_counts"] = simulated_data["signal_counts"] + simulated_data["background_counts"]*(1-simulated_data["fraction"])

simulated_data["observed_counts"] = simulated_data["observed_counts"].astype(int)
for col in simulated_data:
    plt.figure(figsize=(7,5))
    sns.histplot(simulated_data[col])

Lets say I want to write a mixture to model what I am simulating (which sounds like maybe the issue is I wrote the model wrong in pymc). Could you help me with that?

If you model the entire data set you generated (i.e., if you don’t throw out the rows where simulated_data["sample_type"] == 0, then you get bimodal data:

and the model fits ok for me:

but the posterior predictive check looks bad:

because (I assume), you are still mixing the ps on a per observation basis for all observations where simulated_data["sample_type"] == 1.

Here is the same observed data as plotted above, but plotted as proportions:

There is a peak at approximately 0.25. But the model can’t place a peak there, because the model is a mixture of 0.0001 and 0.5. The data is a mixture of zeros (when simulated_data["sample_type"] == 0) and a mixture of 0.0001 and 0.5 (when simulated_data["sample_type"] == 1).

1 Like

Thanks – two questions:

  1. What do you mean by mixing the p’s on a pers observation basis

  2. The peak at 200 should be able to still be resolved given a weight of 0.5 for all the signal samples given that the p0 is forced to be super low given the fraction 0 samples.

A note: I realized that the fraction for sample_type 0 should be set to 0. I was doing this in a different spot in my code, moved it into the right place below

simulated_data["non_fractional_signal_counts"] = ss.binom.rvs(n=simulated_data["N"],p=signal_rate)
cnd = simulated_data["sample_type"] == 0
simulated_data.loc[cnd,"non_fractional_signal_counts"] = 0
simulated_data.loc[cnd,"fraction"] = 0

That would be as in DGP#1 in my code here.

When mix_frac is 0.5, the model behaves as seen here.