Fitting mixture of binomials

cluhmann · January 30, 2025, 7:39pm

Maybe this helps (or helps show why I am misunderstanding). Below, I lay out two different data generating processes (DGP#1 and DGP#2).

Code

# %%
import numpy as np
import pandas as pd
import scipy.stats as ss
import matplotlib.pyplot as plt

# %%
background_rate = 1e-4
signal_rate = 5e-2

num_obs = 500
simulated_data = pd.DataFrame(
    data=[ss.norm.rvs(loc=10_000, scale=1_500, size=num_obs)], index=["N"]
).T

# number of flips per observation
simulated_data["N"] = simulated_data["N"].astype(int)

# mixing fraction
simulated_data["fraction"] = 0.5

# counts of "background" successes
simulated_data["background_counts"] = ss.binom.rvs(
    n=simulated_data["N"], p=background_rate
)

# counts of "signal" successes
simulated_data["signal_counts"] = ss.binom.rvs(n=simulated_data["N"], p=signal_rate)

# %%
# DGP #1
# for each observation, mix the signal and background successes according to the mixing fraction
simulated_data["observed_counts_DGP1"] = (
    (simulated_data["signal_counts"] * simulated_data["fraction"])
    + (simulated_data["background_counts"] * (1 - simulated_data["fraction"]))
).astype(int)

# %%
# DGP #2
# for each observation, select "background" or "signal" successes according to the mixing fraction
switch = ss.binom.rvs(n=1, p=simulated_data["fraction"])
simulated_data["observed_counts_DGP2"] = np.where(
    switch, simulated_data["signal_counts"], simulated_data["background_counts"]
)
# %%
plt.hist(simulated_data["observed_counts_DGP1"], color="r", alpha=0.5)
plt.hist(simulated_data["observed_counts_DGP2"], color="g", alpha=0.5)
plt.xlabel("# of successes per observation")
plt.ylabel("# observations")
plt.legend(["DGP1", "DGP2"])
plt.show()

DGP#1 is roughly what you wrote in your scipy/numpy code. DGP#2 is roughly what your model reflects. As the resulting plot shows, these processes are not the same and are not expected to produce related patterns of data. So what I may have mistaken as an identification problem seems to (also) be a misspecification problem. But I am not sure which DGP is the one you actually believe reflects your application. If it’s DGP#1, then your model is misspecified and I suspect you will have identification problems even if you specify your model correctly. If it’s DGP#2, then you are generating your synthetic data inappropriately.

Topic		Replies	Views
Help with fitting mixture -- data is fit much worse than fitting the distributions separately v5 modeling	0	20	July 24, 2024
Is there an example on how to work with generalized mixture models? Questions	15	3533	March 16, 2019
4 lines basic mixture model outputs wrong results Questions	0	384	March 12, 2020
Mixture of multivariate Bernoullis Questions	11	3268	June 13, 2022
Help with dimensionality of observation data in Mixtures v5 modeling	12	151	July 8, 2024

Fitting mixture of binomials

Related topics