Maybe this helps (or helps show why I am misunderstanding). Below, I lay out two different data generating processes (DGP#1 and DGP#2).
Code
# %%
import numpy as np
import pandas as pd
import scipy.stats as ss
import matplotlib.pyplot as plt
# %%
background_rate = 1e-4
signal_rate = 5e-2
num_obs = 500
simulated_data = pd.DataFrame(
data=[ss.norm.rvs(loc=10_000, scale=1_500, size=num_obs)], index=["N"]
).T
# number of flips per observation
simulated_data["N"] = simulated_data["N"].astype(int)
# mixing fraction
simulated_data["fraction"] = 0.5
# counts of "background" successes
simulated_data["background_counts"] = ss.binom.rvs(
n=simulated_data["N"], p=background_rate
)
# counts of "signal" successes
simulated_data["signal_counts"] = ss.binom.rvs(n=simulated_data["N"], p=signal_rate)
# %%
# DGP #1
# for each observation, mix the signal and background successes according to the mixing fraction
simulated_data["observed_counts_DGP1"] = (
(simulated_data["signal_counts"] * simulated_data["fraction"])
+ (simulated_data["background_counts"] * (1 - simulated_data["fraction"]))
).astype(int)
# %%
# DGP #2
# for each observation, select "background" or "signal" successes according to the mixing fraction
switch = ss.binom.rvs(n=1, p=simulated_data["fraction"])
simulated_data["observed_counts_DGP2"] = np.where(
switch, simulated_data["signal_counts"], simulated_data["background_counts"]
)
# %%
plt.hist(simulated_data["observed_counts_DGP1"], color="r", alpha=0.5)
plt.hist(simulated_data["observed_counts_DGP2"], color="g", alpha=0.5)
plt.xlabel("# of successes per observation")
plt.ylabel("# observations")
plt.legend(["DGP1", "DGP2"])
plt.show()
DGP#1 is roughly what you wrote in your scipy/numpy code. DGP#2 is roughly what your model reflects. As the resulting plot shows, these processes are not the same and are not expected to produce related patterns of data. So what I may have mistaken as an identification problem seems to (also) be a misspecification problem. But I am not sure which DGP is the one you actually believe reflects your application. If it’s DGP#1, then your model is misspecified and I suspect you will have identification problems even if you specify your model correctly. If it’s DGP#2, then you are generating your synthetic data inappropriately.