Fitting conflates signal component with background in Mixture model

The issue I am finding seems to be a general one: Given a mixture model where one component of the mixture is a Uniform component (representing some background signal), if the background is sufficiently high it will bias the width parameters of the remaining components (say a Normal peak, representing some signal).

In other words, fitting the mixture (peak signal + uniform background) will result in a wider peak and a lower background weight than expected.

This appears to happen irrespective of the type of sampler/optimizer I am using (ADVI, NUTS, etc).

My question is this: Are there any particular hyperparameter-setting strategies or tricks that can help to account for this particular bias? I have tried adjusting the model priors to account for this bias, but it only gets me so far and has not proven to be a particularly robust solution.

To give you a sense of what I’m talking about I’ve included the following simple example.

Generate some test data

The example I’m using is a two component model consisting of a Normal distribution centered at zero with some width sigma and a Uniform background component. The component weights are (w_signal, w_back) = (0.1, 0.9). I generate some test data using scipy:

import scipy as sp

data_range = [-20000, 20000]
N = 10000
w_back = 0.9

N_back = int(N*w_back)
N_signal = int(N*(1-w_back))

mu = 0
sigma = 2000

background = np.round(sp.stats.uniform.rvs(loc=data_range[0], scale=data_range[1]-data_range[0], size = N_back, random_state=42))
signal = np.round(sp.stats.norm.rvs(loc=0, scale=2000, size = N_signal, random_state=42))

Where I’ve chosen some total number of data points N and assigned them to the peak signal and background according to some background weight w_back. Plotting the above:

import matplotlib.pyplot as plt

plt.hist(np.concatenate([signal,background]), bins=np.linspace(data_range[0], data_range[1], 300))
plt.xlim(data_range)
plt.show()

background_example

Here you can see the background is very high, but the peak is still clearly visible.

Fitting PyMC model

In order to bias the fit toward smaller values of sigma I’m using an Exponential prior for this parameter. I am also biasing the Dirichlet priors for the weights to attempt to assign more weight to the background. Here I’m showing the results using ADVI as the optimizer, but the same results happens with the MC samplers like NUTS.

with pm.Model() as mod:
    
    peak_mu = pm.Uniform("mu", lower=data_range[0], upper=data_range[1])
    peak_sig = pm.Exponential("sig", lam=1/2000)
    
    peak_pdf = pm.Normal.dist(mu=peak_mu, sigma=peak_sig)
    back_pdf = pm.Uniform.dist(lower=data_range[0], upper=data_range[1])
    
    weights = pm.Dirichlet("w", a=np.array([1, 3]))
    
    model = pm.Mixture("mod", w=weights, comp_dists=[peak_pdf, back_pdf], observed=np.concatenate([signal,background]))

    
with mod:
    vi = pm.ADVI()
    ivi = vi.fit(50000)
    
idata = ivi.sample(10000)

Fitting results

Here we see that the fitting results have two complimentary systematic biases: 1) the peak width sig is biased toward larger values (mean=2131 whereas ground truth is sigma=2000) and 2) the background weight w1 is biased toward smaller values (mean=0.89 whereas ground truth is w_back=0.9).

Effectively, what appears to be happening here is that the optimization is WIDENING the signal peak (sig) to account for some fraction of the uniform background which results in a DECREASE in the inferred background weight (w1).

import arviz as az

az.plot_trace(idata);
az.plot_posterior(idata);

trace_plot

A few criticisms I could imagine a reader might present:

  • But the background is so high to begin with.
  • The bias in the result really isn’t all that significant given that the ground truth values are still within the HDI for each posterior.

Yes, the background is high, but looking at the data the peak is still clearly visible, and so I think it should still be able to be correctly inferred. Furthermore, the Dirichlet weights provide for the possibility that all the data could be attributed to uniform background, so the correct result is still well within the model’s possible parameter space.

Yes, in this case the true value is still within the HDI of the posteriors, but the fact that it’s a consistent systematic bias regardless of how I tune the prior hyperparameters and it occurs in every case I’ve tried with a mixture model consisting of a localized signal + a uniform background, seems like a big problem.

Also, what I am presenting here is just the simplest example I could think of to reproduce the issue. My actual use-case which is a more complicated mixture model demonstrates this same bias but to a much more extreme degree, such that the ground truth parameter values are way outside the HDI for their respective posteriors.

Has anyone else seen this correlated bias between background weight and other model parameters? Is there any robust way of adjusting for this? Thanks for any insights!

I would have a hard time calling those bias, but you said you’ve tried many examples. Did you use different random seeds for your fake data?

If there’s a consistent bias and it’s not a data issue it could come from the priors. I am surprised the uncertainty around the Normal parameters is so large with ~1000 datapoints for it.

Have you tried giving a nore informative prior to mu or sigma? Maybe something like a Gamma or LogNormal instead of the Exponential for the latter?

Also, more generally I suggest using different random state seeds (or probably better, just a single numpy Generator) when simulating draws from the Normal and Uniform. Otherwise you might induce weird correlations depending on how Scipy takes uniform and normal draws.

In your example you used 42 for both.

@ricardoV94 ah it seems you’re right—it’s a data issue! It turns out the bias I was claiming above was just a product of the random sample I generated. I’d generated a few data sets (each with a different manually chosen seed) and much to my bad luck they all had approximately the same bias (which is surprising to me given that I generated 10k samples). I tried the more rigorous approach by using the numpy generator and reran the simulation above a couple dozen times, the results of which produced mean posterior values pretty evenly on both sides of the ground truth value.

It seems my example was a red herring. Unfortunately this doesn’t explain my actual case where the above proposed behavior does happen. I’ll keep working on it. Thanks for the help!

1 Like