Gaussian Mixture Model fitting


I have been having some problems with reusing PyMC3 models iteratively. I have seen some notebooks from PyMC3 that were trying to do so (e.g. this) although the Interpolated() in from_posterior() does not work for me, as I have multidimensional data. Also, it’s a Gaussian Mixture Model, so gaussian_kde() is also not a good choice.

Ok, so I create the first iteration, I fit a GMM, where the only moving parameter are the weights (mus and sigmas are static). Now I sample circa 300 points from this distribution and try to fit a gaussian on them, with moving parameters: weights, mus, and also sigmas. This seems to not work properly, could you please direct me to a better solution? Am I doing some obvious mistake?

Thank you for your time!

I attach code and also visualizations of the “proposed fit”.

On the picture below is shown the proposed Gaussian Mixture fit on top of the original histogram (note that the GMM is scaled to be visible with the histogram)


In the following picture can be seen the posterior predictive. In the majority (and also in the mean - the dashed line) it should be okay, but the gaussian in the middle concerns me…


The code:

TRUE_MUS = np.array([100, 300])
TRUE_SIGS = np.array([50, 60])

# creation of samples from a previously created PyMC3 model
with pm.Model() as model:
    w = np.array([1.0/len(TRUE_MUS) for _ in TRUE_MUS])
    y = pm.NormalMixture('y', w, mu=TRUE_MUS, sigma=TRUE_SIGS)

    trace = pm.sample(800)
    burn_in = 500  # skips the n iters, where the MCMC is just wandering
    chain = trace[burn_in:]
    ppc = pm.sample_posterior_predictive(chain, var_names=['y'], model=model)
    data = ppc['y'].transpose()

    n = len(data)
    bins = int(np.sqrt(n))

    plt.hist(data, bins=bins, label="Data")

# fitting of a Gaussian Mixture Model based on samples from previously built model
with pm.Model() as GMMfit:

    starting_mu = np.mean(data)
    w = pm.Dirichlet('w', np.ones(2))
    mus = pm.Normal('mus', starting_mu, 100, shape=2)
    sigs = pm.HalfCauchy('sigs', beta=5.0, shape=2)
    GMM = pm.NormalMixture('gmm', w=w, mu=mus, sigma=sigs, observed=data)

    trace = pm.sample(1000, step=pm.Metropolis())
    burn_in = 500  
    chain = trace[burn_in:]

    ppc = pm.sample_posterior_predictive(chain, var_names=['w', 'mus', 'sigs', 'gmm'], model=GMMfit)
    idata_pymc3 = az.from_pymc3(chain, posterior_predictive=ppc)
    mus_infered = []
    sigs_infered = []
    weights_infered = []

    for i in range(2):

    # creation of the predicted gaussian mixture
    x = np.linspace(0,600,600)
    for i in range(2):
        if i == 0:
            y = norm.pdf(x, mus_infered[i], sigs_infered[i]) * weights_infered[i]
            y += norm.pdf(x, mus_infered[i], sigs_infered[i]) * weights_infered[i]
    y *= 14900
    plt.plot(x, y, label="Rec. GMM", color="tab:red", lw=2, ls="--")
    plt.hist(data, bins=bins, label="Data")

What is that gaussian in the middle exactly? If you are plotting the mean of your ppc it would make sense it would look like a bump between the two modes.

Hi ricardoV94, thanks for your time.

I am not exactly sure, the gaussian in the middle of the ppc (where the lot of the blue lines are) does not show everytime, (even when I run the same code multiple times) it sometimes shows, sometimes not. Maybe it is just some kind of a bug or whatever.

But nevertheless. The “posterior predictive mean gmm” as labeled on the bottom graph indicated with dashed line does follow the observation closely. The infered means of the two underlying gaussians, however, do not fit the data. (See figure 1, red dashed line - the gaussian mixture reconstructed from mcmc fit. It is not one gaussians, but two gaussians with very cloae means and almost 50:50 weights)

I tried to fit the gmm with sklearn and fit it almost perfectly (original mean: 100, fit by sklearn: 101 on average). I was able to reconstruct the gmm with sklearn more quickly and more precisely.

I am still wondering what am i doing wrong here, or whether im using mcmc methods for something that theyre nkt meant to be used for…

Thank you,


Is there a reason you are using Metropolis for your sampling in GMMFit when you are trying to do your fitting? 1000 samples from Metropolis won’t get close to converging though it might with the default sampler.

I have no particular reason, I am using Metropolis in other parts of the code, where it is significantly faster than NUTS. Are you referring to NUTS as the “default sampler”? I have read some problems with the initialization - needed to use ADVI, which is allegedly at its best with large (and multidimensional) data (according to this post ).

Metropolis may be faster per sample but it will usually take orders of magnitude more samples to converge than NUTS (which is the default). Can you try using NUTS instead? You may have some other difficulties if the problem isn’t identifiable but in general it should work better.