Fitting a histogram

I am trying to build a model that seemingly consist of two normal distributions. My data however is binned. Therefore i believe a qaussian mixture model cannot be used(?). Is there a way that i can do a fitting of two normal distributions to binned data like this in pymc3?

image

2 Likes

There is definitely a less hacky way to do this, but what about just sampling N_b points uniformly from the interval within each bin b and then using the GMM directly?

3 Likes

hi thanks for the reply! I feel this is on the hackier side of things…

You can also try a curve fitting with the curve being the pdf of a GMM.

1 Like

You might find this discussion helpful: Fitting a spectra of gaussians

2 Likes

Hi again, Thank you for the suggestions!
I found the linked discussion very useful, and your notebook example using a mixture_density function was exatly what i was looking for.

I am hoping you can help me clearify one thing, @junpenglao:
I was testing for robustness of different magnitudes of the bin heights, so i gave a histogram with density=true as my data, essentially scaling down the bin heights. The model is not able to approximate this, nor is it able to find a fit when i scale the bin height by lets say 100. I have tried changing my w-prior to compensate for this, but I am getting very weird results.

Is this something that you can explain?
In advance, thank you so much for answering!

you probably should adjust the prior - but if you have access to the binning it would be easier to fit a GMM directly, unless I am missing something here?

Hi, thanks again for getting back to me.
By access to the binning i guess you mean the original data that is beiing binned, but in a real life case i do not av access to this data, Only the bin results.

I have tried changing your mixture density example slightly and am now running the following code, where the observed data is bin heights:

def mixture_density(w, mu, sd, x):
    logp =  pm.NormalMixture.dist(w,mu, sd).logp(x)
    return tt.exp(logp)

 
with m:
    w = pm.Dirichlet('w', np.ones_like(centers)*.5)
    mu = pm.Normal('mu', 0., 5., shape=centers.size)
    tau = pm.HalfCauchy('tau', 1., shape=centers.size)
    y= mixture_density(w, mu, tau, x)
    y_obs=pm.Normal('y_obs',mu=y,observed=df['y'][0])

I am sorry for not understanding what is going wrong, but do you see something that may cause this to fail? It is giving very weird results especally for the density=True cases for the binning operation.

One thing you can try is to add a scaling factor to the mixture_density, as the histogram (even with density=True) might not gives a prior pdf that could fit with the mixture distribution you specified.

def mixture_density(w, mu, sigma, scaling, x):
    logp =  pm.NormalMixture.dist(w, mu, sigma).logp(x)
    return tt.exp(logp) * scaling

 
with m:
    w = pm.Dirichlet('w', np.ones_like(centers)*.5)
    mu = pm.Normal('mu', 0., 5., shape=centers.size)
    sigma = pm.HalfCauchy('sigma', 1., shape=centers.size)
    scaling = pm.TruncatedNormal('scaling', 1., 0.1, lower=0)
    y= mixture_density(w, mu, sigma, scaling, x)
    y_obs=pm.Normal('y_obs',mu=y,observed=df['y'][0])
1 Like