Fitting Gaussian line using posterior trace samples to double peak gaussian data graph

Hello everyone,

I am trying to fit the posterior trace regression line onto my data which shows how good the fit of the model is but I am running into a few problems.

When using the trace sd values it returns a flat line because the values are nowhere near the true values due to the model returning very large values for amp and cen when I try to get the trace sd means close to 1, greatly affecting the corner plot too.
Therefore, to just focus on the cen and amp trace values, I used the true sd values for the line fit which worked for another model using randomly generated data but not for this model.

Here is the code, please skip to the model:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import pymc as pm
import corner
import arviz as az
from scipy.optimize import curve_fit
from scipy.signal import find_peaks

# Importing data
o_data = pd.read_csv('PMMA_Oxygen_Data.csv', names=['x', 'y'], header=None)
x = o_data['x']
y = o_data['y']
# Data into array
x = np.array(x)
y = np.array(y)
# Extracting true values of peaks using find peaks 
peaks, _ = find_peaks(y, height=2397, distance = 1)
# Extracting the x & y values 
xpeaks = x[peaks]
ypeaks = y[peaks]
# Finding true standard deviations peaks using Gaussian function
def gaussian(x, amp, cen, sd):
    return (amp/np.sqrt(2.0*np.pi*sd**2)) * np.exp(-(x-cen)**2 / (2.0*sd**2))
# curve_fit fits Gaussian to each peak
peak_1, _ = curve_fit(gaussian, x, y, p0 = [ypeaks[0], xpeaks[0], 1])
peak_2, _ = curve_fit(gaussian, x, y, p0 = [ypeaks[1], xpeaks[1], 1])
# Assigning variables to true values 
t_amp1 = ypeaks[1]
t_cen1 = xpeaks[1]
t_sd1 = peak_1[2]
t_amp2 = ypeaks[0]
t_delta = 2
t_sd2 = peak_2[2]

# MODEL
gaussian_model = pm.Model()
with pm.Model() as gaussian_model:
    
        # Defines prior parameters of the counts & corrected binding energy (eV)
    amp1 = pm.Normal('amp1', mu=2519, sigma=20, initval=2515)
    cen1 = pm.Normal('cen1', mu=511, sigma=5, initval=509)
    sd1 = pm.HalfNormal('sd1', sigma=1)
    amp2 = pm.Normal('amp2', mu=2365.5, sigma=20, initval=2360)
    delta = pm.LogNormal('delta', mu=0.7, sigma=0.1)
    sd2 = pm.HalfNormal('sd2', sigma=1)
        
        # Expected value of outcome from Gaussian equation 
    mu = (amp1/np.sqrt(2.0*np.pi*sd1**2)) * np.exp(-(x-cen1)**2 / (2.0*sd1**2)) + \
            (amp2/np.sqrt(2.0*np.pi*sd2**2)) * np.exp(-(x-(cen1+delta))**2 / (2.0*sd2**2))
        
        # Defines likelihood 
    peak_1 = pm.Normal('peak_1', mu=mu[0], sigma=sd1, observed=y)
    peak_2 = pm.Normal('peak_2', mu=mu[1], sigma=sd2, observed=y)
        
        # Sampling from posterior 
    trace = pm.sample(
        draws=5500, tune=5500, chains=5, cores=5, return_inferencedata=True
    )

# Estimating values
map_estimate = pm.sample(model=gaussian_model)

pm.summary(trace)

# Plotting data into scatter graph 
fig, ax = plt.subplots(figsize=(15, 10))
ax.set_xlabel('Centre', fontsize = 22.5)
ax.set_ylabel('Amp', fontsize = 22.5)
plt.xticks(fontsize = 18)
plt.yticks(fontsize = 18)
plt.grid(linewidth = 1)
plt.plot(x, y, linestyle='-', marker='o',  c='b');

# Bimodal Gaussian function
def gaussian2(x, amp1, cen1, sd1, amp2, delta_cen, sd2):
    return (amp1/np.sqrt(2.0*np.pi*sd1**2)) * np.exp(-(x-cen1)**2 / (2.0*sd1**2)) + \
            (amp2/np.sqrt(2.0*np.pi*sd2**2)) * np.exp(-(x-(cen1+delta_cen))**2 / (2.0*sd2**2))

# Using posterior trace values to fit Gaussian line 
trace_means = pm.summary(trace).loc[:,'mean']
plt.plot(x, gaussian2(x, 
                      trace_means['amp1'], trace_means['cen1'], true_sd1, 
                      trace_means['amp2'], trace_means['delta'], true_sd2), 
         'g--', linewidth = 2.5)    

ax.set_xbound(lower=520, upper=528)
ax.set_ybound(lower=230, upper=2750)

To explain the model a bit, the pm.summary does return the true values and the delta parameter separates peaks so there is no explicit ordering of the centre value as the 2nd peak is positioned. LogNormal is used to give positive values (this method was suggested by my uni mentor).
This is the graph it returns:


This is the pm.summary:
image
This is what my mentor is trying to make me aim (better if possible):

And here is the data too.
Oxygen_Data.csv (6.8 KB)

This is what I achieved for a different model with randomly generated data:

Hello,

Is it possible that what you want is a Gaussian mixture model?

https://www.pymc.io/projects/docs/en/stable/api/distributions/generated/pymc.Mixture.html

Hey there,

In terms of the model itself, it does work mostly for the intended purposes which is to return an all concentric corner plot of the main parameters (amplitude and centre) at their true values but if the mixture model can help with the high sd values in the posterior which then maybe helps my regression line fitting then I can try that.

In a Gaussian mixture setting it will make a difference whether if you use two normals vs a mixture. See for instance:

import pymc as pm
import numpy as np
import matplotlib.pyplot as plt
import arviz as az

trans = pm.distributions.transforms.ordered

seed = 0

rng = np.random.default_rng(seed)
N1 = 100
N2 = 100
data1 = rng.normal(0, 0.2, size=N1)
data2 = rng.normal(1, 0.2, size=N2)

data = np.concatenate([data1,data2])

plt.hist(data)

with pm.Model() as model1:
  
  w = pm.Dirichlet("w", [10,10])
  mu = pm.Normal("mu", 0, 1, size=2, transform=trans, initval=[0,1])
  sd = pm.HalfNormal("sd", 1, size=2)
  
  comps = [pm.Normal.dist(mu[0], sd[0]), 
           pm.Normal.dist(mu[1], sd[1])]
  
  pm.Mixture("obs", w, comps, observed=data)
  
  trace1 = pm.sample(2000, tune=1000)
  

az.plot_posterior(trace1, var_names=["mu","sd"])
  
with pm.Model() as model2:
  
  mu = pm.Normal("mu", 0, 1, size=2, transform=trans, initval=[0,1])
  sd = pm.HalfNormal("sd", 1, size=2)
  
  pm.Normal("obs1", mu[0], sd[0], observed=data)
  pm.Normal("obs2", mu[1], sd[1], observed=data)

  trace2 = pm.sample(2000, tune=1000)

az.plot_posterior(trace2, var_names=["mu","sd"])

In the latter model, not only the mu parameters are almost identical to each other, sd is also larger. It also has difficulty sampling (divergences). The mixture on the other returns the correct parameters. If you are trying to fit a multimodal Gaussian to some data then you are in this domain too (which is what I deduced from your last plot). I suspect your centres might be returning more reasonable results than here because your priors that define them look very informative. In fact running the following:


with pm.Model() as model2:
  
  mu = pm.Normal("mu", [0,1], [0.01,0.01], size=2, transform=trans, initval=[0,1])
  sd = pm.HalfNormal("sd", 1, size=2)
  
  
  pm.Normal("obs1", mu[0], sd[0], observed=data)
  pm.Normal("obs2", mu[1], sd[1], observed=data)

  trace2 = pm.sample(2000, tune=1000)

az.plot_posterior(trace2, var_names=["mu","sd"])

returns an inference with correct mus but larger sd than correct values.

From a mathematical point of view, the likelihood of two normals will be their product but likelihood of a mixture will be their sum. In the former, it will try to opt for two normals that individually approximate all of the data as best as they could. This might result in very similar parameters especially if the priors are not informative (unlike your case). In the latter case it will find two normals whose mixture approximate the data as best as it could,

I’ve spent a good chunk of the day trying this out, and the mixture model itself decreased the sd trace values up until a point which was an improvement from the pm.Normal model but unfortunately the only way I have gotten it semi close was to make the sigma values very small as seen below and even then the values were 18 and 14 with a run time of over a hour so I am really not sure how to actually achieve the true sd values which are ~1.

sd1 = pm.HalfNormal(‘sd1’, sigma=0.1)

Run time:

Trace summary:
image

True values:
image

I had a more detailed look at your model. Is it possible that you are misspecifying something? The mu variable you create is a tensor with 256 elements (since it derives from x which itself is) and yet you only use mu[0] and mu[1] as the centre of your normals that define the likelihood. If one looks at the posterior predictive plot from plot_ppc there is strong signal that some aspect of the model is misspecified

ppc

The posterior predictive you get looks like that because if you sample mu, its first two elements (out of 256) are very small (8 sth) hence no suprise that that ppc plot lands somewhere near 0 and then sd tries to be very large to accommodate for the observations which are far away from the means of your mixture.

Note that fitting a gaussian mixture in the simplest case would look like this:

with pm.Model() as gaussian_model:

    w = pm.Dirichlet("w", [0.5, 0.5])
    mu1 = pm.Normal('mu1', mu=500, sigma=50, initval=500)
    sd1 = pm.HalfNormal('sd1', sigma=1)

    mu2 = pm.Normal('mu2', mu=2365.5, sigma=20, initval=2360)
    sd2 = pm.HalfNormal('sd2', sigma=1)


    peak_1 = pm.Normal.dist(mu=mu1, sigma=sd1)
    peak_2 = pm.Normal.dist( mu=mu2, sigma=sd2)

    counts = pm.Mixture("counts1", w, [peak_1, peak_2], observed=y)
        # Sampling from posterior
    trace = pm.sample(
        draws=5500, tune=5500, chains=5, cores=5, return_inferencedata=True
    )

In a case like yours, the centre parameters mu1, mu2 might be derived from some equations that depend on other priors (rather then them selves being independent).However, in your case the distribution of y does not really look like mixture of two normals either so even though the peaks for posterior predictive are placed correctly, they don’t look that great:

ppc_normal

sd is also still large here, ~ 50. And this is expected because here sd is measuring the width of the normals you are seeing in the plot above. They do not approximate the sd value=1 that goes into the gaussian function you defined before.

I am wondering now instead if what you are trying to do really is some sort of regression and not mixture of gaussians? In such a case what you have is some observables y, some covariates x and a function that depends on some parameters such that

y \sim N(f(x, p_1, p_2, ...), sd)

If that is the case, you will have only a single normal likelihood, and sd there will be some sort of measurement error and not the sd1, sd2 you use in model generation.

Indeed when I run something like this:

with pm.Model() as gaussian_model:

        # Defines prior parameters of the counts & corrected binding energy (eV)
    w = pm.Dirichlet("w", [0.5, 0.5])
    amp1 = pm.Normal('amp1', mu=2519, sigma=20, initval=2515)
    cen1 = pm.Normal('cen1', mu=511, sigma=5, initval=509)
    sd1 = pm.HalfNormal('sd1', sigma=1)
    amp2 = pm.Normal('amp2', mu=2365.5, sigma=20, initval=2360)
    delta = pm.LogNormal('delta', mu=0.7, sigma=0.1)
    sd2 = pm.HalfNormal('sd2', sigma=1)

    sd = pm.HalfNormal("sd", 1)

        # Expected value of outcome from Gaussian equation
    mu = (amp1/np.sqrt(2.0*np.pi*sd1**2)) * np.exp(-(x-cen1)**2 / (2.0*sd1**2)) + \
            (amp2/np.sqrt(2.0*np.pi*sd2**2)) * np.exp(-(x-(cen1+delta))**2 / (2.0*sd2**2))

        # Defines likelihood
    obs = pm.Normal("obs", mu=mu, sigma=sd, observed=y)

        # Sampling from posterior
    trace = pm.sample(
        draws=5500, tune=5500, chains=5, cores=5, return_inferencedata=True
    )

sd1 and sd2 come around as 0.5 where as sd comes around as 81. I suspect that you are confusing the sd1, sd2 that you used in data generation with measurement error sd which goes in to the likelihood. However the posterior predictive plot, even though it has two peaks now (without a mixture model), has the peaks placed at the wrong spots

ppc

So you might wanna have a look at the equation defining mu to double check everything is correct there.

Using a mixture is a nice idea. You could try using the new MarginalModel to marginalize over the mixture index to get better results. I tried a mixture of two gaussian PDFs plus a deterministic trend (it seemed like the data was upward sloping to me outside of the bump region). Here’s what I came up with:

import pandas as pd
import pymc as pm
import arviz as az
import numpy as np
import pytensor.tensor as pt
import matplotlib.pyplot as plt
from pymc_experimental import MarginalModel
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)

df = pd.read_csv('data/Oxygen_Data.csv', header=None)
df.columns = ['x', 'y']
df = (df - df.mean()) / df.std()


coords={'params':['slope', 'intercept'], 
                           'peak':[0, 1], 
                           'group':['trend', 'peak_1', 'peak_2'],
                           'obs_idx':np.arange(df.shape[0])}

coords={'params':['slope', 'intercept'], 
                           'peak':[0, 1], 
                           'group':['trend', 'peak_1', 'peak_2'],
                           'obs_idx':np.arange(df.shape[0])}

with MarginalModel(coords=coords) as m:
    x_data = pm.ConstantData('x', df.x, dims=['obs_idx'])
    y_data = pm.ConstantData('y', df.y, dims=['obs_idx'])

    X = pt.concatenate([pt.ones_like(x_data[:, None]), x_data[:, None]], axis=-1)
    
    beta = pm.Normal('beta', dims='params')
    mu_trend = X @ beta

    mu_peak = pm.Normal('mu_peak', 
                        dims=['peak'], 
                        transform=pm.distributions.transforms.ordered,
                        initval=[-1, -0.5])
    sigma_peak = pm.Exponential('sigma_peak', 1, dims=['peak'])
    p_x = pt.exp(pm.logp(pm.Normal.dist(mu=mu_peak, sigma=sigma_peak), x_data[:, None]))

    logit_p_group = pm.Normal('logit_p_group', dims=['group'])
    group_idx = pm.Categorical('group_idx', logit_p=logit_p_group, dims=['obs_idx'])
    sigma = pm.Exponential('sigma', 1)

    mu = pt.switch(pt.lt(group_idx, 1), 
                   mu_trend,
                   pt.switch(pt.lt(group_idx, 2), 
                             p_x[:, 0], 
                             p_x[:, 1])
                  )
    
    y_hat = pm.Normal('y_hat', 
                      mu = mu,
                      sigma = sigma,
                      observed=y_data,
                      dims=['obs_idx'])

Then you have to know some boilerplate for using MarginalModel, but it’s not so hard. Tutorial is here.

with m:
    m.marginalize(['group_idx'])
    idata = pm.sample()
    m.recover_marginals(idata)
    m.unmarginalize(m.marginalized_rvs)
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True)

Here’s what I got:

                mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
mu_peak[0]    -0.635  0.003  -0.640   -0.631        0.0      0.0    3417.0    2632.0    1.0
mu_peak[1]    -0.546  0.003  -0.551   -0.541        0.0      0.0    4565.0    3245.0    1.0
sigma_peak[0]  0.119  0.002   0.115    0.123        0.0      0.0    3980.0    3139.0    1.0
sigma_peak[1]  0.129  0.002   0.125    0.132        0.0      0.0    4844.0    3142.0    1.0

Some failures I hit on the way:

  • My first instinct was to use a Dirichlet to parameterize p for the Categorical, but this ends up not letting you use NUTS after you marginalize (@ricardoV94 is this a known bug/limitation?)
  • The ordered transform on the means of the mixture components ends up being important. I guess this is quite common in mixtures, see this discussion. In this case, when I didn’t include it, the model couldn’t pick out the two modes (I guess due to mode switching, but they were so close i didn’t see it in the trace plots [also I didn’t look that carefully])
  • I wanted to have a linear model of X on p_logit, but MarginalModel didn’t that either.
1 Like

I suspect you are also using curve fit wrong (unless I misunderstood your aim). It should probably be

def gaussian(x, amp1, cen1, sd1, amp2, cen2, sd2):
    return (amp1/np.sqrt(2.0*np.pi*sd1**2)) * np.exp(-(x-cen1)**2 / (2.0*sd1**2))+\
      amp2/np.sqrt(2.0*np.pi*sd2**2) * np.exp(-(x-cen2)**2 / (2.0*sd2**2))

fit_params, _ = curve_fit(gaussian, x, y, p0 = [ypeaks[0], xpeaks[0], 1, ypeaks[1], xpeaks[1], 1])

This returns

array([8.21429274e+03, 5.23757513e+02, 1.37229654e+00, 5.25917580e+02,
       5.22756367e+02, 3.13048145e-01])

which look like the parameters you want.

This is a known bug also mentioned here

BUG: Possible OpFromGraph Related Issue for Choosing Samplers in MarginalModel

1 Like

Thank you for this code, I managed to tweak the mu values to get closer to the true values and to get an actual regression fit but as you said and as seen below it is a bit off.
If you look at the csv, the true parameters are different to what that last reply returned, amp1 is ~2551, cen1 is ~523, amp2 is ~2397 and cen2 is ~524 or delta is 2.
I took out the delta parameter and tried to see if using cen2 would make a difference but it returned the exact same plot which is as expected I guess.
As you said I can try checking the mu equation but it is the sum of 2 Gaussian equations to make a bimodal Guassian function so I’m not sure if there’s much wrong.
With delta:

mu = (amp1/np.sqrt(2.0*np.pi*sd1**2)) * np.exp(-(x-cen1)**2 / (2.0*sd1**2)) + \
            (amp2/np.sqrt(2.0*np.pi*sd2**2)) * np.exp(-(x-(cen1+delta))**2 / (2.0*sd2**2))

With cen2:

mu = (amp1/np.sqrt(2.0*np.pi*sd1**2)) * np.exp(-(x-cen1)**2 / (2.0*sd1**2)) + \
            (amp2/np.sqrt(2.0*np.pi*sd2**2)) * np.exp(-(x-cen2))**2 / (2.0*sd2**2))

Fit:

For now I will try more and I can probably get some guidance from my mentor after the april holidays. If you have any more ideas to get a better fit that would be great but you have helped a lot already so thank you!