Two kinds of data in one model

Inbar_Maimon · December 3, 2020, 12:05pm

Hi all,
I have two kinds of data: a vector of counts and a vector of ENERGIES. The counts vector has the shape of a sum of gaussians when plotted vs. its indices:
counts[x] = sum(somegaussians[x])
and I expect the ENERGIES to be a linear function of the means of the gaussians:
ENERGIES = intecept + slope * mu
I add my code below. I’d like to know how to incorporate the very low uncertainty in the energies (sigma inside energy_obs) without the Markov Chain failing:

DATA_BINS = 4096
ENERGIES = [5.34036, 5.42315, 5.68537, 6.05078, 6.28808, 6.7783, 8.78486]  # MeV
THRESHOLD = 5
N = 7

# read data from file:
datafile = 'Alpha_Radiation/cumulated_histogram.txt'
counts = read_data.read_data_file(datafile, (0, DATA_BINS))
bins = np.linspace(1/2, DATA_BINS+1/2, DATA_BINS)

# Clear data:
indices = np.where((THRESHOLD < counts) & (2722 < bins) & (bins < 2800))
counts = counts[indices]
bins = bins[indices]

mod_seven_gaussians = pm.Model()
with mod_seven_gaussians:

    # priors
    amp = pm.Uniform('amp', amp_est_mean-50, amp_est_mean+50.0, shape=N)
    mu = pm.Uniform('mu', mu_est_mean-50, mu_est_mean+50.0, shape=N)
    sig = pm.Uniform('sig', sig_est_mean-7, sig_est_mean+8, shape=N)
    intercept = pm.Uniform('intercept', intecept_est_mean-1, intecept_est_mean+1)
    slope = pm.Uniform('slope', slope_est_mean-1e-2, slope_est_mean+1e-2)

    # likelihood
    counts_obs = pm.Normal('counts',
                            mu=(amp[0]*np.exp(-(bins-mu[0])**2 / sig[0]**2)
                            + amp[1]*np.exp(-(bins-mu[1])**2 / sig[1]**2)
                            + amp[2]*np.exp(-(bins-mu[2])**2 / sig[2]**2)
                            + amp[3]*np.exp(-(bins-mu[3])**2 / sig[3]**2)
                            + amp[4]*np.exp(-(bins-mu[4])**2 / sig[4]**2)
                            + amp[5]*np.exp(-(bins-mu[5])**2 / sig[5]**2)
                            + amp[6]*np.exp(-(bins-mu[6])**2 / sig[6]**2)
                            ),
                            sigma=np.sqrt(counts),
                            observed=counts)
    
    energy_obs = pm.Normal('energies',
                            mu=intercept + slope*mu,
                            sigma=1e-6,
                            observed=ENERGIES)
    
    trace = pm.sample(20000, return_inferencedata=False)

Thanks a lot,
Inbar

ckrapu · December 4, 2020, 1:20am

A few thoughts:

You can’t use Numpy math operations on PyMC3 variables in place of Theano / PyMC3 operations (the latter are equivalent - PyMC3 just wraps the Theano code)
For readability, you could replace the distribution for counts_obs as

gaussian_sum = pm.math.sum(amp*pm.math.exp(-(bins-mu)**2 / sig**2)))
counts_obs = pm.Normal('counts', mu=gaussian_sum, sigma=np.sqrt(counts), observed=counts)

Using uniform priors is usually not a good idea - I suggest replacing all of those pm.Uniform variables with normal priors, using the distance between the uniform bounds as the standard deviation for the normal.
Using such a small standard deviation means that a tiny change in the parameter values leads to a huge jump in the log-likelihood which is probably why the sampler isn’t happy. Can you just work in a different set of units, i.e. if you are using joules or eV, just work in microjoules or meV? That way, you can use a sigma of 1.

Inbar_Maimon · December 4, 2020, 12:43pm

Hi Chris,
Thanks for the suggestions!
bins is already a numpy array so I can’t do the sum you suggested. I mean, inside the definition of counts_obs it just says that the normal distribution is len(bins) dimensional, doesn’t it?. But I did take your advice, apparently I can sum over a generator. I also changed the prior distributions and the scale, but the Chains still fail. The relative uncertainty in the energies is apparently just too low for the algorithm.

Inbar_Maimon · December 4, 2020, 2:28pm

More specifically, when I put in

pm.math.exp(-(bins-mu)**2 / sig**2)

it raises

ValueError: Input dimension mis-match. (input[0].shape[0] = 228, input[1].shape[0] = 7)

where 228 is the number of my bins.

ckrapu · December 4, 2020, 4:31pm

Is the dimension of counts equal to 228? If so, we would want to do some broadcasting inside the math operations so that we take arrays with shape (228,1) and (1,7) to get a raw array of shape (228,7) which we then sum over the second axis so that the resulting vector is 228 long.

As for the energy error, it’s quite hard to diagnose this without access to your data as there are multiple potential causes. Would it be possible for you to define a list / Numpy array providing the data within the script itself? For example, you would write counts = np.asarray([10, 2, 5, ... ,]). This can be done relatively quickly if you print out the value of that variable and then copy/paste its string representation. It’s okay if it turns into a long file - you can just make a gist which I can copy and run locally.

Topic		Replies	Views
How to decide on what priors distributions to use for parameters? Questions	22	5803	January 7, 2022
Problem model definition PyMC3 Questions	12	4722	September 21, 2017
Estimating standard deviation of very coarsely binned data v5	6	423	January 25, 2023
Normalizing/Standardizing variables of interest causes bad energy in Gaussian Process model? Questions	6	2189	May 14, 2019
Bad initial energy: inf. The model might be misspecified Questions	5	3565	June 29, 2018

Two kinds of data in one model

Related topics