Two kinds of data in one model

Hi all,
I have two kinds of data: a vector of counts and a vector of ENERGIES. The counts vector has the shape of a sum of gaussians when plotted vs. its indices:
counts[x] = sum(somegaussians[x])
and I expect the ENERGIES to be a linear function of the means of the gaussians:
ENERGIES = intecept + slope * mu
I add my code below. I’d like to know how to incorporate the very low uncertainty in the energies (sigma inside energy_obs) without the Markov Chain failing:

DATA_BINS = 4096
ENERGIES = [5.34036, 5.42315, 5.68537, 6.05078, 6.28808, 6.7783, 8.78486]  # MeV
THRESHOLD = 5
N = 7

# read data from file:
datafile = 'Alpha_Radiation/cumulated_histogram.txt'
counts = read_data.read_data_file(datafile, (0, DATA_BINS))
bins = np.linspace(1/2, DATA_BINS+1/2, DATA_BINS)

# Clear data:
indices = np.where((THRESHOLD < counts) & (2722 < bins) & (bins < 2800))
counts = counts[indices]
bins = bins[indices]

mod_seven_gaussians = pm.Model()
with mod_seven_gaussians:

    # priors
    amp = pm.Uniform('amp', amp_est_mean-50, amp_est_mean+50.0, shape=N)
    mu = pm.Uniform('mu', mu_est_mean-50, mu_est_mean+50.0, shape=N)
    sig = pm.Uniform('sig', sig_est_mean-7, sig_est_mean+8, shape=N)
    intercept = pm.Uniform('intercept', intecept_est_mean-1, intecept_est_mean+1)
    slope = pm.Uniform('slope', slope_est_mean-1e-2, slope_est_mean+1e-2)

    # likelihood
    counts_obs = pm.Normal('counts',
                            mu=(amp[0]*np.exp(-(bins-mu[0])**2 / sig[0]**2)
                            + amp[1]*np.exp(-(bins-mu[1])**2 / sig[1]**2)
                            + amp[2]*np.exp(-(bins-mu[2])**2 / sig[2]**2)
                            + amp[3]*np.exp(-(bins-mu[3])**2 / sig[3]**2)
                            + amp[4]*np.exp(-(bins-mu[4])**2 / sig[4]**2)
                            + amp[5]*np.exp(-(bins-mu[5])**2 / sig[5]**2)
                            + amp[6]*np.exp(-(bins-mu[6])**2 / sig[6]**2)
                            ),
                            sigma=np.sqrt(counts),
                            observed=counts)
    
    energy_obs = pm.Normal('energies',
                            mu=intercept + slope*mu,
                            sigma=1e-6,
                            observed=ENERGIES)
    
    trace = pm.sample(20000, return_inferencedata=False)

Thanks a lot,
Inbar

A few thoughts:

  • You can’t use Numpy math operations on PyMC3 variables in place of Theano / PyMC3 operations (the latter are equivalent - PyMC3 just wraps the Theano code)

  • For readability, you could replace the distribution for counts_obs as

gaussian_sum = pm.math.sum(amp*pm.math.exp(-(bins-mu)**2 / sig**2)))
counts_obs = pm.Normal('counts', mu=gaussian_sum, sigma=np.sqrt(counts), observed=counts)
  • Using uniform priors is usually not a good idea - I suggest replacing all of those pm.Uniform variables with normal priors, using the distance between the uniform bounds as the standard deviation for the normal.

  • Using such a small standard deviation means that a tiny change in the parameter values leads to a huge jump in the log-likelihood which is probably why the sampler isn’t happy. Can you just work in a different set of units, i.e. if you are using joules or eV, just work in microjoules or meV? That way, you can use a sigma of 1.

Hi Chris,
Thanks for the suggestions!
bins is already a numpy array so I can’t do the sum you suggested. I mean, inside the definition of counts_obs it just says that the normal distribution is len(bins) dimensional, doesn’t it?. But I did take your advice, apparently I can sum over a generator. I also changed the prior distributions and the scale, but the Chains still fail. The relative uncertainty in the energies is apparently just too low for the algorithm.

More specifically, when I put in

pm.math.exp(-(bins-mu)**2 / sig**2)

it raises

ValueError: Input dimension mis-match. (input[0].shape[0] = 228, input[1].shape[0] = 7)

where 228 is the number of my bins.

Is the dimension of counts equal to 228? If so, we would want to do some broadcasting inside the math operations so that we take arrays with shape (228,1) and (1,7) to get a raw array of shape (228,7) which we then sum over the second axis so that the resulting vector is 228 long.

As for the energy error, it’s quite hard to diagnose this without access to your data as there are multiple potential causes. Would it be possible for you to define a list / Numpy array providing the data within the script itself? For example, you would write counts = np.asarray([10, 2, 5, ... ,]). This can be done relatively quickly if you print out the value of that variable and then copy/paste its string representation. It’s okay if it turns into a long file - you can just make a gist which I can copy and run locally.