 Using unobserved distributions to bound observed truncated normal distributions

Hi all, I’m not sure what I am doing commits any cardinal sins in the realm of Probabilistic Programming or just plain wrong or impossible. I’m new to PyMC3 and probabilistic programming, so please forgive me for my ignorance. I would love to be enlightened. With that said he we go…

I have data that is multimodal (4-peaks). For example:

This data can be generated with this:

import matplotlib.pyplot as plt
import pymc3 as pm
import theano.tensor as tt
import numpy as np

# example data:
mu1, mu2, mu3, mu4 = 0, 2.4, 4.7, 7.4
sigma1, sigma2, sigma3, sigma4 = 1, 0.6, 0.9, 0.7

dataset1 = np.random.normal(mu1, sigma1, 1000)
dataset2 = np.random.normal(mu2, sigma2, 1000)
dataset3 = np.random.normal(mu3, sigma3, 1000)
dataset4 = np.random.normal(mu4, sigma4, 1000)

plt.figure()
count, bins1, ignored = plt.hist(dataset1, 30, density=True, alpha=0.5)
count, bins2, ignored = plt.hist(dataset2, 30, density=True, alpha=0.5)
count, bins3, ignored = plt.hist(dataset3, 30, density=True, alpha=0.5)
count, bins4, ignored = plt.hist(dataset4, 30, density=True, alpha=0.5)
plt.show()

Building the model:

model = pm.Model()
with model:
# in between peaks distributions
between1 = pm.Normal('between1', mu=1.3, sigma=0.9)
between2 = pm.Normal('between2', mu=3.8, sigma=0.8)
between3 = pm.Normal('between3', mu=6.1, sigma=0.85)

# the observed distributions
p1 = pm.TruncatedNormal('p1', lower=np.min(dataset1), upper=between1, observed=dataset1)
p2 = pm.TruncatedNormal('p2', lower=between1, upper=between2, observed=dataset2)
p3 = pm.TruncatedNormal('p3', lower=between2, upper=between3, observed=dataset3)
p4 = pm.TruncatedNormal('p4', lower=between3, upper=np.max(dataset4), observed=dataset4)

trace = pm.sample()

So at this point, the code gets an error: pymc3.parallel_sampling.ParallelSamplingError: Bad initial energy. I suspected that this error is caused by the fact that the observed data exceeds the bounds defined in the truncated normal distributions.

As a result, I tried to mitigate this issue by filtering the data such that the data values are within upper and lower bounds of the truncated normal distribution via Theano slicing. So now the model code looks like:

model = pm.Model()
with model:
# in between peaks distributions
between1 = pm.Normal('between1', mu=1.3, sigma=0.9)
between2 = pm.Normal('between2', mu=3.8, sigma=0.8)
between3 = pm.Normal('between3', mu=6.1, sigma=0.85)

# filter the observed data:
dataset1_tt = tt._shared(dataset1)
dataset2_tt = tt._shared(dataset2)
dataset3_tt = tt._shared(dataset3)
dataset4_tt = tt._shared(dataset4)

filtered_data1 = dataset1_tt[(dataset1_tt < between1).nonzero() and (dataset1_tt > np.min(dataset1)).nonzero()]
filtered_data2 = dataset2_tt[(dataset2_tt < between2).nonzero() and (dataset1_tt > between1).nonzero()]
filtered_data3 = dataset3_tt[(dataset3_tt < between3).nonzero() and (dataset3_tt > between2).nonzero()]
filtered_data4 = dataset4_tt[(dataset4_tt < np.max(dataset4)).nonzero() and (dataset4_tt > between3).nonzero()]

# the observed distributions
p1 = pm.TruncatedNormal('p1', lower=np.min(dataset1), upper=between1, observed=filtered_data1)
p2 = pm.TruncatedNormal('p2', lower=between1, upper=between2, observed=filtered_data2)
p3 = pm.TruncatedNormal('p3', lower=between2, upper=between3, observed=filtered_data3)
p4 = pm.TruncatedNormal('p4', lower=between3, upper=np.max(dataset4), observed=filtered_data4)

trace = pm.sample()

Slicing the observed data such that they fall within the bounds each truncated normal didn’t seem to work because I still get the Bad Initial Energy error. Got any ideas how to resolve this?

Hi, what are you trying to model?

I think you should use gaussian mixture model

https://docs.pymc.io/notebooks/gaussian_mixture_model.html

1 Like

Hi, I’m trying to model pixel values and trying to force separability between the histograms because they are overlapping.

The Gaussian mixture model makes sense if I want to model all of the data together, but I’m trying to model a distribution that is separable.

I’m not really understanding this Bad initial Energy error. Do you know what it means?

Hi,
Often it stems from NaNs somewhere in the data. I think you’ll find useful information here.
Hope it helps Okay, I will look at the data, but I’m fairly certain that the data doesn’t have any NaNs.

One thing that I observed is that putting the model creation code (i.e. with model statement) inside a try-except inside a while loop set to true, one iteration eventually starts to trace without issue. For example,

trials = 0
while True:
try:
model = pm.Model()
with model:
# in between peaks distributions
between1 = pm.Normal('between1', mu=1.3, sigma=0.9)
between2 = pm.Normal('between2', mu=3.8, sigma=0.8)
between3 = pm.Normal('between3', mu=6.1, sigma=0.85)

# filter the observed data:
dataset1_tt = tt._shared(dataset1)
dataset2_tt = tt._shared(dataset2)
dataset3_tt = tt._shared(dataset3)
dataset4_tt = tt._shared(dataset4)

filtered_data1 = dataset1_tt[(dataset1_tt < between1).nonzero() and (dataset1_tt > np.min(dataset1)).nonzero()]
filtered_data2 = dataset2_tt[(dataset2_tt < between2).nonzero() and (dataset1_tt > between1).nonzero()]
filtered_data3 = dataset3_tt[(dataset3_tt < between3).nonzero() and (dataset3_tt > between2).nonzero()]
filtered_data4 = dataset4_tt[(dataset4_tt < np.max(dataset4)).nonzero() and (dataset4_tt > between3).nonzero()]

# the observed distributions
p1 = pm.TruncatedNormal('p1', lower=np.min(dataset1), upper=between1, observed=filtered_data1)
p2 = pm.TruncatedNormal('p2', lower=between1, upper=between2, observed=filtered_data2)
p3 = pm.TruncatedNormal('p3', lower=between2, upper=between3, observed=filtered_data3)
p4 = pm.TruncatedNormal('p4', lower=between3, upper=np.max(dataset4), observed=filtered_data4)

trace = pm.sample()
break
except:
print(f'{trials}. Keep Trying')

Does this mean that I have an initial-point problem, where the trace needs to start somewhere that will eventually converge?

Yeah, if I remember correctly, if the initial value has probability zero, then the log likelihood will be -inf and sampling can’t start

Do you have any suggestions on how to mitigate a log likelihood of -inf? Would I need to perform a search on the initial value and evaluate the log-likelihood and sample only if the log-likelihood is not equal to 0?

I don’t want to misguide you, so I think it’s time I use my super power… @junpenglao, would you have any insight on this issue?  There are a few post in general about dealing with Bad initial energy:

In your case, the observed value is outside of the bound of between1, between2… (in general, these kind of parameters you need to make sure they are ordered constrained). For example, p1 is bounded between lower=np.min(dataset1) and upper=between1, but the default of between1 is actually < np.max(dataset1).

I would suggest you to remove the TruncatedNormal and use Normal instead.

1 Like