Hi all, I’m not sure what I am doing commits any cardinal sins in the realm of Probabilistic Programming or just plain wrong or impossible. I’m new to PyMC3 and probabilistic programming, so please forgive me for my ignorance. I would love to be enlightened. With that said he we go…
I have data that is multimodal (4-peaks). For example:
This data can be generated with this:
import matplotlib.pyplot as plt
import pymc3 as pm
import theano.tensor as tt
import numpy as np
# example data:
mu1, mu2, mu3, mu4 = 0, 2.4, 4.7, 7.4
sigma1, sigma2, sigma3, sigma4 = 1, 0.6, 0.9, 0.7
dataset1 = np.random.normal(mu1, sigma1, 1000)
dataset2 = np.random.normal(mu2, sigma2, 1000)
dataset3 = np.random.normal(mu3, sigma3, 1000)
dataset4 = np.random.normal(mu4, sigma4, 1000)
plt.figure()
count, bins1, ignored = plt.hist(dataset1, 30, density=True, alpha=0.5)
count, bins2, ignored = plt.hist(dataset2, 30, density=True, alpha=0.5)
count, bins3, ignored = plt.hist(dataset3, 30, density=True, alpha=0.5)
count, bins4, ignored = plt.hist(dataset4, 30, density=True, alpha=0.5)
plt.show()
Building the model:
model = pm.Model()
with model:
# in between peaks distributions
between1 = pm.Normal('between1', mu=1.3, sigma=0.9)
between2 = pm.Normal('between2', mu=3.8, sigma=0.8)
between3 = pm.Normal('between3', mu=6.1, sigma=0.85)
# the observed distributions
p1 = pm.TruncatedNormal('p1', lower=np.min(dataset1), upper=between1, observed=dataset1)
p2 = pm.TruncatedNormal('p2', lower=between1, upper=between2, observed=dataset2)
p3 = pm.TruncatedNormal('p3', lower=between2, upper=between3, observed=dataset3)
p4 = pm.TruncatedNormal('p4', lower=between3, upper=np.max(dataset4), observed=dataset4)
trace = pm.sample()
So at this point, the code gets an error: pymc3.parallel_sampling.ParallelSamplingError: Bad initial energy
. I suspected that this error is caused by the fact that the observed data exceeds the bounds defined in the truncated normal distributions.
As a result, I tried to mitigate this issue by filtering the data such that the data values are within upper and lower bounds of the truncated normal distribution via Theano slicing. So now the model code looks like:
model = pm.Model()
with model:
# in between peaks distributions
between1 = pm.Normal('between1', mu=1.3, sigma=0.9)
between2 = pm.Normal('between2', mu=3.8, sigma=0.8)
between3 = pm.Normal('between3', mu=6.1, sigma=0.85)
# filter the observed data:
dataset1_tt = tt._shared(dataset1)
dataset2_tt = tt._shared(dataset2)
dataset3_tt = tt._shared(dataset3)
dataset4_tt = tt._shared(dataset4)
filtered_data1 = dataset1_tt[(dataset1_tt < between1).nonzero() and (dataset1_tt > np.min(dataset1)).nonzero()]
filtered_data2 = dataset2_tt[(dataset2_tt < between2).nonzero() and (dataset1_tt > between1).nonzero()]
filtered_data3 = dataset3_tt[(dataset3_tt < between3).nonzero() and (dataset3_tt > between2).nonzero()]
filtered_data4 = dataset4_tt[(dataset4_tt < np.max(dataset4)).nonzero() and (dataset4_tt > between3).nonzero()]
# the observed distributions
p1 = pm.TruncatedNormal('p1', lower=np.min(dataset1), upper=between1, observed=filtered_data1)
p2 = pm.TruncatedNormal('p2', lower=between1, upper=between2, observed=filtered_data2)
p3 = pm.TruncatedNormal('p3', lower=between2, upper=between3, observed=filtered_data3)
p4 = pm.TruncatedNormal('p4', lower=between3, upper=np.max(dataset4), observed=filtered_data4)
trace = pm.sample()
Slicing the observed data such that they fall within the bounds each truncated normal didn’t seem to work because I still get the Bad Initial Energy
error. Got any ideas how to resolve this?