Hello,
I am a newcomer to PyMC3, and would be grateful if I could get some guidance on how to model truncated normal distribution with optimal truncation thresholds.
Toy data was generated as following:
- Generate normal distribution (sample).
- Truncate the distribution, i.e. sample[(sample>a) & (sample<b)]
- Add few points beyond the upper threshold b, from the original sample (it simulates a real business problem, since we don’t know the real truncation thresholds, i.e. thresholds a and b are free variables). The same could be done for a, but omitted it for simplicity.
- The goal is to get posteriors for 1) truncated normal distribution (mu, sigma) and 2) optimal truncation points (a,b).
Minimum reproducible example:
import theano.tensor as tt
import numpy as np
import numpy as np
import matplotlib.pyplot as plt
import pymc3 as pm
a = 5
b = 16
niter = 4000
sigma = 5.
mu = 13.
size=5000
samples = np.random.normal(mu, sigma, size)
# truncate and add few points beyond the point of truncation
random_high = samples[samples>b][:20]
truncated = samples[(samples > a) & (samples < b)]
truncated = np.hstack([truncated, random_high])
def cdf(x):
"""Cumulative distribution function of standard normal distribution"""
return 0.5*(1.0+tt.erf(x/tt.sqrt(2)))
def pdf(x):
"""Cumulative distribution function of standard normal distribution"""
return 1.0/tt.sqrt(2*np.pi)*tt.exp(-0.5*(x**2))
def logp_tr(x, mu, std,a,b):
""" Define logp based on https://en.wikipedia.org/wiki/Truncated_normal_distribution"""
f = pdf((x-mu)/std)/std/((cdf((b-mu)/std) - cdf((a-mu)/std)))
return tt.switch(np.any(x>b) | np.any(x<a), -1e10, tt.log(f))
with pm.Model() as tmodel:
# Just some generic priors
mu = pm.Normal('mu', mu=25, sd=25)
std = pm.Uniform('std', lower=0.1, upper=80)
a=pm.Uniform('low', lower=0, upper =10)
b=pm.Uniform('high', lower=10, upper =40)
x= pm.DensityDist('dist', logp_tr, observed = {'x':truncated, 'mu':mu, 'std':std, 'a':a, 'b':b})
start = pm.find_MAP()
# Sampler
step = pm.Metropolis()
# Sampling
trace = pm.sample(niter, step=step, njobs=4)
pm.traceplot(trace[1000:], varnames=['mu', 'std','low','high']);
Trace that I am getting is:
As can be seen, posterior for b is wrong (and as a result posterior of mu is shifted). I suspect I am not defining correctly the logp_tr function for custom distribution. The pdf is based on wikipedia and seems to be correct (based on previous attempts with fixed a and b).
The correct posteriors should have maximums near mu=13, sigma=5, a = 5, b=16 (based on distribution it was generated from).
Any suggestions on how to fix the problem?
Thanks in advance