Modelling a distribution with a portion removed from it

iavicenna · February 21, 2024, 2:48pm

So, I took a shot at using the model I suggested above:

import numpy as np
import scipy
import matplotlib.pyplot as plt
import pymc as pm
import scipy.stats
import arviz as az
from pytensor.tensor import TensorVariable

def sample(pdf, x, rng, N):

  cdf = np.cumsum(pdf)
  cdf = cdf / cdf[-1]
  values = rng.random(N)
  observed = x[np.searchsorted(cdf, values)]

  return observed


s = 1
mu = 0
tau=2
eps=0.2
x = 0.5
seed = 0


rng = np.random.default_rng(seed)
r = np.linspace(0,100,10000)
y = scipy.stats.lognorm.pdf(r, s,mu) * (1 - x*(scipy.stats.norm.pdf(r,tau,eps)))
observed = sample(y, r, rng, 1000)

fig,ax = plt.subplots()
ax.hist(observed, bins=np.arange(0,10,0.1))


def normal_pdf(x, mu, s):

  return 1/(s*2*pm.math.sqrt(np.pi))*pm.math.exp(-0.5*((x-mu)/s)**2)

def logp(value: TensorVariable, s:TensorVariable, mu: TensorVariable,
         tau: TensorVariable, eps: TensorVariable, x:TensorVariable) -> TensorVariable:

  return pm.LogNormal.logp(value, mu, s) + pm.math.log( 1 - x*(normal_pdf(value,tau,eps)))


with pm.Model() as sel_model:

  s = pm.Exponential("s", 1)
  mu = pm.Normal("mu", 0, 1)
  tau = 0.1 + pm.HalfNormal("tau", 1)
  eps = 0.2
  x = 0.5

  sel_dist = pm.CustomDist('sel_dist', s, mu, tau, eps, x, logp=logp,
                           observed=observed)

  trace = pm.sample(draws=3000, tune=3000, target_accept=0.95)

q = az.plot_posterior(trace)

Note that you also need to write a small function to sample data from your pdf. After that if only tau,s and mu are variables it generally works fine:,

However if you add any other variable to the system then you start getting multi-modal posteriors, divergences and what not. Not %100 sure what is the reason but I guess one possibility is sampling variance combined with non-uniqueness, depending on your samples, regions at the tail of your distribution can sometimes look like it is missing some parts and so once in a while tau being there is selected for. Indeed even in the model with three parameters, if you change seed you sometimes get some small hills at larger tau values like 5, which might be due to this. This identification problem might mean that you might need a stricter prior for tau for instance. You can test this by changing seeds (start with maybe three parameters and work your way up) and see if your posteriors depend on seed. And you can also try stricter priors around the expected values to further verify this. You can look at individual chains and look at what combination of parameters are plausible and try to generate pdfs for those combinations to see how it looks like. For instance the data that generated the posteriors above has an histogram that looks like

hist1

Where as when I use seed=10 the generated data and posterior looks like

hist2

Notice the gap near 6 in the histogram where the model also thinks tau could be with some chance, which I think hints at an identifiability problem with such an approach.

Consider another case (seed=0), in this case difference between the two results are the tau priors, first one has a tau sigma=1 second one tau sigma=2 but they both use the same data:

hist:
hist3

case 1:

case 2:

Notice this time there is also somewhat of a gap near 5 but it was only caught when the prior was flexible enough (since that also affects your chance of sampling). Increasing N seems to help in this regard, which is again another signal that what we observe is sampling variance + non-identifiability but it also not a silver bullet. Following is another case where tau sigma = 2, number of data points is 10000 (seed=10):

hist4

So maybe you can restrict your priors, or maybe you have quite a lot of data points and can give it a try and see. Alternatively you can remodel.

I would also be interested to know what other people think would be a better approach to this. I initially suggested switch points but that doesn’t sound right. Mixture models would be more appropriate for this where you model the likelihood as a sum of two lognormals maybe:

https://www.pymc.io/projects/docs/en/stable/api/distributions/generated/pymc.Mixture.html

Make sure to use an ordering constraint on the centers of the lognormals. See:

for a normal case.

Topic		Replies	Views
Wrong info displayed for PDF of MvNormal in the documentation, code OK Development doc	3	786	December 5, 2017
Frequency Data Modelling Questions	0	342	October 27, 2021
Observed data as a product of distributions version agnostic modeling	4	39	March 13, 2025
Bimodal posterior distribution interpretation/clustering	0	910	May 23, 2021
Incomplete Observations of binary labeled population v5 modeling	4	304	May 31, 2023

Modelling a distribution with a portion removed from it

Related topics