Modelling a distribution with a portion removed from it

I am an archaeologist studying an ancient quarry comprised of a number of blocks. I have measurements of the size of these blocks. Generally these measurements should follow a LogNormal or Weibull distribution (i.e. positive, bigger blocks rarer etc.). I have reason to believe a portion of the observed distribution of blocks is missing, these are the blocks that were deemed useful or correct by past quarriers and thus removed from the quarry. I would like to infer the portion of missing blocks and a description of them (e.g. the average that was removed and the variance).

Untitled

The graph above illustrates my research question. I can observe the red distribution, I want to estimate the blue Missing Portion and describe it.

I have attached a pdf with my most recent efforts to model this process, including my attempts at coding custom log-probabilities.

selection_model.pdf (94.5 KB)

Would greatly appreciate any insights the community has, my problem has confounded my uni statistics support, so at least it should be interesting.

To start with, I can tell you what is wrong with your code. In scipy when you do

r = np.linspace(0,5,1000)
y = scipy.stats.lognorm.pdf(r, 1,0) * (1 - 0.5*(scipy.stats.norm.pdf(r,2, 0.2)))

This combines the probability distributions of the two random variables evaluated at the same observed value (the resulting distribution is no longer a probability distribution function though). What you did in pymc i.e

def selection_d(mu:TensorVariable,
                phi: TensorVariable,
                x: TensorVariable,
                tau: TensorVariable,
                epsilon: TensorVariable,
                size: TensorVariable) -> TensorVariable:


  return pm.Lognormal.dist(mu,phi) * (1 - x*pm.Normal.dist(tau,epsilon))

is not equivalent to that. What this random variable would do I think is sample normal and lognormal independently and plug them into the operation you defined. To understand this with a simpler example, you can try sampling

 pm.Normal.dist(0,1) -  pm.Normal.dist(0,1)

would not give you a (non-probability) distribution which is 0 everywhere but instead a normal distribution with variance 2 where as following would be 0 everywhere

scipy.stats.norm.pdf(r,1, 0)) - scipy.stats.norm.pdf(r,1, 0))

Now, I have actually never worked with custom distributions before but looking at the function reference page for custom dist:

https://www.pymc.io/projects/docs/en/v5.10.3/api/distributions/generated/pymc.CustomDist.html

suggests that what you want can be achieved by defining the logp of your distribution and using sample, since that is the one which gets evaluated at the observed data point:

import numpy as np
import scipy
import matplotlib.pyplot as plt
import pymc as pm
import scipy.stats
import arviz as az
from pytensor.tensor import TensorVariable

s = 1
mu = 0
x = 0.5
tau=2
eps=0.2
r = np.linspace(0,5,1000)
y = scipy.stats.lognorm.pdf(r, s,mu) * (1 - x*(scipy.stats.norm.pdf(r,tau,eps)))
z = scipy.stats.lognorm.pdf(r,s,mu)
plt.plot(r, y,'r-', lw=5, alpha=0.6, label='selection pdf')
plt.plot(r,z,'g',lw=5,alpha=0.6,linestyle='--',label='Lognorm pdf')
plt.fill_between(r,y,z,where=(z>y),label='Missing Portion')
plt.legend()



def normal_pdf(x, mu, s):

  return 1/(s*2*pm.math.sqrt(np.pi))*pm.math.exp(-0.5*((x-mu)/s)**2)

def logp(value: TensorVariable, s:TensorVariable, mu: TensorVariable,
         tau: TensorVariable, eps: TensorVariable, x:TensorVariable) -> TensorVariable:

  return pm.LogNormal.logp(value, mu, s) + pm.math.log( 1 - x*(normal_pdf(value,tau,eps)))


with pm.Model() as sel_model:
  phi = 1
  mu = 0
  x = 0.5
  tau=2
  eps=0.2

  sel_dist = pm.CustomDist('sel_dist',
                          s,
                          mu,
                          tau,
                          eps,
                          x,
                          logp=logp,
                          )

  draws = pm.sample(draws=10000)

q = az.plot_posterior(draws)
q.set_xlim([0,5])

gives
sel_dist

Note that I have used pm.LogNormal.logp since there is some boiler plate involved in dealing with x=0 and I have defined pdf of the normal because I need its pdf first and not logpdf (I log it after summing it with 1). If you compare it with the formula you use which is:

pm.logp(a,value)*(1-x*pm.logp(b,value))

you will notice that you have not applied log quite correctly, after all you have something like
p = p1*(1-xp2) so when logging it you get logp1 + log(1-xp2), so I suspect that is also another source of error in your code. But as I said, never worked with custom dists before so don’t really know things like when you need to define the dist and when not, when you need to define logp and when not and how that effects sample vs draw. You can I guess play around and see. Also perhaps the following is useful:

https://www.pymc.io/projects/docs/en/stable/contributing/implementing_distribution.html

2 Likes

So, I took a shot at using the model I suggested above:

import numpy as np
import scipy
import matplotlib.pyplot as plt
import pymc as pm
import scipy.stats
import arviz as az
from pytensor.tensor import TensorVariable

def sample(pdf, x, rng, N):

  cdf = np.cumsum(pdf)
  cdf = cdf / cdf[-1]
  values = rng.random(N)
  observed = x[np.searchsorted(cdf, values)]

  return observed


s = 1
mu = 0
tau=2
eps=0.2
x = 0.5
seed = 0


rng = np.random.default_rng(seed)
r = np.linspace(0,100,10000)
y = scipy.stats.lognorm.pdf(r, s,mu) * (1 - x*(scipy.stats.norm.pdf(r,tau,eps)))
observed = sample(y, r, rng, 1000)

fig,ax = plt.subplots()
ax.hist(observed, bins=np.arange(0,10,0.1))


def normal_pdf(x, mu, s):

  return 1/(s*2*pm.math.sqrt(np.pi))*pm.math.exp(-0.5*((x-mu)/s)**2)

def logp(value: TensorVariable, s:TensorVariable, mu: TensorVariable,
         tau: TensorVariable, eps: TensorVariable, x:TensorVariable) -> TensorVariable:

  return pm.LogNormal.logp(value, mu, s) + pm.math.log( 1 - x*(normal_pdf(value,tau,eps)))


with pm.Model() as sel_model:

  s = pm.Exponential("s", 1)
  mu = pm.Normal("mu", 0, 1)
  tau = 0.1 + pm.HalfNormal("tau", 1)
  eps = 0.2
  x = 0.5

  sel_dist = pm.CustomDist('sel_dist', s, mu, tau, eps, x, logp=logp,
                           observed=observed)

  trace = pm.sample(draws=3000, tune=3000, target_accept=0.95)

q = az.plot_posterior(trace)

Note that you also need to write a small function to sample data from your pdf. After that if only tau,s and mu are variables it generally works fine:,

However if you add any other variable to the system then you start getting multi-modal posteriors, divergences and what not. Not %100 sure what is the reason but I guess one possibility is sampling variance combined with non-uniqueness, depending on your samples, regions at the tail of your distribution can sometimes look like it is missing some parts and so once in a while tau being there is selected for. Indeed even in the model with three parameters, if you change seed you sometimes get some small hills at larger tau values like 5, which might be due to this. This identification problem might mean that you might need a stricter prior for tau for instance. You can test this by changing seeds (start with maybe three parameters and work your way up) and see if your posteriors depend on seed. And you can also try stricter priors around the expected values to further verify this. You can look at individual chains and look at what combination of parameters are plausible and try to generate pdfs for those combinations to see how it looks like. For instance the data that generated the posteriors above has an histogram that looks like

hist1

Where as when I use seed=10 the generated data and posterior looks like

hist2

Notice the gap near 6 in the histogram where the model also thinks tau could be with some chance, which I think hints at an identifiability problem with such an approach.

Consider another case (seed=0), in this case difference between the two results are the tau priors, first one has a tau sigma=1 second one tau sigma=2 but they both use the same data:

hist:
hist3

case 1:

case 2:

Notice this time there is also somewhat of a gap near 5 but it was only caught when the prior was flexible enough (since that also affects your chance of sampling). Increasing N seems to help in this regard, which is again another signal that what we observe is sampling variance + non-identifiability but it also not a silver bullet. Following is another case where tau sigma = 2, number of data points is 10000 (seed=10):

hist4

So maybe you can restrict your priors, or maybe you have quite a lot of data points and can give it a try and see. Alternatively you can remodel.

I would also be interested to know what other people think would be a better approach to this. I initially suggested switch points but that doesn’t sound right. Mixture models would be more appropriate for this where you model the likelihood as a sum of two lognormals maybe:

https://www.pymc.io/projects/docs/en/stable/api/distributions/generated/pymc.Mixture.html

Make sure to use an ordering constraint on the centers of the lognormals. See:

for a normal case.

2 Likes

Thankyou very much for the detailed and helpful responses!

Yes, I have considered identifiability to be a potential problem and I would expect the model to struggle if there is little selection going on. I am fine with restricting the priors a bit more such that it becomes less of an issue, I have some prior information of what the selected material may be so will use that.

I am also using a mixture model as an alternative to model the distribution as it is also interpretable from another perspective. The mixture is much easier to code. My main motivation for this model was the interpretability of the parameters and the potential flexibility of it (I can hopefully replace the parent distribution with alternatives or use a more informed selection function for example).

I have also tried many forms of censoring and truncation which couldn’t quite capture the final distribution.

Another huge issue I came across was the endless papers discussing MNAR problems with very rare solutions.

Thanks again, I’ll be back if I come up with any more interesting issues.

Does this give some ideas? Modeling with soft truncation

Sounds similar on the surface except with interval truncation instead of edge truncation

Since you care about those bounds I would try to model them directly.

1 Like

Yes it does thanks. I’ll have a go at trying to produce two soft truncations.