Observed data with uncertainty described by an irregular distribution

Hi, I would like to solve an inference problem with NUTS. I have observed data that, after uncertainty propagation from input parameters not designated for updating, has a very irregular distribution. How do I assimilate observed data with irregular distribution?
To be more specific, the observed data is modeled with equations:

x1 = 3*a**2 + 2*b**2 + c + d**2
x2 = a**(3/2) + b**2 + 2*c + d
x3 = a**(5/2) + 4*b**(3/2) + 2*c + 2*d

where a and b are designated for Bayesian updating while c and d are not, so their uncertainties are propagated to observed data. Measured values of observed data = [1238.5, 215, 1544], with measurement uncertainty described by a normal distribution with sd=1. All distributions of a,b,c,d are uniform. The uncertainty propagation is done by simulating observed data with sampling of c,d and measurement uncertainty, and keeping a,b constant:

c = np.random.uniform(5, 9, 10000)
d = np.random.uniform(2, 6, 10000)
a = 14
b = 16
samples1 = a*a*3+b*b*2+c+d**2+np.random.normal(0, 1, 10000)
samples2 = a**(3/2)+b*b*1+c*2+d+np.random.normal(0, 1, 10000)
samples3 = a**(5/2)+b**(3/2)*4+c*2+d*2+np.random.normal(0, 1, 10000)

This problem can be solved with smc-abc without explicit error propagation (by modeling the un-updated parameters as noise in the simulator), but I am curious if it is possible with NUTS because it is much more efficient.
The SMC-ABC code:

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pymc as pm

az.style.use("arviz-darkgrid")
data = [1238.5, 215, 1544]
rng = np.random.default_rng()

def normal_sim(rng, a, b, size=1):
    c = rng.uniform(5, 9)
    d = rng.uniform(2, 6)
    return  [3*a*a+2*b*b+c+d**2, a**(3/2)+b*b+2*c+d, a**(5/2)+4*b**(3/2)+2*c+2*d]

with pm.Model() as example:
    a = pm.Uniform("a", lower=0.1*16, upper=3*16)
    b = pm.Uniform("b", lower=0.1*14, upper=3*14)
    s = pm.Simulator("s", normal_sim, params=(a, b), sum_stat="sort", epsilon=1, observed=data)
    idata = pm.sample_smc(10000, cores=1, chains=1, threshold=0.6)
    idata.extend(pm.sample_posterior_predictive(idata))

az.plot_trace(idata, kind="rank_vlines");
print(az.summary(idata, kind="stats"))

So if I understand you correctly, you want c and d to be source of random noise and not like priors that update and not use SMC-ABC. In that case, there is only one way (that I am aware of) to achieve this and this is defining your own updating step for c and d so that they are not updated like normal priors. This was discussed recently in

See also the links there. As a starters you can try the following:

class NormalNoiseStep(BlockedStep):
    def __init__(self, var, mu, sd, size):
        model = pm.modelcontext(None)
        value_var = model.rvs_to_values[var]
        self.vars = [value_var]
        self.name = value_var.name
        self.mu = mu
        self.size = size
        self.sd = sd

    def step(self, point: dict):

        draw = np.random.normal(self.mu, self.sd, size=self.size)
        point[self.name] = draw
        return point, []

In your model then you would use this as

with pm.Model() as model:
  a = pm.Uniform("a", lower=0.1*16, upper=3*16)
  b = pm.Uniform("b", lower=0.1*14, upper=3*14)
  c = pm.Normal("c", mu ,sd, size=N)
  d = pm.Normal("d", mu ,sd, size=N)
  
  steps = [NormalNoiseStep(c, mu, sd, N), NormalNoiseStep(d, mu, sd, N)]

  #you can now use c and d as any tensor in your operations
  trace = pm.sample(**sample_parameters, step=steps)

This is outside the boundaries of regular Bayesian modelling though so stuff like convergence statistics, r_hat may or may not be meaningful in this context. Perhaps a good thing to do is indeed cross-compare results with SMC-ABC. Never used this method in detail so donā€™t have intuition about it.

2 Likes

The solutions obtained with BlockStep fail convergence tests or cannot recreate the SMC-ABC results, or there are divergences depending on settings. Once a few times, I am able to obtain somewhat similar posterior distributions to SMC-ABC, but there is a very large variation from run to run.
This problem would be absent if the noise could be propagated to observed data uncertainty, but this solution appears not to work unless the resulting observed data distribution is MvNormal or Normal (like in the case of the solution in the other post you mentioned). Is there a formal limitation in the theory of Bayesian inference that limits the observed data distribution to MvNormal/Normal?
Here is the current code; is there any parameter I can tweak except for target_caccept to avoid divergences or convergence issues?:

import pymc as pm
import numpy as np
import arviz as az
rng = np.random.default_rng()
from pymc.step_methods.arraystep import BlockedStep

class NormalNoiseStep(BlockedStep):
    def __init__(self, var, lower, upper, size):
        model = pm.modelcontext(None)
        value_var = model.rvs_to_values[var]
        self.vars = [value_var]
        self.name = value_var.name
        self.lower = lower
        self.size = size
        self.upper = upper
    def step(self, point: dict):
        draw = rng.uniform(self.lower, self.upper, size=self.size)
        point[self.name] = draw
        return point, []

data = [1238.5, 215, 1544]
with pm.Model() as model:
    # Define priors for a and b
    a = pm.Uniform("a", lower=0.1*16, upper=3*16)
    b = pm.Uniform("b", lower=0.1*14, upper=3*14)
    c = pm.Uniform("c", 5 ,9, size=1)
    d = pm.Uniform("d", 2 ,6, size=1)
    steps = [NormalNoiseStep(c, 5, 9, 1), NormalNoiseStep(d, 2, 6, 1)]
    mu=[3*a**2 + 2*b**2 + c[0] + d[0]**2, a**(3/2) + b**2 + 2*c[0] + d[0], a**(5/2) + 4*b**(3/2) + 2*c[0] + 2*d[0]]
    Y_obs = pm.Normal("Y_obs", mu=mu, sigma=1, observed=data)
    trace = pm.sample(tune=2500, chains=4, draws=15000, step=steps, nuts={'target_accept':0.95})

az.plot_trace(trace, kind="rank_vlines")
print(az.summary(trace, kind="stats"))

No.

What is the epistemological meaning of a random variable that you donā€™t update based on experience? One way to view Bayes is that randomness arises due to the subjective limitations of the observer, and ā€œfitting a modelā€ is a process of learning about the system via exposure to the system (data). What does it mean to forbid learning about certain elements of a system?

I donā€™t mean that learning about certain elements of the system is forbidden. There is the uncertainty in prior input parameters (like a, b in my problem), which can take form of any distribution. But there is also the uncertainty of the data, which appears to be limited to be only described by MvNormal and Normal (this is expressed in the line ā€œY_obs = pm.Normal(ā€œY_obsā€, mu=mu, sigma=1, observed=data)ā€. The question is, whether data uncertainty could take form of other distributions (like uniform, or custom, that canā€™t be fitted into any traditional distribution)?

Sure, you can use a pm.CustomDist to express any arbitrary distribution, as long as you can write down its logp function. The Bambi documentation gives a lot of examples of different types of data distributions, because the GLM literature is quite engaged with using different distributions to represent different types of data. For example bernoulli, beta, gamma, negative binomial. This notebook (presented as a series of puzzles) gives a collection of examples of arbitrary logp functions that PyMC can automatically deduce (and which could thus be used to model observed data).

1 Like

Okay, letā€™s assume that dataā€™s uncertainty took form of a uniform distribution. What would the line ā€œY_obs = pm.Normal(ā€œY_obsā€, mu=mu, sigma=1, observed=data)ā€ change to?

pm.Uniform("Y_obs", lower=lower, upper=upper, observed=data)

But mu=[3*a**2 + 2*b**2 + c[0] + d[0]**2, a**(3/2) + b**2 + 2*c[0] + d[0], a**(5/2) + 4*b**(3/2) + 2*c[0] + 2*d[0]]. This is the mathematical model simulating data. Where is it in pm.Uniform(ā€œY_obsā€, lower=lower, upper=upper, observed=data)?

Per the wikipedia, you can parameterize a Uniform distribution as:

f(x) = \begin{cases} \frac{1}{2\sigma\sqrt{3}} & \text{for} -\sigma \sqrt{3} \leq x - \mu \leq \sigma \sqrt{3} \\ 0 & \text{otherwise} \end{cases}

or

f(x) = \begin{cases} \frac{1}{b-a} & \text{for} \, x \in [a, b] \\ 0 & \text{otherwise} \end{cases}

After some algebra and pattern matching, you can find a mapping between the \mu, \sigma parameterization and the a, b parameterization:

\begin{align} b &= \sqrt{3} \sigma + \mu \\ a &= 2\mu - b \end{align}

Hm, okay, I will look into that!

It seems that the problem might be how you setup the priors and how you use the observables and not the model. Once these are fixed, the model acts reasonably:

1- I have noticed that you supplying an array of length 3 as your observed data where as in your first post you have generated samples of length 10000 by adding noise to it. I suspect these should be your observables and not

data = [1238.5, 215, 1544]

so I took the liberty of fixing this (though I used N=100, I leave it up to you to try more or less data).

2- Another problem is the scale of your observed data which are either in the 100s or 1000s. Adding a noise with sd=1 (like you do in your initial post) to this is problematic for the sampler. I have seen in the past instances when the noise is too small compared to your observed data, the sampler is having a difficult time.

With these fixed, I have tried different amounts of noise for c and d, ranging from 0.1 to 2 with no target_accept constraint. I do not get any divergences and as expected as the noise and c and d increases, posteriors for a and b get wider while covering the true parameter. I have also plotted the posterior for observation noises (there are three separate ones), they do not increase which I guess is a nice sign in that the model is not trying to compensate the noise in c and d by adding more observation noise to the system (which could very well have happened).

The model is here:

import pymc as pm
import numpy as np
import arviz as az
import matplotlib.pyplot as plt
from pymc.step_methods.arraystep import BlockedStep

def fun1(a, b, c, d):
  return 3*a**2 + 2*b**2 + c + d**2

def fun2(a, b, c, d):
  return a**(3/2) + b**2 + 2*c + d

def fun3(a, b, c, d):
  return a**(5/2) + 4*b**(3/2) + 2*c + 2*d

class NormalNoiseStep(BlockedStep):
    def __init__(self, var, mean, sd, size):
        model = pm.modelcontext(None)
        value_var = model.rvs_to_values[var]
        self.vars = [value_var]
        self.name = value_var.name
        self.mean = mean
        self.size = size
        self.sd = sd

    def step(self, point: dict):
        draw = rng.normal(self.mean, self.sd, size=self.size)
        point[self.name] = draw
        return point, []

a_real = 14
b_real = 16
sd_obs = [100, 50, 200]
seed = 111

rng = np.random.default_rng(seed)
mean_c = 7
mean_d = 4

N = 100

sds = [0.1, 0.5, 1, 2]
fig,axes = plt.subplots(len(sds), 5, figsize=(25,len(sds)*5), sharex="col")


for ind_sd,sd_cd in enumerate(sds):

  c_noisy = rng.normal(mean_c, sd_cd, N)
  d_noisy = rng.normal(mean_d, sd_cd, N)

  x1 = fun1(a_real, b_real, c_noisy, d_noisy) + rng.normal(0, sd_obs[0], N)
  x2 = fun2(a_real, b_real, c_noisy, d_noisy) + rng.normal(0, sd_obs[1], N)
  x3 = fun3(a_real, b_real, c_noisy, d_noisy) + rng.normal(0, sd_obs[2], N)

  with pm.Model() as model:
      # Define priors for a and b

      a = pm.HalfNormal("a", 5)
      b = pm.HalfNormal("b", 5)
      c = pm.Normal("c", mean_c ,sd_cd, size=1)
      d = pm.Normal("d", mean_d ,sd_cd, size=1)
      sd = pm.HalfNormal("sd", [50, 25, 100], size=3)

      steps = [NormalNoiseStep(c, mean_c, sd_cd, 1),
               NormalNoiseStep(d, mean_d, sd_cd, 1)]

      Yobs1 = pm.Normal("Y_obs1", fun1(a, b, c[0], d[0]), sd[0],
                        observed = x1)
      Yobs2 = pm.Normal("Y_obs2", fun2(a, b, c[0], d[0]), sd[1],
                        observed = x2)
      Yobs3 = pm.Normal("Y_obs3", fun3(a, b, c[0], d[0]), sd[2],
                        observed = x3)

      trace = pm.sample(tune=2000, chains=6, draws=2000, step=steps,
                        random_seed=rng)

  az.plot_posterior(trace, var_names=["a","b","sd"],
                    ax=axes[ind_sd,:]).flatten()


  axes[ind_sd, 0].scatter(a_real, 0, marker='o', color="red")
  axes[ind_sd, 1].scatter(b_real, 0, marker='o', color="red")
  axes[ind_sd, 2].scatter(sd_obs[0], 0, marker='o', color="red")
  axes[ind_sd, 3].scatter(sd_obs[1], 0, marker='o', color="red")
  axes[ind_sd, 4].scatter(sd_obs[2], 0, marker='o', color="red")


fig.savefig("plot.png")

ps: I have also tried sd = 2, 5, 10 to see how it behaves when amount of noise completely washes over the parameters. With sd = 5 I get 1 divergence with sd = 10, I get 100 divergences which I think to be fair is reasonable. I wouldnt even call the posteriors for sd=10 completely unreasonable!

After all a stupid anology would be to try to do gradient minimization where at each step after you compute your gradient you add some random noise to it! The system is not built to deal with such noise but can tolerate it to some amount obviously.

@iavicenna thank you for your solution. Even though the example I provided is a toy problem, it is supposed to be a practice/investigation before an engineering problem type called ā€œInverse uncertainty quantificationā€, so I canā€™t adjust sd freely. All model features have physical interpretation - observed data corresponds to experimental measurements, measured with uncertainty described by sd. The only thing I can do is propagate the uncertainty from un-updated input parameters (from the noise) to the experimental uncertainty. In another post that I made, this solution worked because the resulting uncertainty distribution was MvNormal.
Regarding point 1 - Since ā€œdataā€ corresponds to measurements, it feels problematic to include uncertainty this way. I tried it anyway, and the solution failed convergence tests. I got the same problem with results as with my earlier attempts
Regarding 2: small sd does not cause trouble, when there is no noise, so I donā€™t think it is a problem now
I understand that random noise is a problem for gradient minimization, which is why I wanted to propagate the uncertainty introduced by noise to data uncertainty (then Y_obs could be MvNormal, but not in this case).
From what @jessegrabowski wrote, it seems that this error propagation only works when the uncertainty of data can be described by distribution, which can be expressed in terms of Ļƒ and Ī¼. Correct me if I am wrong, but there is no way to account for uncertainty distribution of data that is very irregular. If there was, the problem would be solved, and the solution would converge nicely as with MvNormal.

so I canā€™t adjust sd freely. All model features have physical interpretation - observed data corresponds to experimental measurements,

Hi, I adjusted sd to see if it behaves the way it should, that is as sd for c and d increases uncertainty in the parameters for a and b increases. After all if the model is correct, that is what you would expect. So this is a sanity check.

Regarding point 1 - Since ā€œdataā€ corresponds to measurements, it feels problematic to include uncertainty this way. I tried it anyway, and the solution failed convergence tests. I got the same problem with results as with my earlier attempts

Donā€™t quite understand this point, have you ran the model that I have provided? I am not getting any convergence issues or divergences like this. And this is the conventional way in which observations are modelled in the Bayesian world. You have some real observations x_real and then you measure them N times with some noise (say Normally distributed) to get x_obs = x_real + Normal(measurement noise mean, measurement noise sd, size=N). This is a big chunk of the form of Bayesian models I have encountered so far be it just estimating the mean of some measurements or doing Bayesian linear regression. But I am not an expert.

Regarding 2: small sd does not cause trouble, when there is no noise, so I donā€™t think it is a problem now

Note that the sd here I am talking about is the observation noise that goes into the likelihood, not the sd on c and d. This is a bit of drastic example but consider:

import numpy as np
import pymc as pm
import arviz as az

means = [1000, 2000]
sds = [0.1, 0.1]
N = 100
obs0 = np.random.normal(means[0], sds[0], N)
obs1 = np.random.normal(means[1], sds[1], N)

obs = np.array([obs0, obs1])


coords = {"group":range(2),
          "obs":range(N)}

with pm.Model(coords=coords) as model:

  sds = pm.HalfNormal("sds", 0.1, dims = "group")
  means = pm.HalfNormal("means", 10, dims = "group")

  pm.Normal("likelihood", means, sds, observed=obs.T)

  trace = pm.sample(1000, tune=1000)

Had the sds been larger (say >1), this is a relatively simple problem to solve and takes couple seconds. With these sds however not only sampler gets stuck multiple times and it also creates divergences and effective sample size <100 warnings. This is I assume a limitation of the sampler. When you do

Y_obs = pm.Normal("Y_obs", mu=mu, sigma=1, observed=data)

for data that is on the order of 1000s, you might be falling into the same problem.

Correct me if I am wrong, but there is no way to account for uncertainty distribution of data that is very irregular.

I maybe misunderstanding you here but I assume you mean uncertainty in the sense of what we are trying to achieve here (adding a source of random noise which is neither like a prior to be updated or a likelihood). I donā€™t know if there is any theoretical knowledge on this, it is for sure that the theory of Bayesian modelling was not formulated with this in mind. So I guess the easiest thing is to try and see. However if your model is facing convergence issues, it can be either because of this or it can be because there are some other aspects of the model which may be problematic which is elevated even more in this non-conventinal context. I have pointed out two such possibilities which when remedied seems to improve results (i.e the code above).

I ran your model, and it works, but then I adjusted it for my priors, noise, and observed data, and it does not converge, probably because the noise is too high.
I think there is a misunderstanding: I interpret the ā€œobservation noiseā€ as the uncertainty with which the measurement was made. Is this an incorrect interpretation? So if we measured something to weight 2 kg with uncertainty described by a normal distribution with sd=0.01kg, then sigma=0.01 and data=[2].
If uncertainty was described by an irregular distribution, then it could be, for example, multimodal, and there is no way of putting multimodal into ā€œY_obsā€ and describing it in terms of Ļƒ and Ī¼.

In the above case data is not [2]. 2 is the real value which you are trying to predict. How you normally proceed is you measure it say 5 times and you get [2.1, 1.99, 2.02, 1.95, 1.99]. And you say

data ~ Normal(2, 0.1)

That is where your first model looks weird to start with when you seem to have replaced your observations with the mean of your simulated data.

If uncertainty was described by an irregular distribution, then it could be, for example, multimodal, and there is no way of putting multimodal into ā€œY_obsā€ and describing it in terms of Ļƒ and Ī¼.

Of course, if say the uncertainty due to measurement can not be parametrized by some parameters and a known distribution then you are pretty much out of luck with classical Bayesian approaches. In classical statistics this would be the difference between parametric (MLE etc) and non-parametric (emprical CDF etc) approaches.

What you are trying to achieve here is also outside the boundaries of classical Bayesian approaches but atleast there is one way to attempt to get to it, where as if you knew nothing about how this uncertainty is generated then there is nothing you can do I guess.

I ran your model, and it works, but then I adjusted it for my priors, noise, and observed data, and it does not converge, probably because the noise is too high.

Can you post a code that shows how you adjusted the parameters?

If uncertainty was described by an irregular distribution, then it could be, for example, multimodal, and there is no way of putting multimodal into ā€œY_obsā€ and describing it in terms of Ļƒ and Ī¼.

That being said, you can create multimodal sources of noise via this

You can even try to ā€œclip awayā€ portions from your noise if you suspect there is something weird going on see a nice topic about this:

So you know it is still way flexible beyond just a couple of known distributions to model your noise. I am aware these examples are not solutions to your problem but just emphasizing the point that it is still quite flexible.

1 Like

Your answer about irregular distribution of data being incompatible with classical Bayesian is a solution to my problem, thank you!

Regarding the example scenario with weight: the weighing is made by a measurement device that has a built in error specified by producer of a device so it always measures ā€œ2ā€ in case of this particular item, but we know that real value should be within Ā± 3sd away (where producer says sd = 0.01). We could construct a simple engineering inverse uncertainty quantification problem: imagine that we have a cube and we want to find its edge length. We estimate that the edge is somewhere between 8 and 15 cm from visual inspection, we know its density to be Ro=1 g/cm^3. Given the mathematical model mu=aaa*Ro, we put that into NUTS (ā€œaā€ would be uniform from 8 to 15) along with data=2 and sigma=0.01 and we find posterior distribution of ā€œaā€. If there was uncertainty in density (and we did not want to find posterior density) then we would have to take it into account by either including this uncertainty as noise (which smc-abc could do), or if we wanted to solve it with NUTS we would have to propagate this uncertainty to data uncertainty represented by sigma. From what we established, this can be done if after error propagation the sigma distribution can be parametrised.

I will show you the code tomorrow or on Monday

If anything Bayesian statistics is a part of parametric inference, so if you can not model your likelihood (i.e observation noise) with combination of some known distributions and parameters, you might indeed be better of doing approximate Bayesian methods. That being said, I am not completely convinced this is your case. You have a source of noise which you can parametrize but you donā€™t want to estimate the parameters for and instead want to fix it. And also it is not quite straight forward how to add to the likelihood. So instead you trick the sampler to add some noise to your system every step. It is not a problem probably considered previously in classical Bayesian statistics, it does not mean it will not work. The divergences you encounter could be coming from other reasons. This ā€œunparametrizableā€ source of noise is not the only thing that is unconventional about your model. The way you also use the data is quite unconventional. The following for instance:

device that has a built in error specified by producer of a device so it always measures ā€œ2ā€ in case of this particular item, but we know that real value should be within Ā± 3sd away (where producer says sd = 0.01)

A device that always measures 2? Then this is the wrong set of observables for defining your likelihood. Maybe if you can observe c and d then you can define the likelihoods in terms of them. Otherwise you are trying to infer 2 parameters from 3 data points in your case, so I would definitely change the way I am looking at the model. I am also telling you a practical fact about pymc, if the noise sd in your likelihoods is very small compared to your data, the sampler will have convergence issues etc. So these issues need not just be coming from the weird source of noise and that is why you might need to rethink your model and try to build your likelihoods around observations where there are repeats with noise.

I still find it very diffucult to imagine how you would be working with just a single data point. If you are, then maybe this is not a problem suitable for statistics? Just say r0 < Ro < r1 therefore (mu/r1)^(1/3) < a < (mu/r0)^(1/3). Or if you want to be more precise, just sample Ro many times and then the distribution for a will be (mu/Ro samples)^(1/3). If there are no observations, then there is no likelihood really and no need for Bayesian statistics.

Statistics becomes more meaningful when you have multiple measurements. The whole of statistics is built on the premise that you have as many as identically distributed independent measurements as possible. On the other hand, if you donā€™t have measurements for mu but you have for the density Ro, then you are setting up your model incorrectly. You should say R0 = mu/(aaa), define some priors for mu and a and say observed R0 ~ Normal(mu/(aaa), sd)