How to inference with interrelated observations

I have two observations: observed_a and observed_data. Both are guassian distributions, as shown in the figure. The two observations are interrelated.How to get the distribution of the parameter eps_1 given the two observations?

with pm.Model() as model:

    eps_0 = pm.Uniform('eps_0', lower=3.0, upper=10.0, shape=1)
    eps_1 = pm.Uniform('eps_1', lower=3.0, upper=10.0, shape=1)
    eps_2 = pm.Uniform('eps_2', lower=3.0, upper=10.0, shape=1)
    loss_0 = pm.Uniform('loss_0', lower=0.0001, upper=0.001, shape=1)
    loss_1 = pm.Uniform('loss_1', lower=0.001, upper=0.1, shape=1)
    loss_2 = pm.Uniform('loss_2', lower=0.001, upper=0.1, shape=1)
    d1 = pm.Uniform('d1', lower=0.1, upper=5, shape=1)

    eps = pt.as_tensor_variable([eps_0, eps_1, eps_2])
    loss = pt.as_tensor_variable([loss_0, loss_1, loss_2])
    thickness = pt.as_tensor_variable([d1, const_a*observed_a / pt.sqrt(eps_1)])

    simulated_data = pm.Deterministic('simulated_data', simulator_op(eps, loss, thickness))
    likelihood = pm.Normal('likelihood', mu=simulated_data, sigma=1, observed=observed_data)
    trace = pm.sample(10000, tune=2000, progressbar=True, return_inferencedata=True)

pm.traceplot(trace, var_names=['eps_1'])
pm.summary(trace, var_names=['eps_1'])

PyMC v5.16.2
simulator_op is my defined Op class that performs simulation with the parameters eps, loss, and thickness. const_a is a constant. You can see that observed_a is correlated with parameter eps_1. I know the code thickness = pt.as_tensor_variable([d1, const_a*observed_a / pt.sqrt(eps_1)]) is erroneous, but how to correctly infer the specific parameter eps_1

Why do you think thickness = pt.as_tensor_variable([d1, const_a*observed_a / pt.sqrt(eps_1)]) is erroneous? It seems fine to me, as long as that equation is the correct physical relationship you’re interested in.

@jessegrabowski

Thank you. I have updated the code as follows:

with pm.Model() as model:
    eps_1 = pm.Uniform('eps_1', lower=3.0, upper=10.0, shape=1)
    eps_2 = pm.Uniform('eps_2', lower=3.0, upper=10.0, shape=1)
    loss_0 = pm.Uniform('loss_0', lower=0.0001, upper=0.001, shape=1)
    loss_1 = pm.Uniform('loss_1', lower=0.001, upper=0.1, shape=1)
    loss_2 = pm.Uniform('loss_2', lower=0.001, upper=0.1, shape=1)
    d1 = pm.Uniform('d1', lower=0.1, upper=5, shape=1)
    consta = pm.Deterministic('consta', pt.as_tensor_variable(const_a))

    def model_func(a):
        eps = pm.Deterministic('eps', pt.stack([pt.constant(2.0), eps_1[0], eps_2[0]]))
        loss = pm.Deterministic('loss', pt.stack([loss_0[0], loss_1[0], loss_2[0]]))
        thickness = pm.Deterministic('thickness', pt.stack([d1[0], a*consta/pt.sqrt(eps_1[0])]))
        return simulator_op(eps, loss, thickness)

    observed_a = pt.as_tensor_variable(observed_a)
    simulated = pm.Deterministic("simulated", pytensor.scan(fn=model_func, sequences=[observed_a])[0])

    likelihood = pm.Normal("y", mu=simulated, observed=[observed_data])

    trace = pm.sample(10000, tune=2000, progressbar=True, return_inferencedata=True)

The error reads as

ValueError                                Traceback (most recent call last)
Cell In[65], line 26
     22 observeddata = pt.as_tensor_variable(observed_data)
     24 likelihood = pm.Normal("y", mu=simulated, observed=[observed_data])
---> 26 trace = pm.sample(10000, tune=2000, progressbar=True, return_inferencedata=True)

File d:\Program\condaenv\myenvs\mcmc\Lib\site-packages\pymc\sampling\mcmc.py:716, in sample(draws, tune, chains, cores, random_seed, progressbar, progressbar_theme, step, var_names, nuts_sampler, initvals, init, jitter_max_retries, n_init, trace, discard_tuned_samples, compute_convergence_checks, keep_warning_stat, return_inferencedata, idata_kwargs, nuts_sampler_kwargs, callback, mp_ctx, blas_cores, model, **kwargs)
    713         auto_nuts_init = False
    715 initial_points = None
--> 716 step = assign_step_methods(model, step, methods=pm.STEP_METHODS, step_kwargs=kwargs)
    718 if nuts_sampler != "pymc":
    719     if not isinstance(step, NUTS):

File d:\Program\condaenv\myenvs\mcmc\Lib\site-packages\pymc\sampling\mcmc.py:215, in assign_step_methods(model, step, methods, step_kwargs)
    213 methods_list: list[type[BlockedStep]] = list(methods or pm.STEP_METHODS)
    214 selected_steps: dict[type[BlockedStep], list] = {}
--> 215 model_logp = model.logp()
    217 for var in model.value_vars:
    218     if var not in assigned_vars:
    219         # determine if a gradient can be computed

File d:\Program\condaenv\myenvs\mcmc\Lib\site-packages\pymc\model\core.py:742, in Model.logp(self, vars, jacobian, sum)
    740 rv_logps: list[TensorVariable] = []
    741 if rvs:
...
    632 return logp_terms_list

ValueError: Random variables detected in the logp graph: {loss_1, d1, eps_2, eps_1, loss_2, loss_0}.
This can happen when DensityDist logp or Interval transform functions reference nonlocal variables,
or when not all rvs have a corresponding value variable.

It seems that pymc can not handle Random variables in sample

The problem is your inner scan function, model_func. You cannot new PyMC objects (like deterministics) inside here. Random variables are fine, as long as they are defined outside and passed in (which they are in this case).

@jessegrabowski
Thank you for the quick reply. However, when I defined outside random variables, there is an error I cannot solve…

with pm.Model() as model:

    eps_1 = pm.Uniform('eps_1', lower=3.0, upper=10.0, shape=1)
    eps_2 = pm.Uniform('eps_2', lower=3.0, upper=10.0, shape=1)
    loss_0 = pm.Uniform('loss_0', lower=0.0001, upper=0.001, shape=1)
    loss_1 = pm.Uniform('loss_1', lower=0.001, upper=0.1, shape=1)
    loss_2 = pm.Uniform('loss_2', lower=0.001, upper=0.1, shape=1)
    d1 = pm.Uniform('d1', lower=0.1, upper=5, shape=1)

    def model_func(a, eps_1, eps_2, loss_0, loss_1, loss_2, d1):
        eps = pt.stack([pt.constant(2.0), eps_1[0], eps_2[0]])
        loss = pt.stack([loss_0[0], loss_1[0], loss_2[0]])
        thickness = pt.stack([d1[0], a*pt.constant(const_a)/pt.sqrt(eps_1[0])])
        return simulator_op(eps, loss, thickness)

    simulated = pm.Deterministic("simulated", pytensor.scan(fn=model_func, sequences=[observed_a], non_sequences=[eps_1, eps_2, loss_0, loss_1, loss_2, d1])[0])
    likelihood = pm.Normal("y", mu=simulated, observed=[observed_data])

    trace = pm.sample(10000, tune=2000, progressbar=True, return_inferencedata=True)

ValueError: Not enough dimensions on input.
Apply node that caused the error: Composite{((-0.5 * sqr((i0 - i1))) - 0.9189385332046727)}(y{[[-5.55437 … 65181123]]}, Scan{scan_fn, while_loop=False, inplace=none}.0)
Toposort index: 18
Inputs types: [TensorType(float64, shape=(1, 10000)), TensorType(float64, shape=(None, None))]
Inputs shapes: [(1, 10000), (10000,)]
Inputs strides: [(80000, 8), (8,)]
Inputs values: [‘not shown’, ‘not shown’]
Outputs clients: [[Sum{axes=None}(sigma > 0)]]

HINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag ‘optimizer=fast_compile’. If that does not work, PyTensor optimizations can be disabled with ‘optimizer=None’.
HINT: Use the PyTensor flag exception_verbosity=high for a debug print-out and storage map footprint of this Apply node

Scan is very strict about the shapes of inputs. When you compute simulated, the shape is one thing, but when PyMC is computing the log-likelihood, a different shape is being found. The two shapes are given here:

 Inputs shapes: [(1, 10000), (10000,)]

Without being to run your code I can only guess, but I guess the leading (1,) arises because you are passing observed_data in a list, which results in a batch dimension being added.

@jessegrabowski Thank you. The problem has been solved as you suggested!


However, I do not quite understand the principle of sampling in PyMC. Is the simulated = pm.Deterministic() result obtained by traversing the space of all possible unknown parameters? How are the likelihood function and sampling performed afterwards?

Additionally, I am considering whether this method of sampling based on correlated distributions is the best choice. Would it be more efficient to obtain a table of simulated values by traversing the entire parameter space and then use a mapping table to get the distribution of values that meet the requirements? The current method does not seem very efficient (it would be running too long). I would appreciate it if you could give me any suggestions.

Sometimes the grid approach is nice, yes. You can see an example of this in this Bayesian bake-off between @AllenDowney and @ricardoV94

PyMC uses MCMC to draw samples from the posterior distribution implied by your model and data. If you’re not familiar with what that is, I suggest this video.

It looks like you are drawing a sample with size 10,000, which is bigger than you probably need. For most purposes 1000 is enough, as long as the effective sample size is not too low.

@AllenDowney Thank you. But sampling with size 1000 is still time-consuming.

@jessegrabowski Thanks a lot. I will try the grid approach.

Just be aware that the number of grid cells you need to evaluate is n^k, where n is the size of the grid and k is number of parameters you’re evaluating. Since you have 6 variables, if you only use a grid size of 10, you will need one million function evaluations. If your function is expensive (which it appears to be), it will still be slow.

Since you are using a scan, I would strongly recommend you compile your model’s logp function to the JAX backend. This might make the huge number of requried functions calls tolerable.

@jessegrabowski Do you mean I should rewrite my function into pytensor function so that I can use Graph and using NUTs sampling? Sorry I am new to pymc and I am still struggling to learn the the syntax and operations :face_exhaling:

If that is possible it would be a very good idea.