How to inference with interrelated observations

Meillo_Fang · October 14, 2024, 9:09am

I have two observations: observed_a and observed_data. Both are guassian distributions, as shown in the figure. The two observations are interrelated.How to get the distribution of the parameter eps_1 given the two observations?

with pm.Model() as model:

    eps_0 = pm.Uniform('eps_0', lower=3.0, upper=10.0, shape=1)
    eps_1 = pm.Uniform('eps_1', lower=3.0, upper=10.0, shape=1)
    eps_2 = pm.Uniform('eps_2', lower=3.0, upper=10.0, shape=1)
    loss_0 = pm.Uniform('loss_0', lower=0.0001, upper=0.001, shape=1)
    loss_1 = pm.Uniform('loss_1', lower=0.001, upper=0.1, shape=1)
    loss_2 = pm.Uniform('loss_2', lower=0.001, upper=0.1, shape=1)
    d1 = pm.Uniform('d1', lower=0.1, upper=5, shape=1)

    eps = pt.as_tensor_variable([eps_0, eps_1, eps_2])
    loss = pt.as_tensor_variable([loss_0, loss_1, loss_2])
    thickness = pt.as_tensor_variable([d1, const_a*observed_a / pt.sqrt(eps_1)])

    simulated_data = pm.Deterministic('simulated_data', simulator_op(eps, loss, thickness))
    likelihood = pm.Normal('likelihood', mu=simulated_data, sigma=1, observed=observed_data)
    trace = pm.sample(10000, tune=2000, progressbar=True, return_inferencedata=True)

pm.traceplot(trace, var_names=['eps_1'])
pm.summary(trace, var_names=['eps_1'])

PyMC v5.16.2
simulator_op is my defined Op class that performs simulation with the parameters eps, loss, and thickness. const_a is a constant. You can see that observed_a is correlated with parameter eps_1. I know the code thickness = pt.as_tensor_variable([d1, const_a*observed_a / pt.sqrt(eps_1)]) is erroneous, but how to correctly infer the specific parameter eps_1

jessegrabowski · October 14, 2024, 10:44am

Why do you think thickness = pt.as_tensor_variable([d1, const_a*observed_a / pt.sqrt(eps_1)]) is erroneous? It seems fine to me, as long as that equation is the correct physical relationship you’re interested in.

Meillo_Fang · October 14, 2024, 4:27pm

@jessegrabowski

Thank you. I have updated the code as follows:

with pm.Model() as model:
    eps_1 = pm.Uniform('eps_1', lower=3.0, upper=10.0, shape=1)
    eps_2 = pm.Uniform('eps_2', lower=3.0, upper=10.0, shape=1)
    loss_0 = pm.Uniform('loss_0', lower=0.0001, upper=0.001, shape=1)
    loss_1 = pm.Uniform('loss_1', lower=0.001, upper=0.1, shape=1)
    loss_2 = pm.Uniform('loss_2', lower=0.001, upper=0.1, shape=1)
    d1 = pm.Uniform('d1', lower=0.1, upper=5, shape=1)
    consta = pm.Deterministic('consta', pt.as_tensor_variable(const_a))

    def model_func(a):
        eps = pm.Deterministic('eps', pt.stack([pt.constant(2.0), eps_1[0], eps_2[0]]))
        loss = pm.Deterministic('loss', pt.stack([loss_0[0], loss_1[0], loss_2[0]]))
        thickness = pm.Deterministic('thickness', pt.stack([d1[0], a*consta/pt.sqrt(eps_1[0])]))
        return simulator_op(eps, loss, thickness)

    observed_a = pt.as_tensor_variable(observed_a)
    simulated = pm.Deterministic("simulated", pytensor.scan(fn=model_func, sequences=[observed_a])[0])

    likelihood = pm.Normal("y", mu=simulated, observed=[observed_data])

    trace = pm.sample(10000, tune=2000, progressbar=True, return_inferencedata=True)

The error reads as

ValueError                                Traceback (most recent call last)
Cell In[65], line 26
     22 observeddata = pt.as_tensor_variable(observed_data)
     24 likelihood = pm.Normal("y", mu=simulated, observed=[observed_data])
---> 26 trace = pm.sample(10000, tune=2000, progressbar=True, return_inferencedata=True)

File d:\Program\condaenv\myenvs\mcmc\Lib\site-packages\pymc\sampling\mcmc.py:716, in sample(draws, tune, chains, cores, random_seed, progressbar, progressbar_theme, step, var_names, nuts_sampler, initvals, init, jitter_max_retries, n_init, trace, discard_tuned_samples, compute_convergence_checks, keep_warning_stat, return_inferencedata, idata_kwargs, nuts_sampler_kwargs, callback, mp_ctx, blas_cores, model, **kwargs)
    713         auto_nuts_init = False
    715 initial_points = None
--> 716 step = assign_step_methods(model, step, methods=pm.STEP_METHODS, step_kwargs=kwargs)
    718 if nuts_sampler != "pymc":
    719     if not isinstance(step, NUTS):

File d:\Program\condaenv\myenvs\mcmc\Lib\site-packages\pymc\sampling\mcmc.py:215, in assign_step_methods(model, step, methods, step_kwargs)
    213 methods_list: list[type[BlockedStep]] = list(methods or pm.STEP_METHODS)
    214 selected_steps: dict[type[BlockedStep], list] = {}
--> 215 model_logp = model.logp()
    217 for var in model.value_vars:
    218     if var not in assigned_vars:
    219         # determine if a gradient can be computed

File d:\Program\condaenv\myenvs\mcmc\Lib\site-packages\pymc\model\core.py:742, in Model.logp(self, vars, jacobian, sum)
    740 rv_logps: list[TensorVariable] = []
    741 if rvs:
...
    632 return logp_terms_list

ValueError: Random variables detected in the logp graph: {loss_1, d1, eps_2, eps_1, loss_2, loss_0}.
This can happen when DensityDist logp or Interval transform functions reference nonlocal variables,
or when not all rvs have a corresponding value variable.

It seems that pymc can not handle Random variables in sample

jessegrabowski · October 14, 2024, 4:34pm

The problem is your inner scan function, model_func. You cannot new PyMC objects (like deterministics) inside here. Random variables are fine, as long as they are defined outside and passed in (which they are in this case).

Meillo_Fang · October 14, 2024, 4:55pm

@jessegrabowski
Thank you for the quick reply. However, when I defined outside random variables, there is an error I cannot solve…

with pm.Model() as model:

    eps_1 = pm.Uniform('eps_1', lower=3.0, upper=10.0, shape=1)
    eps_2 = pm.Uniform('eps_2', lower=3.0, upper=10.0, shape=1)
    loss_0 = pm.Uniform('loss_0', lower=0.0001, upper=0.001, shape=1)
    loss_1 = pm.Uniform('loss_1', lower=0.001, upper=0.1, shape=1)
    loss_2 = pm.Uniform('loss_2', lower=0.001, upper=0.1, shape=1)
    d1 = pm.Uniform('d1', lower=0.1, upper=5, shape=1)

    def model_func(a, eps_1, eps_2, loss_0, loss_1, loss_2, d1):
        eps = pt.stack([pt.constant(2.0), eps_1[0], eps_2[0]])
        loss = pt.stack([loss_0[0], loss_1[0], loss_2[0]])
        thickness = pt.stack([d1[0], a*pt.constant(const_a)/pt.sqrt(eps_1[0])])
        return simulator_op(eps, loss, thickness)

    simulated = pm.Deterministic("simulated", pytensor.scan(fn=model_func, sequences=[observed_a], non_sequences=[eps_1, eps_2, loss_0, loss_1, loss_2, d1])[0])
    likelihood = pm.Normal("y", mu=simulated, observed=[observed_data])

    trace = pm.sample(10000, tune=2000, progressbar=True, return_inferencedata=True)

ValueError: Not enough dimensions on input.
Apply node that caused the error: Composite{((-0.5 * sqr((i0 - i1))) - 0.9189385332046727)}(y{[[-5.55437 … 65181123]]}, Scan{scan_fn, while_loop=False, inplace=none}.0)
Toposort index: 18
Inputs types: [TensorType(float64, shape=(1, 10000)), TensorType(float64, shape=(None, None))]
Inputs shapes: [(1, 10000), (10000,)]
Inputs strides: [(80000, 8), (8,)]
Inputs values: [‘not shown’, ‘not shown’]
Outputs clients: [[Sum{axes=None}(sigma > 0)]]

HINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag ‘optimizer=fast_compile’. If that does not work, PyTensor optimizations can be disabled with ‘optimizer=None’.
HINT: Use the PyTensor flag exception_verbosity=high for a debug print-out and storage map footprint of this Apply node

jessegrabowski · October 15, 2024, 4:45am

Scan is very strict about the shapes of inputs. When you compute simulated, the shape is one thing, but when PyMC is computing the log-likelihood, a different shape is being found. The two shapes are given here:

 Inputs shapes: [(1, 10000), (10000,)]

Without being to run your code I can only guess, but I guess the leading (1,) arises because you are passing observed_data in a list, which results in a batch dimension being added.

Meillo_Fang · October 15, 2024, 7:19am

@jessegrabowski Thank you. The problem has been solved as you suggested!

However, I do not quite understand the principle of sampling in PyMC. Is the simulated = pm.Deterministic() result obtained by traversing the space of all possible unknown parameters? How are the likelihood function and sampling performed afterwards?

Additionally, I am considering whether this method of sampling based on correlated distributions is the best choice. Would it be more efficient to obtain a table of simulated values by traversing the entire parameter space and then use a mapping table to get the distribution of values that meet the requirements? The current method does not seem very efficient (it would be running too long). I would appreciate it if you could give me any suggestions.

jessegrabowski · October 15, 2024, 9:35am

Sometimes the grid approach is nice, yes. You can see an example of this in this Bayesian bake-off between @AllenDowney and @ricardoV94

PyMC uses MCMC to draw samples from the posterior distribution implied by your model and data. If you’re not familiar with what that is, I suggest this video.

AllenDowney · October 15, 2024, 12:53pm

It looks like you are drawing a sample with size 10,000, which is bigger than you probably need. For most purposes 1000 is enough, as long as the effective sample size is not too low.

Meillo_Fang · October 15, 2024, 1:46pm

@AllenDowney Thank you. But sampling with size 1000 is still time-consuming.

Meillo_Fang · October 15, 2024, 1:48pm

@jessegrabowski Thanks a lot. I will try the grid approach.

jessegrabowski · October 15, 2024, 1:59pm

Just be aware that the number of grid cells you need to evaluate is n^k, where n is the size of the grid and k is number of parameters you’re evaluating. Since you have 6 variables, if you only use a grid size of 10, you will need one million function evaluations. If your function is expensive (which it appears to be), it will still be slow.

Since you are using a scan, I would strongly recommend you compile your model’s logp function to the JAX backend. This might make the huge number of requried functions calls tolerable.

Meillo_Fang · October 15, 2024, 2:07pm

@jessegrabowski Do you mean I should rewrite my function into pytensor function so that I can use Graph and using NUTs sampling? Sorry I am new to pymc and I am still struggling to learn the the syntax and operations

jessegrabowski · October 15, 2024, 2:22pm

If that is possible it would be a very good idea.

Topic		Replies	Views
Including Observations in the Model, a beginner question version agnostic gaussian_process , modeling , sampling	14	119	October 20, 2024
Likelihood given by product of two distributions? Questions	6	2184	August 4, 2018
Observation with different shape	4	515	June 16, 2022
Model Fitting using datasets of different observation instruments Questions	11	2541	April 15, 2020
2N+2 parameter inference: sampling all parameters as one? v5	5	152	March 28, 2024

How to inference with interrelated observations

Related topics