Trouble using DensityDist with a subclassed Op

Hi folks!

I’ve been trying to use the DensityDist class in conjunction with a custom Op in order to represent a black-box model likelihood. (The likelihood is calculated by an external library and it would not be feasible to represent it as a symbolic function, hence custom Op. I would like to be able to do predictive tests on the models, hence DensityDist rather than Potential.) It’s currently erroring-out with a ValueError relating to the number of parameters given to the custom Op, and I haven’t been able to figure it out. I understand the documentation is still in need of some clarifications following the v4 refactor (Expand DensityDist docstrings · Issue #5641 · pymc-devs/pymc · GitHub) so I wanted to ask for help before searching for a bug.

Let’s suppose you had an Op instance that could be passed to Potential like so:

# var1, var2, ... = Variables
lik_op = MyOp(data)
lik = pm.Potential("lik", lik_op(var1, var2, ...))

Is the correct way to use DensityDist with a provided logp as follows?

lik = pm.DensityDist("lik", var1, var2, ..., logp=lik_op)

As a concrete example, here is the issue reproduced with an example based off of the black-box likelihood case study notebook (Using a “black box” likelihood function (numpy) — PyMC documentation). Here’s the code condensed:

import aesara.tensor as at
import numpy as np
import pymc as pm

print(f"Running on PyMC v{pm.__version__}")

def my_model(m, c, x):
    return m * x + c
def my_loglike(m, c, x, data, sigma):
    model = my_model(m, c, x)
    return -(0.5 / sigma**2) * np.sum((data - model) ** 2)

class LogLike(at.Op):
    itypes = [at.dscalar, at.dscalar]
    otypes = [at.dscalar]
    def __init__(self, loglike, data, x, sigma):
        self.likelihood = loglike
        self.data = data
        self.x = x
        self.sigma = sigma
    def perform(self, node, inputs, outputs):
        m, c = inputs
        logl = self.likelihood(m, c, self.x, self.data, self.sigma)
        outputs[0][0] = np.array(logl)

N = 10  # number of data points
sigma = 1.0  # standard deviation of noise
x = np.linspace(0.0, 9.0, N)
mtrue = 0.4  # true gradient
ctrue = 3.0  # true y-intercept
truemodel = my_model(mtrue, ctrue, x)

rng = np.random.default_rng(716743)
data = sigma * rng.normal(size=N) + truemodel

logl = LogLike(my_loglike, data, x, sigma)

with pm.Model():
    m = pm.Uniform("m", lower=-10.0, upper=10.0)
    c = pm.Uniform("c", lower=-10.0, upper=10.0)

    lik = pm.Potential("likelihood", logl(m, c))  # Works
    # lik = pm.DensityDist("likelihood", m, c, logp=logl)  # Does not work

    idata = pm.sample()

If I comment out the Potential and uncomment the DensityDist, I get the following error:

/home/cove/project-dev/testing_blackbox.ipynb Cell 1' in <module>
     42 # lik = pm.Potential("likelihood", logl(m, c))  # Works
     43 lik = pm.DensityDist("likelihood", m, c, logp=logl)  # Does not work
---> 45 idata = pm.sample()

File ~/project-dev/venv-project/lib/python3.10/site-packages/pymc/sampling.py:472, in sample(draws, step, init, n_init, initvals, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, idata_kwargs, mp_ctx, **kwargs)
    469         auto_nuts_init = False
    471 initial_points = None
--> 472 step = assign_step_methods(model, step, methods=pm.STEP_METHODS, step_kwargs=kwargs)
    474 if isinstance(step, list):
    475     step = CompoundStep(step)

File ~/project-dev/venv-project/lib/python3.10/site-packages/pymc/sampling.py:198, in assign_step_methods(model, step, methods, step_kwargs)
    195 # Use competence classmethods to select step methods for remaining
    196 # variables
    197 selected_steps = defaultdict(list)
--> 198 model_logpt = model.logpt()
    200 for var in model.value_vars:
    201     if var not in assigned_vars:
    202         # determine if a gradient can be computed

File ~/project-dev/venv-project/lib/python3.10/site-packages/pymc/model.py:763, in Model.logpt(self, vars, jacobian, sum)
    761 rv_logps: List[TensorVariable] = []
    762 if rv_values:
--> 763     rv_logps = joint_logpt(list(rv_values.keys()), rv_values, sum=False, jacobian=jacobian)
    764     assert isinstance(rv_logps, list)
    766 # Replace random variables by their value variables in potential terms

File ~/project-dev/venv-project/lib/python3.10/site-packages/pymc/distributions/logprob.py:223, in joint_logpt(var, rv_values, jacobian, scaling, transformed, sum, **kwargs)
    220                 transform_map[value_var] = original_value_var.tag.transform
    222 transform_opt = TransformValuesOpt(transform_map)
--> 223 temp_logp_var_dict = factorized_joint_logprob(
    224     tmp_rvs_to_values, extra_rewrites=transform_opt, use_jacobian=jacobian, **kwargs
    225 )
    227 # Raise if there are unexpected RandomVariables in the logp graph
    228 # Only SimulatorRVs are allowed
    229 from pymc.distributions.simulator import SimulatorRV

File ~/project-dev/venv-project/lib/python3.10/site-packages/aeppl/joint_logprob.py:186, in factorized_joint_logprob(rv_values, warn_missing_rvs, extra_rewrites, **kwargs)
    183 q_value_vars = remapped_vars[: len(q_value_vars)]
    184 q_rv_inputs = remapped_vars[len(q_value_vars) :]
--> 186 q_logprob_vars = _logprob(
    187     node.op,
    188     q_value_vars,
    189     *q_rv_inputs,
    190     **kwargs,
    191 )
    193 if not isinstance(q_logprob_vars, (list, tuple)):
    194     q_logprob_vars = [q_logprob_vars]

File /usr/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    885 if not args:
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

File ~/project-dev/venv-project/lib/python3.10/site-packages/pymc/distributions/distribution.py:813, in DensityDist.__new__.<locals>.density_dist_logp(op, value_var_list, *dist_params, **kwargs)
    811 _dist_params = dist_params[3:]
    812 value_var = value_var_list[0]
--> 813 return logp(value_var, *_dist_params)

File ~/project-dev/venv-project/lib/python3.10/site-packages/aesara/graph/op.py:294, in Op.__call__(self, *inputs, **kwargs)
    252 r"""Construct an `Apply` node using :meth:`Op.make_node` and return its outputs.
    253 
    254 This method is just a wrapper around :meth:`Op.make_node`.
   (...)
    291 
    292 """
    293 return_list = kwargs.pop("return_list", False)
--> 294 node = self.make_node(*inputs, **kwargs)
    296 if config.compute_test_value != "off":
    297     compute_test_value(node)

File ~/project-dev/venv-project/lib/python3.10/site-packages/aesara/graph/op.py:234, in Op.make_node(self, *inputs)
    228     raise NotImplementedError(
    229         "You can either define itypes and otypes,\
    230      or implement make_node"
    231     )
    233 if len(inputs) != len(self.itypes):
--> 234     raise ValueError(
    235         f"We expected {len(self.itypes)} inputs but got {len(inputs)}."
    236     )
    237 if not all(it.in_same_class(inp.type) for inp, it in zip(inputs, self.itypes)):
    238     raise TypeError(
    239         f"Invalid input types for Op {self}:\n"
    240         + "\n".join(
   (...)
    247         )
    248     )

ValueError: We expected 2 inputs but got 3.

Sorry, long trace for a ValueError. The similar behavior occurs with any number of parameters specified in the way they are above.
p.s: on pymc v4.0.0b6, aesara v2.5.1

I have just realized that the docstring is incomplete on the website because most of the content is on the __new__ method which is private and doesn’t appear there. All that content should be moved to the class docstring. For now you have to read it from the source: pymc.distributions.distribution — PyMC dev documentation

DensityDist is for easily creating distributions from their logp function only. It is therefore compatible with arbitrary logp functions but it is not compatible with any call signature. Distributions in PyMC not only have some parameters, they also need to be evaluated at a value. If the distribution is being used to define a prior that value is the value of that variable proposed by the sampler, if it is an observed variable it is the array_like passed with the observed keyword. The old version of DensityDist was a bit different and too permissive, which was the source of several issues. The 2-3 arguments issue is because you are missing the value part of the call signature and have the parameters only, you have to modify the call signature and as you want a log likelihood term use the observed keyword.

4 Likes

Oh, we should make that visible!

2 Likes

Ahh, thank you very much for this clarification. I now see what the call signature of logp needs to look like and also get the rationale for separating observed data from model parameters in the call signature. I confess that I had come across the docstring, but somehow missed the meaning of the “value” argument there.

To rephrase for anyone else that comes along with a similar question: in the original question, I had tried to embed the observed data in the Op itself. It seems that recommended choice is to provide the data via the observed kwarg of DensityDist. This means that it is also necessary to update the input TensorTypes of the logp Op, with the first input to the Op now representing the observed data.

If it’s alright to ask a follow-up question here: is it poor decision choice/poor statistical practice to embed the data in the likelihood function? In my current project, I’ve chosen to do so because calculating my logp requires converting the raw data (numpy array) to a custom object, and this is somewhat costly to do. So if we can avoid having to do this construction at every time that the sampler calls to logp, that would be quite nice. AFAIK I wasn’t aware that aesara support tensor variables of an “object” type, so I just embedded the python object in the logp Op or function. Obviously this ties our hands from evaluating model parameters on newly-observed data, however. I’m still a relative novice to working with Bayesian models so I don’t have a good sense of how significant this limitation would be to e.g. model assessment.

More than recommended it is required to use this call signature. The observed data doesn’t change with every draw, not sure what you mean with the conversion thing, why does having the observed data be the input in observed kwarg make a difference on this?

Technically you could have a value argument that is ignored, the restriction is on that being on the call signature, not on using the value argument. If you then want to ignore it and embed the data there that is on you, and if you do I’d double check all the results to make sure nothing is not behaving as expected.

2 Likes

Understood, many thanks again. And apologies, just to be clear, I misspoke somewhat with the word “recommendation.” Coming from the black-box likelihood notebook left me a little bit confused, as Potential does not require a logp function that follows the signature required by Distributions, and the notebook gives the example of a more permissive logp function that does embed the data within it. As I’m still a student myself, the importance of the “model parameters”-“observed data” distinction wasn’t immediately clear in working through the notebook.

It’s quite clear to me now what the API is for a Distribution’s logp function, though.

why does having the observed data be the input in observed kwarg make a difference on this?

I don’t mean to take this thread too far out of scope. But to answer this: the primary reason I can imagine that someone would wish to avoid passing in data via the observed kwarg is if they have data which isn’t easily converted to a TensorVariable. To the best that I can tell, using observed requires that the data is numeric. And, given some Op which calculates the logp, it passes the data to the op as an ndarray.

What I was meaning with the “conversion” thing is that I’m using an external library which uses a custom format for storing data, essentially performing some transformations on a dataframe or ndarray. It even accepts non-numeric types in the dataframe/array representing factor levels in the data. Hence, so long as I’m understanding the pymc and aesara API correctly (which I definitely could be misunderstanding!) I believe that in order to use the correct logp call signature, I would need to:

  1. Every time that logp is called, make my op call the external library and perform those transformations to create the library-friendly data object
  2. Either create an op for every relevant combination of factor levels, or convert those factors to integer indices and make my op resolve those indices

I imagine that this is a very very small edge case, and I have a lot of work to do before I’ll be able to assess how the first point could impact sampling performance, so I’m happy for now leaving this as a challenge for future me to figure out. :smiley: But there’s where I’m coming from. (If it seems like I’m still misunderstanding something significant, I’d love to hear!) Thanks again for the discussion and help.

You can try timing the two options as I mentioned above. One using value as an argument on the call signature but not using it and instead embedding the data in the loglik function and another using the value argument. I don’t know enough about the internals to guess if there will be a difference or what that would be.

The warning about the embedded data approach I wrote above is because you might run into shape issues, but if you only sample the model on your data and don’t try to resample on different data or generate posterior predictive samples I think everything should be fine.

1 Like