Defining a numeric (custom) likelihood function in PyMC3

(This is a question I asked over at SO but now I think it might be more suited for this forum, so I’m basically re-posting it here. Let me know if this is not advised)

After looking at several questions/answers and PyMC3’s, I’ve managed to create a minimal working example of my MCMC setup (see below).

My fitted parameters are continuous and discrete, so the priors are defined using pm.Uniform and pm.DiscreteUniform (with a re-scaling applied to the latter). My likelihood function is particularly convoluted (it involves comparing the N-dimensional histograms of my observed data and some synthetic data generated using the free parameters), so I had to write it using theano's @as_op operator.

The implementation shown here works on a toy model working on random data, but in my actual model the likelihood and parameters are very similar.

My questions are:

  1. Is this correct? Is there anything I should be doing different?
  2. The call to the likelihood function is just thrown there apparently doing nothing and connected to nothing. Is this the proper way to do this?
  3. I’m using NUTS for the continuous parameters but since my likelihood is numeric, I don’t think I should be able to do this. Since the code still runs, I’m nut sure what’s going on.

This is the first time I’ve used PyMC3 so any pointers will be really helpful.

import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
import theano.tensor as tt
from theano.compile.ops import as_op


def main():
    trace = bayesMCMC()
    print(pm.summary(trace))
    pm.traceplot(trace)
    plt.show()


def bayesMCMC():
    """
    Define and process the full model.
    """
    with pm.Model() as model:
        # Define uniform priors.
        A = pm.Uniform("A", lower=0., upper=5.)
        B = pm.Uniform("B", lower=10., upper=20.)
        C = pm.Uniform("C", lower=0., upper=1.)

        # Define discrete priors.
        minD, maxD, stepD = 0.005, 0.06, 0.005
        ND = int((maxD - minD) / stepD)
        D = pm.DiscreteUniform("D", 0., ND)
        minE, maxE, stepE = 9., 10., 0.05
        NE = int((maxE - minE) / stepE)
        E = pm.DiscreteUniform("E", 0., NE)

        # Is this correct??
        logp(A, B, C, D, E)

        step1 = pm.NUTS(vars=[A, B, C])
        print("NUTS")
        step2 = pm.Metropolis(vars=[D, E])
        print("Metropolis")

        trace = pm.sample(300, [step1, step2])  # , start)

        return trace


@as_op(
    itypes=[tt.dscalar, tt.dscalar, tt.dscalar, tt.lscalar, tt.lscalar],
    otypes=[tt.dscalar])
def logp(A, B, C, D, E):
    """
    Likelihood evaluation.
    """
    # Get observed data and some extra info to re-scale the discrete parameters
    obsData, minD, stepD, minE, stepE = obsservedData()

    # Scale discrete parameters
    D, E = D * stepD + minD, E * stepE + minE

    # Generate synthetic data using the prior values
    synthData = synthetic(A, B, C, D, E)

    # Generate N-dimensional histograms for both data sets.
    obsHist, edges = np.histogramdd(obsData)
    synHist, _ = np.histogramdd(synthData, bins=edges)

    # Flatten both histograms
    obsHist_f, synHist_f = obsHist.ravel(), synHist.ravel()
    # Remove all bins where N_bin=0.
    binNzero = obsHist_f != 0
    obsHist_f, synHist_f = obsHist_f[binNzero], synHist_f[binNzero]
    # Assign small value to the 0 elements in synHist_f to avoid issues with
    # the log()
    synHist_f[synHist_f == 0] = 0.001

    # Compare the histograms of the observed and synthetic data via a Poisson
    # likelihood ratio.
    lkl = -2. * np.sum(synHist_f - obsHist_f * np.log(synHist_f))

    return lkl


def obsservedData():
    """Some 'observed' data."""
    np.random.seed(12345)
    N = 1000
    obsData = np.random.uniform(0., 10., (N, 3))

    minD, stepD = 0.005, 0.005
    minE, stepE = 9., 0.05

    return obsData, minD, stepD, minE, stepE


def synthetic(A, B, C, D, E):
    """
    Dummy function to generate synthetic data. The actual function makes use
    of the A, B, C, D, E variables (obviously).
    """
    M = np.random.randint(100, 1000)
    synthData = np.random.uniform(0., 10., (M, 3))
    return synthData


if __name__ == "__main__":
    main()

No it is not correct, as you suspect the logp is not connected to the graph, which means it is not being evaluated during sampling/inferece. Right now you are sampling from the prior. You can use potential to add the logp to the model:

with pm.Model() as model:
    ...
    pm.Potential('logp', logp(A, B, C, D, E))

It is allowed as pymc3 will sampled with a metropolis within Gibbs sampler (ie compoundstep), but you should just call sample() as Metropolis is actually not suitable for DiscreteUniform (it would give invalid proposal that goes out of bound). More information see http://docs.pymc.io/notebooks/sampling_compound_step.html, especially the last paragraph.

Lastly, I think you can rewrite your logp into theano. There is random generator in theano, and the histogram function could be rewrite into theano as well.

Thank you very much for your answer @junpenglao. I followed your advise and used pm.Potential('logp', logp(A, B, C, D, E)) and pm.sample().

I get a rather nasty error which I transcribe below:

$ python del.py
Only 300 samples in chain.
Multiprocess sampling (2 chains in 2 jobs)
CompoundStep
>CompoundStep
>>Slice: [C]
>>Slice: [B]
>>Slice: [A]
>CompoundStep
>>Metropolis: [E]
>>Metropolis: [D]
Sampling 2 chains:   0%|                                                         | 0/1600 [00:00<?, ?draws/s]
pymc3.parallel_sampling.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/theano/compile/function_module.py", line 903, in __call__
    self.fn() if output_subset is None else\
ValueError: expected an ndarray

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 73, in run
    self._start_loop()
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 113, in _start_loop
    point, stats = self._compute_point()
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 139, in _compute_point
    point, stats = self._step_method.step(self._point)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/step_methods/compound.py", line 29, in step
    point = method.step(point)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/step_methods/compound.py", line 33, in step
    point = method.step(point)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/step_methods/arraystep.py", line 142, in step
    apoint = self.astep(bij.map(point), *inputs)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/step_methods/slicer.py", line 56, in astep
    y = logp(q) - nr.standard_exponential()
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/blocking.py", line 257, in __call__
    return self.fa(self.fb(x))
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/model.py", line 1107, in __call__
    return self.f(**state)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/theano/compile/function_module.py", line 917, in __call__
    storage_map=getattr(self.fn, 'storage_map', None))
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/theano/gof/link.py", line 325, in raise_with_op
    reraise(exc_type, exc_value, exc_trace)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/six.py", line 692, in reraise
    raise value.with_traceback(tb)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/theano/compile/function_module.py", line 903, in __call__
    self.fn() if output_subset is None else\
ValueError: expected an ndarray
Apply node that caused the error: MakeVector{dtype='float64'}(__logp_A_interval__, __logp_B_interval__, __logp_C_interval__, __logp_D, __logp_E, logp)
Toposort index: 9
Inputs types: [TensorType(float64, scalar), TensorType(float64, scalar), TensorType(float64, scalar), TensorType(float64, scalar), TensorType(float64, scalar), TensorType(float64, scalar)]
Inputs shapes: [(), (), (), (), (), ()]
Inputs strides: [(), (), (), (), (), ()]
Inputs values: [array(-1.38629436), array(-1.38629436), array(-1.38629436), array(-2.48490665), array(-3.04452244), -7993.053447180414]
Outputs clients: [[Sum{acc_dtype=float64}(MakeVector{dtype='float64'}.0)]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
  File "del.py", line 38, in bayesMCMC
    trace = pm.sample(300)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/sampling.py", line 415, in sample
    step = assign_step_methods(model, step, step_kwargs=step_kwargs)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/sampling.py", line 151, in assign_step_methods
    return instantiate_steppers(model, steps, selected_steps, step_kwargs)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/sampling.py", line 72, in instantiate_steppers
    step = step_class(vars=vars, **args)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/step_methods/arraystep.py", line 65, in __new__
    step.__init__([var], *args, **kwargs)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/step_methods/slicer.py", line 47, in __init__
    super(Slice, self).__init__(vars, [self.model.fastlogp], **kwargs)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/model.py", line 233, in fastlogp
    return self.model.fastfn(self.logpt)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/model.py", line 716, in logpt
    logp = tt.sum([tt.sum(factor) for factor in factors])

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.
"""

The above exception was the direct cause of the following exception:

ValueError: expected an ndarray
Apply node that caused the error: MakeVector{dtype='float64'}(__logp_A_interval__, __logp_B_interval__, __logp_C_interval__, __logp_D, __logp_E, logp)
Toposort index: 9
Inputs types: [TensorType(float64, scalar), TensorType(float64, scalar), TensorType(float64, scalar), TensorType(float64, scalar), TensorType(float64, scalar), TensorType(float64, scalar)]
Inputs shapes: [(), (), (), (), (), ()]
Inputs strides: [(), (), (), (), (), ()]
Inputs values: [array(-1.38629436), array(-1.38629436), array(-1.38629436), array(-2.48490665), array(-3.04452244), -7993.053447180414]
Outputs clients: [[Sum{acc_dtype=float64}(MakeVector{dtype='float64'}.0)]]

Backtrace when the node is created(use Theano flag traceback.limit=N to make it longer):
  File "del.py", line 38, in bayesMCMC
    trace = pm.sample(300)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/sampling.py", line 415, in sample
    step = assign_step_methods(model, step, step_kwargs=step_kwargs)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/sampling.py", line 151, in assign_step_methods
    return instantiate_steppers(model, steps, selected_steps, step_kwargs)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/sampling.py", line 72, in instantiate_steppers
    step = step_class(vars=vars, **args)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/step_methods/arraystep.py", line 65, in __new__
    step.__init__([var], *args, **kwargs)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/step_methods/slicer.py", line 47, in __init__
    super(Slice, self).__init__(vars, [self.model.fastlogp], **kwargs)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/model.py", line 233, in fastlogp
    return self.model.fastfn(self.logpt)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/model.py", line 716, in logpt
    logp = tt.sum([tt.sum(factor) for factor in factors])

HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "del.py", line 102, in <module>
    main()
  File "del.py", line 12, in main
    trace = bayesMCMC()
  File "del.py", line 38, in bayesMCMC
    trace = pm.sample(300)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/sampling.py", line 449, in sample
    trace = _mp_sample(**sample_args)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/sampling.py", line 999, in _mp_sample
    for draw in sampler:
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 305, in __iter__
    draw = ProcessAdapter.recv_draw(self._active)
  File "/home/gabriel/anaconda3/envs/asteca3/lib/python3.7/site-packages/pymc3/parallel_sampling.py", line 223, in recv_draw
    six.raise_from(RuntimeError('Chain %s failed.' % proc.chain), old)
  File "<string>", line 3, in raise_from
RuntimeError: Chain 1 failed.

I don’t think I could write the likelihood in Theano because my actual code for generating the synthetic data is quite large and complicated, not just a random generator like I use in this example.