ABC function works independently - but fails in ```pm.Simulator```

Hi PyMCers,
I study study echolocating bats, and am hoping to use PyMC to estimate which bats in a group emit calls in an audio clip. Using ABC, the plan is to generate synthetic audio clips that mimic the call emissions of bats accroding to the region they may be in - and estimate which 1) ‘scenario’ and 2) call positions best fit the observed data. By ‘scenario’ I mean which bat/s emitted a call (e.g. in 3 bats, we can have [True, False, False] when the first bat emits a call, and so on. All scenarios are stored in a (Mscenarios,Nbats) np.array or pt.tensor.var.TensorConstant.

Here I’ve been coding the scenario with a pm.Categorical variable where the drawn value refers to the row in the scenario array/tensor. The drawn scenario then decides which call positions are chosen to be simulated. In a previous post I was messing up numpy and pytensor indexing + objects. Here I’ve tried to keep everything as consistent as possible. Despite my attempts, the synthetic data generating function works independently with pt.tensor.var.TensorVariables when called alone - but fails when called in the pm.Simulator and Model() contexts. Can someone please point out why this happens?

I’m running Windows 10, pymc 5.5.0 and pytensor 2.12.3.

import numpy as np 
import pymc as pm 
import pytensor.tensor as at
import os 
os.environ['PYTHONIOENCODING']='utf-8'    

nbats = 3
# all possible scenarios whare >=1 individual called (1=call, 0=silent)
allscenarios_asnp = np.array([[1., 0., 0.],
                               [0., 1., 0.],
                               [0., 0., 1.],
                               [1., 1., 0.],
                               [1., 0., 1.],
                               [0., 1., 1.],
                               [1., 1., 1.]])

allscenarios_aspt = at.as_tensor(np.bool_(allscenarios_asnp))
# set probability of all scenarios to be the same 
num_scenarios = allscenarios_asnp.shape[0]
p_categories = np.tile(1/num_scenarios, num_scenarios)
p_categories = p_categories/p_categories.sum() # 're' normalise everything to be sure it adds to 1


# x,y,z ranges of possible emission points for each individual per row
# ordered as xmin, xmax, ymin, ymax, zmin, zmax
xyz_ranges = np.ones((nbats,6))
for batnum in range(nbats):
    xyz_ranges[batnum,:] = batnum + np.linspace(0,1,6)

def do_sound_propagation(call_positions):
    ''' dummy function to mimic sound propagation. 
    Parameters
    ----------
    call_positions : (Bbats, 3) np.array
        Where Bbats >= 1.
    Returns
    -------
    synthetic_audio : (Msamples,Nchannels) np.array
    '''
    # sound propagated
    synthetic_audio =  np.random.normal(0,1,100).reshape(20,5)
    return synthetic_audio

def generate_fake_audio(rng, scenario_number, xyz_emissions, size=None):
    '''
    generates audio based on potential call positions
    
    Parameters
    ----------
    rng :  np.random.Generator
    scenario_number : int
    xyz_emissions : (Bbats,3)
        Bbats is the max number of bats
    size : int
    
    Returns
    -------
    emitted_sounds : (Msamples, Nchannels) np.array
    '''
    # which bats are calling 
    currentscenario = allscenarios_aspt[scenario_number]
    #inds = current_scenario
    # select xyz positions of calling bats only
    actual_callpositions = xyz_emissions[currentscenario[0]]
    # do audio propagation stuff  here
    emitted_sounds = do_sound_propagation(actual_callpositions)
    return emitted_sounds


if __name__ == "__main__":
    #%% This fails
    observed_audio =  np.random.normal(0,1,100).reshape(20,5)
    with pm.Model() as mo:
        # choose which scenario is being tested
        current_scenario = pm.Categorical('current_scenario',
                                              p_categories,
                                              shape=(1,))
        
        # generate the hypothesised x,y,z for the emitting bats
        x_hyp = pm.Uniform('x_hyp', lower=xyz_ranges[:,0],
                              upper=xyz_ranges[:,1], shape=(nbats,))
        y_hyp = pm.Uniform('y_hyp', lower=xyz_ranges[:,2],
                              upper=xyz_ranges[:,3],shape=(nbats,))
        z_hyp = pm.Uniform('z_hyp', lower=xyz_ranges[:,4],
                              upper=xyz_ranges[:,5],shape=(nbats,))
        xyz_stack = pm.math.stack([x_hyp, y_hyp, z_hyp], axis=1)
        
        sim = pm.Simulator('sim', generate_fake_audio, params=[current_scenario, xyz_stack],
                                                   observed=observed_audio,
                                                   )
    with mo:
        idata = pm.sample_smc(draws=100, chains=2)
        idata.extend(pm.sample_posterior_predictive(idata))

    #%% But this works
    currscenario = pm.Categorical.dist(p_categories, shape=(1,))
    xhyp = pm.Uniform.dist(lower=xyz_ranges[:,0],
                          upper=xyz_ranges[:,1], shape=(nbats,))
    yhyp = pm.Uniform.dist(lower=xyz_ranges[:,2],
                          upper=xyz_ranges[:,3],shape=(nbats,))
    zhyp = pm.Uniform.dist(lower=xyz_ranges[:,4],
                          upper=xyz_ranges[:,5],shape=(nbats,))
    xyzstack = pm.math.stack([x_hyp, y_hyp, z_hyp], axis=1)
    
    # Lets run the function with tensor variable inputs
    generate_fake_audio(np.random.default_rng(), currscenario, xyz_stack)

The Error message is below:

Initializing SMC sampler...
Sampling 2 chains in 2 jobs
 |--------------------| 0.00% [0/100 00:00<?]RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pytensor\compile\function\types.py", line 970, in __call__
    self.vm()
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pytensor\graph\op.py", line 543, in rval
    r = p(n, [x[0] for x in i], o)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pytensor\tensor\random\op.py", line 378, in perform
    smpl_val = self.rng_fn(rng, *(args + [size]))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pymc\distributions\simulator.py", line 55, in rng_fn
    return cls.fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\users\theja\documents\research_repos\pydatemm\examples\abc_peakdetection\minimum2.py", line 72, in generate_fake_audio
    # do audio propagation stuff  here
                           ^^^^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\multiprocessing\pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pymc\smc\sampling.py", line 419, in _apply_args_and_kwargs
    return fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pymc\smc\sampling.py", line 344, in _sample_smc_int
    smc._initialize_kernel()
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pymc\smc\kernels.py", line 246, in _initialize_kernel
    likelihoods = [self.likelihood_logp_func(sample) for sample in self.tempered_posterior]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pymc\smc\kernels.py", line 246, in <listcomp>
    likelihoods = [self.likelihood_logp_func(sample) for sample in self.tempered_posterior]
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pytensor\compile\function\types.py", line 983, in __call__
    raise_with_op(
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pytensor\link\utils.py", line 535, in raise_with_op
    raise exc_value.with_traceback(exc_trace)
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pytensor\compile\function\types.py", line 970, in __call__
    self.vm()
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pytensor\graph\op.py", line 543, in rval
    r = p(n, [x[0] for x in i], o)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pytensor\tensor\random\op.py", line 378, in perform
    smpl_val = self.rng_fn(rng, *(args + [size]))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\theja\anaconda3\envs\pymcenv\Lib\site-packages\pymc\distributions\simulator.py", line 55, in rng_fn
    return cls.fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\users\theja\documents\research_repos\pydatemm\examples\abc_peakdetection\minimum2.py", line 72, in generate_fake_audio
    # do audio propagation stuff  here
                           ^^^^^^^^^^^^
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Apply node that caused the error: Simulator_sim_rv{0, (0, 0), floatX, True}(simulator_rng, [20  5], 11, Composite{cast{int64}(roundhalftoeven(i0))}.0, Join.0)
Toposort index: 17
Inputs types: [RandomGeneratorType, TensorType(int64, shape=(2,)), TensorType(int64, shape=()), TensorType(int64, shape=(1,)), TensorType(float64, shape=(3, 3))]
Inputs shapes: ['No shapes', (2,), (), (1,), (3, 3)]
Inputs strides: ['No strides', (8,), (), (8,), (24, 8)]
Inputs values: [Generator(PCG64) at 0x276DDADBD80, array([20,  5], dtype=int64), array(11, dtype=int64), array([1], dtype=int64), 'not shown']
Outputs clients: [['output'], [Composite{sqr((i0 - i1))}(sim{[[-0.56321 ... 40216605]]}, simulator_value)]]

HINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag 'optimizer=fast_compile'. If that does not work, PyTensor optimizations can be disabled with 'optimizer=None'.
HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.
"""


The above exception was the direct cause of the following exception:

Traceback (most recent call last):

  File ~\anaconda3\envs\pymcenv\Lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File c:\users\theja\documents\research_repos\pydatemm\examples\abc_peakdetection\minimum2.py:99
    idata = pm.sample_smc(draws=100, chains=2)

  File ~\anaconda3\envs\pymcenv\Lib\site-packages\pymc\smc\sampling.py:213 in sample_smc
    results = run_chains_parallel(

  File ~\anaconda3\envs\pymcenv\Lib\site-packages\pymc\smc\sampling.py:388 in run_chains_parallel
    results = _starmap_with_kwargs(

  File ~\anaconda3\envs\pymcenv\Lib\site-packages\pymc\smc\sampling.py:415 in _starmap_with_kwargs
    return pool.starmap(_apply_args_and_kwargs, args_for_starmap)

  File ~\anaconda3\envs\pymcenv\Lib\multiprocessing\pool.py:375 in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()

  File ~\anaconda3\envs\pymcenv\Lib\multiprocessing\pool.py:774 in get
    raise self._value

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
Apply node that caused the error: Simulator_sim_rv{0, (0, 0), floatX, True}(simulator_rng, [20  5], 11, Composite{cast{int64}(roundhalftoeven(i0))}.0, Join.0)
Toposort index: 17
Inputs types: [RandomGeneratorType, TensorType(int64, shape=(2,)), TensorType(int64, shape=()), TensorType(int64, shape=(1,)), TensorType(float64, shape=(3, 3))]
Inputs shapes: ['No shapes', (2,), (), (1,), (3, 3)]
Inputs strides: ['No strides', (8,), (), (8,), (24, 8)]
Inputs values: [Generator(PCG64) at 0x276DDADBD80, array([20,  5], dtype=int64), array(11, dtype=int64), array([1], dtype=int64), 'not shown']
Outputs clients: [['output'], [Composite{sqr((i0 - i1))}(sim{[[-0.56321 ... 40216605]]}, simulator_value)]]

HINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag 'optimizer=fast_compile'. If that does not work, PyTensor optimizations can be disabled with 'optimizer=None'.
HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.

Not a full solution, but it’s failing on this line:

actual_callpositions = xyz_emissions[currentscenario[0]]

Not sure why you are converting scenario_number into tensor. xyz_emissions isn’t a tensor. But maybe that points you in a useful direction?

scenario_number is already a tensor variable ‘on arrival’ into the generate_fake_audio function to ensure compatibility with the allscenarios_aspt tensor. I tried sub-setting the NumPy array allscenarios_asnp with scenario_number and this fails - and thus tensor index variable for a 2D tensor. I wonder now if there’s a way to naturally convert the scenario_number from tensor to integer or numpy.array - or also in general to go from tensor–>numpy formats when the values are known and not merely symbolic.

To check the types of xyz_emissions and currentscenario I did print(type(...)) from within the function - and can confirm that when called outside of the pm.Model context - xyz_emissions and currentscenario appear to be pytensor.tensor.var.TensorVariable indeed.

Here’s an example simulation function given in the example for approximate bayesian computation:

# Lotka - Volterra equation
def dX_dt(X, t, a, b, c, d):
    """Return the growth rate of fox and rabbit populations."""

    return np.array([a * X[0] - b * X[0] * X[1], -c * X[1] + d * b * X[0] * X[1]])


# simulator function
def competition_model(rng, a, b, size=None):
    return odeint(dX_dt, y0=X0, t=t, rtol=0.01, args=(a, b, c, d))

You can see that these functions expect numpy inputs. I come to this conclusion because 1) the np.array in the dX_dt function, and 2) because odeint is from scipy.integrate, which definitely doesn’t support pytensor tensors. A 3rd piece of information is that, if you can express your model using pytensor Ops, why are you using a simulator?

So while you are inside your generate_fake_audio function, everything should be numpy. The IndexError that you are being shown is actually a numpy error, it’s upset that you are indexing a numpy array (xyz_emissions) by a pytensor tensor (currentscenario[0]).

1 Like

As usual, I just brute-forced it and printed type(scenario_number) from within generate_fake_audio() (and than ran pm.sample_smc()) to reach my conclusion. But hopefully @jessegrabowski 's comments help!

1 Like

You were right though :smiley:

1 Like

Hi @jessegrabowski ,
I believe the cause for the surprise here is actually derived from the ‘mechanics’ implied in the examples itself :|.
In the ABC examples, the pymc variable (e.g. a = pm.Normal("a", mu=0, sigma=5) . Here a is a pt.tensor.var.Variable on definition, but ‘behaves’ like a numpy/float object within the custom function of interest. However, in the cases above - the pt.tensor.var.Variable remains a tensor even on entry into the function.

pm.Categorical is different somehow?

I’ve kind of narrowed down the potential issue. When I use a pm.Categorical variable - somehow it doesn’t ‘behave’ as numpy/int objects within the function. Below is a skeletal example:

'''
The 'model' basically returns the input values called from a dictionary.

'''
import pymc as pm

#%%
randomdict_cat = {i : i for i in range(10)}
uniform_steps = np.around(np.arange(0,9.1,0.1),2)
randomdict_unif = {i: i for i in uniform_steps}

#%% Variable is from a Uniform dist - this works.
def some_model_unif(rng, a,size=None):
    rounded_value = np.around(a,1)
    return randomdict_unif[rounded_value]

with pm.Model() as mo2:
    aaunif = pm.Uniform('aaunif', lower=0, upper=9)
    sim = pm.Simulator('sim', some_model_unif, params=[aaunif], observed=2.3)
    id = pm.sample_smc()

#%% Variable is from Categorical dist - this does not work.

def some_model(rng, a,size=None):
    inta = int(a)
    # we expect a to be an int64
    return randomdict_cat[inta]


obs = np.random.normal(1,0.1,100).reshape(1,100)
p10 = np.tile(1/10,10)
p10 /= p10.sum()
with pm.Model() as mo1:
    aa = pm.Categorical('aa', p=p10)
    sim = pm.Simulator('sim', some_model, params=[aa], observed=2.3)
    id = pm.sample_smc()

The second model + function generates a KeyError - even though by now, we shouldn’t have a tensor variable in hand. Is anyone aware of why/what is causing this odd behaviour?

I’m not sure what you mean, but I ran your code fine after using allscenarios_asnp in the internal function. I also needed to convert it to integers by adding dtype='int' to the np.array. All inputs to the simulator function will come in as numerical values, not as symbolic tensors. You are accessing a global symbolic tensor variable; this is where the problem is coming from. You could also pass allscenarios as an input to to the inner function as a solution.

After that I ran into a problem with Categorical sampling values that breaks indexing, see here for a solution.

1 Like

Do you think it could it help to add type hints to the examples in Simulator — PyMC 5.10.0 documentation and maybe emphasize it on the docstrings?

We did it for CustomDist because some functions expect (and are expected to return) TensorVariables and some others numpy arrays: pymc.CustomDist — PyMC dev documentation

Hi @ricardoV94, @jessegrabowski

Checking pm.Categorical outputs solves the problem.

Apologies for my late answer - conference travel preparations came in the way for a bit.
However, thanks to pointers from both of you - I was able to pin down the problem to my use of pm.Categorical - which outputs invalid values that are not within the size set on initialisation. As shared by @jessegrabowski - in this post when within a pm.Model it seems like pm.Categorical also draws invalid values every now and then - and this messes up any kind of indexing. However, when pm.draw is used outside of a model - all output values are within bounds.

My solution was along the lines of the post’s suggestion to use pm.math.clip to only input
values that are sensible:

with pm.Model() as model1:
    scenario_num = pm.Categorical('scenario_num', p=np.tile(1/7,7))
    scenario_checked = pm.math.clip(scenario_num, 0,6)
    ... 
    ...

Now I have a model that runs without errors - but I’m not entirely sure if it would affect the behind-the-scenes accounting?

@ricardoV94 : It would definitely help having some more documentation - to make the input types for pm.Simulator clear. Also, pointing out to the user that the discrete distributions may occasionally provide out-of-bounds outputs within a pm.Model would also be helpful too.

It won’t affect anything. Internally the sampler is allowed to propose draws outside of the support of your categorical, but their logp is fixed to -inf, so they are always rejected. This usually works fine, but introduces errors when the draws are used as indexes. Adding the clip just “pre-rejects” the impossible proposal (-1 or 7) and replaces it with the nearest feasible one.

1 Like

That’s not specific to Simulator though. It has to do with the generic samplers, which don’t know anything about distribution bounds.

I think the right solution is Implement transforms for discrete variables by ricardoV94 · Pull Request #6102 · pymc-devs/pymc · GitHub

1 Like