Issue with running prediction with multinomial softmax

Hi everyone,

I am trying to build a multinomial logistic regression model for modelling sports [cricket] outcomes where I wish to lear parameters for each batter corresponding to all outcomes that can occur on a ball.

To test this, I generated some data and tried to run oos prediction on after generating the trace. However I seem to run into a shape mismatch error on account of calling pm.sample_posterior_predictive after using pm.set_data to replace the training predictors with the test predictors. It appears the size parameter being passed to CategoricalRV.rng_fn in aesara is still retaining the training size.

The trace generates perfectly okay though. Any idea what I am missing?

Here is my code:

import scipy.stats as sps
import numpy as np
import pymc as pm

#Generating fake data
full_sample_size = 200
num_batters = 120
num_outcomes = 31
p_batter = sps.dirichlet.rvs(np.ones(num_batters),size=1)[0]
batter_rv = sps.multinomial(1,p_batter).rvs(full_sample_size)
alpha_rv = sps.norm(0,1).rvs(size=num_outcomes)
beta_rv = np.array([sps.norm(0,1).rvs(size=num_outcomes) 
                                      for player in range(num_batters)])
mu_rv = alpha_rv + np.dot(batter_rv,beta_rv)
bowling_outcomes_p_rv = softmax(mu_rv,axis=0) 
bowling_outcome_encoded_rv = np.array([sps.multinomial(1,bowling_outcomes_p_rv[i]).rvs(1)[0] for i in range(full_sample_size)])
bowling_outcome_rv = np.where(bowling_outcome_encoded_rv==1)[1]

#Build the model with training data
train_idx = 120

with pm.Model() as model:
  batter_id_by_ball_and_innings_data = pm.MutableData('batter_id_by_ball_and_innings_data',
                                                      batter_rv[:train_idx])
  bowling_outcomes_by_ball_and_innings_data = pm.MutableData('bowling_outcomes_by_ball_and_innings_data',
                                                             bowling_outcome_rv[:train_idx])
  alpha_bowling_outcome_rv = pm.Normal('alpha_bowling_outcome', 
                                    mu=0, 
                                    sigma=1, 
                                    shape=bowling_outcomes_index.shape[0])
  beta_for_batter_id_and_bowling_outcome_rv = pm.Normal('beta_for_batter_id_and_bowling_outcome', 
                                                     mu=0, 
                                                     sigma=1, 
                                                     shape = (num_batters,num_outcomes))
  mu_bowling_outcomes = pm.Deterministic('mu_bowling_outcomes',
                                        at.dot(batter_id_by_ball_and_innings_data,
                                               beta_for_batter_id_and_bowling_outcome_rv) + alpha_bowling_outcome_rv)
  probability_of_bowling_outcome = pm.Deterministic('probability_of_bowling_outcome',
                                                    at.nnet.softmax(mu_bowling_outcomes))
  bowling_outcomes_by_ball_and_innings_rv = pm.Categorical('bowling_outcomes_by_ball_and_innings_rv', 
                                                           p = probability_of_bowling_outcome,
                                                           observed = bowling_outcomes_by_ball_and_innings_data)


#Generate the trace
RANDOM_SEED = 3456
with model:
    inf_data = pm.sample(random_seed=RANDOM_SEED)

#Getting predictions using test data
with model:
    pm.set_data({
        'batter_id_by_ball_and_innings_data': batter_rv[train_idx:]
    })
    inf_data = pm.sample_posterior_predictive(
            inf_data,
            predictions=True,
            extend_inferencedata=True,
            random_seed=RANDOM_SEED,
    )

Stack Trace of Error:

ValueError                                Traceback (most recent call last)
File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/aesara/compile/function/types.py:971, in Function.__call__(self, *args, **kwargs)
    969 try:
    970     outputs = (
--> 971         self.vm()
    972         if output_subset is None
    973         else self.vm(output_subset=output_subset)
    974     )
    975 except Exception:

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/aesara/graph/op.py:543, in Op.make_py_thunk.<locals>.rval(p, i, o, n, params)
    539 @is_thunk_type
    540 def rval(
    541     p=p, i=node_input_storage, o=node_output_storage, n=node, params=None
    542 ):
--> 543     r = p(n, [x[0] for x in i], o)
    544     for o in node.outputs:

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/aesara/tensor/random/op.py:368, in RandomVariable.perform(self, node, inputs, outputs)
    366 rng_var_out[0] = rng
--> 368 smpl_val = self.rng_fn(rng, *(args + [size]))
    370 if (
    371     not isinstance(smpl_val, np.ndarray)
    372     or str(smpl_val.dtype) != out_var.type.dtype
    373 ):

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/aesara/tensor/random/basic.py:1643, in CategoricalRV.rng_fn(cls, rng, p, size)
   1642 unif_samples = rng.uniform(size=size)
-> 1643 samples = vsearchsorted(p.cumsum(axis=-1), unif_samples)
   1645 return samples

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/numpy/lib/function_base.py:2328, in vectorize.__call__(self, *args, **kwargs)
   2326     vargs.extend([kwargs[_n] for _n in names])
-> 2328 return self._vectorize_call(func=func, args=vargs)

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/numpy/lib/function_base.py:2402, in vectorize._vectorize_call(self, func, args)
   2401 if self.signature is not None:
-> 2402     res = self._vectorize_call_with_signature(func, args)
   2403 elif not args:

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/numpy/lib/function_base.py:2430, in vectorize._vectorize_call_with_signature(self, func, args)
   2428 args = tuple(asanyarray(arg) for arg in args)
-> 2430 broadcast_shape, dim_sizes = _parse_input_dimensions(
   2431     args, input_core_dims)
   2432 input_shapes = _calculate_shapes(broadcast_shape, dim_sizes,
   2433                                  input_core_dims)

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/numpy/lib/function_base.py:2090, in _parse_input_dimensions(args, input_core_dims)
   2089     broadcast_args.append(dummy_array)
-> 2090 broadcast_shape = np.lib.stride_tricks._broadcast_shape(*broadcast_args)
   2091 return broadcast_shape, dim_sizes

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/numpy/lib/stride_tricks.py:422, in _broadcast_shape(*args)
    420 # use the old-iterator because np.nditer does not handle size 0 arrays
    421 # consistently
--> 422 b = np.broadcast(*args[:32])
    423 # unfortunately, it cannot handle 32 or more arguments directly

ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (80,) and arg 1 with shape (120,).

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In [116], line 5
      1 with model:
      2     pm.set_data({
      3         'batter_id_by_ball_and_innings_data': batter_rv[train_idx:]
      4     })
----> 5     inf_data = pm.sample_posterior_predictive(
      6             inf_data,
      7             predictions=True,
      8             extend_inferencedata=True,
      9             random_seed=RANDOM_SEED,
     10     )

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/pymc/sampling.py:1996, in sample_posterior_predictive(trace, samples, model, var_names, keep_size, random_seed, progressbar, return_inferencedata, extend_inferencedata, predictions, idata_kwargs, compile_kwargs)
   1991 # there's only a single chain, but the index might hit it multiple times if
   1992 # the number of indices is greater than the length of the trace.
   1993 else:
   1994     param = _trace[idx % len_trace]
-> 1996 values = sampler_fn(**param)
   1998 for k, v in zip(vars_, values):
   1999     ppc_trace_t.insert(k.name, v, idx)

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/pymc/util.py:366, in point_wrapper.<locals>.wrapped(**kwargs)
    364 def wrapped(**kwargs):
    365     input_point = {k: v for k, v in kwargs.items() if k in ins}
--> 366     return core_function(**input_point)

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/aesara/compile/function/types.py:984, in Function.__call__(self, *args, **kwargs)
    982     if hasattr(self.vm, "thunks"):
    983         thunk = self.vm.thunks[self.vm.position_of_error]
--> 984     raise_with_op(
    985         self.maker.fgraph,
    986         node=self.vm.nodes[self.vm.position_of_error],
    987         thunk=thunk,
    988         storage_map=getattr(self.vm, "storage_map", None),
    989     )
    990 else:
    991     # old-style linkers raise their own exceptions
    992     raise

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/aesara/link/utils.py:534, in raise_with_op(fgraph, node, thunk, exc_info, storage_map)
    529     warnings.warn(
    530         f"{exc_type} error does not allow us to add an extra error message"
    531     )
    532     # Some exception need extra parameter in inputs. So forget the
    533     # extra long error message in that case.
--> 534 raise exc_value.with_traceback(exc_trace)

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/aesara/compile/function/types.py:971, in Function.__call__(self, *args, **kwargs)
    968 t0_fn = time.time()
    969 try:
    970     outputs = (
--> 971         self.vm()
    972         if output_subset is None
    973         else self.vm(output_subset=output_subset)
    974     )
    975 except Exception:
    976     restore_defaults()

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/aesara/graph/op.py:543, in Op.make_py_thunk.<locals>.rval(p, i, o, n, params)
    539 @is_thunk_type
    540 def rval(
    541     p=p, i=node_input_storage, o=node_output_storage, n=node, params=None
    542 ):
--> 543     r = p(n, [x[0] for x in i], o)
    544     for o in node.outputs:
    545         compute_map[o][0] = True

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/aesara/tensor/random/op.py:368, in RandomVariable.perform(self, node, inputs, outputs)
    364     rng = copy(rng)
    366 rng_var_out[0] = rng
--> 368 smpl_val = self.rng_fn(rng, *(args + [size]))
    370 if (
    371     not isinstance(smpl_val, np.ndarray)
    372     or str(smpl_val.dtype) != out_var.type.dtype
    373 ):
    374     smpl_val = _asarray(smpl_val, dtype=out_var.type.dtype)

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/aesara/tensor/random/basic.py:1643, in CategoricalRV.rng_fn(cls, rng, p, size)
   1640             raise ValueError("`size` is incompatible with the shape of `p`")
   1642 unif_samples = rng.uniform(size=size)
-> 1643 samples = vsearchsorted(p.cumsum(axis=-1), unif_samples)
   1645 return samples

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/numpy/lib/function_base.py:2328, in vectorize.__call__(self, *args, **kwargs)
   2325     vargs = [args[_i] for _i in inds]
   2326     vargs.extend([kwargs[_n] for _n in names])
-> 2328 return self._vectorize_call(func=func, args=vargs)

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/numpy/lib/function_base.py:2402, in vectorize._vectorize_call(self, func, args)
   2400 """Vectorized call to `func` over positional `args`."""
   2401 if self.signature is not None:
-> 2402     res = self._vectorize_call_with_signature(func, args)
   2403 elif not args:
   2404     res = func()

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/numpy/lib/function_base.py:2430, in vectorize._vectorize_call_with_signature(self, func, args)
   2425     raise TypeError('wrong number of positional arguments: '
   2426                     'expected %r, got %r'
   2427                     % (len(input_core_dims), len(args)))
   2428 args = tuple(asanyarray(arg) for arg in args)
-> 2430 broadcast_shape, dim_sizes = _parse_input_dimensions(
   2431     args, input_core_dims)
   2432 input_shapes = _calculate_shapes(broadcast_shape, dim_sizes,
   2433                                  input_core_dims)
   2434 args = [np.broadcast_to(arg, shape, subok=True)
   2435         for arg, shape in zip(args, input_shapes)]

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/numpy/lib/function_base.py:2090, in _parse_input_dimensions(args, input_core_dims)
   2088     dummy_array = np.lib.stride_tricks.as_strided(0, arg.shape[:ndim])
   2089     broadcast_args.append(dummy_array)
-> 2090 broadcast_shape = np.lib.stride_tricks._broadcast_shape(*broadcast_args)
   2091 return broadcast_shape, dim_sizes

File ~/PycharmProjects/SouthridgeCorp/player-outcome-predictor/venv/lib/python3.9/site-packages/numpy/lib/stride_tricks.py:422, in _broadcast_shape(*args)
    417 """Returns the shape of the arrays that would result from broadcasting the
    418 supplied arrays against each other.
    419 """
    420 # use the old-iterator because np.nditer does not handle size 0 arrays
    421 # consistently
--> 422 b = np.broadcast(*args[:32])
    423 # unfortunately, it cannot handle 32 or more arguments directly
    424 for pos in range(32, len(args), 31):
    425     # ironically, np.broadcast does not properly handle np.broadcast
    426     # objects (it treats them as scalars)
    427     # use broadcasting to avoid allocating the full array

ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (80,) and arg 1 with shape (120,).
Apply node that caused the error: categorical_rv{0, (1,), int64, True}(RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7FC37D1024A0>), MakeVector{dtype='int64'}.0, TensorConstant{4}, probability_of_bowling_outcome)
Toposort index: 4
Inputs types: [RandomGeneratorType, TensorType(int64, (1,)), TensorType(int64, ()), TensorType(float64, (None, None))]
Inputs shapes: ['No shapes', (1,), (), (80, 31)]
Inputs strides: ['No strides', (8,), (), (248, 8)]
Inputs values: [Generator(PCG64) at 0x7FC37D1024A0, array([120]), array(4), 'not shown']
Outputs clients: [['output'], ['output']]

HINT: Re-running with most Aesara optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the Aesara flag 'optimizer=fast_compile'. If that does not work, Aesara optimizations can be disabled with 'optimizer=None'.
HINT: Use the Aesara flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.

Further update:

If I set train_idx = 100 so that the shape of training and testing data is equal, then the error does not occur.

Does this mean I am misinterpreting how I should be using set_data, or could this be a bug where the size is just being passed on from the training data set?

You need to specify shape manually if you want it to automatically resize with the shape of its arguments. Look at the first example in the docs: pymc.set_data — PyMC 4.3.0+0.gea721e4a.dirty documentation

Thanks @ricardoV94 that fixes it. Changed this line:

  bowling_outcomes_by_ball_and_innings_rv = pm.Categorical('bowling_outcomes_by_ball_and_innings_rv', 
                                                           p = probability_of_bowling_outcome,
                                                           observed = bowling_outcomes_by_ball_and_innings_data,
                                                           shape = batter_id_by_ball_and_innings_data.shape[0])```