Modeling an agent whose actions depend on expected value of future actions (Updated Question)

Hi everyone!

Disclaimer: I’m completely new to PyMC.
I am trying to model an agent that takes actions a to maximize a utility function over states s. The model I’m using is from this paper:

Though in my current model, for simplicity transitions T(s,a) are deterministic and U is only a function of states. The goal is to be able to infer U and \alpha from an observed sequence of actions and states.

The closest example I could find is the reinforcement learning case study and so I’m trying to stick to it as closely as possible. However, an important difference is that in the example, the agent’s action probabilities are proportional to the Q-values which can be deterministically computed from the observed actions using the update rule Q_a = Q_a + \alpha(r - Q_a). On the other hand, in my case the action probabilities are proportional to the expected utility of a state, action pair: EU_s[a] = U(s) + \mathbb{E}_{a'}[EU_{s'}[a']] where s' is deterministically based on s and a and a' is drawn from C(s').

UPDATE: I have changed my approach a bit, so the question and code below are no longer up-to-date. Please look at my reply for the up to date version of the question.

My question is, how would I best go about modeling this? I’m particularly confused about how to get the expectation over C(s').

My code so far looks like this:

# This is just a toy example where there are ten positions and the agent can move "up" and "down"
# can move "up" and "down" using the 0 and 1 actions. After ten timesteps the episode ends. 
# States encode timesteps in the second digit, so e.g. state 12 means position 2 at timestep 1,
# state 37 means position 7 at timestep 3, etc... .100 is the terminal state where the episode ends.
# Utility is uniformly distributed between 0 and 1 and only based on position, not timestep.
 
POSITIONS = 10
TIMESTEPS = 10
NUM_STATES = TIMESTEPS * POSITIONS + 1
INITIAL_STATE = 0
ACTIONS = [0,1]

def is_terminal_state(state):
  return state == NUM_STATES - 1

# WIP, I'm wondering if this is could be a reasonable approach to get expected EU?
def expected_utility(action, state, U):
  if is_terminal_state(state):
    return 0
  u = U[state]
  next_action_eu = # how to get the expectation of the next action's EU?
  return u + next_action_eu

# Code below here is inspired by 
# https://www.pymc.io/projects/examples/en/latest/case_studies/reinforcement_learning.html#estimating-the-learning-parameters-via-pymc
def pytensor_llik_td(alpha, U, actions, states):
    actions = pt.as_tensor_variable(actions, dtype="int32")
    states = pt.as_tensor_variable(states, dtype="int32")

    # Compute the EUs values
    # EUs should be of shape [# of observed actions,2]
    # use expected_utility() to compute EU for every step once it works

    # Apply the softmax transformation
    EUs = EUs[:-1] * alpha
    logp_actions = EUs - pt.logsumexp(EUs, axis=1, keepdims=True)

    # Calculate the log likelihood of the observed actions
    logp_actions = logp_actions[pt.arange(actions.shape[0] - 1), actions[1:]]
    return pt.sum(logp_actions)  # PyMC expects the standard log-likelihood

with pm.Model() as m:
    alpha = pm.Uniform(name="alpha", lower=0., upper=1.)
    U = pm.Uniform("U", lower=[0.]*POSITIONS, upper=[1.]*POSITIONS)

    like = pm.Potential(name="like", var=pytensor_llik_td(alpha, U, actions, states))

    tr = pm.sample(random_seed=rng)

I would be happy about any suggestions and pointers for how to do this. Thanks!

UPDATE

I have thought about it a bit more and tried a different approach.
I figured to make things easier I will first try to create the model for a fixed number of three steps (i.e. three action choices made by the agent). Because every C(s) depends on the next choice except for the last one, I could start with the choice distribution for the final state and then iteratively define the ancestors. To do so, I used a CustomDist for the choices:

# These  3 methods are used for the custom distribution of the choice function.
# U is the utility function which is uniform between 0 and 1 for every state.
# next_choice follows the choice distribution, hence every state's choice distribution
# depends on the next state's choice, except for the terminal state. 
def expected_utility(action, state, U, next_choice):
  if is_terminal_state(state):
    return 0
  u = U[state % 10]
  next_state = transition(state, next_choice)
  next_u = U[next_state % 10]
  return u + next_u 

  
def choice_logp(action, state, U, next_choice, alpha):
  eu = expected_utility(action, state, U, next_choice)
  return alpha * eu


def choice_random(state, U, next_choice, alpha, rng=None, size=None):
  eus = np.array([expected_utility(action, state, U, next_choice) for action in ACTIONS])
  probs = (alpha * np.exp(eus)) / np.sum(eus)
  action = np.array([rng.random.choice(ACTIONS, p=probs) for _ in size])
  return action

# The following lines are used to create the agent model:
# In this model the agent visited a total of three states, the last of which was terminal,
# and took three actions (made three choices). Hence, there are three choice variables,
# corresponding to the first, second, and last choice. The first choice depends on the 
# second and the second depends on the last.

states = [86,97,100]
actions = [1,0,0]

agent_model_fixed_length = pm.Model()

with agent_model_fixed_length:
    alpha = pm.Uniform(name="alpha", lower=0., upper=1.)
    U = pm.Uniform("U", lower=[0.]*POSITIONS, upper=[1.]*POSITIONS)

    last_c = pm.CustomDist(
        'last_c',
        states[-1],
        U,
        pt.dscalar(),
        alpha,
        logp=choice_logp,
        random=choice_random,
        observed=[actions[-1]],
    )
    middle_c = pm.CustomDist(
        'middle_c',
        states[-2],
        U,
        last_c,
        alpha,
        logp=choice_logp,
        random=choice_random,
        observed=[actions[-2]],
    )
    first_c = pm.CustomDist(
        'first_c',
        states[0],
        U,
        middle_c,
        alpha,
        logp=choice_logp,
        random=choice_random,
        observed=[actions[0]],
    )

When I print the graph of this model it looks as expected and I can also sample from it just fine.

Of course, ultimately I want to model agents that made a variable number of choices. For this, I took inspiration from this question on Markov Chains and tried to make my code more general using PyTensor’s scan function:

states = [86,97,100]
actions = [1,0,0]

agent_model = pm.Model()

with agent_model:
  actions = pt.as_tensor_variable([actions], dtype="int32")
  states = pt.as_tensor_variable(states, dtype="int32")

  alpha = pm.Uniform(name="alpha", lower=0., upper=1.)
  U = pm.Uniform("U", lower=[0.]*POSITIONS, upper=[1.]*POSITIONS)

  # random and logp are defined as above.
  def choice(next_choice, state, U, alpha):
      return pm.CustomDist.dist(
        state,
        U,
        next_choice,
        alpha,
        logp=choice_logp,
        random=choice_random,
        class_name="choice",
        dtype="int32"
      )

  output, updates = pytensor.scan(fn=choice, 
                            outputs_info=[pt.ones_like(states)],
                            sequences=[states],
                            non_sequences=[U, alpha],
                            n_steps=3)

  agent_model.register_rv(output, name="choice", observed=actions)

The graph of the model now looks like this:
Screenshot from 2023-05-17 11-21-14

However, when I run pm.sample(), I get the following error:

/usr/local/lib/python3.10/dist-packages/pytensor/tensor/basic.py in _as_tensor_Variable(x, name, ndim, **kwargs)
     99 def _as_tensor_Variable(x, name, ndim, **kwargs):
    100     if not isinstance(x.type, TensorType):
--> 101         raise TypeError(
    102             f"Tensor type field must be a TensorType; found {type(x.type)}."
    103         )

TypeError: Tensor type field must be a TensorType; found <class 'pytensor.tensor.random.type.RandomGeneratorType'>.

Further, from printing the inputs to choice_logp() and expected_utility(), it also looks like they were called with <TensorType(int32, (10,))> as the value of action. However, 10 is not a valid value for action, as it has to be either 0 or 1.

Moreover, pm.sample_prior_predictive(100) gives the following error:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()


During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Apply node that caused the error: CustomDist_choice_rv{0, (0, 0, 0, 0), int32, False}(*2-<RandomGeneratorType>, TensorConstant{[]}, TensorConstant{3}, *1-<TensorType(int32, (3,))>, *3-<TensorType(float64, (10,))>, *0-<TensorType(int32, ())>, *4-<TensorType(float64, ())>)
Toposort index: 0
Inputs types: [RandomGeneratorType, TensorType(int64, (0,)), TensorType(int64, ()), TensorType(int32, (3,)), TensorType(float64, (10,)), TensorType(int32, ()), TensorType(float64, ())]
Inputs shapes: ['No shapes', (0,), (), (3,), (10,), (), ()]
Inputs strides: ['No strides', (8,), (), (4,), (8,), (), ()]
Inputs values: [Generator(PCG64) at 0x7F684AF83AE0, array([], dtype=int64), array(3), array([1, 1, 1], dtype=int32), 'not shown', array(86, dtype=int32), array(0.55775481)]
Outputs clients: [[], ['output']]

Backtrace when the node is created (use PyTensor flag traceback__limit=N to make it longer):
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-63-1e78187dfaf3>", line 26, in <cell line: 6>
    output, updates = pytensor.scan(fn=choice,
  File "/usr/local/lib/python3.10/dist-packages/pytensor/scan/basic.py", line 852, in scan
    raw_inner_outputs = fn(*args)
  File "<ipython-input-63-1e78187dfaf3>", line 14, in choice
    d = pm.CustomDist.dist(
  File "/usr/local/lib/python3.10/dist-packages/pymc/distributions/distribution.py", line 1034, in dist
    return _CustomDist.dist(
  File "/usr/local/lib/python3.10/dist-packages/pymc/distributions/distribution.py", line 524, in dist
    return super().dist(
  File "/usr/local/lib/python3.10/dist-packages/pymc/distributions/distribution.py", line 389, in dist
    rv_out = cls.rv_op(*dist_params, size=create_size, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pymc/distributions/distribution.py", line 579, in rv_op
    return rv_op(*dist_params, **kwargs)

HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.


During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)

<ipython-input-62-9176fb444312> in expected_utility(action, state, U, next_choice)
      1 def expected_utility(action, state, U, next_choice):
      2   print(f"calculating eu of action {action} in state {state}")
----> 3   if is_terminal_state(state):
      4     return 0
      5   u = U[state % 10]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Apply node that caused the error: CustomDist_choice_rv{0, (0, 0, 0, 0), int32, False}(*2-<RandomGeneratorType>, TensorConstant{[]}, TensorConstant{3}, *1-<TensorType(int32, (3,))>, *3-<TensorType(float64, (10,))>, *0-<TensorType(int32, ())>, *4-<TensorType(float64, ())>)
Toposort index: 0
Inputs types: [RandomGeneratorType, TensorType(int64, (0,)), TensorType(int64, ()), TensorType(int32, (3,)), TensorType(float64, (10,)), TensorType(int32, ()), TensorType(float64, ())]
Inputs shapes: ['No shapes', (0,), (), (3,), (10,), (), ()]
Inputs strides: ['No strides', (8,), (), (4,), (8,), (), ()]
Inputs values: [Generator(PCG64) at 0x7F684AF83AE0, array([], dtype=int64), array(3), array([1, 1, 1], dtype=int32), 'not shown', array(86, dtype=int32), array(0.55775481)]
Outputs clients: [[], ['output']]

Backtrace when the node is created (use PyTensor flag traceback__limit=N to make it longer):
  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-63-1e78187dfaf3>", line 26, in <cell line: 6>
    output, updates = pytensor.scan(fn=choice,
  File "/usr/local/lib/python3.10/dist-packages/pytensor/scan/basic.py", line 852, in scan
    raw_inner_outputs = fn(*args)
  File "<ipython-input-63-1e78187dfaf3>", line 14, in choice
    d = pm.CustomDist.dist(
  File "/usr/local/lib/python3.10/dist-packages/pymc/distributions/distribution.py", line 1034, in dist
    return _CustomDist.dist(
  File "/usr/local/lib/python3.10/dist-packages/pymc/distributions/distribution.py", line 524, in dist
    return super().dist(
  File "/usr/local/lib/python3.10/dist-packages/pymc/distributions/distribution.py", line 389, in dist
    rv_out = cls.rv_op(*dist_params, size=create_size, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/pymc/distributions/distribution.py", line 579, in rv_op
    return rv_op(*dist_params, **kwargs)

HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.
Apply node that caused the error: forall_inplace,cpu,scan_fn}(TensorConstant{3}, TensorConstant{[ 86  97 100]}, IncSubtensor{InplaceSet;:int64:}.0, RandomGeneratorSharedVariable(<Generator(PCG64) at 0x7F684AF83AE0>), U, alpha)
Toposort index: 4
Inputs types: [TensorType(int8, ()), TensorType(int32, (3,)), TensorType(int32, (3, 3)), RandomGeneratorType, TensorType(float64, (10,)), TensorType(float64, ())]
Inputs shapes: [(), (3,), (3, 3), 'No shapes', (10,), ()]
Inputs strides: [(), (4,), (12, 4), 'No strides', (8,), ()]
Inputs values: [array(3, dtype=int8), array([ 86,  97, 100], dtype=int32), 'not shown', Generator(PCG64) at 0x7F684AF83AE0, 'not shown', array(0.55775481)]
Outputs clients: [['output']]

HINT: Re-running with most PyTensor optimizations disabled could provide a back-trace showing when this node was created. This can be done by setting the PyTensor flag 'optimizer=fast_compile'. If that does not work, PyTensor optimizations can be disabled with 'optimizer=None'.
HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.

It seems that somehow choice_random() is called with [1, 1, 1] as the value for state and that this causes the “truth value of an array with more than one element is ambiguous” error when state is compared to the terminal state.

In both cases, I do not know what could cause this behavior. Presumably, I am doing something wrong with how I use scan, since the case with a fixed number of choices worked. Any ideas on what could cause these errors would be appreciated. Thanks!

Welcome!

I would suggest checking out @ricardoV94 's notebook as a reference as it sounds like you are doing something similar.