Hi everyone!

Disclaimer: I’m completely new to PyMC.

I am trying to model an agent that takes actions a to maximize a utility function over states s. The model I’m using is from this paper:

Though in my current model, for simplicity transitions T(s,a) are deterministic and U is only a function of states. The goal is to be able to infer U and \alpha from an observed sequence of actions and states.

The closest example I could find is the reinforcement learning case study and so I’m trying to stick to it as closely as possible. However, an important difference is that in the example, the agent’s action probabilities are proportional to the Q-values which can be deterministically computed from the observed actions using the update rule Q_a = Q_a + \alpha(r - Q_a). On the other hand, in my case the action probabilities are proportional to the expected utility of a state, action pair: EU_s[a] = U(s) + \mathbb{E}_{a'}[EU_{s'}[a']] where s' is deterministically based on s and a and a' is drawn from C(s').

**UPDATE:** I have changed my approach a bit, so the question and code below are no longer up-to-date. Please look at my reply for the up to date version of the question.

My question is, how would I best go about modeling this? I’m particularly confused about how to get the expectation over C(s').

My code so far looks like this:

```
# This is just a toy example where there are ten positions and the agent can move "up" and "down"
# can move "up" and "down" using the 0 and 1 actions. After ten timesteps the episode ends.
# States encode timesteps in the second digit, so e.g. state 12 means position 2 at timestep 1,
# state 37 means position 7 at timestep 3, etc... .100 is the terminal state where the episode ends.
# Utility is uniformly distributed between 0 and 1 and only based on position, not timestep.
POSITIONS = 10
TIMESTEPS = 10
NUM_STATES = TIMESTEPS * POSITIONS + 1
INITIAL_STATE = 0
ACTIONS = [0,1]
def is_terminal_state(state):
return state == NUM_STATES - 1
# WIP, I'm wondering if this is could be a reasonable approach to get expected EU?
def expected_utility(action, state, U):
if is_terminal_state(state):
return 0
u = U[state]
next_action_eu = # how to get the expectation of the next action's EU?
return u + next_action_eu
# Code below here is inspired by
# https://www.pymc.io/projects/examples/en/latest/case_studies/reinforcement_learning.html#estimating-the-learning-parameters-via-pymc
def pytensor_llik_td(alpha, U, actions, states):
actions = pt.as_tensor_variable(actions, dtype="int32")
states = pt.as_tensor_variable(states, dtype="int32")
# Compute the EUs values
# EUs should be of shape [# of observed actions,2]
# use expected_utility() to compute EU for every step once it works
# Apply the softmax transformation
EUs = EUs[:-1] * alpha
logp_actions = EUs - pt.logsumexp(EUs, axis=1, keepdims=True)
# Calculate the log likelihood of the observed actions
logp_actions = logp_actions[pt.arange(actions.shape[0] - 1), actions[1:]]
return pt.sum(logp_actions) # PyMC expects the standard log-likelihood
with pm.Model() as m:
alpha = pm.Uniform(name="alpha", lower=0., upper=1.)
U = pm.Uniform("U", lower=[0.]*POSITIONS, upper=[1.]*POSITIONS)
like = pm.Potential(name="like", var=pytensor_llik_td(alpha, U, actions, states))
tr = pm.sample(random_seed=rng)
```

I would be happy about any suggestions and pointers for how to do this. Thanks!