Fitting Reinforcement Learning model

frrfpp · November 4, 2023, 2:09pm

Hi,
I am trying to adapt the reinforcement learning model example (Fitting a Reinforcement Learning Model to Behavioral Data with PyMC — PyMC example gallery) to a more complex model.
I am quite new to PyMC and I am having some problems writing a function to use with scan.

The task I am modelling is a 2 armed bandit task with independent rewards and punishments for each arm.
The model is a single learning rate Rescorla-Wagner type of model with a softmax decision rule parametrised by an inverse temperature.
At each trial the model chooses one of the two arms and it receives one of 4 different outcomes (reward, punishment, reward and punishment, nothing).
The model tracks 4 different beliefs (arm A reward, arm A punishment, arm B reward, arm B punishment) but only 2 of them are updated at each trial based on which arm is chosen.

import numpy as np
import jax.numpy as jnp
import arviz as az
import matplotlib.pyplot as plt
import seaborn as sns
import pymc as pm
import pytensor.tensor as pt
from pytensor import scan, function
import scipy


def simulate_model(params, beliefs, outcomes):
    '''
    params:
     - learning rate
     - inverse temperature

    beliefs = [belief_A_rew, belief_A_pun, belief_B_rew, belief_B_pun]
    outcomes = [reward_A, punish_A, reward_B, punish_B]  and are [0/1]
    '''
    learning_rate = params[0]
    inv_temp = params[1]
    
    n_trials = outcomes.shape[0]
    choices = np.zeros(n_trials)

    for t in range(n_trials):
        probA = np.exp(inv_temp * (beliefs[t, 0] - beliefs[t,1])) / (np.exp(inv_temp * (beliefs[t,0] - beliefs[t,1])) + np.exp(inv_temp* (beliefs[t,2] - beliefs[t,3]))) 

        choices[t] = int((probA > np.random.rand()) == 0)

        # choice is A
        if choices[t] == 0:
            beliefs[t+1,0] = beliefs[t,0] + learning_rate * (outcomes[t,0] - beliefs[t,0])
            beliefs[t+1,1] = beliefs[t,1] + learning_rate * (outcomes[t,1] - beliefs[t,1])
            beliefs[t+1,2] = beliefs[t,2] # don't update when not chosen
            beliefs[t+1,3] = beliefs[t,3] # don't update when not chosen
        else:
            beliefs[t+1,0] = beliefs[t,0] # don't update when not chosen
            beliefs[t+1,1] = beliefs[t,1] # don't update when not chosen
            beliefs[t+1,2] = beliefs[t,2] + learning_rate * (outcomes[t,2] - beliefs[t,2])
            beliefs[t+1,3] = beliefs[t,3] + learning_rate * (outcomes[t,3] - beliefs[t,3])

    return choices, beliefs

I am trying to follow the tutorial and I am trying to write the function to estimate the parameters using PyMC to then compare with the maximum likelihood results.

def update_model(choice, outcome, choice_probability, belief, learning_rate, inv_temp):
    if choice == 0:
        c = 0
        lr = pt.as_tensor([learning_rate, learning_rate, 0, 0])
    else:
        c = 2
        lr = pt.as_tensor([0, 0, learning_rate, learning_rate])

    choice_probability = np.exp(inv_temp * (belief[c] - belief[c+1])) / (np.exp(inv_temp * (belief[0] - belief[1])) + np.exp(inv_temp* (belief[2] - belief[3]))) 

    new_belief = belief + lr * (outcome - belief)

    return choice_probability, new_belief

# choices is [1 x n_trials]
# outcomes is [4 x n_trials]
choices_pt = pt.as_tensor_variable(choices, dtype='int32')
outcomes_pt = pt.as_tensor_variable(outcomes, dtype='int32')

learning_rate = pt.scalar('learning_rate')
inv_temp = pt.scalar('inv_temp')

choice_probablities = pt.zeros(1, dtype='float64')
beliefs = pt.zeros((1,4), dtype='float64')

results, updates = scan(
    fn=update_1lr1t, 
    sequences=[choices_pt, outcomes_pt], 
    non_sequences=[learning_rate, inv_temp],
    outputs_info=[choice_probablities, beliefs]
)

neg_loglike = -pt.sum(results[0])

pytensor_llik_td = pt.pytensor.function(
    inputs=[learning_rate, inv_temp], outputs=neg_loglike, on_unused_input="ignore"
)
result = pytensor_llik_td(0.2, 2)

The code works ok until I try to create the pytensor.function where I get the following error

TypeError: Inconsistency in the inner graph of scan ‘scan_fn’ : an input and an output are associated with the same recurrent state and should have compatible types but have type ‘Vector(float64, shape=(1,))’ and ‘Vector(float64, shape=(?,))’ respectively.

When I print the shapes of the tensors inside update_model I get

<Scalar(int32, shape=())>             # choice
<Vector(int32, shape=(?,))>           # outcome
<Vector(float64, shape=(1,))>         # choice_probability
<Matrix(float64, shape=(1, 4))>       #  belief

What is the correct way to pass a single row of outcome to scan at each iteration?
Any help is greatly appreciated!

Topic		Replies	Views
Modeling reinforcement learning of human participant using PyMC3 Questions	8	3180	June 7, 2023
Fitting a simple Reinforcement Learning model to behavioral data with PyMC (Jupyter NB) Sharing	27	3308	April 28, 2023
Some questions about PyMC Models in online settings v5	3	442	August 29, 2023
Problems when fitting a reinforcement learning model to data Questions modeling	4	510	April 20, 2022
Model convergence/estimation problem for a simple prospect theory model v5 modeling	2	303	October 19, 2023

Fitting Reinforcement Learning model

Related topics