# Conditioning in Reinforcement Learning Setup

Hi
I am working on a reinforcement learning problem wherein the model works in a loop until time is not over.
Something like this:

with pm.Model():
timeLeft = 7(say)
while(timeLeft>0):
//sample from a distribution and compute an outcome based on defined action
//store the outcome in an array
timeLeft = timeLeft-1


after I come out of the model
I get an array of outcome of length 7.
I am interested in conditioning this array of outcome to the actual array and see the posterior distribution of latent variables.

How do I write the code for the conditioning part and the inference part?

If there are any existing issues that solve my problems please point to them.

Thanks

The conditioning works usually as an input to a logp function. For example, if the observed is from a Gaussian distribution, you can input the array of outcome as \mu and input the actual data as observed.

If I understand you correctly, in the while loop you stimulate a data array and you would like to match with the actual observed? It sounds to me you don’t have a likelihood function, which the inference can be done using approximate bayesian computation (not available yet, we are implementing it as one of the GSOC project this summer). A work around for now is assign a Gaussian likelihood with a small sigma.

Yes you understood my problem completely.

I gave a shot with Gaussian likelihood but it is throwing some error. I used this specific line:

y = pm.Normal('y', mu=simulated_results , sigma=0.2, observed=obs)


Here simulated_results is an array of length 7 and obs too is of length 7.

Code:

with pm.Model():
timeLeft = 7
ac = 0
calc_reach = []
maxi = 3400
while(timeLeft>0):
theta1 = pm.Beta('theta1', alpha=2, beta=3)
theta2 = pm.Beta('theta2', alpha=4, beta=20)
old = theta1 * ac
new = theta2 * maxi
res = old + new
ac = new
timeLeft = timeLeft-1
calc_reach.append(reach)

y = pm.Normal('y', mu=calc_reach , sigma = 0.2, observed=obs)


Error: Variable name theta1_logodds__ already exists.

I am not sure what it means. Further if I sample theta1 and 2 above the while loop. Like this:

with pm.Model():
timeLeft = 7
ac = 0
calc_reach = []
maxi = 3400
theta1 = pm.Beta('theta1', alpha=2, beta=3)
theta2 = pm.Beta('theta2', alpha=4, beta=20)
while(timeLeft>0):

old = theta1 * ac
new = theta2 * maxi
res = old + new
ac = new
timeLeft = timeLeft-1
calc_reach.append(reach)

y = pm.Normal('y', mu=calc_reach , tau = 0.2, observed=obs)

start = pm.find_MAP() # Find good starting point
step = pm.Slice() # Instantiate MCMC sampling algorithm
trace = pm.sample(10000, step, start=start, progressbar=False)

Error: Input 0 of the graph (indices start from 0), used to compute Elemwise{neg,no_inplace}(winning_logodds__), was not provided and not given a value. Use the Theano flag exception_verbosity='high', for more information on this error.


I think the core problem lies in the likelihood statement. Not sure though.

You should rewrite the while loop either into a matrix operation or as a theano.scan. There are some example of using theano.scan in the doc
http://docs.pymc.io/notebooks/PyMC3_tips_and_heuristic.html
and in the example folder:

You can also get some inspiration from the timeseries distribution implementation as theano.scan is used as well: