I’m trying to implement Bayesian Q Learning via PyMC3. I’m holding a Q-table where each cell has 2 random variables: mu and sd, each representing the estimated Q-value of a given state-action pair, and sd representing the “certainty” around that value.

The problem I’m having is that each time step offers only 1 state-action-reward example, so we have an observed variable that is a very sparse 3-dimensional tensor: [time x number of actions x number of states]. The code of my dense model looks like this:

```
with self.model:
Qmus = pm.Normal("Qmus", mu=0., sd=1., shape=[2, self.D])
Qsds = pm.Normal("Qsds", mu=0., sd=1., shape=[2, self.D])
pm.Normal('Qtable', mu=Qmus, sd=np.exp(Qsds), observed=full_tensor)
mean_field = pm.fit(n=2500, method='advi', obj_optimizer=pm.adam(learning_rate=.1))
self.trace = mean_field.sample(5000)
```

This of course doesn’t work great, because it confuses the model with plenty of 0-reward examples when in fact these state-pair combinations were simply not visited at that time.

What would be a better way to do this? Can we somehow update 1 “cell” of a multidimensional normal variable at a time? Seems like post: Using sparse matrices as observed in DensityDist might be related to my problem, but I’m having a hard time understanding the answer.