Hello, I am using Gaussian process to model an output (Reward), in terms of inputs (states), according to Bellman equation , the reward of the current step, should satisfy this fromula
R_{T+1} = Q_{T} - γ Q_{T+1}.
my goal is to estimate the value functions, Q_{T}, Q_{T+1} as latent variables , so at the first time step where I have only one output, I need to estimate these two latent variables. I understand each of their values shall not be accurate just from one observation, but should not the difference of their posterior means ,with the discount factor γ , at least recover the output value from the first step, regardless of the noise ?
this is how I am coding the mean function
def custum_nonparametric_mean_function(a_v_function_params, disc_factor):
means_tensor=pt.as_tensor_variable([a_v_function_params[i] - disc_factor* a_v_function_params[i+1] for i in range(len(a_v_function_params) -1 )])
return means_tensor
the a_v_function_params
is a list of the parameters priors, which increase by one at each time step.