Custom implementation of epsilon, sum_stat in simulator?

Hello :slight_smile:

Shorter version:
I want to implement a custom epsilon in my simulator to model within-simulation variability for the same parameters.

  1. I am not sure if this is taken care of in the smc algorithm by default?
  2. I don’t know what is the right syntax to define this epsilon, if the numpy array is depending on the current parameters and therefore needs to call the simulator function.
  3. What is the syntax to have additional parameters in the simulator function that should not be predicted by the model?

Instead of my_simulator(rng,a,b,c,size=None) smth like my_simulator(rng,a,b,c,H,E,L,P,size=None) where only a,b,c should be predicted?

Longer version:

  1. The problem:
    I am using the smc_sampler to estimate 3 parameters: a,b and c that shape my data.
    While a and b get predicted quite well, the third parameter is estimated quite poorly.

By working on the same problem beforehand with different methods other than pymc (=a self written non-fancy proof-of-concept mcmc, that does the job but takes quite a long time)
I know that - by design of the problem - a better prediction is possible. I am however struggeling to find the right implementation within pymc.

How I got good predictions without pymc:
In my self-written method I utilized the same simulator function to create my data. In each step of my chain I compare the simulated to the observed data.
But instead of computing a distance between one simulated instance under the current suggested parameters to the real data, as it is implemented in the distance functions in pymc (like gaussian and laplace),
I simulate 10 datasets for the same set of the current parameters. From this I obtained confidence intervals and I was able to check if the observed data lies within the uncertainty that is
obtained for the parameters. With this I ended up with a pretty accurate estimate of c.

It seems to me like as if using this variability from simulating the data several times
for a given parameterset is not implemented in the smc_sampler method - is this correct or am I missing something?

Because of this I assume that the pymc smc_sampler struggles to find the best c prediction.

  1. Possible solutions:

I played around with the epsilon value as a scalar, but it doesn’t seem to do a good job.
For epsilon=1 I already get good predictions for a and b.
For epsilon < 0.5 and smaller the algortihm runs really long and the c prediction does not change significantly.

Now it seems like there is a way to utilize the custom options for sum_stat and epsilon to get closer to my solution:
My idea is to indirectly use the same information as I used to before implementing my problem in pymc:

Give the smc_sampler the standard devition obtained from 10 simulations under the current parameters in the form the an epsilon-array.
I read in the simulator documentation (Simulator — PyMC dev documentation)
that feeding the smc_sampler with a custom 1-d numpy array epsilon is possible, in order to reflect different variation
among summary statistics. This implies that I would also need to include summary statistics of the same dimension.
In my case this would actually just be one representative run of the simulated data.
I could not find any example case of a custom numpy array of sum_stat and epsilon within pymc.
Since I need to call my simulator function within the model I am not sure which is the right syntax to use, similiary to
how the simulator function (my_simulator) needs to have a very specific form to be compatible within pm.Simulator.

#Choose a set of true parameters for the observed distribution, called 'real_observation'

#Defining additional parameters
#Format: the observed data is a 1-dim numpy array of 100 bins.
#The order of the data is important because it represents histogram data, representing a bimodal distribution.

#Define my simulator function as described in the new version of pymc, with parameters in between rng and size. 
def my_simulator(rng,a,b,c,size=None):  
    sim = create_simulation_sample(a,b,c,H,E,L,P)
    return sim

with pm.Model() as model1:
    a = pm.Uniform("a", lower=0.0, upper=1.0)
    b  = pm.DiscreteUniform("b", lower=0, upper=1000)
    c  = pm.DiscreteUniform("c", lower=0, upper=1000)

    #computing mean and std of 10 simulations under the current a,b,c parameters.
    #Returns two numpy arrays of same dimension as real_observation
    s = pm.Simulator("s", my_simulator, params=[a,b,c],
                     distance="gaussian", sum_stat=sum_stat_data, 
    trace_model1 = pm.sample_smc()
  1. My final question is:
    Is there a way to add my additional arguments within my_simulator? Such as
    I would like to change those parameters for different runs in the future but it seems as if pm.Simulator can only work with this strict definition.

You should be able to pass more parameters to your simulator, even if they are constants. Try wrapping them in pytensor.tensor.as_tensor() if you are getting errors.

sum_stat should be a function that summarizes the observed data and the one returned from the simulator. The distance function is then computed on the two summaries (and scaled by epsilon) to define the simulator likelihood.

It may make things easier for debugging to simply return the mean/std from your simulator, and also pass the mean/std as the observations. Then you can set sum_stat = "identity", since there’s no further processing to do

1 Like

Dear Ricardo,

Thank you for your super quick reply! :slight_smile:
For my problem this implies that I can solve it in only two ways. Please let me know if I understood it correctly. I wonder if there is a pymc trick that would solve this in a more elegant way, because I am convinced that my solution for the problem at hand - namely creating a custom-epsilon that is updated each time with the parameters and a suitable simulated data array represented by the mean of several simulations instead of just one - could be useful to improve the prediction accuracy for many problems.

I now understand that sum_stat needs to be a function that will take one time the observed data and another time the simulated data, compute the same summary statistics on each of them. Both summary stats are then - together with epsilon - passed to the distance function.

In my case, I would like to instead use the raw observed data but for the simulated data I want to perform 10 (or more) simulations, compute the mean and standard deviation per bin (resulting in two arrays with the same dimensions as the raw observed data) and use the mean as input data for the distance function while i use the std as epsilon.

It seems like in the current implementation of pymc I have two options to achieve a workaround for my solution, both using sum_stat=”identity”:

Instead of the original simulator function, I pass a function to pm.Simulator() that calculates the mean of 10 simulations. Since epsilon can only be passed as np.array I would need to perform 10 additional simulations, compute the std and then also pass the resulting epsilon.

Keep the same epsilon as in 1) but use the original simulator and live with the fact that we compute the distance from a single instance simulation instead of the mean, which would probably represent better what happens for the given parameters.

I am curious to try out both options and see if the results vary significantly in runtime or prediction accuracy.

Dear @ricardoV94 , I am sorry for bothering you.

I am stuggeling to continue the project because I am missing the information on how to implement a function that calculates epsilon from the current parameters inside the model as:

c = pm.DiscreteUniform(“c”, lower=0, upper=1000)

What I miss is the conversion of c from pm.DiscreteUniform to int or float…
Or what would be the alternative approach to that.

I posted the question here: Convert from pm.DiscreteUniform to int?

I am sure it is a simple command but I cannot find it anywhere on discourse.pymc or else.

Please let me know if there is a solution. :pray: :pray: :pray: