Using ADVI (with mini-batch) with streaming data

I’m trying to figure out how can we use the current ADVI (with mini-batch) implementation with streaming data. Usually, In ADVI, we have to provide the complete dataset before start training. However, I’m interested in using streaming datasets to perform the Bayesian Inference.

Please let me know that Is it possible to use ADVI with streaming dataset (as a online ML algorithm)? if possible then how can I proceed with that?

What do you mean by streaming data? I guess you can use the theano.shared and do .set_value() when you have new data comes in.

Thanks for the suggestion. I thought of that, then I felt that may not continuously update the posterior.

advi = ADVI()
for _x, _y in batches:
    share_x = share_x.set_value(_x)
    share_y = share_y.set_value(_y)
    minibatch_x, minibatch_y = pm.Minibatch(_, batch_size=100),  pm.Minibatch(_y batch_size=100)
    apprx = fit.(100, more_replacements={shared_X: minibatch_x, shared_y: minibatch_y})

So will this update the mean, std approximations of ADVI incrementally for each batch? Here it should be noted that I completely discard the previous batches when a new batch comes.

No it will not. I see what you mean now - basically what you want is a way to incrementally update the approximation, so that if you are taking batch by batch the end result will be identical as if you train on the whole dataset altogether. Is that right?

If that’s the case, I dont think there is an out of box solution yet. I was thinking of how to do this without recompiling a new model with new prior, but so far still no good answer.

1 Like


I see. So is this a limitation due to the way that ADVI is implemented in pymc3 ?

Because, in the paper, the way author presented ADVI make the impression that it can be used for streaming data as well.

I think this is a limitation of almost all the framework currently, they are not build to handle Bayesian Filtering problem.

Unless I miss it somewhere, in the paper they are referring to minibatching the data input to reduce the computation demand (which is what you can already do with minibatch in PyMC3).

I don’t think you missed it either, I didn’t see a place they explicitly tell that, but I got the impression due to the way they present mini-batch.

Can you please clarify me this statement again. Does that mean ADVI mini-batch is not meant to train with streaming data? Is this a limitation in the ADVI framework?

No it’s not a limitation for ADVI alone, you have the same problem doing sampling. I see this question comes up quite a bit, for example after you inference via posterior sample, is is possible to update the trace according to the new observation - I have not yet see any solution that is general and easy to implement.

As for ADVI specifically, there is a scaling that you need to define via total_size kwarg. This means that if you have future data with unknown length, you need to properly reweigh the scaling - simply replacing the minibatch value is not valid.

1 Like

Thanks for the clarification @junpenglao. This exactly what I wanted to know.

I think Edward has managed to extend ADVI for online learning. @junpenglao do you know how does this work?

That page describe the same thing as the minibatch in PyMC3.

1 Like

Yup, it seems I misunderstood what they say. They are proposing to use Bayesian filtering.

Anyway, you mentioned that we use size of the dataset for the ADVI computations. Is it only to compute the scaling( total_size kwarg), which defines the amount of scaling of the computations on the mini-batch?

Yep that’s the only application right now.

1 Like

I tried to implement this using PyMC3 :

This is my code:

seed = 7

d = 5
m = 10
coeffs = np.random.uniform(-10, 10, d)

def mean_absolute_percentage_error(y_true, y_pred):
    y_true, y_pred = np.array(y_true), np.array(y_pred)
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

def next_set(w, m=10, d=3):
    x = np.random.rand(m, d)
    y =, w) 
    y = y + np.random.normal(0, 0.1, m)
    return x, y

test_x, test_y = next_set(coeffs, 50, d)

model = pm.Model()
shared_w_mu = shared(np.full(d, 0.0))
shared_w_sd = shared(np.full(d, 1.0))
shared_sigma_mu = shared(0.0)
shared_sigma_sd = shared(1.0)

x, y = next_set(coeffs, m, d)
shared_x = shared(x)
shared_y = shared(y)

with model:
    w = pm.Normal('w', mu=shared_w_mu.get_value(), sd=shared_w_sd.get_value(), shape=d)
    sigma = pm.Normal("s", shared_sigma_mu.get_value(), sd=shared_sigma_sd.get_value())

    mu =, w)
    pm.Normal("y", mu=mu, sd=sigma, observed=shared_y)

    advi = pm.ADVI(total_size = 500)

for i in range(50):
    with model:
        apprx =

    x, y = next_set(coeffs, m, d)

    mu_dic = apprx.groups[0].bij.rmap(apprx.params[0].eval())
    sd_dic = apprx.groups[0].bij.rmap(apprx.params[1].eval())


    pred =, mu_dic['w']) # avoid ppc to improve the performance
    print(mean_absolute_percentage_error(test_y, pred))

Here, I try to extend the ADVI for streaming ML using Bayesian Filtering, assuming that we know the total_size. However, the accuracy of estimated coefficients improves very slowly compared to the Edward implementation. What am I doing wrong?

The total_size should be specified in pm.Normal("y", mu=mu, sd=sigma, observed=shared_y)

Changed but did not improve

apprx = is probably too little for the approximation to converge, check also the optimizer.

I increase the number of iterations and set the optimizer to the same optimizer used in Edward (ADAM). However, still the PyMC3 model does not improve.

I think the model is not updated with the new mean and std once single batch is trained. Do we have a way of changing the mean and std of the each FreeRV dynamically?

Not sure what you mean - when you are in the new batch the mean and std of the ADVI approximation will not be reinitialized, it will start with the mean and std resulting from the training of the last batch.

Try removing .get_value() in the model definitation:

with model:
    w = pm.Normal('w', mu=shared_w_mu, sd=shared_w_sd, shape=d)
    sigma = pm.Normal("s", shared_sigma_mu, sd=shared_sigma_sd)