Poor Accuracy of ADVI for Linear Regression

I tried training a linear regression model using both ADVI and MCMC. ADVI is expected to have less accuracy than MCMC when both models are converged. However, I found that ADVI has unacceptable accuracy compare to the MCMC sampling.

The script that was used for evaluation :

import time

import numpy as np
import pymc3 as pm
from sklearn.datasets import load_diabetes, load_boston
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from theano import tensor as tt, shared

def adding_intecept(X):
    Adding a column for intercept
    :param X: features
    return np.c_[X, np.ones(X.shape[0])]

data = load_diabetes() # load diabetes dataset
# data = load_boston() # load boston dataset
data["data"] = adding_intecept(data["data"])
X, X_test, y, y_test = train_test_split(data["data"], data["target"], test_size=0.25, shuffle=False)

shared_X = shared(X)

# TODO : add hyper prior
# Define the model
with pm.Model() as regression_model:
    # Define priors
    sigma = pm.HalfCauchy('sigma', beta=10)

    w = pm.Normal("w", mu=0, sd=20, shape=X.shape[1])
    mu = tt.dot(w, shared_X.T)

    # Define likelihood
    likelihood = pm.Normal("y", mu=mu, sd=sigma, observed=y)

    # Inference
    t_start = time.time()
    trace = pm.sample(1000)
    time_taken_mcmc = (int(round((time.time() - t_start) * 1000)))

    t_start = time.time()
    apprx = pm.fit(10000)
    time_taken_advi = (int(round((time.time() - t_start) * 1000)))


# MCMC error
ppc = pm.sample_ppc(trace, model=regression_model, samples=50000)
pred_mcmc = ppc['y'].mean(axis=0)
mse_mcmc = mean_squared_error(y_test, pred_mcmc)
print("Inference : MCMC, wall time : %ims, mse: %f" % (time_taken_mcmc, mse_mcmc))

# ADVI error
ppc = pm.sample_ppc(apprx.sample(50000), model=regression_model, samples=50000)
pred_advi = ppc['y'].mean(axis=0)
mse_advi = mean_squared_error(y_test, pred_advi)
print("Inference : ADVI, wall time : %ims, mse: %f" % (time_taken_advi, mse_advi))

The output :

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1500/1500 [00:03<00:00, 466.80it/s]
Average Loss = 2,413.6: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10000/10000 [00:01<00:00, 6276.26it/s]
Finished [100%]: Average Loss = 2,410.3

Inference : MCMC, wall time : 4532ms, mse: 5440.780652
Inference : ADVI, wall time : 6138ms, mse: 29495.179671

Since I was not happy with the ADVI results I extended the evaluation by evaluating the linear regression using CAVI.

The extended script :

from skbayes.linear_models import VBLinearRegression

lr = VBLinearRegression(n_iter=100)
t_start = time.time()
time_taken_vi = (int(round((time.time() - t_start) * 1000)))
pred_vi = lr.predict(X_test)
mse_vi = mean_squared_error(y_test, pred_vi)
print("Inference : VI, wall time : %ims, mse: %f" % (time_taken_vi, mse_vi))

Output :
Inference : VI, wall time : 1ms, mse: 2823.810371

As you can observe, the CAVI performs as expected (It outperform MCMC aswell. May be due to the less number of samples draws during MCMC). I’m interested in knowing why ADVI failed to perform as desired, if CAVI can achieve better accuracy.

I appreciate any help to understand the following observations,

  1. ADVI has very poor accuracy compared to MCMC for linear regression. What could be the reasons for that?
  2. Why did ADVI fail to achieve sufficient accuracy if CAVI is capable of providing better accuracy ?
  3. Why is ADVI much slower than CAVI (is it due to the automatic differentiation of expressions) ?
  4. Why is ADVI slower than MCMC? Is it because overhead due to automatic differentiation higher than gain of it stochastic VI in this particular case ?
  5. What is the complete time taken to perform ADVI? Is it the wall time measured by my script (6138ms) or the time logged by PyMC3 (1s)?

Note : used skbayes to perform linear regression with CAVI

There are a few points to keep in mind before you compare the fitting accuracy between MCMC and ADVI:

  1. Convergence. Are you sure ADVI is already converged? did you check the ELBO history? It is often observed that VI is very sensitive to training time (eg see https://arxiv.org/abs/1802.02538).

  2. Optimizer. Related to the first point, sometimes the default SGD does not perform well. Try ADAM.

  3. Stochastic nature of VI in PyMC3. PyMC3 use MC sample to approximate the objective gradients. As a consequence, the result of the fit is stochastic - you can see that in the ELBO it is not always decreasing. So when you stop the training, VI return the fitting from the last iteration, which can happen to have high ELBO. Solution is to increase the obj_n_mc - Number of monte carlo samples used for approximation of objective gradients. Another solution is get the approximation that minimized the ELBO in the last few hundred iterations, but I don’t think there is code to do it out of the box yet.

Currently these are the points that comes to mind. You can have a look of my WIP notebook with similar idea to compare VI and MCMC.

1 Like

Thanks for the suggestions @junpenglao

  1. Yes, I check the convergence. I have plotted the ELBO history. It seems that ADVI has already converged when passing 6000th iteration. Since our model is a simple linear regression model, I expected that to converge within 10000 iterations. Moreover, I tried out with several other datasets and observed poor accuracy for ADVI. To be sure, I also tried increasing the number of iterations to 50000 and the model still gives similar error.

  2. After you recommended, I tried using ADAM and still I observe an error close to SGD.

  3. I tried increasing the number of MC samples, and yet ADVI gives the same accuracy.

It seems that we are missing something here. This task is not that difficult (this is not some hierarchical model or BDL) for both ADVI and MCMC, and therefore they both should behave as expected even with the default configurations.

I would love to look at your notebook, can you please provide me with the link to that.

Oops, sorry notebook here: https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/WIP/[WIP]%20Comparing%20VI%20approximation.ipynb

Your ADVI has not converged, there are some flat gradient makes the training difficult when the ELBO gets to ~2000. Your plot is still on a crazy scale - limit it to between [-5000, 0] and you will see the ELBO is still decreasing.

I am currently doing:

with regression_model:
    apprx = pm.fit(50000, method='advi', obj_n_mc=25,

And the ELBO of the last 40000 iteration looks like this:

The saddle point of this model is just crazy.

A closer look shows some paramters are on a different scale compared to the others, and the gradient just propagates slowly there.


{'sigma_log__': array(4.25433834),
 'w': array([ 12.20237742,   3.06760075,  47.48649899,  36.56623474,
         18.55803309,  12.2358095 , -24.81552255,  27.37072583,
         46.02615981,  25.88801179, 145.41653368])}
1 Like

Thanks for helping me to understand the tunning process of ADVI @junpenglao. I’m sorry that it took some time for me to reply back. I took a break from everything for few days (it is a festival season in my country :slight_smile: ).

With this tuning I was able to get the accuracy close to MCMC, yet the ADVI took more than 40s where as MCMC converged within 2s. This is yet contradictory to our expectations since, ADVI is supposed to be faster than MCMC.

I wanted to investigate this statement and more importantly find out if the cause for the ridiculous time taken for the ADVI convergence is due to the differences in the scale of parameters. Therefore, I generated very simple dataset with single predictor variable in order to simulate a simple linear regression model to simplify the task.

These are the results,

100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3500/3500 [00:04<00:00, 749.04it/s]
Average Loss = 14.753: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10000/10000 [00:02<00:00, 3427.40it/s]

Inference : MCMC, wall time : 5882ms, mse: 0.256141
Inference : ADVI, wall time : 7087ms, mse: 0.255874
Inference : VI, wall time : 1ms, mse: 0.257132

Now, I have a problem with interpreting the results. In my 5th question, I wanted to know the different between time shown next to the ADVI progress bar, and the wall time measure by wrapping the pm.fit() with start and end time statements. I assumed that 2s next to progress bar is the actual time taken for stochastic optimization, whereas the wall time measured by me is the time taken to perform automatic differentiation + stochastic optimization + other overheads.

Yes thats correct. theano need to compile the model, compile the approximation, and perform the optimization. The progress bar only shows the last.

Then how can we access (using a script) the time taken for training for both MCMC and ADVI without other overheads?

If you call approx = pm.ADVI() first, then do approx.fit(n=10000), it should separate the set up with the actual optimization.

t1 = time.time()
advi = pm.ADVI()
t2 = time.time()
approax = advi.fit(n=10000)
t3 = time.time()

I tried that but there is only 2-4 ms time difference between t3-t1 and t3-t2.

However, when t3-t2 = 5709 ms, the time shown by PyMC3 progress bar is 1s.

Average Loss = 24.219: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 10000/10000 [00:01<00:00, 7736.26it/s]
Inference : ADVI, wall time : 5709ms, full time : 5712ms

@junpenglao are these the expected values? shouldn’t the t3-t2 should be something close to 1s?

Hmm, seems there are still additional compilation when you first call advi.fit().

Depending on which part of the speed you want to measure. Maybe if you call advi.fit(1) before just so that all loss functions are set up you can measure the raw speed of the optimization - but I am not completely sure about this.

1 Like

This worked. :smiley: Thanks for the idea @junpenglao

I can use this for my current work.

But can’t we have a more clean way of accessing the training time (including MCMC and SVGD etc), because that will be very helpful for those who want to experiment on various Bayesian inference techniques that is supported by PyMC3.
Even if one tries to log/dump the time consumed for inferences exposing such measurements to the users will be useful.

It is certainly possible. I think if we implemented the new logging system project wise (https://github.com/pymc-devs/pymc3/pull/2778), we can potentially dump this information into the log.

1 Like