BNN regression using ADVI


I’m trying to train a simple toy model with ADVI.
The ELBO seems converged, but the prediction is wrong.

The same model works with NUTS sampler, but not with ADVI.

My code is here:


#Toy model
def build_toy_dataset(N=50, noise_std=0.2):
x = np.linspace(-3, 3, num=N)
y = np.cos(x) + np.random.normal(0, noise_std, size=N)
x = x.astype(floatX).reshape((N, 1))
y = y.astype(floatX)
return x, y

N = 50 # number of data points
D = 1 # number of features

X_train, Y_train = build_toy_dataset(N)
X_test, Y_test = build_toy_dataset(N)

fig, ax = plt.subplots()
ax.set(xlabel=‘X’, ylabel=‘Y’, title=‘Toy Regression data set’);

Construct NN - 2 layers with 5 nodes each

def construct_nn_2Layers_with_b(ann_input, ann_output):
n_hidden = 5
n_features = ann_input.get_value().shape[1]

# Initialize random weights between each layer
init_1 = np.random.randn(n_features, n_hidden).astype(floatX)
init_2 = np.random.randn(n_hidden, n_hidden).astype(floatX)
init_out = np.random.randn(n_hidden).astype(floatX)

init_b_1 = np.random.randn(n_hidden).astype(floatX)
init_b_2 = np.random.randn(n_hidden).astype(floatX)
init_b_out = np.random.randn(1).astype(floatX)

with pm.Model() as neural_network:
    # Weights from input to hidden layer
    weights_in_1 = pm.Normal('w_in_1', 0, sd=1,
                             shape=(n_features, n_hidden),

    bias_1 = pm.Normal('b_1', mu=0, sd=1, shape=(n_hidden), testval=init_b_1)
    # Weights from 1st to 2nd layer
    weights_1_2 = pm.Normal('w_1_2', 0, sd=1,
                            shape=(n_hidden, n_hidden),

    bias_2 = pm.Normal('b_2', mu=0, sd=1, shape=(n_hidden), testval=init_b_2)
    # Weights from hidden layer to output
    weights_2_out = pm.Normal('w_2_out', 0, sd=1,
    bias_out = pm.Normal('b_out', mu=0, sd=1, shape=(1), testval=init_b_out)
    # Build neural-network using tanh activation function
    act_1 = pm.math.tanh(,
    act_2 = pm.math.tanh(,
    act_out = (, weights_2_out) + bias_out)

    sd = pm.HalfNormal('sd', sd=1)
    out = pm.Normal('out', mu=act_out, sd=sd, observed=ann_output, total_size=Y_train.shape[0])
return neural_network

ann_input = theano.shared(X_train)
ann_output = theano.shared(Y_train)
neural_network = construct_nn_2Layers_with_b(ann_input, ann_output)


with neural_network:
s = theano.shared(pm.floatX(1))
inference = pm.ADVI(cost_part_grad_scale=s)
tracker = pm.callbacks.Tracker(
mean=inference.approx.mean.eval, # callable that returns mean
std=inference.approx.std.eval # callable that returns std
), method=inference, callbacks=[tracker])
# It is time to set s to zero
approx =, method=inference, callbacks=[tracker] )


with neural_network:
ppc = pm.sample_posterior_predictive(trace, samples=500, progressbar=False)

pred = ppc[‘out’].mean(axis=0)


fig, ax = plt.subplots()
ax.plot(X_test,pred,‘r–’,alpha=0.9,label=‘Posterior predictive means’)
ax.set(xlabel=‘X’, ylabel=‘Y’, title=‘Test set’);

The Notebook is private, it’s asking for username and password for access. Could you host it as an open gist instead?

You might also find this previous discussion helpful: Poor Accuracy of ADVI for Linear Regression

Hi, thanks! I looked into this discussion. it helped a bit but didn’t solve my problem :slight_smile:

Hi, sorry I didn’t notice it is private. Here is the gist:

Thanks so much!

I looked into your problem and found a solution. Your model had a very wide prior set for sd. This made it more convenient for the fit to make all weights close to 0, to get the overall data mean correct, and then just increase the standard deviation to make all the observations consistent with noise.

To avoid this, I made the prior for sd narrower:

sd  = pm.HalfNormal('sd', sd=0.01)

This forces the model to actually care about a mean as a function of X, and than makes it work well. I also changed the optimizer and increased the number of steps (but maybe those were unnecessary).

You can find the gist with the results here

1 Like

Thanks so much!! I appreciate that.
I checked also without the optimizer, and with lower number of steps and it still converges. :slight_smile:

Could you please explain why did you delete the ‘s.set_value(0)’ in the inference?

and a follow-up question please - how should I choose the prior set in cases that I don’t know the model?

Thanks so much and Happy Sunday :smile:

Sorry for the confusion. Before touching the model, I did a bunch of small tests with the optimizer that didn’t work. Removing that line was one of them. I imagine the new model will work with it fine.

Do you mean the prior hyperparameter values, or the prior distributions and conditional relations?

About the values, if you have no idea of a potential range, you should try a couple of values that are orders of magnitude apart. That way you check your models sensitivity to the prior hyperparameters.

About the distributions and conditional relations, that is the entire core of modeling. It is very problem specific and depends on what you want to do

Thanks so much!

Yes, I meant this:

Such a simple toy model is sensitive to the prior hyperpatameters, so I’m concerned about applying it to complex models. What other build-in function pymc3 has in order to check and verify that? for example, I saw “pm.traceplot” is one option. Do you have other recommendations?

Also, if 2 or more Gaussians are required , for example the data will be

x = np.linspace(-6, 6, num=N)

instead of x = np.linspace(-3, 3, num=N), then we changed the ‘out’ function in the NN in order to capture this (similar to the Gaussian Mixture Mode ).

out = pm.Normal(‘out’, mu=[act_out,act_out], sd=sd, observed=ann_output, total_size=Y_train.shape[0])

but this just made the all means zero again… I’d much appreciate your help to figure out what am I doing wrong.

Once again, thanks so much for helping me getting started on pymc3!

I think that @aloctavodia or @colcarroll can give you much better advice than me on this subject.

I’m sorry but you lost me. You mean to have 2 or more peaks of the cosine spanned by x? If this is the case, the usual advice is to use a periodic activation function for the nodes in the neural network instead of tanh. If you don’t do that, you’ll need much more intermediate nodes. I don’t understand what you did with mu=[act_out,act_out]. I would change the activation function, the number of hidden nodes and the number of input nodes to match the data range.

Nevertheless, ADVI has subtleties that can make it converge to something that’s completely off. Maybe @junpenglao or @ferrine can comment on that better than me.

My suggestion is to make weight init to small values, like normal with 1e-2 std and small initial posterior sigma. Increased hidden dim should also help. The drawback with normal prior is to irreducible over parametrization, you May also try some sparsifying priors like studentT.