Bayesian Neural Network captures the mean response but not the variance in the training data

Hello all,
I am trying to use a Bayesian Neural Network (BNN) to model the stress-strain response of a group of materials. The training data consists of a Xnorm (= \lambda_1, inputs) and Ynorm (= \sigma_1, outputs). Xnorm varies from -0.5 to 0.5, and, for each Xnorm, Ynorm varies with some distribution that is dependent on the value of Xnorm. I have a plot of my training data below for more information.

I then build a BNN using PyMC3 in which I:

  • Assume the weights and biases in the network are normally distributed
  • Sample the means and biases of the neural network from uniform distributions over some (hopefully) reasonable ranges
  • Forward propagate the neural network in the usual manner using tanh activation functions
  • Sample the final output from a normal distribution whose mean is the output of the BNN and whose standard deviation I pick from a HalfNormal distribution.

When I then fit the model using ADVI and sample from the posterior of the model, I see that the model captures the mean response pretty well, but for some reason it can’t capture the variance in the training data. The variance in the training data reaches 0.14 at Xnorm = -0.5 and ~0.01 at Xnorm = 0.1, but the variance in the outputs of the model are almost constant everywhere and are very different from these.
I have been stuck for a long time now and can’t move forward no matter what I try, so I would highly appreciate any ideas, suggestions, criticisms etc. no matter how small they may be.

Here is how I build my model:

ann_input = theano.shared(Xnorm)
ann_output = theano.shared(Ynorm)

n_hidden = 3
# Initialize random weights between each layer
init_1 = np.random.randn(1, n_hidden).astype(theano.config.floatX)
init_out = np.random.randn(n_hidden, 1).astype(theano.config.floatX)

ml = -2.5 # Mean Lower bound
mu = +2.5 # Mean Upper bound

sdl = 0 # Standard deviation lower bound
sdu = 5 # Standard deviation upper bound
with pm.Model() as neural_network:
    # Means and standard deviations of the weights in the neural network
    # w1m == weight 1 mean, w1sd == weight 1 standard deviation...
    w1m  = pm.Uniform('w1m',  lower=ml,   upper=mu,  shape=(1, n_hidden))
    w1sd = pm.Uniform('w1sd', lower=sdl,  upper=sdu, shape=(1, n_hidden))
    wom  = pm.Uniform('wom',  lower=ml,   upper=mu,  shape=(n_hidden, 1))
    wosd = pm.Uniform('wosd', lower=sdl,  upper=sdu, shape=(n_hidden, 1))
    # Means and standard deviations of the biases in the neural network
    # b1m == bias 1 mean, b1sd == bias 1 standard deviation...
    b1m  = pm.Uniform('b1m',  lower=ml,   upper=mu,  shape = n_hidden)
    b1sd = pm.Uniform('b1sd', lower=sdl,  upper=sdu, shape = n_hidden)
    bom  = pm.Uniform('bom',  lower=ml,   upper=mu)
    bosd = pm.Uniform('bosd', lower=sdl,  upper=sdu)
    #Weights and biases sampled from normal distributions with means and standard deviations defined above
    weights_1 = pm.Normal('l1_w', mu=w1m, sd=w1sd, shape=(1, n_hidden),        testval=init_1)
    weights_o = pm.Normal('o_w',  mu=wom, sd=wosd, shape=(n_hidden, 1),        testval=init_out)
    bias_1 = pm.Normal('l1_b', mu=b1m, sd=b1sd, shape = n_hidden)
    bias_o = pm.Normal('o_b',  mu=bom, sd=bosd)

    # Now assemble the neural network
    layer_1 = pm.math.tanh(, weights_1) + bias_1)
    layer_2 = pm.math.tanh(, weights_2) + bias_2)
    layer_3 = pm.math.tanh(, weights_3) + bias_3)
    layer_4 = pm.math.tanh(, weights_4) + bias_4)
    layer_o =, weights_o) + bias_o
    sigma   = pm.HalfNormal('sigma',sd=0.1)
    y = pm.Normal('y', layer_o, sd=sigma, observed = ann_output)

Then I fit the model using ADVI:

with neural_network:
    n_samples = 150000
    inference = pm.ADVI()
    approx = = n_samples, method = inference)


with neural_network:
    trace = approx.sample(draws=1000)
samples = pm.sample_posterior_predictive(trace, var_names=['y','sigma'], model=neural_network, size=1000)

Here is what the neural network predictions look like given the training data shown below:

As you can see the mean response is similar, but the variance in the NN output is nowhere near that of the data used to train it.

Again, as I said, I am desperate for any and all suggestions, so please don’t hesitate to comment your opinions. Here are some things that I have tried so far:

  • Use ReLU activation functions instead of tanh
  • Use other optimization methods such as NUTS
  • Sample the weights and biases from Normal distributions with fixed means and standard deviations rather than sampling the means and SDs from Uniform distributions.
  • Use smaller and larger networks; fewer nodes per layer or fewer layers etc. Reducing the network size only makes the output coarser, but the overall behavior is similar.
  • Note that the solution does converge. Loss stops decreasing and I have tried running the fitting for much longer iterations as well as using ADVI.refine, but the results don’t improve.
  • I have also tried manually playing with the weights and biases to see if it is even possible to get this shape of results with this neural network architecture, and it seems that yes it is possible given the right weight and bias distributions.

Thank you for your time and thoughts.

Edit: Note that in the top right plot “Likelihood” refers to the value of sigma (i.e. the standard deviation of the “y” in the model) exactly, whereas I calculate the “Total” variance by taking the standard deviation of all the samples of “y” at every point.

Edit 2: I uploaded a better looking predictions plot.

1 Like

The way you defined your model, it assumes constant noise regardless of your output. If you want your model noise to vary, you will have to model that explicitly.

This expression does not depend on any of the inputs:

sigma = pm.HalfNormal('sigma',sd=0.1)

1 Like

Thank you for your response!

Could you please suggest a way to make the noise dependent on the input, or if you know of any examples here in the discourse or elsewhere on the web I would highly appreciate a link.

Here is the reasoning that I followed when writing this code. The weights and biases of the neural network are random variables. Therefore layer_o (output of the BNN) will already have some variance that depends on the input. My reasoning is that this variance due to the uncertainty in the weights and biases should account for the input dependent noise. But my training data is generated by adding a small constant Gaussian noise at every point (notice the waviness in the training data), so even if the model figured out the exact distributions of weights and biases required to generate this data, it would still need to add a small variance on top to account for the gaussian noise, that is why I am using the sigma = pm.HalfNormal('sigma',sd=0.1). I would appreciate it if you could help me understand the error in my logic.

Edit: By the way, as I expect, the variance in the NN does indeed change with the input (as seen in the top right plot), but the change is very small and is very different from the actual variance in the training data. So another follow up question would be, what is causing that change in the variance if my model is configured to have a fixed noise regardless of the input?

1 Like

I see, so you are saying that the variation in the “mean” predictions themselves should be accounting for the variation in the data. I am not sure if this follows directly from the model (ie if it’s flexible enough for this). Given that you opened this issue in the first place I would say it’s not.

I should warn you that I am not very familiar with Bayesian NNs!

Anyway to model the noise explicitly you can do it just as you did for your mean outputs, i.e., make it depend on a separate set of NN activations. You can use an exponential link to make predictions from the real domain to the positive domain.

You can probably also make this depend just on the output layer of your NN for computational savings. Or you can go even more simple and just use a vanilla GLM based on your NN outputs for the noise component.

Let me know if I sound completely off!


Thanks again for taking the time to respond.
I think your point on flexibility of the neural network is important, however I have checked and confirmed that the NN is indeed flexible enough for this. To check this I took the mean values of the weights and biases from the ADVI fit and manually tweaked them to see how the output changes. And, indeed, it turns out that the neural network is capable of capturing this behavior if it just figured out the right distributions for the weights and biases. Here is an image of what I did:

Notice how when I change b2 (bias 2) the output changes to reflect the variance at Xnorm=-0.5 . This is exactly what I expect from my fitting, but it’s just not happening.

I have thought about modeling the noise explicitly using a separate set of activations, but I think that defeats the purpose of using a Bayesian Neural Network in the first place. In that method, there is no longer a point in making the weights and biases random variables. I could just build two neural networks in a non-probabilistic package like Keras/Tensorflow and train one of them to return the mean response and the other to return the variance as a function of the input.

How do the posterior weights at different layers look like compared to the true weights? Do you see systematic biases? Too flat? Too peaked?

I checked your model more carefully and indeed a fixed noise should be fine (I didn’t realize it was data you generated). Is the posterior for the noise centered around the correct value of 1?

Also, I am not sure I understand your first plots were you demonstrate the noise issues. Could you explain a bit more what you are plotting?

I am not sure I understand some of what you are saying. I am fairly new to PyMC3 and probabilistic programming, so excuse my ignorance. For example what do you mean by ‘true weights’? Also I don’t understand why the correct value for the posterior of the noise would be 1…

But why don’t I just show you? Here is what I get with az.summary:

(Sorry Ipython truncates the output for some reason).

And here is what I get with plot_trace:

Note that in this case I am using a BNN with 2 layers only unlike in my original question where I am using 4 layers. And here is what the predictions of the NN look like in this case if you were curious (I also added this to my question because I think this version of the plots better illustrates the issue):

The BNN captures the trend in the variance but the values are very off…

Regarding your last question:
I sample 1000 samples for ‘y’ and ‘sigma’ from the posterior. Then in my “BNN Output Variance” plot (top right in my question above) I plot 2 curves. First I take the standard deviation of ‘y’ across all 1000 samples for every value of Xnorm (=\lambda_1) using SD = np.std(samples['y'],axis=0) and plot it under the name “Total”. Then I take the average of ‘sigma’ across all 1000 samples and plot that under “Likelihood” using np.mean(samples['sigma'],axis=0). Hope this explains it. On the other hand, in the top left plot I plot the mean value of ‘y’ (blue line) and surround it with a shaded area from -2*SD to +2*SD.

The true weights are the ones you used to generate the data (which your model should be able to recover). For instance in the code above you added a noise with a standard deviation of 1 to your final output, that’s why sigma should be around 1.

Or did I misunderstand that you are testing on data you generated yourself?

I generate the data myself, but not with a neural network of course. I generate it with a completely different set of equations that has 4-5 parameters only and looks nothing like the activations of a neural network. I am trying to simulate the results of those equations with a neural network here. So there is no true value of network weights and biases.

For sigma I sample it from a HalfNormal with a standard deviation of 0.1 so I would expect it to have a positive mean that is much smaller than 0.1 .

Does anyone have any opinions on why this might be happening?

I am still looking for suggestions on this problem if anyone can contribute. Thanks a lot.

Part of the issue might be in how you’re predicting y. Note that when you run sample_posterior_predictive to give you y, the resulting variance includes the noise term (sigma), which is why the mean +/- 2SD plot looks like there is little input-dependence to the uncertainty. Rather than sample y directly, I think you’ll want to set layer_o to be a pm.Deterministic named variable, then sample/plot that. That will give you (small, ~0.0005) input-dependent uncertainty.

More generally, though, I think you might be mistakenly conflating the Bayesian concept of confidence with the idea of observed variance. Look at what the likelihood term is doing. Your BNN is only trying to predict the mean of the data, so the resulting uncertainty reflects the model’s confidence in the mean, not the expected spread in observed values. You haven’t told the model that variance depends on inputs, so it doesn’t know to look for it. Keep in mind, too, that it looks like you have several traces there, each with one reading at each λ1. Note that your model doesn’t know that, it’s treating each individual reading from a given λ1 as completely independent of those at a neighboring λ1. Given that your BNN lacks any convolutional structure, and that your observations are just lumped together in the likelihood, the fact that neighboring measurements are correlated is completely lost.

You could specify a heteroskedastic model, where sigma depends on inputs. This could simply be a separate/parallel BNN. You could also try to get clever and replace every Normal prior with an MvNormal prior that outputs a vector of values, one element for each of mean and variance. You could also go with an entirely separate approach, such as a (likely Sparse) heteroskedastic Gaussian Process, which is basically doing the same thing (but arguably simpler than a BNN). The predictions of your model will give you the expected mean and expected spread in the data, just like a vanilla NN, but because it’s Bayesian you will also have an estimate of your confidence in that mean and confidence in that spread.

I suspect, though, that that’s not quite what you want. Is there any information on what distinguishes the different traces, or is the spread truly random? If you have anything, you should include that as a coordinate/covariate in your model. You could fit a (homoskedastic) 2D GP or BNN, you could use Hierarchical modeling (see here or here), you could use a coregionalized GP, etc. Using one of these approaches, when asking your model for a prediction, you would specify this extra variable and receive an expected mean with corresponding confidnce. I suspect that’s more what you’re looking for.