Poor Performance of BNN on Multiclass Classification

Hi, I am trying to implement a basic feed-forward Bayesian neural network to be used on the MNIST data set. I have followed this tutorial https://docs.pymc.io/en/v3/pymc-examples/examples/variational_inference/bayesian_neural_network_advi.html and tried to extend it to multi class classification. For this I replaced the Bernoulli Likelihood at the very end with a Categorical Likelihood with softmax probabilities. I have also removed one hidden layer so that my current model only has a single hidden layer in order to speed up training time. The input X is the flattened MNIST data of shape (28*28,) and the labels are one-hot encoded feature vectors of shape (10,) representing the numbers from 0-9.

My problem is however that the model seems to be unable to reproduce the data. When swapping the training data out for the test data and sampling the posterior, the posterior mean yields a uniform distribution over the labels, e.g.

> ppc["out"].mean(axis=0)[:, 23]
array([0.124, 0.102, 0.103, 0.106, 0.093, 0.117, 0.09 , 0.121, 0.091,
       0.111])

I have had a look at this thread https://discourse.pymc.io/t/poor-accuracy-of-bnn-for-mnist/1978 but this model is using lasagne for it’s layers where I wanted to code my model in “pure” PyMC.

Do you have any advice or can you maybe spot a mistake in my model? Any help would be greatly appreciated!

Below is the code of my current model

from warnings import filterwarnings

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pymc3 as pm
import seaborn as sns
import theano
import theano.tensor as T
from tensorflow import keras
floatX = theano.config.floatX

# Load and process MNIST data
(X_train, Y_train), (X_test, Y_test) = keras.datasets.mnist.load_data()
X_train = X_train[:100].reshape(-1, 28**2).astype(floatX) / 255.
X_test = X_test[:100].reshape(-1, 28**2).astype(floatX) / 255.
Y_train = keras.utils.to_categorical(Y_train[:100], 10).astype(floatX)
Y_test = keras.utils.to_categorical(Y_test[:100], 10).astype(floatX)

# construct NN
def construct_nn(X_train, Y_train):
    n_hidden = 128
    # Initialize random weights between each layer
    init_1 = np.random.randn(X_train.shape[1], n_hidden).astype(floatX)
    init_2 = np.random.randn(n_hidden, n_hidden).astype(floatX)
    init_out = np.random.randn(n_hidden, 10).astype(floatX)

    with pm.Model() as neural_network:
        ann_input = pm.Data("ann_input", X_train)
        ann_output = pm.Data("ann_output", Y_train)

        # Weights from input to hidden layer
        weights_in_1 = pm.Normal("w_in_1", 0, sigma=1, shape=(X_train.shape[1], n_hidden), testval=init_1)
        
        
        # Weights from hidden layer to output
        weights_2_out = pm.Normal("w_2_out", 0, sigma=1, shape=(n_hidden,10), testval=init_out)

        # Build neural-network using tanh activation function
        act_1 = pm.math.tanh(pm.math.dot(ann_input, weights_in_1))
        act_out = pm.Deterministic("softmax", T.nnet.softmax(pm.math.dot(act_1, weights_2_out)))
        
        out = pm.Categorical("out", p = act_out, observed = ann_output.T, total_size=Y_train.shape[0])
    return neural_network


neural_network = construct_nn(X_train, Y_train)

# sampling
with neural_network:
    trace = pm.sample(1000, chains=1, progressbar=True, init="advi", n_init=50000)

# predict test data
pm.set_data(new_data={"ann_input": X_test, "ann_output": Y_test}, model=neural_network)
ppc = pm.sample_posterior_predictive(
    trace, samples=1000, progressbar=True, model=neural_network
)
1 Like

Did you manage to solve this issue? I am having the save problem.

I did actually, turns out the issues were

  1. One-hot encoding was not a good choice, instead use the integer labels and convert them to floats
  2. ADVI got stuck in some local minima which I could resolve by using mini batch ADVI instead

I uploaded my notebook here if you want to check it out :slight_smile:

2 Likes

Thanks a lot!

If it can be of any help, I have an active toy project were I tried to have a “Keras-like” interface for building BNN. It allows “flexible” priors and likelihood specification as well as providing some pre-implemented architectures.

I still haven’t managed to make Embedding and vanilla RNN layers to work, but I have a couple of Notebooks on MLP (both for discrete and continuous targets), Autoencoders and embedding visualization.