Binary neural network training

I’m doing a binary neural network training (weights constrained to be either 0 or 1). My model is


In JAGS it looks like

model {
  for (i in 1:length(y)) {
    y[i] ~ dbern(p[i])
    logit(p[i]) = z[i,2] - z[i,1]
    z[i,1:2] = w %*% x[i,]
  }
  for (i_output in 1:2) {
    for (j_input in 1:25) {
      w[i_output, j_input] ~ dbern(0.5)
    }
  }
}

I started simple: a vector of 25 binary values (MNIST 5x5) is fed into a network with binary weights and binary outputs - label 0 or 1 is taken as argmax softmax of the activations W * x like so:
y_predicted = argmax{ softmax(W * x) }
When I finish with this toy example, I’ll go for a full MNIST 28x28 with a true softmax (not a logit).

I’d like to port this toy model in PyMC3. I’m struggling with matrix multiplication W * x. How to implement such a model?
What I expect to see in PyMC3 is something like this:

data = pd.read_csv("https://www.dropbox.com/s/l7uppxi1wvfj45z/MNIST56_train.csv?dl=1")
x = data.loc[:, :25]  # MNIST 5x5 flatten
y = data.loc[:, 25]  # label 0 or 1

basic_model = pm.Model()  # am I right here?

with basic_model:
    # Priors for unknown model parameters
    w = pm.Bernoulli('w', p=0.5)
    logit_p = w %*% x
    y_obs = pm.Bernoulli('y_obs', logit_p=logit_p, observed=y)

    step = pm.Slice()  # what sampler to use??
    trace = pm.sample(1000, step=step, njobs=1, chains=3, n_init=1000)

You need to make sure the shape is correct here, since x has shape of (n, 25), the weight should be (25, 1), something like below should work:

import theano.tensor as tt

with basic_model:
    w = pm.Bernoulli('w', p=0.5, shape=(25, 1))
    logit_p = tt.dot(x, w)
    ...

@junpenglao
Thanks, that does the job for a binary classifier. The complete and ready-to-run source code is here.

Now I want to switch to a multi-class problem, where the output of my network is a softmax that tells the probability of belonging to the particular class among 10. How to model such a layer? Something like Multivariate Bernoulli distribution, constrained to have only one bit activate among 10… I know this is not a PyMC-related question, though.

you can extend the dimension of w so that the dot product returns a (n, k) matrix where k=10:

with basic_model:
    w = pm.Bernoulli('w', p=0.5, shape=(25, 10))
    p = tt.nnet.softmax(tt.dot(x, w))

There is a similar example here: Multinomial with Random Effects on panel data (shape issues)

Yes, Multinomial with n=1 seems to capture what I need. Yet sampling even 10 draws of 784x10 binary matrix on full 60000 MNIST digits data will take days. The idea was to take advantage of Bayesian techniques to optimize neural networks in a case when parameter space is small (constrained to be binary). This example shows that the direct approach is hopeless in terms of the training time, compared to traditional gradient-based optimizations.

Anyway, for those who are interested, the source is here.

Yeah sampling from a large binary matrix is really slow in PyMC3 - you can try instead relax the requirement and sample from a matrix with value between 0 and 1.