Hierarchical Bayesian Neural Networks with Informative Priors by @twiecki

New blog post by @twiecki

4 Likes

Very cool stuff, as per usual :slight_smile:

One thing I can’t help but notice is that the number of weights (n_hidden) is very small here - just 5 neurons per layer. When I played around with twiecki’s previous bayesian NN examples and did my own experiments, I ran into really severe issues with non-identifiability and multimodality in the posterior as you increase the number of neurons per layer.

ADVI fails badly as the number of weights increase for obvious reasons (both mode-seeking and mode-covering behaviour result in poor approximations), and NUTS takes forever and also struggles with the model structure. To me this is the biggest challenge facing the application of bayesian NN’s, and I dunno if it’s been satisfactorily solved yet.

Unless I’m mistaken, but due to Numpy’s constraints with stacking, this requires there be an equal number of samples in each group, right?

In reality, this is rarely the case. I’ve been thinking of getting around this problem using masked arrays, but before I try that I was wondering if anyone had any intuition of how PyMC3/Theano will handle masked arrays as the input (I’ve only ever seen examples of masked arrays being used as an observed variable).

Alternatively, I’d appreciate any other suggestions to get around this problem. Thanks!

[quote=“bglick13, post:3, topic:1718”]
In reality, this is rarely the case. I’ve been thinking of getting around this problem using masked arrays, but before I try that I was wondering if anyone had any intuition of how PyMC3/Theano will handle masked arrays as the input (I’ve only ever seen examples of masked arrays being used as an observed variable).[/quote]

Thanks @twiecki for the post!

I am also trying to get around this issue. Masked array won’t work here as some values from the observed variable (one group with a smaller size for example) will also need to be discarded. Is there a way to not take into account those values?

Does anyone have an idea? Thanks

Can’t you just let Xs and Ys be lists and manage the activities and outputs as lists instead of tensors? i.e. change

act_1 = pm.math.tanh(tt.batched_dot(Xs, weights_in_1))

to

act_1 = [pm.math.tanh(tt.dot(X[i], weights_in_1[i,:,:])]

(etc) ?

Any update on this i.e. how to handle dot product in case of hierarchical networks when there are NOT equal number of samples in each group, and thus groups can’t be stacked.

Hi, sorry to come back to it after such a long time. But I was wondering if you’d be able to elaborate on this. e.g. what’s i, what does your etc entail?
Thanks