Numeric instability in model?

Hi folks :wave:

I built this model to attempt to identify the mnist dataset via a pymc model; very similar to Bayesian Deep Learning Part II: Bridging PyMC3 and Lasagne to build a Hierarchical Neural Network — While My MCMC Gently Samples. It’s more of an exercise, so I’m not trying to maximize the model score. I’m utilizing the mnist pypi package to load all the images and stuff. I do a bit of work to make the model input a (train_set_size, 28x28) shaped tensor, but besides that I don’t do much data transforming.

My model is the following

with pm.Model() as model:
    input_layer = pm.MutableData("Input data", xtr_collapsed)
    # half normal is dumb here, testing if it fixes the bad sampling. 
    sigma = 10
    w1 = pm.HalfNormal(name="weights_L1", sigma=sigma, shape=(operator.mul(*xtr.shape[1:]), 10))
    b1 = pm.HalfNormal(name="bias_L1", sigma=sigma)
    w2 = pm.HalfNormal(name="weights_L2", sigma=sigma, shape=(10, 10))
    b2 = pm.HalfNormal(name="bials_L2", sigma=sigma)
    
    d1 = pm.math.dot(input_layer, w1) + b1
    d1_act = tt.nnet.relu(d1)
    d2 = pm.math.dot(d1_act, w2) + b2
    d2_act = pm.Deterministic("prob_vector", tt.nnet.softmax(d2))
    
    out = pm.Categorical("output_image", p=d2_act, shape=(10,), observed=ytr)
    trace = pm.sample(1000)

However, this will almost reliably never work because the model fails to evaluating the starting points.

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'weights_L1_log__': array([[1.37482088, 1.73278893, 2.99339127, ..., 3.19726469, 2.48872794,
        1.61221258],
       [2.26450808, 1.61254952, 2.66221272, ..., 2.48100945, 1.88426663,
        3.19736027],
       [2.7489012 , 1.76085018, 2.90702417, ..., 2.53959523, 2.67068885,
        2.38506311],
       ...,
       [3.05376546, 3.25595545, 2.90500401, ..., 1.76518558, 1.97759702,
        2.79381552],
       [2.11998757, 2.60491563, 1.62591751, ..., 2.97126484, 1.37402504,
        1.42363439],
       [3.05235059, 2.14056238, 1.56912392, ..., 1.82148841, 2.6710015 ,
        2.95111808]]), 'bias_L1_log__': array(3.09719118), 'weights_L2_log__': array([[2.24883723, 1.63096953, 2.1467081 , 3.05354696, 1.42033742,
        1.68394757, 2.8135108 , 1.93633338, 1.45380023, 1.86627278],
       [2.60919593, 3.12426194, 2.78506874, 2.98095268, 2.71895375,
        2.83157437, 2.25913873, 1.81027324, 2.01794709, 1.87127218],
       [3.05540505, 1.37352505, 2.91633462, 2.22470959, 3.0643761 ,
        1.66476466, 2.51197866, 1.81694154, 2.63089558, 2.21638313],
       [1.34984191, 2.490901  , 1.5030485 , 3.00968011, 1.91147255,
        2.20350087, 1.98295273, 3.25073676, 2.09514283, 2.17073368],
       [2.50043393, 2.14080139, 1.63344651, 1.30433951, 1.65032382,
        2.6382024 , 1.60630552, 2.10396354, 2.34055179, 3.07468549],
       [2.71416466, 1.91966414, 1.44767031, 3.02037052, 1.62546005,
        2.14546952, 1.57661002, 3.11709261, 2.83912894, 3.12258633],
       [1.37028274, 3.14157269, 2.33021313, 2.00895707, 2.71446165,
        1.88357063, 1.3826265 , 2.47217179, 2.82311552, 1.88598325],
       [2.24698109, 3.07370228, 2.72725184, 1.97487037, 1.48434121,
        3.11197024, 1.52367014, 2.74767257, 2.53206642, 2.83946842],
       [1.99620709, 2.87543325, 2.48868871, 1.97008951, 2.02698368,
        1.79207962, 2.49672451, 1.92625075, 2.23352972, 2.46698243],
       [2.79665175, 2.23985067, 2.66792614, 2.49033473, 1.98972399,
        2.41311927, 1.69043013, 2.09696587, 2.01460189, 2.97104148]]), 'bials_L2_log__': array(2.40026694)}

Initial evaluation results:
{'weights_L1': -8859.84, 'bias_L1': -1.88, 'weights_L2': -105.45, 'bials_L2': -0.74, 'output_image': -inf}

I’m reasonably confident the issue is in the output_image RV; and looking at the probability vector generated by the softmax call in aesara. I see some others are also talking about some numeric instability with theano but i’m not super sure.

I tried to instead do the logexpsum trick by defining a new softmax method

def log_softmax(p_before_softmax, T=tt):
    softmax = p_before_softmax - T.log(T.sum(T.exp(p_before_softmax),axis=1,keepdims=True))
    return softmax

which I can then pass in as logit_p to the categorical stochastic variable, but I still see the problem. Is this indeed an issue with some bizarre numeric instability? Or my model poorly defined?

Does it work if you forego the final activation and pass d2 to the logit_p argument of pm.Categorical (instead of activating and using the p argument)?

Happy holidays!

No doesn’t seem to unfortunately

(venv) ➜  mnist_pymc3 python nonotebook.py
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Traceback (most recent call last):
  File "/Users/ischweer/dev/mnist_pymc3/nonotebook.py", line 54, in <module>
    trace = pm.sample(1000)
  File "/Users/ischweer/dev/mnist_pymc3/venv/lib/python3.9/site-packages/pymc/sampling/mcmc.py", line 481, in sample
    model.check_start_vals(ip)
  File "/Users/ischweer/dev/mnist_pymc3/venv/lib/python3.9/site-packages/pymc/model.py", line 1735, in check_start_vals
    raise SamplingError(
pymc.exceptions.SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'weights_L1_log__': array([[2.0489868 , 2.29282028, 1.62479404, ..., 1.77906665, 3.12434933,
        2.5982067 ],
       [2.50507699, 2.8722204 , 2.47869092, ..., 1.53915589, 2.36268589,
        2.33305066],
       [2.22272731, 2.11543629, 2.66759047, ..., 1.70295995, 1.90047655,
        2.81893555],
       ...,
       [3.21871667, 1.82691797, 2.3540698 , ..., 2.56889394, 1.31456825,
        3.13029525],
       [2.07278691, 2.60046735, 2.78101664, ..., 1.3830041 , 2.23511333,
        3.11576379],
       [2.23634112, 3.0639098 , 1.80898628, ..., 3.21843128, 2.92637867,
        2.94519339]]), 'bias_L1_log__': array(2.76700306), 'weights_L2_log__': array([[2.00262782, 2.05867521, 2.61962345, 2.4207734 , 2.29723549,
        2.76937656, 2.25062155, 2.22674004, 2.42367538, 1.70406751],
       [1.65884359, 1.56057319, 2.1418022 , 2.91019823, 3.13579391,
        1.70351207, 1.50608573, 3.2868618 , 1.81836728, 2.13352568],
       [1.67942477, 2.81756431, 3.08895532, 2.03716696, 2.90191013,
        2.12077548, 3.30015637, 2.34415218, 2.49743927, 2.32026869],
       [2.92590058, 2.22784299, 3.11016981, 1.63095765, 2.66670251,
        2.94009566, 2.126239  , 2.51133983, 1.40885567, 2.18278176],
       [1.97944919, 2.93071893, 1.81210398, 2.28716214, 2.81114298,
        1.82795826, 3.22492277, 2.45893184, 2.13650817, 2.97507377],
       [2.36527763, 1.51739704, 3.22580218, 2.32563154, 2.3347282 ,
        3.10205879, 2.98607366, 2.53586637, 2.63115598, 1.4620414 ],
       [2.83583981, 2.50755664, 2.6958747 , 2.07311375, 1.93379108,
        1.46731138, 2.6502225 , 2.33321623, 1.31035125, 1.34227451],
       [2.31935981, 1.78741482, 1.76724034, 2.49260163, 1.68024617,
        2.12868402, 2.90296069, 3.06259197, 2.41879445, 2.18154351],
       [2.9783877 , 1.68506507, 1.66709071, 1.79545396, 1.88851668,
        2.99001033, 2.28954519, 2.12944325, 1.44489446, 2.19257303],
       [1.95916708, 2.09635011, 2.39680305, 3.18580203, 3.26291856,
        2.21683949, 3.26032369, 3.07621556, 1.71212483, 2.76736796]]), 'bials_L2_log__': array(3.18213543)}

Initial evaluation results:
{'weights_L1': -8887.72, 'bias_L1': -1.03, 'weights_L2': -110.74, 'bials_L2': -2.25, 'output_image': -inf}
(venv) ➜  mnist_pymc3

So softmax may not be the problem LOL

Are you triply sure that the values of ytr are exactly within the set of expected categories for pm.Categorical? If you aren’t sure, you could draw some prior samples for out by removing the observed kwarg and then looking at the values of the draws of out after sampling.

2 Likes

I’m fairly confident it works yeah.

with pm.Model() as dummy:
    pm.Categorical("testing", p=[0.1]*10, shape=(10,), observed=ytr)
    x = pm.sample_prior_predictive()
x.prior_predictive.testing.values

array([[[1, 2, 9, ..., 8, 3, 3],
        [9, 2, 7, ..., 5, 9, 4],
        [2, 7, 1, ..., 9, 1, 3],
        ...,
        [5, 9, 3, ..., 8, 0, 5],
        [0, 8, 6, ..., 7, 6, 3],
        [4, 7, 6, ..., 1, 7, 4]]])
with pm.Model() as dummy:
    pm.Categorical("testing", p=[0.1]*10, shape=(10,))
    x = pm.sample_prior_predictive()
x.prior.testing.values

array([[[4, 2, 5, ..., 0, 6, 3],
        [8, 3, 9, ..., 3, 0, 9],
        [4, 1, 8, ..., 2, 9, 6],
        ...,
        [1, 1, 9, ..., 6, 3, 4],
        [1, 2, 3, ..., 5, 1, 7],
        [1, 5, 9, ..., 6, 6, 5]]])

I’m fairly confident it’s because I’m putting in garbage to my model, but I’m unable to determine what the garbage is. Here is a really simple example

with pm.Model() as dummy:
    w = pm.Normal("weights", mu=0, sigma=1, shape=(3,))
    w2 = pm.Deterministic("probabilities", pm.math.dot(w, np.array([[-1, 0.1, 1]]*3)))
    cats = pm.Categorical("testing", p=w2, shape=(3))
    pp = pm.sample_prior_predictive()
    trace = pm.sample(100)

This isn’t the source of your problem but this shorter example is missing a sigmoid function or something similar to map the real-valued activations from w2 into [0,1]. I see you had the relu activation in your longer example.

To help rule out overflow, could you try using the first 2 images, then the first 50, first 1000, and so on? It looks like your example might be using only a single image, but I couldn’t tell for certain.

This took me pretty long to solve, it was a shape problem.

xtr_collapsed.shape, ytr.shape
(784, 1000), (1000,)
with pm.Model() as model:
    input_layer = pm.MutableData("Input data", xtr_collapsed)
    sigma = 10
    w1 = pm.Normal(name="weights_L1", mu=0, sigma=sigma, shape=(10, 784))
    b1 = pm.Normal(name="bias_L1", mu=0, sigma=sigma, shape=(10, 1))
    w2 = pm.Normal(name="weights_L2", mu=0, sigma=sigma, shape=(10, 10))
    b2 = pm.Normal(name="bials_L2", mu=0, sigma=sigma, shape=(10, 1))
    
    d1 = pm.math.dot(w1, input_layer) + b1
    d1_act = tt.nnet.relu(d1)
    d2 = pm.math.dot(w2, d1_act) + b2
    d2_act = pm.Deterministic("prob_vector", tt.nnet.softmax(d1, axis=1))
    
    out = pm.Categorical("output_image", p=d2_act, shape=(10,))
    trace = pm.sample(1000)

The first dot was doing inner product and not like, a cascading dot product (we should have 10 values, each linear combinations of 784 values * 784 weights). This lead to a less obvious issue.

2 Likes