Numeric instability in model?

Ch0ronomato · November 23, 2022, 8:45pm

Hi folks

I built this model to attempt to identify the mnist dataset via a pymc model; very similar to Bayesian Deep Learning Part II: Bridging PyMC3 and Lasagne to build a Hierarchical Neural Network — While My MCMC Gently Samples. It’s more of an exercise, so I’m not trying to maximize the model score. I’m utilizing the mnist pypi package to load all the images and stuff. I do a bit of work to make the model input a (train_set_size, 28x28) shaped tensor, but besides that I don’t do much data transforming.

My model is the following

with pm.Model() as model:
    input_layer = pm.MutableData("Input data", xtr_collapsed)
    # half normal is dumb here, testing if it fixes the bad sampling. 
    sigma = 10
    w1 = pm.HalfNormal(name="weights_L1", sigma=sigma, shape=(operator.mul(*xtr.shape[1:]), 10))
    b1 = pm.HalfNormal(name="bias_L1", sigma=sigma)
    w2 = pm.HalfNormal(name="weights_L2", sigma=sigma, shape=(10, 10))
    b2 = pm.HalfNormal(name="bials_L2", sigma=sigma)
    
    d1 = pm.math.dot(input_layer, w1) + b1
    d1_act = tt.nnet.relu(d1)
    d2 = pm.math.dot(d1_act, w2) + b2
    d2_act = pm.Deterministic("prob_vector", tt.nnet.softmax(d2))
    
    out = pm.Categorical("output_image", p=d2_act, shape=(10,), observed=ytr)
    trace = pm.sample(1000)

However, this will almost reliably never work because the model fails to evaluating the starting points.

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'weights_L1_log__': array([[1.37482088, 1.73278893, 2.99339127, ..., 3.19726469, 2.48872794,
        1.61221258],
       [2.26450808, 1.61254952, 2.66221272, ..., 2.48100945, 1.88426663,
        3.19736027],
       [2.7489012 , 1.76085018, 2.90702417, ..., 2.53959523, 2.67068885,
        2.38506311],
       ...,
       [3.05376546, 3.25595545, 2.90500401, ..., 1.76518558, 1.97759702,
        2.79381552],
       [2.11998757, 2.60491563, 1.62591751, ..., 2.97126484, 1.37402504,
        1.42363439],
       [3.05235059, 2.14056238, 1.56912392, ..., 1.82148841, 2.6710015 ,
        2.95111808]]), 'bias_L1_log__': array(3.09719118), 'weights_L2_log__': array([[2.24883723, 1.63096953, 2.1467081 , 3.05354696, 1.42033742,
        1.68394757, 2.8135108 , 1.93633338, 1.45380023, 1.86627278],
       [2.60919593, 3.12426194, 2.78506874, 2.98095268, 2.71895375,
        2.83157437, 2.25913873, 1.81027324, 2.01794709, 1.87127218],
       [3.05540505, 1.37352505, 2.91633462, 2.22470959, 3.0643761 ,
        1.66476466, 2.51197866, 1.81694154, 2.63089558, 2.21638313],
       [1.34984191, 2.490901  , 1.5030485 , 3.00968011, 1.91147255,
        2.20350087, 1.98295273, 3.25073676, 2.09514283, 2.17073368],
       [2.50043393, 2.14080139, 1.63344651, 1.30433951, 1.65032382,
        2.6382024 , 1.60630552, 2.10396354, 2.34055179, 3.07468549],
       [2.71416466, 1.91966414, 1.44767031, 3.02037052, 1.62546005,
        2.14546952, 1.57661002, 3.11709261, 2.83912894, 3.12258633],
       [1.37028274, 3.14157269, 2.33021313, 2.00895707, 2.71446165,
        1.88357063, 1.3826265 , 2.47217179, 2.82311552, 1.88598325],
       [2.24698109, 3.07370228, 2.72725184, 1.97487037, 1.48434121,
        3.11197024, 1.52367014, 2.74767257, 2.53206642, 2.83946842],
       [1.99620709, 2.87543325, 2.48868871, 1.97008951, 2.02698368,
        1.79207962, 2.49672451, 1.92625075, 2.23352972, 2.46698243],
       [2.79665175, 2.23985067, 2.66792614, 2.49033473, 1.98972399,
        2.41311927, 1.69043013, 2.09696587, 2.01460189, 2.97104148]]), 'bials_L2_log__': array(2.40026694)}

Initial evaluation results:
{'weights_L1': -8859.84, 'bias_L1': -1.88, 'weights_L2': -105.45, 'bials_L2': -0.74, 'output_image': -inf}

I’m reasonably confident the issue is in the output_image RV; and looking at the probability vector generated by the softmax call in aesara. I see some others are also talking about some numeric instability with theano but i’m not super sure.

I tried to instead do the logexpsum trick by defining a new softmax method

def log_softmax(p_before_softmax, T=tt):
    softmax = p_before_softmax - T.log(T.sum(T.exp(p_before_softmax),axis=1,keepdims=True))
    return softmax

which I can then pass in as logit_p to the categorical stochastic variable, but I still see the problem. Is this indeed an issue with some bizarre numeric instability? Or my model poorly defined?

jessegrabowski · November 23, 2022, 8:54pm

Does it work if you forego the final activation and pass d2 to the logit_p argument of pm.Categorical (instead of activating and using the p argument)?

Ch0ronomato · November 25, 2022, 5:09pm

Happy holidays!

No doesn’t seem to unfortunately

(venv) ➜  mnist_pymc3 python nonotebook.py
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Traceback (most recent call last):
  File "/Users/ischweer/dev/mnist_pymc3/nonotebook.py", line 54, in <module>
    trace = pm.sample(1000)
  File "/Users/ischweer/dev/mnist_pymc3/venv/lib/python3.9/site-packages/pymc/sampling/mcmc.py", line 481, in sample
    model.check_start_vals(ip)
  File "/Users/ischweer/dev/mnist_pymc3/venv/lib/python3.9/site-packages/pymc/model.py", line 1735, in check_start_vals
    raise SamplingError(
pymc.exceptions.SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'weights_L1_log__': array([[2.0489868 , 2.29282028, 1.62479404, ..., 1.77906665, 3.12434933,
        2.5982067 ],
       [2.50507699, 2.8722204 , 2.47869092, ..., 1.53915589, 2.36268589,
        2.33305066],
       [2.22272731, 2.11543629, 2.66759047, ..., 1.70295995, 1.90047655,
        2.81893555],
       ...,
       [3.21871667, 1.82691797, 2.3540698 , ..., 2.56889394, 1.31456825,
        3.13029525],
       [2.07278691, 2.60046735, 2.78101664, ..., 1.3830041 , 2.23511333,
        3.11576379],
       [2.23634112, 3.0639098 , 1.80898628, ..., 3.21843128, 2.92637867,
        2.94519339]]), 'bias_L1_log__': array(2.76700306), 'weights_L2_log__': array([[2.00262782, 2.05867521, 2.61962345, 2.4207734 , 2.29723549,
        2.76937656, 2.25062155, 2.22674004, 2.42367538, 1.70406751],
       [1.65884359, 1.56057319, 2.1418022 , 2.91019823, 3.13579391,
        1.70351207, 1.50608573, 3.2868618 , 1.81836728, 2.13352568],
       [1.67942477, 2.81756431, 3.08895532, 2.03716696, 2.90191013,
        2.12077548, 3.30015637, 2.34415218, 2.49743927, 2.32026869],
       [2.92590058, 2.22784299, 3.11016981, 1.63095765, 2.66670251,
        2.94009566, 2.126239  , 2.51133983, 1.40885567, 2.18278176],
       [1.97944919, 2.93071893, 1.81210398, 2.28716214, 2.81114298,
        1.82795826, 3.22492277, 2.45893184, 2.13650817, 2.97507377],
       [2.36527763, 1.51739704, 3.22580218, 2.32563154, 2.3347282 ,
        3.10205879, 2.98607366, 2.53586637, 2.63115598, 1.4620414 ],
       [2.83583981, 2.50755664, 2.6958747 , 2.07311375, 1.93379108,
        1.46731138, 2.6502225 , 2.33321623, 1.31035125, 1.34227451],
       [2.31935981, 1.78741482, 1.76724034, 2.49260163, 1.68024617,
        2.12868402, 2.90296069, 3.06259197, 2.41879445, 2.18154351],
       [2.9783877 , 1.68506507, 1.66709071, 1.79545396, 1.88851668,
        2.99001033, 2.28954519, 2.12944325, 1.44489446, 2.19257303],
       [1.95916708, 2.09635011, 2.39680305, 3.18580203, 3.26291856,
        2.21683949, 3.26032369, 3.07621556, 1.71212483, 2.76736796]]), 'bials_L2_log__': array(3.18213543)}

Initial evaluation results:
{'weights_L1': -8887.72, 'bias_L1': -1.03, 'weights_L2': -110.74, 'bials_L2': -2.25, 'output_image': -inf}
(venv) ➜  mnist_pymc3

So softmax may not be the problem LOL

ckrapu · November 25, 2022, 6:22pm

Are you triply sure that the values of ytr are exactly within the set of expected categories for pm.Categorical? If you aren’t sure, you could draw some prior samples for out by removing the observed kwarg and then looking at the values of the draws of out after sampling.

Ch0ronomato · November 26, 2022, 9:49pm

I’m fairly confident it works yeah.

with pm.Model() as dummy:
    pm.Categorical("testing", p=[0.1]*10, shape=(10,), observed=ytr)
    x = pm.sample_prior_predictive()
x.prior_predictive.testing.values

array([[[1, 2, 9, ..., 8, 3, 3],
        [9, 2, 7, ..., 5, 9, 4],
        [2, 7, 1, ..., 9, 1, 3],
        ...,
        [5, 9, 3, ..., 8, 0, 5],
        [0, 8, 6, ..., 7, 6, 3],
        [4, 7, 6, ..., 1, 7, 4]]])

with pm.Model() as dummy:
    pm.Categorical("testing", p=[0.1]*10, shape=(10,))
    x = pm.sample_prior_predictive()
x.prior.testing.values

array([[[4, 2, 5, ..., 0, 6, 3],
        [8, 3, 9, ..., 3, 0, 9],
        [4, 1, 8, ..., 2, 9, 6],
        ...,
        [1, 1, 9, ..., 6, 3, 4],
        [1, 2, 3, ..., 5, 1, 7],
        [1, 5, 9, ..., 6, 6, 5]]])

Ch0ronomato · November 26, 2022, 10:20pm

I’m fairly confident it’s because I’m putting in garbage to my model, but I’m unable to determine what the garbage is. Here is a really simple example

with pm.Model() as dummy:
    w = pm.Normal("weights", mu=0, sigma=1, shape=(3,))
    w2 = pm.Deterministic("probabilities", pm.math.dot(w, np.array([[-1, 0.1, 1]]*3)))
    cats = pm.Categorical("testing", p=w2, shape=(3))
    pp = pm.sample_prior_predictive()
    trace = pm.sample(100)

ckrapu · November 26, 2022, 10:54pm

This isn’t the source of your problem but this shorter example is missing a sigmoid function or something similar to map the real-valued activations from w2 into [0,1]. I see you had the relu activation in your longer example.

To help rule out overflow, could you try using the first 2 images, then the first 50, first 1000, and so on? It looks like your example might be using only a single image, but I couldn’t tell for certain.

Ch0ronomato · November 28, 2022, 4:49am

This took me pretty long to solve, it was a shape problem.

xtr_collapsed.shape, ytr.shape
(784, 1000), (1000,)

with pm.Model() as model:
    input_layer = pm.MutableData("Input data", xtr_collapsed)
    sigma = 10
    w1 = pm.Normal(name="weights_L1", mu=0, sigma=sigma, shape=(10, 784))
    b1 = pm.Normal(name="bias_L1", mu=0, sigma=sigma, shape=(10, 1))
    w2 = pm.Normal(name="weights_L2", mu=0, sigma=sigma, shape=(10, 10))
    b2 = pm.Normal(name="bials_L2", mu=0, sigma=sigma, shape=(10, 1))
    
    d1 = pm.math.dot(w1, input_layer) + b1
    d1_act = tt.nnet.relu(d1)
    d2 = pm.math.dot(w2, d1_act) + b2
    d2_act = pm.Deterministic("prob_vector", tt.nnet.softmax(d1, axis=1))
    
    out = pm.Categorical("output_image", p=d2_act, shape=(10,))
    trace = pm.sample(1000)

The first dot was doing inner product and not like, a cascading dot product (we should have 10 values, each linear combinations of 784 values * 784 weights). This lead to a less obvious issue.

Topic		Replies	Views
Value error of pm.model Questions	8	599	May 28, 2019
Model fits fine on OSx but not on Ubuntu 16.04 Questions	6	619	July 31, 2018
Input Dimension mismatch error when modelling biases inside the HBNN tutorial Questions	4	1371	February 27, 2019
MCMC SamplingError: Initial evaluation of model at starting point failed! v5 bug , modeling	6	434	October 3, 2023
Pymc problem: SamplingError: Initial evaluation of model at starting point failed! Questions	1	1000	January 12, 2022

Numeric instability in model?

Related topics