Tt.shared variable making sampling very slow


#1

I just modified my code to take advantage of tt.shared variables to be able to test my code with unseen data but it makes sampling 10x (if not more) slower. Is this normal?

Thanks,


#2

I have never observe any similar slowdown - what is the set up of your model?


#3

Here is the code. If I replace the shared variables by the numpy arrays, it is much much faster:

scores_shared = shared(audits.Score.values)
n_shared = shared(audits.ScoreMaximum.values)
Shift1Score_shared = shared(audits['Shift1Score'].values)

# scores_shared = audits.Score.values
# n_shared = audits.ScoreMaximum.values
# Shift1Score_shared = audits['Shift1Score'].values

with pm.Model() as binomial_model:
    # Hyperpriors
    mu_a = pm.Uniform('mu_alpha', lower=0, upper=1)
    sigma_a = pm.HalfNormal('sigma_alpha', sd=0.1)
    mu_b = pm.Normal('mu_beta', mu=0, sd=1)
    sigma_b = pm.HalfCauchy('sigma_beta', beta=1)
    
    # Intercept for each store, distributed around group mean mu_a
    a = pm.Normal('intercept', mu=mu_a, sd=sigma_a, shape=len(uniqueStores))
    # Intercept for each store, distributed around group mean mu_a
    b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=len(uniqueStores))
    
    # Model error
    #eps = pm.HalfNormal('eps', sd=0.1)
    
    # Expected value
    p = pm.math.sigmoid(a[storeIDX] + (b[storeIDX] * Shift1Score_shared))
    
    # Data likelihood    
    y_like = pm.Binomial('y_like', n=n_shared, p=p, observed=scores_shared)

    trace_binomial = pm.sample(progressbar=True, njobs=4)

#4

It’s not immediately clear to me why it would be slower. You can try following the profiling method described here to check where the slow down comes from.


#5

Here is the output from the profiling method:

Function profiling
==================
  Message: C:\ProgramData\Anaconda3\lib\site-packages\pymc3\model.py:853
  Time in 1000 calls to Function.__call__: 3.342232e+01s
  Time in Function.fn.__call__: 3.336384e+01s (99.825%)
  Time in thunks: 3.332985e+01s (99.723%)
  Total compile time: 4.253020e+00s
    Number of Apply nodes: 37
    Theano Optimizer time: 3.294990e-01s
       Theano validate time: 4.003763e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 3.885020e+00s
       Import time 6.899691e-02s
       Node make_thunk time 3.883020e+00s
           Node Elemwise{Composite{Switch(Identity((GE(i0, i1) * LE(i0, i2) * GE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1) * LE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i6))), (((i7 - i8) - i9) + Switch(EQ(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1), i10, (i11 * i0 * scalar_softplus((-Composite{(i0 + (i1 * i2))}(i3, i4, i5))))) + Switch(EQ((i12 - scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5))), i1), i10, (i13 * (i2 - i0) * scalar_softplus(Composite{(i0 + (i1 * i2))}(i3, i4, i5))))), i10)}}[(0, 3)](Elemwise{Cast{int64}}.0, TensorConstant{(1,) of 0}, <TensorType(int32, vector)>, AdvancedSubtensor1.0, AdvancedSubtensor1.0, <TensorType(float64, vector)>, TensorConstant{(1,) of 1}, gammaln.0, gammaln.0, gammaln.0, TensorConstant{(1,) of -inf}, TensorConstant{(1,) of -1.0}, TensorConstant{(1,) of 1.0}, TensorConstant{(1,) of -1.0}) time 1.011997e+00s
           Node Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}(Elemwise{Composite{Identity(GT(i0, i1))}}.0, TensorConstant{(1,) of 0.5}, Elemwise{Composite{inv(sqr(i0))}}.0, b, InplaceDimShuffle{x}.0, Elemwise{Composite{log((i0 * i1))}}.0, TensorConstant{(1,) of -inf}) time 8.304996e-01s
           Node Elemwise{Composite{(Switch(Identity((GE(i0, i1) * LE(i0, i2))), i1, i3) - ((i4 * scalar_softplus((-i5))) + i5))}}[(0, 0)](sigmoid.0, TensorConstant{0.0}, TensorConstant{1.0}, TensorConstant{-inf}, TensorConstant{2.0}, mu_alpha_interval__) time 7.100189e-01s
           Node Elemwise{Composite{(Switch(Identity(GE(i0, i1)), (i2 - log1p(sqr(i0))), i3) + i4)}}[(0, 0)](Elemwise{exp,no_inplace}.0, TensorConstant{0}, TensorConstant{-0.45158270338480055}, TensorConstant{-inf}, sigma_beta_log__) time 6.914997e-01s
           Node Elemwise{Composite{inv(sqr(i0))}}(InplaceDimShuffle{x}.0) time 5.865004e-01s

Time in all call to theano.grad() 0.000000e+00s
Time since theano import 84.273s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  99.9%    99.9%      33.292s       1.45e-03s     C Py   23000      23   theano.tensor.elemwise.Elemwise
   0.1%   100.0%       0.025s       1.25e-05s     C     2000       2   theano.tensor.subtensor.AdvancedSubtensor1
   0.0%   100.0%       0.011s       2.75e-06s     C     4000       4   theano.tensor.elemwise.Sum
   0.0%   100.0%       0.001s       1.50e-06s     C     1000       1   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.000s       1.66e-07s     C     3000       3   theano.compile.ops.ViewOp
   0.0%   100.0%       0.000s       0.00e+00s     C     4000       4   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  95.2%    95.2%      31.731s       1.06e-02s     Py    3000        3   gammaln
   4.6%    99.8%       1.524s       1.52e-03s     C     1000        1   Elemwise{Composite{Switch(Identity((GE(i0, i1) * LE(i0, i2) * GE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1) * LE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i6))), (((i7 - i8) - i9) + Switch(EQ(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1), i10, (i11 * i0 * scalar_softplus((-Composite{(i0 + (i1 * i2))}(i3, i4, i5))))) + Switch(EQ((i12 - scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5))),
   0.1%    99.8%       0.025s       1.25e-05s     C     2000        2   AdvancedSubtensor1
   0.0%    99.9%       0.013s       1.30e-05s     C     1000        1   Elemwise{Composite{((i0 + i1) - i2)}}
   0.0%    99.9%       0.011s       2.75e-06s     C     4000        4   Sum{acc_dtype=float64}
   0.0%   100.0%       0.010s       5.00e-06s     C     2000        2   Elemwise{add,no_inplace}
   0.0%   100.0%       0.006s       3.25e-06s     C     2000        2   Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}
   0.0%   100.0%       0.003s       3.00e-06s     C     1000        1   Elemwise{Cast{int64}}
   0.0%   100.0%       0.001s       1.50e-06s     C     1000        1   MakeVector{dtype='float64'}
   0.0%   100.0%       0.001s       5.00e-07s     C     2000        2   Elemwise{Composite{Identity(GT(i0, i1))}}
   0.0%   100.0%       0.001s       2.50e-07s     C     2000        2   Elemwise{Composite{log((i0 * i1))}}
   0.0%   100.0%       0.001s       5.01e-07s     C     1000        1   Elemwise{Composite{(Switch(Identity(GE(i0, i1)), (i2 + (i3 * sqr(i0))), i4) + i5)}}[(0, 0)]
   0.0%   100.0%       0.001s       5.00e-07s     C     1000        1   Elemwise{Composite{(i0 * (i1 + (-sqr(i2))))}}
   0.0%   100.0%       0.000s       2.50e-07s     C     2000        2   Elemwise{exp,no_inplace}
   0.0%   100.0%       0.000s       5.00e-07s     C     1000        1   sigmoid
   0.0%   100.0%       0.000s       5.00e-07s     C     1000        1   Elemwise{Composite{(Switch(Identity((GE(i0, i1) * LE(i0, i2))), i1, i3) - ((i4 * scalar_softplus((-i5))) + i5))}}[(0, 0)]
   0.0%   100.0%       0.000s       1.66e-07s     C     3000        3   ViewOp
   0.0%   100.0%       0.000s       4.99e-07s     C     1000        1   Elemwise{Composite{(Switch(Identity(GE(i0, i1)), (i2 - log1p(sqr(i0))), i3) + i4)}}[(0, 0)]
   0.0%   100.0%       0.000s       0.00e+00s     C     4000        4   InplaceDimShuffle{x}
   0.0%   100.0%       0.000s       0.00e+00s     C     2000        2   Elemwise{Composite{inv(sqr(i0))}}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  31.8%    31.8%      10.607s       1.06e-02s   1000    11   gammaln(Elemwise{add,no_inplace}.0)
  31.8%    63.6%      10.589s       1.06e-02s   1000    16   gammaln(Elemwise{add,no_inplace}.0)
  31.6%    95.2%      10.535s       1.05e-02s   1000    15   gammaln(Elemwise{Composite{((i0 + i1) - i2)}}.0)
   4.6%    99.8%       1.524s       1.52e-03s   1000    20   Elemwise{Composite{Switch(Identity((GE(i0, i1) * LE(i0, i2) * GE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1) * LE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i6))), (((i7 - i8) - i9) + Switch(EQ(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1), i10, (i11 * i0 * scalar_softplus((-Composite{(i0 + (i1 * i2))}(i3, i4, i5))))) + Switch(EQ((i12 - scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5))), i1), i10, 
   0.0%    99.8%       0.013s       1.35e-05s   1000     3   AdvancedSubtensor1(intercept, TensorConstant{[  0   0  ..9 619 619]})
   0.0%    99.9%       0.013s       1.30e-05s   1000     9   Elemwise{Composite{((i0 + i1) - i2)}}(TensorConstant{(1,) of 1}, <TensorType(int32, vector)>, Elemwise{Cast{int64}}.0)
   0.0%    99.9%       0.012s       1.15e-05s   1000     2   AdvancedSubtensor1(b, TensorConstant{[  0   0  ..9 619 619]})
   0.0%    99.9%       0.009s       9.49e-06s   1000    27   Sum{acc_dtype=float64}(Elemwise{Composite{Switch(Identity((GE(i0, i1) * LE(i0, i2) * GE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1) * LE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i6))), (((i7 - i8) - i9) + Switch(EQ(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1), i10, (i11 * i0 * scalar_softplus((-Composite{(i0 + (i1 * i2))}(i3, i4, i5))))) + Switch(EQ((i12 - scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3
   0.0%    99.9%       0.006s       6.00e-06s   1000    31   Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}(Elemwise{Composite{Identity(GT(i0, i1))}}.0, TensorConstant{(1,) of 0.5}, Elemwise{Composite{inv(sqr(i0))}}.0, intercept, InplaceDimShuffle{x}.0, Elemwise{Composite{log((i0 * i1))}}.0, TensorConstant{(1,) of -inf})
   0.0%   100.0%       0.006s       6.00e-06s   1000    10   Elemwise{add,no_inplace}(TensorConstant{(1,) of 1}, Elemwise{Cast{int64}}.0)
   0.0%   100.0%       0.004s       4.00e-06s   1000     1   Elemwise{add,no_inplace}(TensorConstant{(1,) of 1}, <TensorType(int32, vector)>)
   0.0%   100.0%       0.003s       3.00e-06s   1000     0   Elemwise{Cast{int64}}(<TensorType(int32, vector)>)
   0.0%   100.0%       0.001s       1.50e-06s   1000    35   MakeVector{dtype='float64'}(__logp_mu_alpha_interval__, __logp_sigma_alpha_log__, __logp_mu_beta, __logp_sigma_beta_log__, __logp_intercept, __logp_b, __logp_y_like)
   0.0%   100.0%       0.001s       1.00e-06s   1000    33   Sum{acc_dtype=float64}(Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}.0)
   0.0%   100.0%       0.001s       5.01e-07s   1000    29   Elemwise{Composite{log((i0 * i1))}}(TensorConstant{(1,) of 0...9154943092}, Elemwise{Composite{inv(sqr(i0))}}.0)
   0.0%   100.0%       0.001s       5.01e-07s   1000    26   Elemwise{Composite{(Switch(Identity(GE(i0, i1)), (i2 + (i3 * sqr(i0))), i4) + i5)}}[(0, 0)](Elemwise{exp,no_inplace}.0, TensorConstant{0}, TensorConstant{2.076793740349318}, TensorConstant{-49.99999999999999}, TensorConstant{-inf}, sigma_alpha_log__)
   0.0%   100.0%       0.001s       5.01e-07s   1000    22   Elemwise{Composite{Identity(GT(i0, i1))}}(InplaceDimShuffle{x}.0, TensorConstant{(1,) of 0})
   0.0%   100.0%       0.001s       5.00e-07s   1000     8   Elemwise{Composite{(i0 * (i1 + (-sqr(i2))))}}(TensorConstant{0.5}, TensorConstant{-1.8378770664093453}, mu_beta)
   0.0%   100.0%       0.000s       5.00e-07s   1000    34   Sum{acc_dtype=float64}(Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}.0)
   0.0%   100.0%       0.000s       5.00e-07s   1000    32   Elemwise{Composite{(Switch(Identity((GE(i0, i1) * LE(i0, i2))), i1, i3) - ((i4 * scalar_softplus((-i5))) + i5))}}[(0, 0)](sigmoid.0, TensorConstant{0.0}, TensorConstant{1.0}, TensorConstant{-inf}, TensorConstant{2.0}, mu_alpha_interval__)
   ... (remaining 17 Apply instances account for 0.01%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  - Try the Theano flag floatX=float32
We don't know if amdlibm will accelerate this scalar op. scalar_gammaln
We don't know if amdlibm will accelerate this scalar op. scalar_gammaln
We don't know if amdlibm will accelerate this scalar op. scalar_gammaln
  - Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.

#6

I also replaced this code by a dot product to try to speed things up with the shared variable but this didnt help.

Replaced this code:
p = pm.math.sigmoid(a[storeIDX] + (b[storeIDX] * Shift1Score_shared))

With this:
_, L = p.dmatrices(‘Score ~ -1+RelatedStoreID’, data=audits, return_type=‘matrix’)
L = np.asarray(L)
w = tt.dot(L,b)
# Expected value
p = pm.math.sigmoid(tt.dot(L,a) + (tt.dot(Shift1Score_shared, w)))

This didnt help. The sampling is initially super fast, go to around 50/1000 but then gets suuuper slow.

I also tried setting theano.config.floatX = ‘float32’, but this didn’t change anything.


#7

If it is still during tuning, you do at times see slowdowns. As long as the trace is fine and there is no warning, your result should be fine.


#8

Alright it appears that some of my math prevented the sampler from exploring the space because it was getting stuck and never returned. Fixed my dot products and it works well now.