I just modified my code to take advantage of tt.shared variables to be able to test my code with unseen data but it makes sampling 10x (if not more) slower. Is this normal?
Thanks,
I just modified my code to take advantage of tt.shared variables to be able to test my code with unseen data but it makes sampling 10x (if not more) slower. Is this normal?
Thanks,
I have never observe any similar slowdown - what is the set up of your model?
Here is the code. If I replace the shared variables by the numpy arrays, it is much much faster:
scores_shared = shared(audits.Score.values)
n_shared = shared(audits.ScoreMaximum.values)
Shift1Score_shared = shared(audits['Shift1Score'].values)
# scores_shared = audits.Score.values
# n_shared = audits.ScoreMaximum.values
# Shift1Score_shared = audits['Shift1Score'].values
with pm.Model() as binomial_model:
# Hyperpriors
mu_a = pm.Uniform('mu_alpha', lower=0, upper=1)
sigma_a = pm.HalfNormal('sigma_alpha', sd=0.1)
mu_b = pm.Normal('mu_beta', mu=0, sd=1)
sigma_b = pm.HalfCauchy('sigma_beta', beta=1)
# Intercept for each store, distributed around group mean mu_a
a = pm.Normal('intercept', mu=mu_a, sd=sigma_a, shape=len(uniqueStores))
# Intercept for each store, distributed around group mean mu_a
b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=len(uniqueStores))
# Model error
#eps = pm.HalfNormal('eps', sd=0.1)
# Expected value
p = pm.math.sigmoid(a[storeIDX] + (b[storeIDX] * Shift1Score_shared))
# Data likelihood
y_like = pm.Binomial('y_like', n=n_shared, p=p, observed=scores_shared)
trace_binomial = pm.sample(progressbar=True, njobs=4)
It’s not immediately clear to me why it would be slower. You can try following the profiling method described here to check where the slow down comes from.
Here is the output from the profiling method:
Function profiling
==================
Message: C:\ProgramData\Anaconda3\lib\site-packages\pymc3\model.py:853
Time in 1000 calls to Function.__call__: 3.342232e+01s
Time in Function.fn.__call__: 3.336384e+01s (99.825%)
Time in thunks: 3.332985e+01s (99.723%)
Total compile time: 4.253020e+00s
Number of Apply nodes: 37
Theano Optimizer time: 3.294990e-01s
Theano validate time: 4.003763e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 3.885020e+00s
Import time 6.899691e-02s
Node make_thunk time 3.883020e+00s
Node Elemwise{Composite{Switch(Identity((GE(i0, i1) * LE(i0, i2) * GE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1) * LE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i6))), (((i7 - i8) - i9) + Switch(EQ(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1), i10, (i11 * i0 * scalar_softplus((-Composite{(i0 + (i1 * i2))}(i3, i4, i5))))) + Switch(EQ((i12 - scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5))), i1), i10, (i13 * (i2 - i0) * scalar_softplus(Composite{(i0 + (i1 * i2))}(i3, i4, i5))))), i10)}}[(0, 3)](Elemwise{Cast{int64}}.0, TensorConstant{(1,) of 0}, <TensorType(int32, vector)>, AdvancedSubtensor1.0, AdvancedSubtensor1.0, <TensorType(float64, vector)>, TensorConstant{(1,) of 1}, gammaln.0, gammaln.0, gammaln.0, TensorConstant{(1,) of -inf}, TensorConstant{(1,) of -1.0}, TensorConstant{(1,) of 1.0}, TensorConstant{(1,) of -1.0}) time 1.011997e+00s
Node Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}(Elemwise{Composite{Identity(GT(i0, i1))}}.0, TensorConstant{(1,) of 0.5}, Elemwise{Composite{inv(sqr(i0))}}.0, b, InplaceDimShuffle{x}.0, Elemwise{Composite{log((i0 * i1))}}.0, TensorConstant{(1,) of -inf}) time 8.304996e-01s
Node Elemwise{Composite{(Switch(Identity((GE(i0, i1) * LE(i0, i2))), i1, i3) - ((i4 * scalar_softplus((-i5))) + i5))}}[(0, 0)](sigmoid.0, TensorConstant{0.0}, TensorConstant{1.0}, TensorConstant{-inf}, TensorConstant{2.0}, mu_alpha_interval__) time 7.100189e-01s
Node Elemwise{Composite{(Switch(Identity(GE(i0, i1)), (i2 - log1p(sqr(i0))), i3) + i4)}}[(0, 0)](Elemwise{exp,no_inplace}.0, TensorConstant{0}, TensorConstant{-0.45158270338480055}, TensorConstant{-inf}, sigma_beta_log__) time 6.914997e-01s
Node Elemwise{Composite{inv(sqr(i0))}}(InplaceDimShuffle{x}.0) time 5.865004e-01s
Time in all call to theano.grad() 0.000000e+00s
Time since theano import 84.273s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
99.9% 99.9% 33.292s 1.45e-03s C Py 23000 23 theano.tensor.elemwise.Elemwise
0.1% 100.0% 0.025s 1.25e-05s C 2000 2 theano.tensor.subtensor.AdvancedSubtensor1
0.0% 100.0% 0.011s 2.75e-06s C 4000 4 theano.tensor.elemwise.Sum
0.0% 100.0% 0.001s 1.50e-06s C 1000 1 theano.tensor.opt.MakeVector
0.0% 100.0% 0.000s 1.66e-07s C 3000 3 theano.compile.ops.ViewOp
0.0% 100.0% 0.000s 0.00e+00s C 4000 4 theano.tensor.elemwise.DimShuffle
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
95.2% 95.2% 31.731s 1.06e-02s Py 3000 3 gammaln
4.6% 99.8% 1.524s 1.52e-03s C 1000 1 Elemwise{Composite{Switch(Identity((GE(i0, i1) * LE(i0, i2) * GE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1) * LE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i6))), (((i7 - i8) - i9) + Switch(EQ(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1), i10, (i11 * i0 * scalar_softplus((-Composite{(i0 + (i1 * i2))}(i3, i4, i5))))) + Switch(EQ((i12 - scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5))),
0.1% 99.8% 0.025s 1.25e-05s C 2000 2 AdvancedSubtensor1
0.0% 99.9% 0.013s 1.30e-05s C 1000 1 Elemwise{Composite{((i0 + i1) - i2)}}
0.0% 99.9% 0.011s 2.75e-06s C 4000 4 Sum{acc_dtype=float64}
0.0% 100.0% 0.010s 5.00e-06s C 2000 2 Elemwise{add,no_inplace}
0.0% 100.0% 0.006s 3.25e-06s C 2000 2 Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}
0.0% 100.0% 0.003s 3.00e-06s C 1000 1 Elemwise{Cast{int64}}
0.0% 100.0% 0.001s 1.50e-06s C 1000 1 MakeVector{dtype='float64'}
0.0% 100.0% 0.001s 5.00e-07s C 2000 2 Elemwise{Composite{Identity(GT(i0, i1))}}
0.0% 100.0% 0.001s 2.50e-07s C 2000 2 Elemwise{Composite{log((i0 * i1))}}
0.0% 100.0% 0.001s 5.01e-07s C 1000 1 Elemwise{Composite{(Switch(Identity(GE(i0, i1)), (i2 + (i3 * sqr(i0))), i4) + i5)}}[(0, 0)]
0.0% 100.0% 0.001s 5.00e-07s C 1000 1 Elemwise{Composite{(i0 * (i1 + (-sqr(i2))))}}
0.0% 100.0% 0.000s 2.50e-07s C 2000 2 Elemwise{exp,no_inplace}
0.0% 100.0% 0.000s 5.00e-07s C 1000 1 sigmoid
0.0% 100.0% 0.000s 5.00e-07s C 1000 1 Elemwise{Composite{(Switch(Identity((GE(i0, i1) * LE(i0, i2))), i1, i3) - ((i4 * scalar_softplus((-i5))) + i5))}}[(0, 0)]
0.0% 100.0% 0.000s 1.66e-07s C 3000 3 ViewOp
0.0% 100.0% 0.000s 4.99e-07s C 1000 1 Elemwise{Composite{(Switch(Identity(GE(i0, i1)), (i2 - log1p(sqr(i0))), i3) + i4)}}[(0, 0)]
0.0% 100.0% 0.000s 0.00e+00s C 4000 4 InplaceDimShuffle{x}
0.0% 100.0% 0.000s 0.00e+00s C 2000 2 Elemwise{Composite{inv(sqr(i0))}}
... (remaining 0 Ops account for 0.00%(0.00s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
31.8% 31.8% 10.607s 1.06e-02s 1000 11 gammaln(Elemwise{add,no_inplace}.0)
31.8% 63.6% 10.589s 1.06e-02s 1000 16 gammaln(Elemwise{add,no_inplace}.0)
31.6% 95.2% 10.535s 1.05e-02s 1000 15 gammaln(Elemwise{Composite{((i0 + i1) - i2)}}.0)
4.6% 99.8% 1.524s 1.52e-03s 1000 20 Elemwise{Composite{Switch(Identity((GE(i0, i1) * LE(i0, i2) * GE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1) * LE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i6))), (((i7 - i8) - i9) + Switch(EQ(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1), i10, (i11 * i0 * scalar_softplus((-Composite{(i0 + (i1 * i2))}(i3, i4, i5))))) + Switch(EQ((i12 - scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5))), i1), i10,
0.0% 99.8% 0.013s 1.35e-05s 1000 3 AdvancedSubtensor1(intercept, TensorConstant{[ 0 0 ..9 619 619]})
0.0% 99.9% 0.013s 1.30e-05s 1000 9 Elemwise{Composite{((i0 + i1) - i2)}}(TensorConstant{(1,) of 1}, <TensorType(int32, vector)>, Elemwise{Cast{int64}}.0)
0.0% 99.9% 0.012s 1.15e-05s 1000 2 AdvancedSubtensor1(b, TensorConstant{[ 0 0 ..9 619 619]})
0.0% 99.9% 0.009s 9.49e-06s 1000 27 Sum{acc_dtype=float64}(Elemwise{Composite{Switch(Identity((GE(i0, i1) * LE(i0, i2) * GE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1) * LE(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i6))), (((i7 - i8) - i9) + Switch(EQ(scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3, i4, i5)), i1), i10, (i11 * i0 * scalar_softplus((-Composite{(i0 + (i1 * i2))}(i3, i4, i5))))) + Switch(EQ((i12 - scalar_sigmoid(Composite{(i0 + (i1 * i2))}(i3
0.0% 99.9% 0.006s 6.00e-06s 1000 31 Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}(Elemwise{Composite{Identity(GT(i0, i1))}}.0, TensorConstant{(1,) of 0.5}, Elemwise{Composite{inv(sqr(i0))}}.0, intercept, InplaceDimShuffle{x}.0, Elemwise{Composite{log((i0 * i1))}}.0, TensorConstant{(1,) of -inf})
0.0% 100.0% 0.006s 6.00e-06s 1000 10 Elemwise{add,no_inplace}(TensorConstant{(1,) of 1}, Elemwise{Cast{int64}}.0)
0.0% 100.0% 0.004s 4.00e-06s 1000 1 Elemwise{add,no_inplace}(TensorConstant{(1,) of 1}, <TensorType(int32, vector)>)
0.0% 100.0% 0.003s 3.00e-06s 1000 0 Elemwise{Cast{int64}}(<TensorType(int32, vector)>)
0.0% 100.0% 0.001s 1.50e-06s 1000 35 MakeVector{dtype='float64'}(__logp_mu_alpha_interval__, __logp_sigma_alpha_log__, __logp_mu_beta, __logp_sigma_beta_log__, __logp_intercept, __logp_b, __logp_y_like)
0.0% 100.0% 0.001s 1.00e-06s 1000 33 Sum{acc_dtype=float64}(Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}.0)
0.0% 100.0% 0.001s 5.01e-07s 1000 29 Elemwise{Composite{log((i0 * i1))}}(TensorConstant{(1,) of 0...9154943092}, Elemwise{Composite{inv(sqr(i0))}}.0)
0.0% 100.0% 0.001s 5.01e-07s 1000 26 Elemwise{Composite{(Switch(Identity(GE(i0, i1)), (i2 + (i3 * sqr(i0))), i4) + i5)}}[(0, 0)](Elemwise{exp,no_inplace}.0, TensorConstant{0}, TensorConstant{2.076793740349318}, TensorConstant{-49.99999999999999}, TensorConstant{-inf}, sigma_alpha_log__)
0.0% 100.0% 0.001s 5.01e-07s 1000 22 Elemwise{Composite{Identity(GT(i0, i1))}}(InplaceDimShuffle{x}.0, TensorConstant{(1,) of 0})
0.0% 100.0% 0.001s 5.00e-07s 1000 8 Elemwise{Composite{(i0 * (i1 + (-sqr(i2))))}}(TensorConstant{0.5}, TensorConstant{-1.8378770664093453}, mu_beta)
0.0% 100.0% 0.000s 5.00e-07s 1000 34 Sum{acc_dtype=float64}(Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}.0)
0.0% 100.0% 0.000s 5.00e-07s 1000 32 Elemwise{Composite{(Switch(Identity((GE(i0, i1) * LE(i0, i2))), i1, i3) - ((i4 * scalar_softplus((-i5))) + i5))}}[(0, 0)](sigmoid.0, TensorConstant{0.0}, TensorConstant{1.0}, TensorConstant{-inf}, TensorConstant{2.0}, mu_alpha_interval__)
... (remaining 17 Apply instances account for 0.01%(0.00s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
- Try the Theano flag floatX=float32
We don't know if amdlibm will accelerate this scalar op. scalar_gammaln
We don't know if amdlibm will accelerate this scalar op. scalar_gammaln
We don't know if amdlibm will accelerate this scalar op. scalar_gammaln
- Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.
I also replaced this code by a dot product to try to speed things up with the shared variable but this didnt help.
Replaced this code:
p = pm.math.sigmoid(a[storeIDX] + (b[storeIDX] * Shift1Score_shared))
With this:
_, L = p.dmatrices(‘Score ~ -1+RelatedStoreID’, data=audits, return_type=‘matrix’)
L = np.asarray(L)
w = tt.dot(L,b)
# Expected value
p = pm.math.sigmoid(tt.dot(L,a) + (tt.dot(Shift1Score_shared, w)))
This didnt help. The sampling is initially super fast, go to around 50/1000 but then gets suuuper slow.
I also tried setting theano.config.floatX = ‘float32’, but this didn’t change anything.
If it is still during tuning, you do at times see slowdowns. As long as the trace is fine and there is no warning, your result should be fine.
Alright it appears that some of my math prevented the sampler from exploring the space because it was getting stuck and never returned. Fixed my dot products and it works well now.