Hi,
I’m using PyMC3 for a model that does the folling:
- The full dataset (timeseries), is a combination of multiple individual datasets that have a common trend
- Each dataset has its own offset to subtract and its own white noise term that I add in quadrature with the initial errors
- Then, inside the model, I want to take the (weighted) rolling median or rolling mean of the full residuals after subtracting the offsets and white noise terms:
- The rolling window is based on time, not on a number of points, so I use a boolean mask to keep only weights and values in a certain time window around a given point.
- For the rolling median, I need to use
tt.argsort
to sort values and calculate the weighted median (or at least I have not found another way so far), then I usepymc3.{theanof,aesaraf}.take_along_axis
to sort the 2d arrays along the desired axes.
Currently, sampling the model with a rolling mean in it takes about 20 minutes to run on 4 CPUs for 500 tuning steps and 500 draws while the rolling median just keeps adding more time to the estimated total (after 40 hours it still shows > 80 hours). I first used a for loop to iterate over all the points in my timeseries, then tried with tt.scan
, and then I tried to use matrices and perform the calculation along and axis.
My questions are:
- Is there a known better way to implement such models with PyMC3 ?
- According to the profiling output below, if I understand it correctly, ArgSort is the main bottleneck, so I was wondering if there are alternative mechanism in PyMC3 to avoid using argsort (or use it better).
I’m very much a beginner with PyMC3, sorry if this already has an answer somewhere. Let me know if there is other useful information I could share to help answer the question .
Thank you !
logpt profile output
[nav] In [225]: rolling_model_pm.profile(rolling_model_pm.logpt).summary()
Function profiling
==================
Message: /home/vandal/.pyenv/versions/v0cal/lib/python3.9/site-packages/pymc3/model.py:1254
Time in 1000 calls to Function.__call__: 4.207738e+01s
Time in Function.fn.__call__: 41.98405909538269s (99.778%)
Time in thunks: 41.927605390548706s (99.644%)
Total compile time: 4.491658e-01s
Number of Apply nodes: 52
Theano Optimizer time: 3.716309e-01s
Theano validate time: 3.323317e-03s
Theano Linker time (includes C, CUDA code generation/compiling): 0.036318063735961914s
Import time 0.000000e+00s
Node make_thunk time 3.473973e-02s
Node Elemwise{mul,no_inplace}(InplaceDimShuffle{x,0}.0, TensorConstant{[[1. 1. 0...1. 1. 1.]]}) time 9.595394e-03s
Node Elemwise{Composite{((((((((((i0 - (i1 * i2)) - (i3 * i4)) - (i5 * i6)) - (i7 * i8)) - (i9 * i10)) - (i11 * i12)) - (i13 * i14)) - (i15 * i16)) - (i17 * i18)) - (i19 * i20))}}(TensorConstant{[-5.323109...58023765]},
InplaceDimShuffle{x}.0, TensorConstant{[0. 0. 0. .. 0. 0. 1.]}, InplaceDimShuffle{x}.0, TensorConstant{[0. 0. 0. .. 0. 0. 0.]}, InplaceDimShuffle{x}.0, TensorConstant{[0. 0. 0. .. 0. 0. 0.]}, InplaceDimShuffle{x}.0, TensorConstant{
[0. 1. 0. .. 0. 1. 0.]}, InplaceDimShuffle{x}.0, TensorConstant{[0. 0. 0. .. 0. 0. 0.]}, InplaceDimShuffle{x}.0, TensorConstant{[0. 0. 0. .. 0. 0. 0.]}, InplaceDimShuffle{x}.0, TensorConstant{[0. 0. 0. .. 0. 0. 0.]}, InplaceDimShuf
fle{x}.0, TensorConstant{[1. 0. 1. .. 0. 0. 0.]}, InplaceDimShuffle{x}.0, TensorConstant{[0. 0. 0. .. 0. 0. 0.]}, InplaceDimShuffle{x}.0, TensorConstant{[0. 0. 0. .. 1. 0. 0.]}) time 1.569986e-03s
Node Elemwise{Composite{(i0 * (i1 + (i2 * sqr((i3 + i4)))))}}(TensorConstant{0.5}, TensorConstant{-9.496770537132663}, TensorConstant{-0.0004718..2148866378}, TensorConstant{0.07975531206725212}, gamma1) time 1.4977
46e-03s
Node InplaceDimShuffle{x}(gamma2) time 9.064674e-04s
Node Elemwise{Composite{((i0 * i1 * sqr((i2 - (i3 + (((i4 - i5) * (i6 - i3)) / (i7 - i5)))))) + i8)}}[(0, 2)](TensorConstant{(1,) of -1.0}, TensorConstant{[0.8051108...49118222]}, Elemwise{Composite{((((((((((i0 - (i1 *
i2)) - (i3 * i4)) - (i5 * i6)) - (i7 * i8)) - (i9 * i10)) - (i11 * i12)) - (i13 * i14)) - (i15 * i16)) - (i17 * i18)) - (i19 * i20))}}.0, Subtensor{int64}.0, TensorConstant{(1,) of 0.5}, Subtensor{int64}.0, Subtensor{int64}.0, Subt
ensor{int64}.0, TensorConstant{[-2.054652...54881717]}) time 7.774830e-04s
Time in all call to theano.grad() 2.246730e+01s
Time since theano import 36252.758s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
34.7% 34.7% 14.537s 1.45e-02s Py 1000 1 theano.tensor.sort.ArgSortOp
29.2% 63.8% 12.224s 3.06e-03s Py 4000 4 theano.tensor.subtensor.AdvancedSubtensor
16.3% 80.1% 6.827s 3.59e-04s C 19000 19 theano.tensor.elemwise.Elemwise
9.2% 89.3% 3.866s 3.87e-03s C 1000 1 theano.tensor.extra_ops.CumOp
4.2% 93.6% 1.773s 1.77e-03s Py 1000 1 theano.tensor.extra_ops.DiffOp
4.1% 97.7% 1.738s 5.79e-04s C 3000 3 theano.tensor.elemwise.Sum
2.2% 99.9% 0.929s 9.29e-04s C 1000 1 theano.tensor.basic.Argmax
0.0% 100.0% 0.016s 1.01e-06s C 16000 16 theano.tensor.elemwise.DimShuffle
0.0% 100.0% 0.013s 1.33e-05s C 1000 1 theano.tensor.basic.Join
0.0% 100.0% 0.003s 8.15e-07s C 4000 4 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.002s 1.66e-06s C 1000 1 theano.tensor.opt.MakeVector
... (remaining 0 Classes account for 0.00%(0.00s) of the runtime)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
34.7% 34.7% 14.537s 1.45e-02s Py 1000 1 ArgSortOp{quicksort, None}
29.2% 63.8% 12.224s 3.06e-03s Py 4000 4 AdvancedSubtensor
9.2% 73.0% 3.866s 3.87e-03s C 1000 1 CumOp{1, add}
4.2% 77.3% 1.773s 1.77e-03s Py 1000 1 DiffOp{n=1, axis=1}
4.2% 81.4% 1.742s 1.74e-03s C 1000 1 Elemwise{true_div,no_inplace}
4.1% 85.6% 1.736s 1.74e-03s C 1000 1 Sum{axis=[1], acc_dtype=float64}
3.9% 89.5% 1.655s 1.66e-03s C 1000 1 Elemwise{sub,no_inplace}
3.1% 92.6% 1.308s 1.31e-03s C 1000 1 Elemwise{Composite{sgn(((i0 + i1) - i2))}}[(0, 1)]
2.5% 95.2% 1.067s 1.07e-03s C 1000 1 Elemwise{mul,no_inplace}
2.4% 97.6% 1.026s 1.03e-03s C 1000 1 Elemwise{Composite{((i0 * i1) / i2)}}[(0, 1)]
2.2% 99.8% 0.929s 9.29e-04s C 1000 1 Argmax{axis=(1,)}
0.0% 99.9% 0.013s 1.33e-05s C 1000 1 Join
0.0% 99.9% 0.012s 1.19e-05s C 1000 1 Elemwise{Composite{((((((((((i0 - (i1 * i2)) - (i3 * i4)) - (i5 * i6)) - (i7 * i8)) - (i9 * i10)) - (i11 * i12)) - (i13 * i14)) - (i15 * i16)) - (i17 * i18)) -
(i19 * i20))}}
0.0% 99.9% 0.008s 7.99e-07s C 10000 10 InplaceDimShuffle{x}
0.0% 99.9% 0.008s 7.86e-07s C 10000 10 Elemwise{Composite{(i0 * (i1 + (i2 * sqr((i3 + i4)))))}}
0.0% 100.0% 0.005s 4.81e-06s C 1000 1 Elemwise{Composite{((i0 * i1 * sqr((i2 - (i3 + (((i4 - i5) * (i6 - i3)) / (i7 - i5)))))) + i8)}}[(0, 2)]
0.0% 100.0% 0.003s 3.49e-06s C 1000 1 Elemwise{add,no_inplace}
0.0% 100.0% 0.003s 8.15e-07s C 4000 4 Subtensor{int64}
0.0% 100.0% 0.003s 1.02e-06s C 3000 3 InplaceDimShuffle{1,0}
0.0% 100.0% 0.003s 1.49e-06s C 2000 2 InplaceDimShuffle{x,0}
... (remaining 4 Ops account for 0.02%(0.01s) of the runtime)
Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
34.7% 34.7% 14.537s 1.45e-02s 1000 23 ArgSortOp{quicksort, None}(Elemwise{mul,no_inplace}.0, TensorConstant{1})
14.6% 49.3% 6.140s 6.14e-03s 1000 25 AdvancedSubtensor(Elemwise{mul,no_inplace}.0, TensorConstant{[[ 0]
[..]
[1206]]}, ArgSortOp{quicksort, None}.0)
14.3% 63.6% 5.988s 5.99e-03s 1000 24 AdvancedSubtensor(TensorConstant{[[0.805110..49118222]]}, TensorConstant{[[ 0]
[..]
[1206]]}, ArgSortOp{quicksort, None}.0)
9.2% 72.8% 3.866s 3.87e-03s 1000 30 CumOp{1, add}(Elemwise{true_div,no_inplace}.0)
4.2% 77.0% 1.773s 1.77e-03s 1000 33 DiffOp{n=1, axis=1}(Elemwise{Composite{sgn(((i0 + i1) - i2))}}[(0, 1)].0)
4.2% 81.2% 1.742s 1.74e-03s 1000 28 Elemwise{true_div,no_inplace}(AdvancedSubtensor.0, InplaceDimShuffle{0,x}.0)
4.1% 85.3% 1.736s 1.74e-03s 1000 26 Sum{axis=[1], acc_dtype=float64}(AdvancedSubtensor.0)
3.9% 89.3% 1.655s 1.66e-03s 1000 31 Elemwise{sub,no_inplace}(CumOp{1, add}.0, Elemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0)
3.1% 92.4% 1.308s 1.31e-03s 1000 32 Elemwise{Composite{sgn(((i0 + i1) - i2))}}[(0, 1)](TensorConstant{(1, 1) of -0.5}, CumOp{1, add}.0, Elemwise{Composite{((i0 * i1) / i2)}}[(0, 1)].0)
2.5% 95.0% 1.067s 1.07e-03s 1000 22 Elemwise{mul,no_inplace}(InplaceDimShuffle{x,0}.0, TensorConstant{[[1. 1. 0...1. 1. 1.]]})
2.4% 97.4% 1.026s 1.03e-03s 1000 29 Elemwise{Composite{((i0 * i1) / i2)}}[(0, 1)](TensorConstant{(1, 1) of 0.5}, AdvancedSubtensor.0, InplaceDimShuffle{0,x}.0)
2.2% 99.6% 0.929s 9.29e-04s 1000 34 Argmax{axis=(1,)}(DiffOp{n=1, axis=1}.0)
0.1% 99.8% 0.061s 6.08e-05s 1000 40 AdvancedSubtensor(AdvancedSubtensor.0, TensorConstant{[[ 0]
[..]
[1206]]}, InplaceDimShuffle{1,0}.0)
0.1% 99.8% 0.035s 3.45e-05s 1000 39 AdvancedSubtensor(Elemwise{sub,no_inplace}.0, TensorConstant{[[ 0]
[..]
[1206]]}, InplaceDimShuffle{1,0}.0)
0.0% 99.9% 0.013s 1.33e-05s 1000 37 Join(TensorConstant{0}, InplaceDimShuffle{x,0}.0, Elemwise{add,no_inplace}.0)
0.0% 99.9% 0.012s 1.19e-05s 1000 20 Elemwise{Composite{((((((((((i0 - (i1 * i2)) - (i3 * i4)) - (i5 * i6)) - (i7 * i8)) - (i9 * i10)) - (i11 * i12)) - (i13 * i14)) - (i15 * i16)) - (i17 * i18)) - (i19 * i20
))}}(TensorConstant{[-5.323109...58023765]}, InplaceDimShuffle{x}.0, TensorConstant{[0. 0. 0. .. 0. 0. 1.]}, InplaceDimShuffle{x}.0, TensorConstant{[0. 0. 0. .. 0. 0. 0.]}, InplaceDimShuffle{x}.0, TensorConstant{[0. 0. 0. .. 0. 0.
0.]}, InplaceDimShuffle{x}.0, TensorConstant{[0. 1.
0.0% 99.9% 0.005s 4.81e-06s 1000 47 Elemwise{Composite{((i0 * i1 * sqr((i2 - (i3 + (((i4 - i5) * (i6 - i3)) / (i7 - i5)))))) + i8)}}[(0, 2)](TensorConstant{(1,) of -1.0}, TensorConstant{[0.8051108...4911822
2]}, Elemwise{Composite{((((((((((i0 - (i1 * i2)) - (i3 * i4)) - (i5 * i6)) - (i7 * i8)) - (i9 * i10)) - (i11 * i12)) - (i13 * i14)) - (i15 * i16)) - (i17 * i18)) - (i19 * i20))}}.0, Subtensor{int64}.0, TensorConstant{(1,) of 0.5},
Subtensor{int64}.0, Subtensor{int64}.0, Subtensor{i
0.0% 99.9% 0.003s 3.49e-06s 1000 36 Elemwise{add,no_inplace}(TensorConstant{(1, 1) of 1}, InplaceDimShuffle{x,0}.0)
0.0% 99.9% 0.002s 2.15e-06s 1000 27 InplaceDimShuffle{0,x}(Sum{axis=[1], acc_dtype=float64}.0)
0.0% 99.9% 0.002s 2.10e-06s 1000 48 Sum{acc_dtype=float64}(Elemwise{Composite{((i0 * i1 * sqr((i2 - (i3 + (((i4 - i5) * (i6 - i3)) / (i7 - i5)))))) + i8)}}[(0, 2)].0)
... (remaining 32 Apply instances account for 0.07%(0.03s) of the runtime)
Here are tips to potentially make your code run faster
(if you think of new ones, suggest them on the mailing list).
Test them first, as they are not guaranteed to always provide a speedup.
- Try the Theano flag floatX=float32