Thanks Junpeng! So it seems that 80% of ops time come from (results of m_slope.profile(m_slope.logpt).summary()):
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
32.1% 32.1% 0.603s 1.00e-04s Py 6000 6 Solve{A_structure='lower_triangular', lower=False, overwrite_A=False, overwrite_b=False}
20.2% 52.3% 0.380s 3.17e-05s Py 12000 12 AdvancedSubtensor
11.1% 63.4% 0.209s 3.49e-05s Py 6000 6 AdvancedIncSubtensor{inplace=False, set_instead_of_inc=True}
3.9% 67.3% 0.073s 1.22e-05s Py 6000 6 ExtractDiag{offset=0, axis1=0, axis2=1, view=False}
3.2% 70.5% 0.060s 1.20e-05s C 5000 5 IncSubtensor{Inc;int64}
3.0% 73.5% 0.056s 9.41e-06s C 6000 6 IncSubtensor{InplaceInc;int64::}
2.9% 76.4% 0.055s 9.13e-06s C 6000 6 AdvancedIncSubtensor1{no_inplace,set}
2.9% 79.3% 0.054s 9.01e-06s C 6000 6 CumOp{None, add}
It looks like the classes LinearAlgebra, AdvancedSubtensor and AdvancedIncSubtensor are taking the most time, but I’m not sure how to improve that…
I see these two tips at the end of the output:
- Try the Theano flag floatX=float32
- Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.
Does it help in your experience – or would it in this case?
A huge thank you again for your time 