How to implement Zero-Inflated Multinomial?

Thanks Junpeng! So it seems that 80% of ops time come from (results of m_slope.profile(m_slope.logpt).summary()):

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  32.1%    32.1%       0.603s       1.00e-04s     Py    6000        6   Solve{A_structure='lower_triangular', lower=False, overwrite_A=False, overwrite_b=False}
  20.2%    52.3%       0.380s       3.17e-05s     Py    12000       12   AdvancedSubtensor
  11.1%    63.4%       0.209s       3.49e-05s     Py    6000        6   AdvancedIncSubtensor{inplace=False,  set_instead_of_inc=True}
   3.9%    67.3%       0.073s       1.22e-05s     Py    6000        6   ExtractDiag{offset=0, axis1=0, axis2=1, view=False}
   3.2%    70.5%       0.060s       1.20e-05s     C     5000        5   IncSubtensor{Inc;int64}
   3.0%    73.5%       0.056s       9.41e-06s     C     6000        6   IncSubtensor{InplaceInc;int64::}
   2.9%    76.4%       0.055s       9.13e-06s     C     6000        6   AdvancedIncSubtensor1{no_inplace,set}
   2.9%    79.3%       0.054s       9.01e-06s     C     6000        6   CumOp{None, add}

It looks like the classes LinearAlgebra, AdvancedSubtensor and AdvancedIncSubtensor are taking the most time, but I’m not sure how to improve that…
I see these two tips at the end of the output:

  - Try the Theano flag floatX=float32
  - Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.

Does it help in your experience – or would it in this case?
A huge thank you again for your time :champagne: