How to speed up AdvancedIncSubtensor

I’m using profiling on a large model to attempt to speed sampling.

func = model.logp_dlogp_function(profile=True)
        func.set_extra_values({})
        x = np.random.randn(func.size)
        for _ in range(1000):
            func(x)
        print(func.profile.summary())

A segment of the output is below:

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  72.9%    72.9%      14.346s       4.78e-03s     Py    3000       3   theano.tensor.subtensor.AdvancedIncSubtensor
   8.1%    81.1%       1.602s       1.60e-03s     C     1000       1   theano.tensor.nnet.nnet.Softmax
   7.0%    88.1%       1.383s       3.74e-05s     C    37000      37   theano.tensor.elemwise.Elemwise
   4.7%    92.8%       0.919s       9.19e-04s     C     1000       1   theano.tensor.subtensor.AdvancedIncSubtensor1
   4.0%    96.8%       0.790s       2.63e-04s     Py    3000       3   theano.tensor.subtensor.AdvancedSubtensor

Notice that super yucky 14.3 seconds/it command (theano.tensor.subtensor.AdvancedIncSubtensor) taking up 73% of computation time. Its also a py op.

Its comes from a single line in my model, a paired down version is below:

idx1 = shared(np.random.int(0,4,(20000,0)))
idx2 = shared(np.random.int(0,4,(20000,0)))
with pm.Model() as mod:
    a_multi_dim_param = pm.Normal('a_multi_dim_param', 0, 1, shape=(4,4,4))
    a_multi_dim_param[idx1, idx2] #AdvancedIncSubtensor
   #....

Questions:
Why is this a py op? and is there anything that can be done to speed it up? a reshape maybe?

1 Like

This is potentially promising, they flatten a multi dim matrix and then reshape it back.
https://groups.google.com/forum/#!searchin/theano-users/Advanced$20index$20on$20GPU/theano-users/pukE6_tGnd8/-mnM2Q1NZ5cJ

Although I’m still not sure why AdvancedIncSubtensor is a py Op.

My guess would be that this is a slow fallback for indexing with more than one index array at the same time. If so, than reshaping and indexing with only one should help.
Something I’ve also done a few times to speed up indexing is to replace it with a sparse matrix multiplication. In some cases that seems to speed up the gradient quite a bit.

1 Like

You’re correct that indexing with 2 arrays at once was the cause of the slowdown. I reshaped the first two dimentions into one, and it sped things up significantly:

Here is a helper function if anyone hits something similar:

def partial_ravel_index(a,idx1,idx2):
    a_shape = tt.shape(a)
    a_reshaped = tt.reshape(a, (a_shape[0]*a_shape[1], a_shape[2]))
    return a_reshaped[idx1*a_shape[1]+idx2]

using this to index compiles down to a AdvancedIncSubtensor1, which is a C Op (with GPU support), and much faster than AdvancedIncSubtensor

function testing for a complete example:

import numpy as np
import theano.tensor as tt

a = np.array(
    [[[1, 11],
      [2, 22],
      [3, 33],
      [32,332]],
     [[4, 44],
      [5, 55],
      [6, 66],
      [62, 662]],
     [[7, 77],
      [8, 88],
      [9, 99],
      [92, 992]]
     ])
a_shape = a.shape
idx1 = np.array([0, 0, 0, 1, 1, 1])
idx2 = np.array([2, 2, 2, 2, 2, 2])

assert np.all(a[idx1, idx2] == partial_ravel_index(tt.as_tensor_variable(a),
                                                  tt.as_tensor_variable(idx1),
                                                  tt.as_tensor_variable(idx2)).eval())
1 Like

Nice, thanks for the code.
Just found this function in theano:
http://deeplearning.net/software/theano/library/tensor/extra_ops.html#theano.tensor.extra_ops.ravel_multi_index

Ah yea i tried that, but it threw and error when trying to use a slice(None) at the 3rd dimention (at least the numpy version did).

I’m thinking of trying to get my model sampling on the GPU, but i still have some Py ops:

  32.5%    32.5%       3.062s       1.02e-03s     C     3000       3   theano.tensor.subtensor.AdvancedIncSubtensor1
  19.5%    52.0%       1.839s       5.11e-05s     C    36000      36   theano.tensor.elemwise.Elemwise
  19.3%    71.3%       1.819s       1.82e-03s     C     1000       1   theano.tensor.nnet.nnet.Softmax
  15.4%    86.7%       1.450s       1.45e-03s     Py    1000       1   theano.tensor.subtensor.AdvancedIncSubtensor
   3.8%    90.6%       0.362s       1.21e-04s     C     3000       3   theano.tensor.subtensor.AdvancedSubtensor1
   2.4%    92.9%       0.222s       2.22e-04s     C     1000       1   theano.tensor.nnet.nnet.SoftmaxGrad
   2.2%    95.2%       0.212s       1.93e-05s     C    11000      11   theano.tensor.elemwise.Sum
   1.7%    96.8%       0.158s       1.58e-04s     Py    1000       1   theano.tensor.subtensor.AdvancedSubtensor
   1.5%    98.3%       0.140s       7.01e-05s     C     2000       2   theano.tensor.elemwise.All
   0.7%    99.0%       0.064s       1.61e-05s     C     4000       4   theano.tensor.basic.Alloc
   0.4%    99.4%       0.034s       3.45e-05s     Py    1000       1   theano.tensor.basic.ARange
   0.3%    99.7%       0.031s       1.25e-06s     C    25000      25   theano.tensor.subtensor.IncSubtensor
   0.1%    99.8%       0.010s       8.61e-07s     C    12000      12   theano.tensor.elemwise.DimShuffle
   0.1%    99.9%       0.009s       6.47e-07s     C    14000      14   theano.tensor.basic.Reshape
   0.0%   100.0%       0.003s       5.72e-07s     C     6000       6   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.003s       5.03e-07s     C     5000       5   theano.tensor.subtensor.Subtensor
   0.0%   100.0%       0.002s       7.59e-07s     C     2000       2   theano.compile.ops.Shape_i

Does this mean tensor data will have to be copied from the GPU to the CPU evertime a Py Op is encountered? Is it worth my time trying to remove these Py Ops (some of which may be inside pymc3 functions) to speed GPU calcs?

I wouldn’t spend the time hunting down py ops with the cpu, but with hunting down cpu ops on the gpu. It is entirely possible to have a case where there is a gpu op but no C, or a C op but no gpu op.

1 Like

Just reporting back on GPU vs CPU.
With my model (catergorical regression + transforms) and 300,000 datapoints, I saw a small speedup on the GPU, but this was offset by the loss of multi core sampling (python 3 + GPU can only do a single core at present). CPU wins again.

Interesting, thanks for reporting the results. :slight_smile: