Big drop in sampling performance using long data table format

I’m observing a big drop in sampling performance using long form data, a factor of 10X.

Original code 150 it/s

import pymc3 as pm
import numpy as np

# A study on avatar perceived age. Participants have to rate the age of each avatar twice

# 16 avatars to rate
# 8 participants
# Each participant rate the same avatar twice (2)
num_avatar = 16
num_participant = 100
exposure_time = 2


age_obs_data = np.random.randint(0, 100, size=(num_avatar, num_participant, exposure_time)) # fake data

with pm.Model() as model1:
    # Age
    ## priors
    μ_avatars_age = pm.Uniform('μ_avatars_age', 0, 100, shape=(num_avatar, 1, 1))
    σ_avatars_age = pm.HalfCauchy('σ_avatars_age', 25, shape=(num_avatar, 1, 1))
    
    
    μ_participant_bias = pm.Normal('μ_participant_bias', 0, 10, shape=(1, num_participant, 1))
    
    
    obs_age = pm.Normal('obs_age', 
                        mu=μ_avatars_age + μ_participant_bias,
                        sd=σ_avatars_age,
                        shape=(num_avatar, num_participant, 1),
                        observed=age_obs_data,
                       )
    

    trace = pm.sample(draws=2000, cores=1)

long form code 20 it/s

import pymc3 as pm
import numpy as np
import pandas as pd
import patsy

# A study on avatar perceived age. Participants have to rate the age of each avatar twice

# 16 avatars to rate
# 8 participants
# Each participant rate the same avatar twice (2)
num_avatar = 16
num_participant = 100
exposure_time = 2


participant = np.repeat(np.arange(num_participant), num_avatar*exposure_time)
avatar = np.repeat(np.arange(num_avatar), num_participant*exposure_time)
data = np.random.randint(0, 100, size=(num_avatar*num_participant*exposure_time))


df = pd.DataFrame([participant, avatar, data], index=['participant', 'avatar', 'data']).T
df.participant = df.participant.astype('category')
df.avatar = df.avatar.astype('category')

participant_dm = np.asarray(patsy.dmatrix('0 + participant', data=df))
avatar_dm = np.asarray(patsy.dmatrix('0 + avatar', data=df))

with pm.Model() as model2:
    # Age
    ## priors
    μ_avatars_age = pm.math.dot(avatar_dm, pm.Uniform('μ_avatars_age', 0, 100, shape=num_avatar))
    σ_avatars_age = pm.math.dot(avatar_dm, pm.HalfCauchy('σ_avatars_age', 25, shape=num_avatar))
    
    μ_participant = pm.math.dot(participant_dm, pm.Normal('μ_participant', mu=0, sd=5, shape=num_participant))
        
    obs_age = pm.Normal('obs_age',
                        mu=μ_avatars_age + μ_participant,
                        sd=σ_avatars_age,
                        observed=df.data.values,
                        shape=len(df.data.values)
                       )
    
    trace = pm.sample(draws=2000, cores=1)

Is it normal or am I doing something wrong?

Interesting, it might worth to profile the model to find out why: http://docs.pymc.io/notebooks/profiling.html

I have trouble interpreting the profiling information. What should I be looking for?

Fast model (Wide form data):

Function profiling
==================
  Message: C:\Users\alfoc.ULAVAL\AppData\Local\Continuum\miniconda3\envs\data-science\lib\site-packages\pymc3\model.py:909
  Time in 1000 calls to Function.__call__: 4.900503e-02s
  Time in Function.fn.__call__: 2.300286e-02s (46.940%)
  Time in thunks: 1.700163e-02s (34.694%)
  Total compile time: 5.300529e-01s
    Number of Apply nodes: 17
    Theano Optimizer time: 1.930192e-01s
       Theano validate time: 0.000000e+00s
    Theano Linker time (includes C, CUDA code generation/compiling): 3.600359e-02s
       Import time 1.200104e-02s
       Node make_thunk time 3.500342e-02s
           Node Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}(Elemwise{Composite{Cast{int8}(GT(i0, i1))}}.0, TensorConstant{(1, 1, 1) of 0.5}, Elemwise{Composite{inv(sqr(i0))}}[(0, 0)].0, TensorConstant{[[[11.  0... 0. 89.]]]}, Elemwise{add,no_inplace}.0, Elemwise{Composite{log((i0 * i1))}}.0, TensorConstant{(1, 1, 1) of -inf}) time 2.000213e-02s
           Node Elemwise{Composite{(Switch(Cast{int8}(GE(i0, i1)), (i2 - log1p(sqr((i3 * i0)))), i4) + i5)}}(σ_avatars_age, TensorConstant{(1, 1, 1) of 0}, TensorConstant{(1, 1, 1) ..8588394566}, TensorConstant{(1, 1, 1) of 0.04}, TensorConstant{(1, 1, 1) of -inf}, σ_avatars_age_log__) time 4.000425e-03s
           Node Elemwise{Composite{((i0 + Switch(Cast{int8}((GE(i1, i2) * LE(i1, i3))), i4, i5)) - ((i6 * scalar_softplus((-i7))) + i7))}}[(0, 1)](TensorConstant{(1, 1, 1) ..0185988092}, μ_avatars_age, TensorConstant{(1, 1, 1) of 0.0}, TensorConstant{(1, 1, 1) of 100.0}, TensorConstant{(1, 1, 1) ..0185988092}, TensorConstant{(1, 1, 1) of -inf}, TensorConstant{(1, 1, 1) of 2.0}, μ_avatars_age_interval__) time 4.000187e-03s
           Node Elemwise{Composite{inv(sqr(i0))}}[(0, 0)](σ_avatars_age) time 2.000093e-03s
           Node Elemwise{Mul}[(0, 1)](TensorConstant{0.5}, Sum{acc_dtype=float64}.0) time 1.000166e-03s

Time in all call to theano.grad() 1.612999e+00s
Time since theano import 1342.864s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  88.2%    88.2%       0.015s       1.36e-06s     C    11000      11   theano.tensor.elemwise.Elemwise
  11.8%   100.0%       0.002s       4.00e-07s     C     5000       5   theano.tensor.elemwise.Sum
   0.0%   100.0%       0.000s       0.00e+00s     C     1000       1   theano.tensor.opt.MakeVector
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  58.8%    58.8%       0.010s       1.00e-05s     C     1000        1   Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}
  11.8%    70.6%       0.002s       4.00e-07s     C     5000        5   Sum{acc_dtype=float64}
   5.9%    76.5%       0.001s       1.00e-06s     C     1000        1   Elemwise{exp,no_inplace}
   5.9%    82.4%       0.001s       1.00e-06s     C     1000        1   Elemwise{Composite{(Switch(Cast{int8}(GE(i0, i1)), (i2 - log1p(sqr((i3 * i0)))), i4) + i5)}}
   5.9%    88.2%       0.001s       1.00e-06s     C     1000        1   Elemwise{add,no_inplace}
   5.9%    94.1%       0.001s       1.00e-06s     C     1000        1   Elemwise{Composite{inv(sqr(i0))}}[(0, 0)]
   5.9%   100.0%       0.001s       1.00e-06s     C     1000        1   Elemwise{Composite{((i0 + Switch(Cast{int8}((GE(i1, i2) * LE(i1, i3))), i4, i5)) - ((i6 * scalar_softplus((-i7))) + i7))}}[(0, 1)]
   0.0%   100.0%       0.000s       0.00e+00s     C     1000        1   Elemwise{Composite{(i0 * scalar_sigmoid(i1))}}
   0.0%   100.0%       0.000s       0.00e+00s     C     1000        1   Elemwise{Composite{(i0 + (i1 * sqr(i2)))}}
   0.0%   100.0%       0.000s       0.00e+00s     C     1000        1   Elemwise{Composite{Cast{int8}(GT(i0, i1))}}
   0.0%   100.0%       0.000s       0.00e+00s     C     1000        1   Elemwise{Mul}[(0, 1)]
   0.0%   100.0%       0.000s       0.00e+00s     C     1000        1   Elemwise{Composite{log((i0 * i1))}}
   0.0%   100.0%       0.000s       0.00e+00s     C     1000        1   MakeVector{dtype='float64'}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  58.8%    58.8%       0.010s       1.00e-05s   1000    13   Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}(Elemwise{Composite{Cast{int8}(GT(i0, i1))}}.0, TensorConstant{(1, 1, 1) of 0.5}, Elemwise{Composite{inv(sqr(i0))}}[(0, 0)].0, TensorConstant{[[[11.  0... 0. 89.]]]}, Elemwise{add,no_inplace}.0, Elemwise{Composite{log((i0 * i1))}}.0, TensorConstant{(1, 1, 1) of -inf})
  11.8%    70.6%       0.002s       2.00e-06s   1000    14   Sum{acc_dtype=float64}(Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * sqr((i3 - i4)))) + i5)), i6)}}.0)
   5.9%    76.5%       0.001s       1.00e-06s   1000     8   Elemwise{Composite{((i0 + Switch(Cast{int8}((GE(i1, i2) * LE(i1, i3))), i4, i5)) - ((i6 * scalar_softplus((-i7))) + i7))}}[(0, 1)](TensorConstant{(1, 1, 1) ..0185988092}, μ_avatars_age, TensorConstant{(1, 1, 1) of 0.0}, TensorConstant{(1, 1, 1) of 100.0}, TensorConstant{(1, 1, 1) ..0185988092}, TensorConstant{(1, 1, 1) of -inf}, TensorConstant{(1, 1, 1) of 2.0}, μ_avatars_age_interval__)
   5.9%    82.4%       0.001s       1.00e-06s   1000     7   Elemwise{Composite{inv(sqr(i0))}}[(0, 0)](σ_avatars_age)
   5.9%    88.2%       0.001s       1.00e-06s   1000     5   Elemwise{add,no_inplace}(μ_avatars_age, μ_participant_bias)
   5.9%    94.1%       0.001s       1.00e-06s   1000     3   Elemwise{Composite{(Switch(Cast{int8}(GE(i0, i1)), (i2 - log1p(sqr((i3 * i0)))), i4) + i5)}}(σ_avatars_age, TensorConstant{(1, 1, 1) of 0}, TensorConstant{(1, 1, 1) ..8588394566}, TensorConstant{(1, 1, 1) of 0.04}, TensorConstant{(1, 1, 1) of -inf}, σ_avatars_age_log__)
   5.9%   100.0%       0.001s       1.00e-06s   1000     0   Elemwise{exp,no_inplace}(σ_avatars_age_log__)
   0.0%   100.0%       0.000s       0.00e+00s   1000    16   Sum{acc_dtype=float64}(MakeVector{dtype='float64'}.0)
   0.0%   100.0%       0.000s       0.00e+00s   1000    15   MakeVector{dtype='float64'}(__logp_μ_avatars_age_interval__, __logp_σ_avatars_age_log__, __logp_μ_participant_bias, __logp_obs_age)
   0.0%   100.0%       0.000s       0.00e+00s   1000    12   Sum{acc_dtype=float64}(Elemwise{Composite{((i0 + Switch(Cast{int8}((GE(i1, i2) * LE(i1, i3))), i4, i5)) - ((i6 * scalar_softplus((-i7))) + i7))}}[(0, 1)].0)
   0.0%   100.0%       0.000s       0.00e+00s   1000    11   Elemwise{Composite{log((i0 * i1))}}(TensorConstant{(1, 1, 1) ..4309189535}, Elemwise{Composite{inv(sqr(i0))}}[(0, 0)].0)
   0.0%   100.0%       0.000s       0.00e+00s   1000    10   Elemwise{Mul}[(0, 1)](TensorConstant{0.5}, Sum{acc_dtype=float64}.0)
   0.0%   100.0%       0.000s       0.00e+00s   1000     9   Sum{acc_dtype=float64}(Elemwise{Composite{(Switch(Cast{int8}(GE(i0, i1)), (i2 - log1p(sqr((i3 * i0)))), i4) + i5)}}.0)
   0.0%   100.0%       0.000s       0.00e+00s   1000     6   Sum{acc_dtype=float64}(Elemwise{Composite{(i0 + (i1 * sqr(i2)))}}.0)
   0.0%   100.0%       0.000s       0.00e+00s   1000     4   Elemwise{Composite{Cast{int8}(GT(i0, i1))}}(σ_avatars_age, TensorConstant{(1, 1, 1) of 0})
   0.0%   100.0%       0.000s       0.00e+00s   1000     2   Elemwise{Composite{(i0 + (i1 * sqr(i2)))}}(TensorConstant{(1, 1, 1) ..2891277546}, TensorConstant{(1, 1, 1) of -0.04}, μ_participant_bias)
   0.0%   100.0%       0.000s       0.00e+00s   1000     1   Elemwise{Composite{(i0 * scalar_sigmoid(i1))}}(TensorConstant{(1, 1, 1) of 100.0}, μ_avatars_age_interval__)
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  - Try the Theano flag floatX=float32
  - Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.

Slow model (Long form data) :

Function profiling
==================
  Message: C:\Users\alfoc.ULAVAL\AppData\Local\Continuum\miniconda3\envs\data-science\lib\site-packages\pymc3\model.py:909
  Time in 1000 calls to Function.__call__: 1.150117e-01s
  Time in Function.fn.__call__: 8.600879e-02s (74.783%)
  Time in thunks: 8.000898e-02s (69.566%)
  Total compile time: 4.070408e-01s
    Number of Apply nodes: 17
    Theano Optimizer time: 3.190322e-01s
       Theano validate time: 2.000093e-03s
    Theano Linker time (includes C, CUDA code generation/compiling): 4.100394e-02s
       Import time 6.000996e-03s
       Node make_thunk time 3.900385e-02s
           Node CGemv{inplace}(CGemv{no_inplace}.0, TensorConstant{-1.0}, TensorConstant{[[1. 0. 0...0. 0. 1.]]}, μ_participant, TensorConstant{1.0}) time 1.300120e-02s
           Node CGemv{no_inplace}(TensorConstant{[ 9. 11.  ... 87. 89.]}, TensorConstant{-1.0}, TensorConstant{[[1. 0. 0...0. 0. 1.]]}, μ_avatars_age, TensorConstant{1.0}) time 1.100111e-02s
           Node Elemwise{Composite{((i0 + Switch(Cast{int8}((GE(i1, i2) * LE(i1, i3))), i4, i5)) - ((i6 * scalar_softplus((-i7))) + i7))}}[(0, 1)](TensorConstant{(1,) of 4...0185988092}, μ_avatars_age, TensorConstant{(1,) of 0.0}, TensorConstant{(1,) of 100.0}, TensorConstant{(1,) of -4..0185988092}, TensorConstant{(1,) of -inf}, TensorConstant{(1,) of 2.0}, μ_avatars_age_interval__) time 4.000187e-03s
           Node Elemwise{Composite{Switch(Cast{int8}(GT(i0, i1)), (i2 * ((-(Composite{inv(sqr(i0))}(i0) * sqr(i3))) + log((i4 * Composite{inv(sqr(i0))}(i0))))), i5)}}[(0, 0)](CGemv{inplace}.0, TensorConstant{(1,) of 0}, TensorConstant{(1,) of 0.5}, CGemv{inplace}.0, TensorConstant{(1,) of 0...4309189535}, TensorConstant{(1,) of -inf}) time 3.000259e-03s
           Node Elemwise{Composite{(Switch(Cast{int8}(GE(i0, i1)), (i2 - log1p(sqr((i3 * i0)))), i4) + i5)}}[(0, 0)](σ_avatars_age, TensorConstant{(1,) of 0}, TensorConstant{(1,) of -3..8588394566}, TensorConstant{(1,) of 0.04}, TensorConstant{(1,) of -inf}, σ_avatars_age_log__) time 2.000332e-03s

Time in all call to theano.grad() 1.612999e+00s
Time since theano import 1361.139s
Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  72.5%    72.5%       0.058s       8.29e-06s     C     7000       7   theano.tensor.elemwise.Elemwise
  26.2%    98.8%       0.021s       7.00e-06s     C     3000       3   theano.tensor.blas_c.CGemv
   1.2%   100.0%       0.001s       2.00e-07s     C     5000       5   theano.tensor.elemwise.Sum
   0.0%   100.0%       0.000s       0.00e+00s     C     1000       1   theano.tensor.basic.AllocEmpty
   0.0%   100.0%       0.000s       0.00e+00s     C     1000       1   theano.tensor.opt.MakeVector
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  65.0%    65.0%       0.052s       5.20e-05s     C     1000        1   Elemwise{Composite{Switch(Cast{int8}(GT(i0, i1)), (i2 * ((-(Composite{inv(sqr(i0))}(i0) * sqr(i3))) + log((i4 * Composite{inv(sqr(i0))}(i0))))), i5)}}[(0, 0)]
  18.7%    83.7%       0.015s       7.50e-06s     C     2000        2   CGemv{inplace}
   7.5%    91.2%       0.006s       6.00e-06s     C     1000        1   CGemv{no_inplace}
   2.5%    93.7%       0.002s       2.00e-06s     C     1000        1   Elemwise{exp,no_inplace}
   1.3%    95.0%       0.001s       1.00e-06s     C     1000        1   Elemwise{Composite{(i0 * scalar_sigmoid(i1))}}
   1.3%    96.3%       0.001s       1.00e-06s     C     1000        1   Elemwise{Composite{(i0 + (i1 * sqr(i2)))}}
   1.3%    97.5%       0.001s       1.00e-06s     C     1000        1   Elemwise{Composite{((i0 + Switch(Cast{int8}((GE(i1, i2) * LE(i1, i3))), i4, i5)) - ((i6 * scalar_softplus((-i7))) + i7))}}[(0, 1)]
   1.3%    98.8%       0.001s       1.00e-06s     C     1000        1   Elemwise{Composite{(Switch(Cast{int8}(GE(i0, i1)), (i2 - log1p(sqr((i3 * i0)))), i4) + i5)}}[(0, 0)]
   1.2%   100.0%       0.001s       2.00e-07s     C     5000        5   Sum{acc_dtype=float64}
   0.0%   100.0%       0.000s       0.00e+00s     C     1000        1   AllocEmpty{dtype='float64'}
   0.0%   100.0%       0.000s       0.00e+00s     C     1000        1   Elemwise{Mul}[(0, 1)]
   0.0%   100.0%       0.000s       0.00e+00s     C     1000        1   MakeVector{dtype='float64'}
   ... (remaining 0 Ops account for   0.00%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  65.0%    65.0%       0.052s       5.20e-05s   1000    13   Elemwise{Composite{Switch(Cast{int8}(GT(i0, i1)), (i2 * ((-(Composite{inv(sqr(i0))}(i0) * sqr(i3))) + log((i4 * Composite{inv(sqr(i0))}(i0))))), i5)}}[(0, 0)](CGemv{inplace}.0, TensorConstant{(1,) of 0}, TensorConstant{(1,) of 0.5}, CGemv{inplace}.0, TensorConstant{(1,) of 0...4309189535}, TensorConstant{(1,) of -inf})
  12.5%    77.5%       0.010s       1.00e-05s   1000     9   CGemv{inplace}(CGemv{no_inplace}.0, TensorConstant{-1.0}, TensorConstant{[[1. 0. 0...0. 0. 1.]]}, μ_participant, TensorConstant{1.0})
   7.5%    85.0%       0.006s       6.00e-06s   1000     4   CGemv{no_inplace}(TensorConstant{[ 9. 11.  ... 87. 89.]}, TensorConstant{-1.0}, TensorConstant{[[1. 0. 0...0. 0. 1.]]}, μ_avatars_age, TensorConstant{1.0})
   6.2%    91.2%       0.005s       5.00e-06s   1000     5   CGemv{inplace}(AllocEmpty{dtype='float64'}.0, TensorConstant{1.0}, TensorConstant{[[1. 0. 0...0. 0. 1.]]}, σ_avatars_age, TensorConstant{0.0})
   2.5%    93.7%       0.002s       2.00e-06s   1000     1   Elemwise{exp,no_inplace}(σ_avatars_age_log__)
   1.3%    95.0%       0.001s       1.00e-06s   1000     0   Elemwise{Composite{(i0 * scalar_sigmoid(i1))}}(TensorConstant{(1,) of 100.0}, μ_avatars_age_interval__)
   1.3%    96.3%       0.001s       1.00e-06s   1000     8   Elemwise{Composite{(Switch(Cast{int8}(GE(i0, i1)), (i2 - log1p(sqr((i3 * i0)))), i4) + i5)}}[(0, 0)](σ_avatars_age, TensorConstant{(1,) of 0}, TensorConstant{(1,) of -3..8588394566}, TensorConstant{(1,) of 0.04}, TensorConstant{(1,) of -inf}, σ_avatars_age_log__)
   1.3%    97.5%       0.001s       1.00e-06s   1000     7   Elemwise{Composite{((i0 + Switch(Cast{int8}((GE(i1, i2) * LE(i1, i3))), i4, i5)) - ((i6 * scalar_softplus((-i7))) + i7))}}[(0, 1)](TensorConstant{(1,) of 4...0185988092}, μ_avatars_age, TensorConstant{(1,) of 0.0}, TensorConstant{(1,) of 100.0}, TensorConstant{(1,) of -4..0185988092}, TensorConstant{(1,) of -inf}, TensorConstant{(1,) of 2.0}, μ_avatars_age_interval__)
   1.3%    98.8%       0.001s       1.00e-06s   1000     3   Elemwise{Composite{(i0 + (i1 * sqr(i2)))}}(TensorConstant{(1,) of -5..2891277546}, TensorConstant{(1,) of -0.04}, μ_participant)
   1.2%   100.0%       0.001s       1.00e-06s   1000    14   Sum{acc_dtype=float64}(Elemwise{Composite{Switch(Cast{int8}(GT(i0, i1)), (i2 * ((-(Composite{inv(sqr(i0))}(i0) * sqr(i3))) + log((i4 * Composite{inv(sqr(i0))}(i0))))), i5)}}[(0, 0)].0)
   0.0%   100.0%       0.000s       0.00e+00s   1000    16   Sum{acc_dtype=float64}(MakeVector{dtype='float64'}.0)
   0.0%   100.0%       0.000s       0.00e+00s   1000    15   MakeVector{dtype='float64'}(__logp_μ_avatars_age_interval__, __logp_σ_avatars_age_log__, __logp_μ_participant, __logp_obs_age)
   0.0%   100.0%       0.000s       0.00e+00s   1000    12   Sum{acc_dtype=float64}(Elemwise{Composite{(Switch(Cast{int8}(GE(i0, i1)), (i2 - log1p(sqr((i3 * i0)))), i4) + i5)}}[(0, 0)].0)
   0.0%   100.0%       0.000s       0.00e+00s   1000    11   Sum{acc_dtype=float64}(Elemwise{Composite{((i0 + Switch(Cast{int8}((GE(i1, i2) * LE(i1, i3))), i4, i5)) - ((i6 * scalar_softplus((-i7))) + i7))}}[(0, 1)].0)
   0.0%   100.0%       0.000s       0.00e+00s   1000    10   Elemwise{Mul}[(0, 1)](TensorConstant{0.5}, Sum{acc_dtype=float64}.0)
   0.0%   100.0%       0.000s       0.00e+00s   1000     6   Sum{acc_dtype=float64}(Elemwise{Composite{(i0 + (i1 * sqr(i2)))}}.0)
   0.0%   100.0%       0.000s       0.00e+00s   1000     2   AllocEmpty{dtype='float64'}(TensorConstant{1600})
   ... (remaining 0 Apply instances account for 0.00%(0.00s) of the runtime)

Here are tips to potentially make your code run faster
                 (if you think of new ones, suggest them on the mailing list).
                 Test them first, as they are not guaranteed to always provide a speedup.
  - Try the Theano flag floatX=float32
  - Try installing amdlibm and set the Theano flag lib.amdlibm=True. This speeds up only some Elemwise operation.

I dont understand your confusion. You added 3 dot products to the graph that simply require the additional time. Or am I missing something here?

Even without the dot product it is still slower: I tried with using indexing.

It is that it is representing the same problem, but using long form data with a design matrix, while the other wide form data using 3d matrix with the shape parameters. As I understand it is preferable to work with long form data, ie one observation per line, but if that incurres a high cost in compute time, I find that a bit problematic.

I’m afraid a performance difference that looks mostly like this is to be expected (I’m seeing 200it/s vs 50it/s, this may depend a bit on the blas implementation and cpu). If we look at the profiling information we can see that we spend about 70% of the time doing matrix vector multiplications:

# I'm not using `model2.profile` because that looks at the
# time it takes to compute the logp, but ignores the gradient.
# The `logp_dlogp_function` is what is actually used in NUTS.
func = model2.logp_dlogp_function(profile=True)
func.set_extra_values({})
x = np.random.randn(func.size)
for _ in range(100):
    func(x)

func.profile.summary_ops(N=10)
Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  43.4%    43.4%       0.013s       2.63e-05s     C      500        5   CGemv{inplace}
  25.3%    68.7%       0.008s       7.67e-05s     C      100        1   CGemv{no_inplace}
  10.1%    78.9%       0.003s       3.06e-05s     C      100        1   Elemwise{Composite{Switch(i0, (i1 * ((-(i2 * i3)) + log((i4 * i2)))), i5)}}[(0, 2)]
   3.6%    82.5%       0.001s       3.64e-06s     C      300        3   IncSubtensor{InplaceInc;int64:int64:}
   3.6%    86.0%       0.001s       1.08e-05s     C      100        1   Elemwise{Composite{Switch(i0, (i1 * (((i2 * Composite{inv(Composite{(sqr(i0) * i0)}(i0))}(i3)) / i4) - (i2 * i5 * Composite{inv(Composite{(sqr(i0) * i0)}(i0))}(i3)))), i6)}}[(0, 3)]
   2.5%    88.6%       0.001s       7.69e-06s     C      100        1   Elemwise{Composite{Switch(i0, (i1 * (i2 - (i3 + i4))), i5)}}
   2.1%    90.7%       0.001s       1.28e-06s     C      500        5   Sum{acc_dtype=float64}
   1.3%    91.9%       0.000s       7.67e-07s     C      500        5   AllocEmpty{dtype='float64'}
   1.2%    93.1%       0.000s       3.63e-06s     C      100        1   Elemwise{Composite{inv(sqr(i0))}}
   0.9%    94.0%       0.000s       2.70e-06s     C      100        1   Alloc
   ... (remaining 17 Ops account for   5.96%(0.00s) of the runtime)

(each row corresponds to an operation in the graph of the logp and logp gradient computation.)

If I can, I usually don’t use the long form in my models. But if you have to, you can use indexing, which should be at least a bit faster than the matrix multiplication. I’ve also played a bit with using a sparse design matrix – theano has some basic support for those – and it seems a small to medium speedup should be possible with that as well.

Thanks for the insight

@aseyboldt Can you explain how to do this? Having the same problem (also solved by aggregating) but would like to know how to squeeze out performance if I need to.