Making regressor array c-contiguous seems to dramatically improve ADVI minibatch speed

Hi,

I’m currently working on a simple logistic regression model, and I’m seeing behavior I find strange that I’d like to understand. I’m working with a fairly large dataset, ~600,000 datapoints with ~50 regressors. Given the size of the data, I’m playing around with ADVI and minibatches, both of which I am very much a novice in.

When I was fitting on the full dataset, I was finding fairly slow (~4 iter/s) fitting speed. However, while doing some experimentation, I found that when I simply make my numpy regressor array c-contiguous, my fitting speed improved by ~100x to ~400 iter/s. The resulting model looks correct at quick inspection. Have other people observed this behavior before? I wasn’t able to find information about this doing some quick searches. Is it possible there’s something strange about my computing environment or model?

I can provide code if this is not a trivial observation.

Thanks,
-Zach

1 Like

Interesting. I have never thought about that.
Maybe @aseyboldt or @ferrine has some insight?

Just wow, could you please provide some code to play with?

A speedup of 100x is more than I would have expected. I’d love to play with an example as well :slight_smile:
It is possible that theano falls back to a python implementation of an op for non-contiguous memory, or that it copies the dataset to be contiguous. If the fast path is using blas on top of that, that might explain quite large speedups.
You can have a look at the profiling info from theano to see which op is responsible:

func = model.logp_dlogp_function(profile=True)
func.set_extra_values({})
x = np.random.randn(func.size)
for _ in range(100):
    func(x)
func.profile.summary()

This profiles the function used for nuts, not advi though, so the results might be a bit different.
@ferrine Can we do this for ADVI as well?

I think there is a way to get the score function for VI as well, then we can profile it similarly

Yes, sure https://github.com/pymc-devs/pymc3/blob/master/pymc3/tests/test_variational_inference.py#L689 here is an example code

Here’s some basic code that illustrates what I’m seeing. Couple notes: so scipy.stats.rvs creates c_contiguous arrays by default. However, I was (unbeknownst to be) accidentally getting f_contiguous arrays when I was doing pandas.read_csv(‘filename.csv’).as_matrix() (which I know is also deprecated). Also, looks like the degree of slowdown from f_contiguous is dependent on the size of the data.

As I mentioned, very much a novice to all this stuff, so this could be an artifact of a mistake I’m making elsewhere.

import pymc3 as pm
import numpy as np

import scipy.stats as stats

Y = stats.bernoulli.rvs(0.5,size=100000)
X = stats.norm.rvs(0.,1.,size=(100000,50))

mini_X = pm.Minibatch(X,batch_size=1000)
mini_Y = pm.Minibatch(Y.reshape(-1),batch_size=1000)

with pm.Model() as model: 
    intercept = pm.Normal('intercept',mu=0,sd=5.)
    beta = pm.Normal('coeff', mu=0, sd=5.,shape=X.shape[1])
    p = pm.math.sigmoid(pm.math.dot(mini_X, beta) + intercept)
    likelihood = pm.Bernoulli('likelihood', p, observed=mini_Y,total_size=X.shape[0])

with model:
    advi_approx = pm.fit(method='advi',n=1000)

Which produces output:

Average Loss = 1,976.1: 100%|██████████| 1000/1000 [00:00<00:00, 1106.92it/s]
Finished [100%]: Average Loss = 1,974.7

Compared with:

X_f = np.asfortranarray(X)

mini_X = pm.Minibatch(X_f,batch_size=1000)
mini_Y = pm.Minibatch(Y.reshape(-1),batch_size=1000)

with pm.Model() as fortran_model: 
    intercept = pm.Normal('intercept',mu=0,sd=5.)
    beta = pm.Normal('coeff', mu=0, sd=5.,shape=X.shape[1])
    p = pm.math.sigmoid(pm.math.dot(mini_X, beta) + intercept)
    likelihood = pm.Bernoulli('likelihood', p, observed=mini_Y,total_size=X.shape[0])

with fortran_model:
    advi_approx = pm.fit(method='advi',n=1000)

with output:

Average Loss = 1,970: 100%|██████████| 1000/1000 [00:15<00:00, 63.00it/s] 
Finished [100%]: Average Loss = 1,968.4

When I profile the functions, looks like the fortran array spends most of its time in theano.tensor.subtensor.AdvancedSubtensor1.

c_contiguous:

  36.9%    36.9%       0.811s       5.00e-05s     C    16222       2   theano.tensor.subtensor.AdvancedSubtensor1
  33.0%    69.9%       0.726s       5.26e-06s     C   137887      17   theano.tensor.elemwise.Elemwise
  13.7%    83.6%       0.302s       1.86e-05s     C    16222       2   theano.sandbox.rng_mrg.mrg_uniform
   8.1%    91.7%       0.179s       1.10e-05s     C    16222       2   theano.tensor.blas_c.CGemv
   3.8%    95.6%       0.085s       5.22e-06s     C    16222       2   theano.tensor.subtensor.IncSubtensor
   1.4%    96.9%       0.030s       7.36e-07s     C    40555       5   theano.tensor.elemwise.DimShuffle
   1.0%    98.0%       0.023s       6.98e-07s     C    32444       4   theano.tensor.elemwise.Sum
   0.5%    98.5%       0.011s       2.72e-07s     C    40555       5   theano.compile.ops.Shape_i
   0.4%    98.9%       0.010s       1.18e-06s     C     8111       1   theano.tensor.basic.Alloc
   0.4%    99.3%       0.008s       4.92e-07s     C    16222       2   theano.tensor.subtensor.Subtensor
   0.3%    99.5%       0.006s       3.90e-07s     C    16222       2   theano.tensor.basic.Reshape
   0.3%    99.8%       0.006s       3.46e-07s     C    16222       2   theano.tensor.opt.MakeVector
   0.2%   100.0%       0.005s       5.58e-07s     C     8111       1   theano.tensor.basic.AllocEmpty

f_contiguous:

  98.4%    98.4%      11.819s       7.29e-03s     C     1622       2   theano.tensor.subtensor.AdvancedSubtensor1
   0.6%    99.1%       0.078s       5.64e-06s     C    13787      17   theano.tensor.elemwise.Elemwise
   0.3%    99.4%       0.041s       2.52e-05s     C     1622       2   theano.tensor.blas_c.CGemv
   0.3%    99.7%       0.031s       1.93e-05s     C     1622       2   theano.sandbox.rng_mrg.mrg_uniform
   0.1%    99.8%       0.017s       1.03e-05s     C     1622       2   theano.tensor.subtensor.IncSubtensor
   0.1%    99.9%       0.009s       2.17e-06s     C     4055       5   theano.tensor.elemwise.DimShuffle
   0.0%    99.9%       0.003s       8.48e-07s     C     3244       4   theano.tensor.elemwise.Sum
   0.0%    99.9%       0.002s       2.34e-06s     C      811       1   theano.tensor.basic.Alloc
   0.0%   100.0%       0.002s       4.63e-07s     C     4055       5   theano.compile.ops.Shape_i
   0.0%   100.0%       0.001s       7.76e-07s     C     1622       2   theano.tensor.basic.Reshape
   0.0%   100.0%       0.001s       7.68e-07s     C     1622       2   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.001s       6.79e-07s     C     1622       2   theano.tensor.subtensor.Subtensor
   0.0%   100.0%       0.001s       9.12e-07s     C      811       1   theano.tensor.basic.AllocEmpty

I’m seeing this behavior both on my mid-2015 Macbook Pro, and on an old server with a 2011 Xeon processor that I have access to.

pip freeze of the current environment:

appnope==0.1.0
backcall==0.1.0
bleach==2.1.3
certifi==2018.4.16
cycler==0.10.0
decorator==4.3.0
entrypoints==0.2.3
graphviz==0.8.4
h5py==2.8.0
html5lib==1.0.1
ipykernel==4.8.2
ipython==6.4.0
ipython-genutils==0.2.0
ipywidgets==7.2.1
jedi==0.12.0
Jinja2==2.10
joblib==0.11
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.3
jupyter-console==5.2.0
jupyter-core==4.4.0
kiwisolver==1.0.1
MarkupSafe==1.0
matplotlib==2.2.2
mistune==0.8.3
mkl-fft==1.0.0
mkl-random==1.0.1
nbconvert==5.3.1
nbformat==4.4.0
notebook==5.5.0
numpy==1.14.3
pandas==0.23.0
pandocfilters==1.4.2
parso==0.2.1
patsy==0.5.0
pexpect==4.6.0
pickleshare==0.7.4
prompt-toolkit==1.0.15
ptyprocess==0.5.2
Pygments==2.2.0
pymc3==3.5rc1
pyparsing==2.2.0
python-dateutil==2.7.3
pytz==2018.4
pyzmq==17.0.0
qtconsole==4.3.1
scikit-learn==0.19.1
scipy==1.1.0
seaborn==0.8.1
Send2Trash==1.5.0
simplegeneric==0.8.1
six==1.11.0
statsmodels==0.9.0
terminado==0.8.1
testpath==0.3.1
Theano==1.0.1+2.gcd195ed
tornado==5.0.2
tqdm==4.23.4
traitlets==4.3.2
wcwidth==0.1.7
webencodings==0.5.1
widgetsnbextension==3.2.1

(As an aside, the way I found this was by playing around with down-sampling my data just to see how it affected compute time. I became very confused why the operation X[np.random.choice(range(X.shape[0]),X.shape[0],replace=False),:]) drastically affected how fast my code ran.)