Here’s some basic code that illustrates what I’m seeing. Couple notes: so scipy.stats.rvs creates c_contiguous arrays by default. However, I was (unbeknownst to be) accidentally getting f_contiguous arrays when I was doing pandas.read_csv(‘filename.csv’).as_matrix() (which I know is also deprecated). Also, looks like the degree of slowdown from f_contiguous is dependent on the size of the data.

As I mentioned, very much a novice to all this stuff, so this could be an artifact of a mistake I’m making elsewhere.

```
import pymc3 as pm
import numpy as np
import scipy.stats as stats
Y = stats.bernoulli.rvs(0.5,size=100000)
X = stats.norm.rvs(0.,1.,size=(100000,50))
mini_X = pm.Minibatch(X,batch_size=1000)
mini_Y = pm.Minibatch(Y.reshape(-1),batch_size=1000)
with pm.Model() as model:
intercept = pm.Normal('intercept',mu=0,sd=5.)
beta = pm.Normal('coeff', mu=0, sd=5.,shape=X.shape[1])
p = pm.math.sigmoid(pm.math.dot(mini_X, beta) + intercept)
likelihood = pm.Bernoulli('likelihood', p, observed=mini_Y,total_size=X.shape[0])
with model:
advi_approx = pm.fit(method='advi',n=1000)
```

Which produces output:

```
Average Loss = 1,976.1: 100%|██████████| 1000/1000 [00:00<00:00, 1106.92it/s]
Finished [100%]: Average Loss = 1,974.7
```

Compared with:

```
X_f = np.asfortranarray(X)
mini_X = pm.Minibatch(X_f,batch_size=1000)
mini_Y = pm.Minibatch(Y.reshape(-1),batch_size=1000)
with pm.Model() as fortran_model:
intercept = pm.Normal('intercept',mu=0,sd=5.)
beta = pm.Normal('coeff', mu=0, sd=5.,shape=X.shape[1])
p = pm.math.sigmoid(pm.math.dot(mini_X, beta) + intercept)
likelihood = pm.Bernoulli('likelihood', p, observed=mini_Y,total_size=X.shape[0])
with fortran_model:
advi_approx = pm.fit(method='advi',n=1000)
```

with output:

```
Average Loss = 1,970: 100%|██████████| 1000/1000 [00:15<00:00, 63.00it/s]
Finished [100%]: Average Loss = 1,968.4
```

When I profile the functions, looks like the fortran array spends most of its time in `theano.tensor.subtensor.AdvancedSubtensor1`

.

c_contiguous:

```
36.9% 36.9% 0.811s 5.00e-05s C 16222 2 theano.tensor.subtensor.AdvancedSubtensor1
33.0% 69.9% 0.726s 5.26e-06s C 137887 17 theano.tensor.elemwise.Elemwise
13.7% 83.6% 0.302s 1.86e-05s C 16222 2 theano.sandbox.rng_mrg.mrg_uniform
8.1% 91.7% 0.179s 1.10e-05s C 16222 2 theano.tensor.blas_c.CGemv
3.8% 95.6% 0.085s 5.22e-06s C 16222 2 theano.tensor.subtensor.IncSubtensor
1.4% 96.9% 0.030s 7.36e-07s C 40555 5 theano.tensor.elemwise.DimShuffle
1.0% 98.0% 0.023s 6.98e-07s C 32444 4 theano.tensor.elemwise.Sum
0.5% 98.5% 0.011s 2.72e-07s C 40555 5 theano.compile.ops.Shape_i
0.4% 98.9% 0.010s 1.18e-06s C 8111 1 theano.tensor.basic.Alloc
0.4% 99.3% 0.008s 4.92e-07s C 16222 2 theano.tensor.subtensor.Subtensor
0.3% 99.5% 0.006s 3.90e-07s C 16222 2 theano.tensor.basic.Reshape
0.3% 99.8% 0.006s 3.46e-07s C 16222 2 theano.tensor.opt.MakeVector
0.2% 100.0% 0.005s 5.58e-07s C 8111 1 theano.tensor.basic.AllocEmpty
```

f_contiguous:

```
98.4% 98.4% 11.819s 7.29e-03s C 1622 2 theano.tensor.subtensor.AdvancedSubtensor1
0.6% 99.1% 0.078s 5.64e-06s C 13787 17 theano.tensor.elemwise.Elemwise
0.3% 99.4% 0.041s 2.52e-05s C 1622 2 theano.tensor.blas_c.CGemv
0.3% 99.7% 0.031s 1.93e-05s C 1622 2 theano.sandbox.rng_mrg.mrg_uniform
0.1% 99.8% 0.017s 1.03e-05s C 1622 2 theano.tensor.subtensor.IncSubtensor
0.1% 99.9% 0.009s 2.17e-06s C 4055 5 theano.tensor.elemwise.DimShuffle
0.0% 99.9% 0.003s 8.48e-07s C 3244 4 theano.tensor.elemwise.Sum
0.0% 99.9% 0.002s 2.34e-06s C 811 1 theano.tensor.basic.Alloc
0.0% 100.0% 0.002s 4.63e-07s C 4055 5 theano.compile.ops.Shape_i
0.0% 100.0% 0.001s 7.76e-07s C 1622 2 theano.tensor.basic.Reshape
0.0% 100.0% 0.001s 7.68e-07s C 1622 2 theano.tensor.opt.MakeVector
0.0% 100.0% 0.001s 6.79e-07s C 1622 2 theano.tensor.subtensor.Subtensor
0.0% 100.0% 0.001s 9.12e-07s C 811 1 theano.tensor.basic.AllocEmpty
```

I’m seeing this behavior both on my mid-2015 Macbook Pro, and on an old server with a 2011 Xeon processor that I have access to.

pip freeze of the current environment:

```
appnope==0.1.0
backcall==0.1.0
bleach==2.1.3
certifi==2018.4.16
cycler==0.10.0
decorator==4.3.0
entrypoints==0.2.3
graphviz==0.8.4
h5py==2.8.0
html5lib==1.0.1
ipykernel==4.8.2
ipython==6.4.0
ipython-genutils==0.2.0
ipywidgets==7.2.1
jedi==0.12.0
Jinja2==2.10
joblib==0.11
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.3
jupyter-console==5.2.0
jupyter-core==4.4.0
kiwisolver==1.0.1
MarkupSafe==1.0
matplotlib==2.2.2
mistune==0.8.3
mkl-fft==1.0.0
mkl-random==1.0.1
nbconvert==5.3.1
nbformat==4.4.0
notebook==5.5.0
numpy==1.14.3
pandas==0.23.0
pandocfilters==1.4.2
parso==0.2.1
patsy==0.5.0
pexpect==4.6.0
pickleshare==0.7.4
prompt-toolkit==1.0.15
ptyprocess==0.5.2
Pygments==2.2.0
pymc3==3.5rc1
pyparsing==2.2.0
python-dateutil==2.7.3
pytz==2018.4
pyzmq==17.0.0
qtconsole==4.3.1
scikit-learn==0.19.1
scipy==1.1.0
seaborn==0.8.1
Send2Trash==1.5.0
simplegeneric==0.8.1
six==1.11.0
statsmodels==0.9.0
terminado==0.8.1
testpath==0.3.1
Theano==1.0.1+2.gcd195ed
tornado==5.0.2
tqdm==4.23.4
traitlets==4.3.2
wcwidth==0.1.7
webencodings==0.5.1
widgetsnbextension==3.2.1
```

(As an aside, the way I found this was by playing around with down-sampling my data just to see how it affected compute time. I became very confused why the operation X[np.random.choice(range(X.shape[0]),X.shape[0],replace=False),:]) drastically affected how fast my code ran.)