# Sparse array glm fit in pymc3 very long with NUTS

I’m trying to sample from a one-hot encoded matrix that’s got a shape of roughly 225,000 x 1,000. Using the GLM function within pymc3 my goal is to extract the distributions for each of the coefficients from the matrix, while also incorporating gaussain priors for each of them as well. I’ve included a self-contained example below, with a much smaller dataset (20 x 20). I’ve included my PC’s stats below as well…

Is this simply a matter of needed more CPU and RAM (AWS Sagemaker GPU would be my next attempt) or is there something in the model instantiation that I should look into. Thanks.

PC Stats:
Dell Latitude 5280
Intel i7 Core 2.90 CPU quad core
16.0 GB RAM

Standalone example (takes ~2 min to run on my local):

``````    #PYMC3 STANDALONE EXAMPLE
import pymc3 as pm
from pymc3 import *
from sklearn import linear_model
import pandas as pd
import numpy as np
import time
from matplotlib import pyplot as plt

start = time.time()

### CREATE SAMPLE SPARSE MATRIX ###
X = np.zeros((20,20))

#add in flags on six random columns every row:
for i in range(len(X)):
for j in range(6):
ix = np.random.randint(0,20)
X[i,ix] = 1

y = np.random.randint(0,4,20)

# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, y)
print('--- SKLEARN IMPLEMENTATION ---')
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

#WE CAN SEE THE LINREG COEFS ABOVE, SO I'M GOING TO SET SOME RANDOM PRIORS OF MY OWN
my_priors = {}
for i in range(0,20):
ix = str(i)
val = np.random.rand()*3
my_priors[ix] = pm.Normal.dist(val,0.25)

my_labels = [str(x) for x in range(0,20)]

#with pymc3
py_lr = pm.glm.linear.GLM(X, y, intercept=True, labels=my_labels, priors=my_priors, vars=None, family='normal', name='', model=None, offset=0.0)

#RUN THE SAMPLING
with py_lr as model:
trace = pm.sample(500,cores=2) # draw 500 posterior samples using NUTS sampling

print('--- PYMC3 IMPLEMENTATION ---')
plt.figure(figsize=(7, 7))
traceplot(trace)
plt.tight_layout();

end = time.time()
print("total run took {} minutes to run".format((end-start)/60))
``````

You can try the sparse matrix from theano in this case - might be able to get some speed up:

``````from theano import sparse
X_sparse = sparse.csc_from_dense(X)

with pm.Model() as py_lr2:
intercept = pm.Normal('c', 0, 100)
betas = pm.Normal('betas', 0, 10, shape=(20, 1))
sd = pm.HalfNormal('sigma', 5.)
obs = pm.Normal('y', mu=sparse.basic.dot(X_sparse, betas) + intercept, sigma=sd, observed=y)
``````
1 Like

thank you junpenglao. I am incorporating this today will report back.

so one thing I should have actually clarified, I am running this through a scipy csr_matrix in my full implenetation of it (I realise that I used np in the dummy example). your suggestion worked great on the simple example, but on my full matrix (1000 x 200000) it throws a `MemoryError: None` almost immediately. I assume this is a scenario where I’d need to go to AWS etc?

I think it is also worth trying to split the dataset, or ADVI with minibatch - we dont have a lot of example dealing with large data set so I would love to learn about your experience and approach

so this is where I’ve gotten with it. please note that `matrix` is a scipy csc_matrix

``````### TRYING A SPARSE MATRIX IMPLEMENTATION
print('[LOGGER]: STARTING PYMC3')
start = time.time()

#shape of betas
n_betas = matrix.get_shape()
target_array = np.array(target_list)

with pm.Model() as py_lr2:
intercept = pm.Normal('c', 0, 100)
betas = pm.Normal('betas', 0, 10, shape=(n_betas, 1))
sd = pm.HalfNormal('sigma', 5.)
mu = theano.sparse.basic.dot(matrix, betas) + intercept
obs = pm.Normal('y', mu=mu, sigma=sd, observed=target_array)
trace = sample(500, cores=2,progressbar=True) # draw 500 posterior samples using NUTS

plt.figure(figsize=(7, 7))
traceplot(trace)
plt.tight_layout()
plt.show()

end = time.time()
print("[LOGGER]: FINISHED PYMC3...total run took {} minutes to run".format((end-start)/60))
``````

This throws the following traceback …

``````ERROR retrieving error_storage.Was the error set in the c code? [<class 'MemoryError'>, ((222713, 222713), dtype('float64')), None]
Traceback (most recent call last):
File "rpm_modeling.py", line 177, in <module>
obs = pm.Normal('y', mu=mu, sigma=sd, observed=target_array)
File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\pymc3\distributions\distribution.py", line 83, in __new__
return model.Var(name, dist, data, total_size, dims=dims)
File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\pymc3\model.py", line 1112, in Var
var = ObservedRV(
File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\pymc3\model.py", line 1740, in __init__
self.logp_elemwiset = distribution.logp(data)
File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\pymc3\distributions\continuous.py", line 537, in logp
return bound((-tau * (value - mu)**2 + tt.log(tau / np.pi / 2.)) / 2.,
File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\theano\tensor\var.py", line 150, in __sub__
return theano.tensor.basic.sub(self, other)
File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\theano\gof\op.py", line 674, in __call__
required = thunk()
File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\theano\gof\op.py", line 862, in rval
thunk()
File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\theano\gof\cc.py", line 1731, in __call__
exc_value = exc_type(_exc_value)
TypeError: __init__() missing 1 required positional argument: 'dtype'
``````

What I can’t tell is if this is memory related (ERROR retrieving error_storage) or something in my syntax related to the `dtype` argument. Any thoughts would be great, thanks.

Could you try casting `matrix` into theano? Seems there is a memory error when theano is trying to do that internally.

is there a decent way to do that without taking the csc matrix to a np dense array first? In looking here: http://deeplearning.net/software/theano/tutorial/sparse.html … I haven’t seen anything except just building it from the np dense array. Thanks!

I’m doing the following but am getting a dimension mismatch error now…

``````### TRYING A SPARSE MATRIX IMPLEMENTATION
#https://discourse.pymc.io/t/sparse-array-glm-fit-in-pymc3-very-long-with-nuts/6063/6
print('[LOGGER]: STARTING PYMC3')
start = time.time()

#shape of betas
n_betas = matrix.get_shape()
target_array = np.array(target_list)
target_array = target_array.astype(float)
print(target_array)

np_mtx = scipy.sparse.csc_matrix.toarray(matrix)
theano_mtx = theano.sparse.csc_from_dense(np_mtx)
print(theano_mtx)
del np_mtx

print('[LOGGER]: SHAPE OF BETAS:{}'.format(n_betas))
print('[LOGGER]: SHAPE OF MATRIX:{}'.format(theano.tensor.shape(theano_mtx)))
print('[LOGGER]: SHAPE OF TARGET:{}'.format(len(target_array)))

theano_mtx.tag.test_value = scipy.sparse.csc_matrix(target_array)

with pm.Model() as py_lr2:
intercept = pm.Normal('c', 0, 100)
betas = pm.Normal('betas', 0, 10, shape=(n_betas, 1))   #EVENTUALLY THIS WILL TAKE THE FORM OF THE PRIORS FROM THE META MODEL
sd = pm.HalfNormal('sigma', 5.)
mu = theano.sparse.basic.dot(theano_mtx, betas) + intercept
obs = pm.Normal('y', mu=mu, sigma=sd, observed=target_array)
trace = sample(500, cores=2,progressbar=True) # draw 500 posterior samples using NUTS sampling, initialize with advi

plt.figure(figsize=(7, 7))
traceplot(trace)
plt.tight_layout()
plt.show()

end = time.time()
print("[LOGGER]: FINISHED PYMC3...total run took {} minutes to run".format((end-start)/60))``````

I would give this a go on AWS. It seems to me like you’ve hit a limitation on your computer. Let me know how it goes.