Sparse array glm fit in pymc3 very long with NUTS

I’m trying to sample from a one-hot encoded matrix that’s got a shape of roughly 225,000 x 1,000. Using the GLM function within pymc3 my goal is to extract the distributions for each of the coefficients from the matrix, while also incorporating gaussain priors for each of them as well. I’ve included a self-contained example below, with a much smaller dataset (20 x 20). I’ve included my PC’s stats below as well…

Is this simply a matter of needed more CPU and RAM (AWS Sagemaker GPU would be my next attempt) or is there something in the model instantiation that I should look into. Thanks.

PC Stats:
Dell Latitude 5280
Intel i7 Core 2.90 CPU quad core
16.0 GB RAM

Standalone example (takes ~2 min to run on my local):

    #PYMC3 STANDALONE EXAMPLE
    import pymc3 as pm
    from pymc3 import *
    from sklearn import linear_model
    import pandas as pd
    import numpy as np
    import time
    from matplotlib import pyplot as plt

    start = time.time()

    ### CREATE SAMPLE SPARSE MATRIX ###
    X = np.zeros((20,20))

    #add in flags on six random columns every row:
    for i in range(len(X)):
    for j in range(6):
        ix = np.random.randint(0,20)
        X[i,ix] = 1
        
    y = np.random.randint(0,4,20)

    # with sklearn
    regr = linear_model.LinearRegression()
    regr.fit(X, y)
    print('--- SKLEARN IMPLEMENTATION ---')
    print('Intercept: \n', regr.intercept_)
    print('Coefficients: \n', regr.coef_)



    #WE CAN SEE THE LINREG COEFS ABOVE, SO I'M GOING TO SET SOME RANDOM PRIORS OF MY OWN
    my_priors = {}
    for i in range(0,20):
    ix = str(i)
    val = np.random.rand()*3
    my_priors[ix] = pm.Normal.dist(val,0.25)

    my_labels = [str(x) for x in range(0,20)]

    #with pymc3
    py_lr = pm.glm.linear.GLM(X, y, intercept=True, labels=my_labels, priors=my_priors, vars=None, family='normal', name='', model=None, offset=0.0)

    #RUN THE SAMPLING
    with py_lr as model:
    trace = pm.sample(500,cores=2) # draw 500 posterior samples using NUTS sampling


    print('--- PYMC3 IMPLEMENTATION ---')
    plt.figure(figsize=(7, 7))
    traceplot(trace)
    plt.tight_layout();

    end = time.time()
    print("total run took {} minutes to run".format((end-start)/60))

You can try the sparse matrix from theano in this case - might be able to get some speed up:

from theano import sparse
X_sparse = sparse.csc_from_dense(X)

with pm.Model() as py_lr2:
    intercept = pm.Normal('c', 0, 100)
    betas = pm.Normal('betas', 0, 10, shape=(20, 1))
    sd = pm.HalfNormal('sigma', 5.)
    obs = pm.Normal('y', mu=sparse.basic.dot(X_sparse, betas) + intercept, sigma=sd, observed=y)
1 Like

thank you junpenglao. I am incorporating this today will report back.

so one thing I should have actually clarified, I am running this through a scipy csr_matrix in my full implenetation of it (I realise that I used np in the dummy example). your suggestion worked great on the simple example, but on my full matrix (1000 x 200000) it throws a MemoryError: None almost immediately. I assume this is a scenario where I’d need to go to AWS etc?

I think it is also worth trying to split the dataset, or ADVI with minibatch - we dont have a lot of example dealing with large data set so I would love to learn about your experience and approach

so this is where I’ve gotten with it. please note that matrix is a scipy csc_matrix

### TRYING A SPARSE MATRIX IMPLEMENTATION
print('[LOGGER]: STARTING PYMC3')
start = time.time()

#shape of betas
n_betas = matrix.get_shape()[1]
target_array = np.array(target_list)

with pm.Model() as py_lr2:
    intercept = pm.Normal('c', 0, 100)
    betas = pm.Normal('betas', 0, 10, shape=(n_betas, 1))
    sd = pm.HalfNormal('sigma', 5.)
    mu = theano.sparse.basic.dot(matrix, betas) + intercept
    obs = pm.Normal('y', mu=mu, sigma=sd, observed=target_array)
    trace = sample(500, cores=2,progressbar=True) # draw 500 posterior samples using NUTS

plt.figure(figsize=(7, 7))
traceplot(trace)
plt.tight_layout()
plt.show()

end = time.time()
print("[LOGGER]: FINISHED PYMC3...total run took {} minutes to run".format((end-start)/60))

This throws the following traceback …

ERROR retrieving error_storage.Was the error set in the c code? [<class 'MemoryError'>, ((222713, 222713), dtype('float64')), None]
Traceback (most recent call last):
  File "rpm_modeling.py", line 177, in <module>
    obs = pm.Normal('y', mu=mu, sigma=sd, observed=target_array)
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\pymc3\distributions\distribution.py", line 83, in __new__
    return model.Var(name, dist, data, total_size, dims=dims)
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\pymc3\model.py", line 1112, in Var
    var = ObservedRV(
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\pymc3\model.py", line 1740, in __init__
    self.logp_elemwiset = distribution.logp(data)
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\pymc3\distributions\continuous.py", line 537, in logp
    return bound((-tau * (value - mu)**2 + tt.log(tau / np.pi / 2.)) / 2.,
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\theano\tensor\var.py", line 150, in __sub__
    return theano.tensor.basic.sub(self, other)
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\theano\gof\op.py", line 674, in __call__
    required = thunk()
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\theano\gof\op.py", line 862, in rval
    thunk()
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\theano\gof\cc.py", line 1731, in __call__
    exc_value = exc_type(_exc_value)
TypeError: __init__() missing 1 required positional argument: 'dtype'

What I can’t tell is if this is memory related (ERROR retrieving error_storage) or something in my syntax related to the dtype argument. Any thoughts would be great, thanks.

Could you try casting matrix into theano? Seems there is a memory error when theano is trying to do that internally.

is there a decent way to do that without taking the csc matrix to a np dense array first? In looking here: http://deeplearning.net/software/theano/tutorial/sparse.html … I haven’t seen anything except just building it from the np dense array. Thanks!

I’m doing the following but am getting a dimension mismatch error now…

### TRYING A SPARSE MATRIX IMPLEMENTATION
#https://discourse.pymc.io/t/sparse-array-glm-fit-in-pymc3-very-long-with-nuts/6063/6
print('[LOGGER]: STARTING PYMC3')
start = time.time()

#shape of betas
n_betas = matrix.get_shape()[1]
target_array = np.array(target_list)
target_array = target_array.astype(float)
print(target_array)

np_mtx = scipy.sparse.csc_matrix.toarray(matrix)
theano_mtx = theano.sparse.csc_from_dense(np_mtx)
print(theano_mtx)
del np_mtx

print('[LOGGER]: SHAPE OF BETAS:{}'.format(n_betas))
print('[LOGGER]: SHAPE OF MATRIX:{}'.format(theano.tensor.shape(theano_mtx)))
print('[LOGGER]: SHAPE OF TARGET:{}'.format(len(target_array)))

theano_mtx.tag.test_value = scipy.sparse.csc_matrix(target_array)

with pm.Model() as py_lr2:
    intercept = pm.Normal('c', 0, 100)
    betas = pm.Normal('betas', 0, 10, shape=(n_betas, 1))   #EVENTUALLY THIS WILL TAKE THE FORM OF THE PRIORS FROM THE META MODEL
    sd = pm.HalfNormal('sigma', 5.)
    mu = theano.sparse.basic.dot(theano_mtx, betas) + intercept
    obs = pm.Normal('y', mu=mu, sigma=sd, observed=target_array)
    trace = sample(500, cores=2,progressbar=True) # draw 500 posterior samples using NUTS sampling, initialize with advi

plt.figure(figsize=(7, 7))
traceplot(trace)
plt.tight_layout()
plt.show()

end = time.time()
print("[LOGGER]: FINISHED PYMC3...total run took {} minutes to run".format((end-start)/60))

I would give this a go on AWS. It seems to me like you’ve hit a limitation on your computer. Let me know how it goes.