Sparse array glm fit in pymc3 very long with NUTS

jezlax · October 21, 2020, 10:29pm

I’m trying to sample from a one-hot encoded matrix that’s got a shape of roughly 225,000 x 1,000. Using the GLM function within pymc3 my goal is to extract the distributions for each of the coefficients from the matrix, while also incorporating gaussain priors for each of them as well. I’ve included a self-contained example below, with a much smaller dataset (20 x 20). I’ve included my PC’s stats below as well…

Is this simply a matter of needed more CPU and RAM (AWS Sagemaker GPU would be my next attempt) or is there something in the model instantiation that I should look into. Thanks.

PC Stats:
Dell Latitude 5280
Intel i7 Core 2.90 CPU quad core
16.0 GB RAM

Standalone example (takes ~2 min to run on my local):

    #PYMC3 STANDALONE EXAMPLE
    import pymc3 as pm
    from pymc3 import *
    from sklearn import linear_model
    import pandas as pd
    import numpy as np
    import time
    from matplotlib import pyplot as plt

    start = time.time()

    ### CREATE SAMPLE SPARSE MATRIX ###
    X = np.zeros((20,20))

    #add in flags on six random columns every row:
    for i in range(len(X)):
    for j in range(6):
        ix = np.random.randint(0,20)
        X[i,ix] = 1
        
    y = np.random.randint(0,4,20)

    # with sklearn
    regr = linear_model.LinearRegression()
    regr.fit(X, y)
    print('--- SKLEARN IMPLEMENTATION ---')
    print('Intercept: \n', regr.intercept_)
    print('Coefficients: \n', regr.coef_)



    #WE CAN SEE THE LINREG COEFS ABOVE, SO I'M GOING TO SET SOME RANDOM PRIORS OF MY OWN
    my_priors = {}
    for i in range(0,20):
    ix = str(i)
    val = np.random.rand()*3
    my_priors[ix] = pm.Normal.dist(val,0.25)

    my_labels = [str(x) for x in range(0,20)]

    #with pymc3
    py_lr = pm.glm.linear.GLM(X, y, intercept=True, labels=my_labels, priors=my_priors, vars=None, family='normal', name='', model=None, offset=0.0)

    #RUN THE SAMPLING
    with py_lr as model:
    trace = pm.sample(500,cores=2) # draw 500 posterior samples using NUTS sampling


    print('--- PYMC3 IMPLEMENTATION ---')
    plt.figure(figsize=(7, 7))
    traceplot(trace)
    plt.tight_layout();

    end = time.time()
    print("total run took {} minutes to run".format((end-start)/60))

junpenglao · October 25, 2020, 8:48am

You can try the sparse matrix from theano in this case - might be able to get some speed up:

from theano import sparse
X_sparse = sparse.csc_from_dense(X)

with pm.Model() as py_lr2:
    intercept = pm.Normal('c', 0, 100)
    betas = pm.Normal('betas', 0, 10, shape=(20, 1))
    sd = pm.HalfNormal('sigma', 5.)
    obs = pm.Normal('y', mu=sparse.basic.dot(X_sparse, betas) + intercept, sigma=sd, observed=y)

jezlax · October 25, 2020, 8:45pm

thank you junpenglao. I am incorporating this today will report back.

jezlax · October 25, 2020, 10:56pm

so one thing I should have actually clarified, I am running this through a scipy csr_matrix in my full implenetation of it (I realise that I used np in the dummy example). your suggestion worked great on the simple example, but on my full matrix (1000 x 200000) it throws a MemoryError: None almost immediately. I assume this is a scenario where I’d need to go to AWS etc?

junpenglao · October 26, 2020, 7:06am

I think it is also worth trying to split the dataset, or ADVI with minibatch - we dont have a lot of example dealing with large data set so I would love to learn about your experience and approach

jezlax · October 27, 2020, 6:33am

so this is where I’ve gotten with it. please note that matrix is a scipy csc_matrix

### TRYING A SPARSE MATRIX IMPLEMENTATION
print('[LOGGER]: STARTING PYMC3')
start = time.time()

#shape of betas
n_betas = matrix.get_shape()[1]
target_array = np.array(target_list)

with pm.Model() as py_lr2:
    intercept = pm.Normal('c', 0, 100)
    betas = pm.Normal('betas', 0, 10, shape=(n_betas, 1))
    sd = pm.HalfNormal('sigma', 5.)
    mu = theano.sparse.basic.dot(matrix, betas) + intercept
    obs = pm.Normal('y', mu=mu, sigma=sd, observed=target_array)
    trace = sample(500, cores=2,progressbar=True) # draw 500 posterior samples using NUTS

plt.figure(figsize=(7, 7))
traceplot(trace)
plt.tight_layout()
plt.show()

end = time.time()
print("[LOGGER]: FINISHED PYMC3...total run took {} minutes to run".format((end-start)/60))

This throws the following traceback …

ERROR retrieving error_storage.Was the error set in the c code? [<class 'MemoryError'>, ((222713, 222713), dtype('float64')), None]
Traceback (most recent call last):
  File "rpm_modeling.py", line 177, in <module>
    obs = pm.Normal('y', mu=mu, sigma=sd, observed=target_array)
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\pymc3\distributions\distribution.py", line 83, in __new__
    return model.Var(name, dist, data, total_size, dims=dims)
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\pymc3\model.py", line 1112, in Var
    var = ObservedRV(
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\pymc3\model.py", line 1740, in __init__
    self.logp_elemwiset = distribution.logp(data)
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\pymc3\distributions\continuous.py", line 537, in logp
    return bound((-tau * (value - mu)**2 + tt.log(tau / np.pi / 2.)) / 2.,
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\theano\tensor\var.py", line 150, in __sub__
    return theano.tensor.basic.sub(self, other)
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\theano\gof\op.py", line 674, in __call__
    required = thunk()
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\theano\gof\op.py", line 862, in rval
    thunk()
  File "C:\Users\cory.jez\AppData\Local\Continuum\anaconda3\envs\rpm-env\lib\site-packages\theano\gof\cc.py", line 1731, in __call__
    exc_value = exc_type(_exc_value)
TypeError: __init__() missing 1 required positional argument: 'dtype'

What I can’t tell is if this is memory related (ERROR retrieving error_storage) or something in my syntax related to the dtype argument. Any thoughts would be great, thanks.

junpenglao · October 27, 2020, 7:46am

Could you try casting matrix into theano? Seems there is a memory error when theano is trying to do that internally.

jezlax · October 27, 2020, 4:49pm

is there a decent way to do that without taking the csc matrix to a np dense array first? In looking here: http://deeplearning.net/software/theano/tutorial/sparse.html … I haven’t seen anything except just building it from the np dense array. Thanks!

jezlax · October 27, 2020, 7:00pm

I’m doing the following but am getting a dimension mismatch error now…

### TRYING A SPARSE MATRIX IMPLEMENTATION
#https://discourse.pymc.io/t/sparse-array-glm-fit-in-pymc3-very-long-with-nuts/6063/6
print('[LOGGER]: STARTING PYMC3')
start = time.time()

#shape of betas
n_betas = matrix.get_shape()[1]
target_array = np.array(target_list)
target_array = target_array.astype(float)
print(target_array)

np_mtx = scipy.sparse.csc_matrix.toarray(matrix)
theano_mtx = theano.sparse.csc_from_dense(np_mtx)
print(theano_mtx)
del np_mtx

print('[LOGGER]: SHAPE OF BETAS:{}'.format(n_betas))
print('[LOGGER]: SHAPE OF MATRIX:{}'.format(theano.tensor.shape(theano_mtx)))
print('[LOGGER]: SHAPE OF TARGET:{}'.format(len(target_array)))

theano_mtx.tag.test_value = scipy.sparse.csc_matrix(target_array)

with pm.Model() as py_lr2:
    intercept = pm.Normal('c', 0, 100)
    betas = pm.Normal('betas', 0, 10, shape=(n_betas, 1))   #EVENTUALLY THIS WILL TAKE THE FORM OF THE PRIORS FROM THE META MODEL
    sd = pm.HalfNormal('sigma', 5.)
    mu = theano.sparse.basic.dot(theano_mtx, betas) + intercept
    obs = pm.Normal('y', mu=mu, sigma=sd, observed=target_array)
    trace = sample(500, cores=2,progressbar=True) # draw 500 posterior samples using NUTS sampling, initialize with advi

plt.figure(figsize=(7, 7))
traceplot(trace)
plt.tight_layout()
plt.show()

end = time.time()
print("[LOGGER]: FINISHED PYMC3...total run took {} minutes to run".format((end-start)/60))

springcoil · November 5, 2020, 9:28am

I would give this a go on AWS. It seems to me like you’ve hit a limitation on your computer. Let me know how it goes.

Topic		Replies	Views
Poor estimates from pymc examples on marginal and latent GP v5 gaussian_process , modeling	2	468	May 26, 2023
Sparse GP with linear covariate? Questions	2	597	August 3, 2021
Multivariate linear regression with different prior distributions for coefficients v3 modeling	0	532	May 7, 2022
The pymc3 way: Linear regression and inferring given posterior/trace Questions	4	1761	May 9, 2018
Trouble putting together a regression model in pymc3 Questions	4	683	June 10, 2019

Sparse array glm fit in pymc3 very long with NUTS

Related topics