I’m trying to sample from a one-hot encoded matrix that’s got a shape of roughly 225,000 x 1,000. Using the GLM function within pymc3 my goal is to extract the distributions for each of the coefficients from the matrix, while also incorporating gaussain priors for each of them as well. I’ve included a self-contained example below, with a much smaller dataset (20 x 20). I’ve included my PC’s stats below as well…
Is this simply a matter of needed more CPU and RAM (AWS Sagemaker GPU would be my next attempt) or is there something in the model instantiation that I should look into. Thanks.
PC Stats:
Dell Latitude 5280
Intel i7 Core 2.90 CPU quad core
16.0 GB RAM
Standalone example (takes ~2 min to run on my local):
#PYMC3 STANDALONE EXAMPLE
import pymc3 as pm
from pymc3 import *
from sklearn import linear_model
import pandas as pd
import numpy as np
import time
from matplotlib import pyplot as plt
start = time.time()
### CREATE SAMPLE SPARSE MATRIX ###
X = np.zeros((20,20))
#add in flags on six random columns every row:
for i in range(len(X)):
for j in range(6):
ix = np.random.randint(0,20)
X[i,ix] = 1
y = np.random.randint(0,4,20)
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(X, y)
print('--- SKLEARN IMPLEMENTATION ---')
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
#WE CAN SEE THE LINREG COEFS ABOVE, SO I'M GOING TO SET SOME RANDOM PRIORS OF MY OWN
my_priors = {}
for i in range(0,20):
ix = str(i)
val = np.random.rand()*3
my_priors[ix] = pm.Normal.dist(val,0.25)
my_labels = [str(x) for x in range(0,20)]
#with pymc3
py_lr = pm.glm.linear.GLM(X, y, intercept=True, labels=my_labels, priors=my_priors, vars=None, family='normal', name='', model=None, offset=0.0)
#RUN THE SAMPLING
with py_lr as model:
trace = pm.sample(500,cores=2) # draw 500 posterior samples using NUTS sampling
print('--- PYMC3 IMPLEMENTATION ---')
plt.figure(figsize=(7, 7))
traceplot(trace)
plt.tight_layout();
end = time.time()
print("total run took {} minutes to run".format((end-start)/60))