Memory issues with creating simple regression model

Hi,

I’ve just started using pymc3 and I’m trying to build a simple multivariate regression model but my jupyter kernel keeps dying or having a memory error. I’m sure it’s something with the way I’ve designed the model but I cannot work it out. My design matrix is large-ish, at (300000, 17) but not so large. Sklearn ridge regression runs very fast.

here is my code

basic_model = pm.Model()

with basic_model:

    alpha = pm.Normal('alpha', mu=0, sd=1)
    beta = pm.Normal('beta', mu=0, sd=1, shape=(len(data_X.columns), 1))
    sigma = pm.HalfNormal('sigma', sd=10)
    
    mu = alpha + pm.math.dot(data_X.values, beta)

    # Likelihood (sampling distribution) of observations
    Y_obs = pm.Normal('Y_obs', mu=mu, sd=sigma, observed=data_y.values)

If I run the above model, I immediately get a memory exception inside theano chunk function (see stack trace at the end). If I limit the number of rows in my design matrix to ~10k, then it doesn’t crash but quickly uses over 100g of ram (if I then do pm.sample it can barely manage 1-2 it/sec).

I am running on a large AWS box. I’m aware of the previous issue with amazon boxes talked about here, however, I can run the random_walk_deep_net notebook no problem on my AWS box without any leaks. Also, if I run the model on my local windows box, I also get a related memory error in the theano code.

Any ideas super appreciated…

Thank you,

William

Environment:

  • PyMC3 Version: 3.6
  • Theano Version: 1.0.3
  • Python Version: 3.6.8
  • Operating system: AWS linux

Stack trace below:


MemoryError Traceback (most recent call last)
in ()
15
16 # Likelihood (sampling distribution) of observations
—> 17 Y_obs = pm.Normal(‘Y_obs’, mu=mu, sd=sigma, observed=data_y.values)

~/.conda/envs/my_root/lib/python3.6/site-packages/pymc3/distributions/distribution.py in new(cls, name, *args, **kwargs)
40 total_size = kwargs.pop(‘total_size’, None)
41 dist = cls.dist(*args, **kwargs)
—> 42 return model.Var(name, dist, data, total_size)
43 else:
44 raise TypeError(“Name needs to be a string but got: {}”.format(name))

~/.conda/envs/my_root/lib/python3.6/site-packages/pymc3/model.py in Var(self, name, dist, data, total_size)
837 var = ObservedRV(name=name, data=data,
838 distribution=dist,
–> 839 total_size=total_size, model=self)
840 self.observed_RVs.append(var)
841 if var.missing_values:

~/.conda/envs/my_root/lib/python3.6/site-packages/pymc3/model.py in init(self, type, owner, index, name, data, distribution, total_size, model)
1322
1323 self.missing_values = data.missing_values
-> 1324 self.logp_elemwiset = distribution.logp(data)
1325 # The logp might need scaling in minibatches.
1326 # This is done in Factor.

~/.conda/envs/my_root/lib/python3.6/site-packages/pymc3/distributions/continuous.py in logp(self, value)
478 mu = self.mu
479
–> 480 return bound((-tau * (value - mu)**2 + tt.log(tau / np.pi / 2.)) / 2.,
481 sd > 0)
482

~/.conda/envs/my_root/lib/python3.6/site-packages/theano/tensor/var.py in sub(self, other)
145 # and the return value in that case
146 try:
–> 147 return theano.tensor.basic.sub(self, other)
148 except (NotImplementedError, AsTensorError):
149 return NotImplemented

~/.conda/envs/my_root/lib/python3.6/site-packages/theano/gof/op.py in call(self, *inputs, **kwargs)
672 thunk.outputs = [storage_map[v] for v in node.outputs]
673
–> 674 required = thunk()
675 assert not required # We provided all inputs
676

~/.conda/envs/my_root/lib/python3.6/site-packages/theano/gof/op.py in rval()
860
861 def rval():
–> 862 thunk()
863 for o in node.outputs:
864 compute_map[o][0] = True

~/.conda/envs/my_root/lib/python3.6/site-packages/theano/gof/cc.py in call(self)
1733 print(self.error_storage, file=sys.stderr)
1734 raise
-> 1735 reraise(exc_type, exc_value, exc_trace)
1736
1737

~/.conda/envs/my_root/lib/python3.6/site-packages/six.py in reraise(tp, value, tb)
691 if value.traceback is not tb:
692 raise value.with_traceback(tb)
–> 693 raise value
694 finally:
695 value = None

I solved this. It was because my y variable was a numpy array instead of a column vector.

Can you go into a little detail on this? You’re saying that reshaping y fixed the memory problem?

I’m still a pymc3 newbie, so I cannot explain exactly why this fixed it, but yes that’s what I’m saying. When I had data_y.values.shape = (300000,), then this line would crash with a memory error:

Y_obs = pm.Normal(‘Y_obs’, mu=mu, sd=sigma, observed=data_y.values)

and when I changed the shape to (300000,1), it stopped crashing.

1 Like

I think this is some broadcast error - when you have data_y.values.shape = (300000,) Y_obs actually being broadcasted to (300000, 300000) which cause the memory error.