Hi there
Our workplace is fitting relatively simple GLM models to datasets with approx 1 million to 10 million observations.
The models themselves aren’t overly complex – either simple linear regression or GLMs with a design matrix of approximately 1 to 5 columns.
I have included a reproducible example below which uses the linear regression example from the pymc3 Getting Started guide, but increases the size of the data to 2 million observations. This produces similar behaviour to the models we are fitting.
When fitting the models using the default MCMC the memory usage during sampling remains stable at about ~10GB during sampling (taking about an hour to sample 2000 draws including warm up).
But when the sampling finishes there is a massive memory spike. The memory usage spikes to 32GB (the maximum on my local machine) for a couple of minutes. Sometimes the machine even runs out of memory.
We are even running AWS EC2 containers with 64GB RAM that are crashing from lack of memory with only about 1 million observations.
My questions:
-
What operation is happening at the end of the sampling that requires such a massive memory spike for a short period of time (e.g. collating the arrays for the multiple chains)?
-
And is there any way we can potentially avoid it?
If not, then it seems we will have to run a memory intensive machine, when the max memory is only being utilised for a ~1% of the total compute time.
Any advice much appreciated!
Reproducible example just in case (run time is about one hour though):
import numpy as np
import pymc3 as pm
from pymc3 import *
print('Running on PyMC3 v{}'.format(pm.__version__)) # v3.9.2
size = 2000000
true_intercept = 1
true_slope = 2
x = np.linspace(0, 1, size)
true_regression_line = true_intercept + true_slope * x
y = true_regression_line + np.random.normal(scale=.5, size=size)
data = dict(x=x, y=y)
with Model() as model: # model specifications in PyMC3 are wrapped in a with-statement
sigma = HalfCauchy('sigma', beta=10, testval=1.)
intercept = Normal('Intercept', 0, sigma=20)
x_coeff = Normal('x', 0, sigma=20)
likelihood = Normal('y', mu=intercept + x_coeff * x, sigma=sigma, observed=y)
trace = sample(1000, cores=2)