Larger dataset linear regression GPU memory problems with NUTS for larger N and D (batch training?)

Hi all :slightly_smiling_face:

I’ve recently begun to dabble with pymc3 for some modelling problems and have implemented a basic linear regression model. Having experimented on my laptop and trying to scale up the datasets (N=20,000, D=2000), i’m encountering an OOM using NUTS sampling over the weights. I’m not so familiar with the implementation nor sampling methods in general and was wondering if there was some kind of batch equivalent (like with reparameterized VI) for the sampling methods (i.e. metropolis, NUTS). The error is as follows, and makes sense why it’s happening (my GPU has only 2GB…)

It seems like it’s trying to accomodate the entire dataset which is clearly infeasible. However, the ADVI initialisation stage appears to complete without any memory errors, but fails immediately after:

Finished [100%]: Average Loss = 23,867
ERROR (theano.gof.opt): Optimization failure due to: constant_folding
ERROR (theano.gof.opt): node: GpuFromHost(TensorConstant{[[ 0.02855…01668083]]})
ERROR (theano.gof.opt): TRACEBACK:

The final traceback looks like:

pygpu.gpuarray.GpuArrayException: b’cuMemAlloc: CUDA_ERROR_OUT_OF_MEMORY: out of memory’
Apply node that caused the error: GpuFromHost(TensorConstant{[[ 0.02855…01668083]]})
Toposort index: 2
Inputs types: [TensorType(float64, matrix)]
Inputs shapes: [(20000, 2000)]
Inputs strides: [(8, 160000)]
Inputs values: [‘not shown’]
Outputs clients: [[GpuGemv{inplace=True}(GpuAllocEmpty{dtype=‘float64’, context_name=None}.0, TensorConstant{1.0}, GpuFromHost.0, GpuSubtensor{int64:int64:}.0, TensorConstant{0.0}), GpuGemv{inplace=True}(GpuAllocEmpty{dtype=‘float64’, context_name=None}.0, TensorConstant{1.0},
GpuFromHost.0, GpuSubtensor{int64:int64:}.0, TensorConstant{0.0})]]

Could anyone suggest ways to mitigate this without having to buy bigger hardware :stuck_out_tongue: ? Is batch training applicable to NUTS or metropolis? The model is simple bayesian linear regression with learned weight and noise variances:

model = pm.Model()
with model:
    # Hyper-priors
    mu_w = pm.Normal('mu_w', mu=0., sd=100 ** 2)
    sigma_w = pm.HalfCauchy('sigma_w', 100)
    sigma_y = pm.HalfCauchy('sigma_y', beta=sigmaEps, testval=sigmaInit)
    # Priors
    w = pm.Normal('w', mu=mu_w, sd=sigma_w, shape=D)
    # Likelihood model
    y =, X_train.T)
    y_obs = pm.Normal('y_obs', mu=y, sd=sigma_y, observed=y_train.
    trace = pm.sample(draws=1000, init="auto", n_init=10000, njobs=n_jobs)

As an aside, running fullrank_advi appears incredibly slow compared to NUTS but would be wonderful if i could use it as an alternative with batch training (I’m primarily interested in acquiring a usable predictive posterior so advi isn’t really an option). Perhaps grouped VI ( or normalizing flows ( That said, I’ve probably only scratched the surface!

Hardware is a laptop i7 6700, 16GB ram, GTX 960M with 2GB ram

thanks all!

Data partitioning in MCMC needs some care to do it correctly. Currently out of the box in PyMC3 you can try SGFS (it is an experimental sampler so use with caution)

Alternatively, you can partition your data by hand and fit multiple smaller model with NUTS and then combine them post hoc. The official treatment is Expectation propagation (e.g.,, which you can find some STAN codes here: GitHub - gelman/ep-stan

A model like that should run just fine on a cpu with nuts, if it is slow I’d look into the parametrization a bit. Your prior for sigma_w is a bit strange by the way, are you sure want to set beta to 100? If so, maybe normalize the data and set it so something like 2.5 instead.
You can also try the development version (or the soon to be released rc). We made some changes to the initialisation of nuts which might help here.