Bayesian regression: crashes and performance issues with large datasets and many parameters with different priors?

Hi all,

I am currently implementing a Bayesian regression model to identify 57 unknown parameters.
My datasets consists of over 500000 observations.

I order to identify the parameters using Bayesian regression with PyMC3, I created a design matrix A with over 500000 rows and 57 columns and an observation vector d with over 500000 rows.

At a first try, I implemented this using Matlab using conjugate priors and the results look promising (very similiar to the conventional least square solution, in fact).
Nonetheless, some of the identfied parameters are physically not feasible, as the conjugate assumtions rely simply on normal priors. Thus, I am starting with PyMC3 for this now.

As I have varying knowledge on the shape of some of the parameters, I placed different forms of priors on each of them, e.g. normal priors, negative bounded normal priors or inverse gamma priors.

I have the following issues:

  1. I hope that the model is correct, as I stacked the different priors using tt.stack.
    As the model is at least running, I hope that this is correct.

  2. As I have so many possible observations (taken from time series), I run in memory issues. Thus, I took only a very limited subset for now ( about 1000-10000 observation rows). Nonetheless, Matlab is able to process the whole dataset in only very few seconds with a ‘lasso’ model where the posterior is not analytically traceable, so I am wondering if PyMC3 can even handle such big datasets with so many parameters?

  3. The model repeatedly crashes after 305 or 405 samples, even when I change the number of stochastic parameters or the number of observations. Any idea on why it crashes often at the same number of points?

  4. When it crashes, the following error appears (:

ValueError: Mass matrix contains zeros on the diagonal. 
The derivative of RV `p1`.ravel()[0] is zero.
The derivative of RV `p2`.ravel()[0] is zero.
The derivative of RV `p3_log__`.ravel()[0] is zero.
The derivative of RV `p4_log__`.ravel()[0] is zero.

I also tried using the MAP as start and also tried the initialization with advi+adapt_diag.

  1. The sampling gets VERY slow after a few samples, even with only 1000 observations (about 1 sample in 1-5 sec). Is this normal with so many parameters? Or do you say that PyMC3 should be able to process so many parameters with so many observations on a regular laptop in reasonable time?

Unfortunately, I didn’t find any solution in any of the other posts yet to solve this.
In the following, I attached my (cleaned) code.

I’d be very grateful for any support on where my issue is! :pray:

data = sio.loadmat('measurements.mat')
A = data['A']
d= data['d']

with pm.Model() as model:

    NegNormal = pm.Bound(pm.Normal, lower=-np.inf, upper=0)
    p1 =  pm.Normal('', mu=1.0, sigma=1.0)
    p2 = NegNormal('', mu=-10.0, sigma=1.0)
    p3 = NegNormal('', mu=-10.0, sigma=1.0)

    p4= pm.Normal('', mu=1.0, sigma=1.0)
    p5= pm.Normal('', mu=1.0, sigma=1.0)
    p6= pm.InverseGamma('', alpha=3, beta=1)
    p7= pm.InverseGamma('', alpha=3, beta=1)
    p8= pm.InverseGamma('', alpha=3, beta=1)
    p9= pm.InverseGamma('', alpha=3, beta=1)
    p10= pm.InverseGamma('', alpha=3, beta=1)
    p11= NegNormal('', mu=-10.0, sigma=1.0)
    p12= NegNormal('', mu=-10.0, sigma=1.0)
    p13= pm.Normal('', mu=1.0, sigma=1.0)
    p14= pm.Normal('', mu=1.0, sigma=1.0)
    p15= pm.Normal('', mu=1.0, sigma=1.0)
    p16= pm.InverseGamma('', alpha=3, beta=1)
    p17= pm.InverseGamma('', alpha=3, beta=1)
    p18= pm.InverseGamma('', alpha=3, beta=1)
    p19= pm.InverseGamma('', alpha=3, beta=1)
    p20= pm.InverseGamma('', alpha=3, beta=1)
    p21= NegNormal('', mu=-10.0, sigma=1.0)
    p22= NegNormal('', mu=-10.0, sigma=1.0)

    p23= pm.Normal('', mu=1.0, sigma=1.0)
    p24= pm.Normal('', mu=1.0, sigma=1.0)
    p25= pm.Normal('', mu=1.0, sigma=1.0)
    p26= pm.InverseGamma('', alpha=3, beta=1)
    p27= pm.InverseGamma('', alpha=3, beta=1)
    p28= pm.InverseGamma('', alpha=3, beta=1)
    p29= pm.InverseGamma('', alpha=3, beta=1)
    p30= pm.InverseGamma('', alpha=3, beta=1)
    p31= pm.InverseGamma('', alpha=3, beta=1)
    p32= NegNormal('', mu=-10.0, sigma=1.0)
    p33= NegNormal('', mu=-10.0, sigma=1.0)

    p34= pm.Normal('', mu=1.0, sigma=1.0)
    p35= pm.Normal('', mu=1.0, sigma=1.0)
    p36= pm.Normal('', mu=1.0, sigma=1.0)
    p37= pm.InverseGamma('', alpha=3, beta=1)
    p38= pm.InverseGamma('', alpha=3, beta=1)
    p39= pm.InverseGamma('', alpha=3, beta=1)
    p40= pm.InverseGamma('', alpha=3, beta=1)
    p41= pm.InverseGamma('', alpha=3, beta=1)
    p42= pm.InverseGamma('', alpha=3, beta=1)
    p43= NegNormal('', mu=-10.0, sigma=1.0)
    p44= NegNormal('', mu=-10.0, sigma=1.0)

    p45= pm.Normal('', mu=1.0, sigma=1.0)
    p46= pm.Normal('', mu=1.0, sigma=1.0)
    p47= pm.Normal('', mu=1.0, sigma=1.0)
    p48= pm.InverseGamma('', alpha=3, beta=1)
    p49= pm.InverseGamma('', alpha=3, beta=1)
    p50= pm.InverseGamma('', alpha=3, beta=1)
    p51= pm.InverseGamma('', alpha=3, beta=1)
    p52= pm.InverseGamma('', alpha=3, beta=1)
    p53= pm.InverseGamma('', alpha=3, beta=1)
    p54= NegNormal('', mu=-10.0, sigma=1.0)
    p55= NegNormal('', mu=-10.0, sigma=1.0)
    p56= pm.Normal('', mu=1.0, sigma=1.0)
    p57= pm.Normal('', mu=1.0, sigma=1.0)

    beta = tt.stack([p1, p2, p3, \
                    p4, p5, p6, p7, p8, p9, p10, p11, p12, \
                    .... \
                   p56, p57 ])
    e = pm.InverseGamma('e', alpha=3, beta=1)
    d_est =, beta)
    likelihood = pm.Normal('d', mu=d_est, sigma=e, observed=d)
    trace = pm.sample(1000, init = 'adapt_diag', progressbar=True, cores=1, chains=1)

so I am wondering if PyMC3 can even handle such big datasets with so many parameters?

It appears you really only have 57 parameters here which isn’t that many. It’s not terribly uncommon to see models using many thousands of parameters to work well in PyMC3. The slowness could be coming from 1) the large number of data points, or 2) difficult posterior geometry resulting from data issues or from the priors you have. My recommendation, to start, is to replace your 57 distinct priors with a single vector normal prior with a vector of means and a vector of SDs that more or less capture the same prior beliefs. For example, you can replace a zero-mean normal bounded to the right of zero with an unbounded normal that places most of its probability mass in some interval between positive numbers. If the sampling is fast enough for you under this model, then it is likely that the issue arises from your prior specifications. If it is still too slow, then perhaps you should look into methods which are a little faster like ADVI.

1 Like

Thanks for the quick and very helpful reply! :slight_smile:
Indeed, using a single normal prior with shape=(57,1) increases the speed drastically. Nonetheless, I am still facing some issues:

  1. The speedup is not possible, when I use 57 distinct normal priors an stack them as I did above. I cannot see a reason why this is very different from the prior with shape=(57,1)?

  2. Depending on the choice of the initialization, the regression still fails with ValueError: Mass matrix contains zeros on the diagonal., independent on how many observations I use. When using the ‘advi+adapt_diag’ initialization, the regression works, but the code runs slower, but that loss of speed is acceptable. I’ll see if the results match the expectations an compare it to the ;atlab solution.

Do you have any recommendation how I can make use of more informative priors (IG or bounded normals)?

Thanks again for the valuable tipps! :slight_smile:

1 Like

You are right in the sense that they are mathematically equivalent, but they will have differing implementation when compiled by Theano.

The underlying Theano computational graph generally does better in handling small numbers of variables, even if the variables have very large dimensions. By using 57 distinct Theano variables (since each PyMC3 random variable is really just a Theano variable with some extra info / methods), there is extra overhead incurred. I’m not sure of the precise low-level reason for this.

This is interesting and maybe suggests some more nuanced issues with your model (e.g. nearly collinear predictors, data on a less-than-ideal measurement scale) which are harder to diagnose without more context. If you are able to post even part of your data in a gist we could take a look. Alternately, I would recommend running the model to completion using whatever initialization method works and checking the number of effective samples using pm.summary. The variables which have the lowest effective sample size often correspond to those variables that cause any underlying issues.

1 Like

I ran a model with fake data and encountered numerous issues too. With Normal likelihood the sampling also failed with error of zero derivative. Changing to a StudentT worked but took over 5 hours to run and didn’t converge.

This should be an ideal model since there is only 1 Normal variable for the coefficients with shape 57.

N, K = 500_000, 57
A = np.random.normal(0, 1, size=(N, K))
beta_real = np.random.normal(0, 1, K)
y =, beta_real)
y += np.random.normal(0, 1, size=N)

with pm.Model() as model:
    beta = pm.Normal('beta', 0, 2, shape=K)
    scale = pm.Exponential('scale', 1)
    loc =, beta)
    obs = pm.Normal('obs', mu=loc, sigma=scale, observed=y)
    trace = pm.sample(return_inferencedata=True)

with pm.Model() as model2:
    beta = pm.Normal('beta', 0, 2, shape=K)
    scale = pm.Exponential('scale', 1)
    nu = pm.Gamma('nu', 2.5, 0.1)
    loc =, beta)
    obs = pm.StudentT('obs', nu=nu, mu=loc, sigma=scale, observed=y)
    trace = pm.sample(return_inferencedata=True)

I’m not sure if it matters but it’s a good idea to give a non-blank name to each of the parameters.

1 Like

In addition, with 500,000 observations and only 57 coefficients any Bayesian approach should more or less collapse down to the MLE estimator. Using a simpler, faster package might be easier?

from scipy import optimize
def f(b):
    y_hat =, b)
    return np.power(y - y_hat, 2).sum()
res_scipy = optimize.minimize(f, x0=np.random.rand(57))

res_np, _, _, _ = np.linalg.lstsq(A, y, rcond=None)

res_scipy and res_np both match beta_real

You could ensure strictly positive or negative coefficients by using a np.exp within the minimised function

If you want to add some hierarchy later on then Bayesian would be the best way forward indeed.

1 Like

If you can use these vector normal priors plus an inverse gamma for sigma you can actually calculate the posterior analytically, see for some guidance and derivation for example. It will always be faster than any sampler. This approach won’t work however if you really need to have inversegamma or negative normal priors on the coefficients.

Also, I want to point out that with 5000000 observations, you may want to use idata_kwargs={"log_likelihood": False} in pm.sample to avoid computing and storing the pointwise log likelihood values, which you won’t need if you are not planning to perform model comparison using loo/waic.

Hi guys,

thank you so much for the quick and very detailed responses! :pray: :slight_smile:

Thanks so much for testing! (5 hours, wow, thanks a lot!)

This is my mistake, I deleted the variable names, but forgot to replace them :man_facepalming:

Yes, you might be right (I gotta dive deeper into the theory). The 500,000 observations are timesteps from a time series measurement. If possible, I’d like to reduce the time (and hereby the number of observations points as much as possible).

That’s a really helpful hint! :slight_smile:

To the best of my knowledge, the inverse gamma sigma with normal priors are implemented in the Matlab Bayesian Regression functions (in the ‘conjugate’ case). That’s probably why Matlab is super fast in this case. Nonetheless, also the non-analytically-posteriors are quite fast… But as PyMC3 gives me much more flexibility in the future, I’d like to stick with that :wink:

This is also very helpful, thank you a lot! I’ll take a look!

Thank you so much guys for taking the time to support me here!
As I am pretty much a Bayesian beginner, this makes the entry to the Bayesian world a lot easier.