Hello, all. After spending some time on this forum, I wanted to get my feet wet and try out a more advanced example that I found online (my data will be attached below), but no matter what I’ve done, my NUTS sampler never finishes, or runs into some error, even on a fast machine.
This is my code attempting a basic linear regression,
with pm.Model() as model:
# Priors
beta = pm.Uniform('beta', shape=(135, 1))
intercept = pm.Normal('intercept', mu=0, sd=3)
std = pm.HalfNormal('std', sd=3)
# Likelihood
price = intercept + pm.math.dot(train_norm_retained, beta)
y_lik = pm.Normal('y_lik', mu=price, sd=std, observed=np.log(SalePrice.values))
trace = pm.sample(cores=4, njobs=4)
My data consists of 270 variables, and the model simply refused to run that at first, with the kernel just dying. To make NUTS’ job easier, I decided to standardize all of my data, log my outcome variable, and used feature selection to pick the most important 50% of them, so now there are only 135 variables. Still, the model runs into plenty of errors, either Mass matrix contains zeros on the diagonal
, or another random error.
I tried using exoplanet’s PyMC3 modifications, and that one gets a bit further with my model, but outputs a LAPACK error saying __'th leading minor of the array is not positive definite.
Is there anything else I could do to get this model to work, or is my data that unwieldy, which would be fine by me, but at least I would like a confirmation if that is indeed true.
train_norm_retained.csv (1.3 MB)
SalePrice.csv (9.9 KB)