Hello. I just started diving into PyMC3 after using machine learning and wanting those flexible Bayesian models. To start, I wanted to run a basic regression on a housing dataset that I cleaned from kaggle, but my model keeps giving me a dimension mis-match error, no matter what I do?
This is my code,
with pm.Model() as model:
beta = pm.Normal('beta', mu=0, sd=10000, shape=list(train_norm.columns))
intercept = pm.Normal('intercept', mu=0, sd=10000)
std = pm.HalfNormal('std', sd=100)
price = intercept + beta*train_norm
y_lik = pm.Normal('y_lik', mu=price, sd=std, observed=SalePrice)
trace = sample()
, with the resulting error: “Input dimension mis-match. (input.shape = 1460, input.shape = 270)”.
I uploaded my input and output data, which can be loaded back in easily with
SalePrice = pd.read_csv('SalePrice.csv')
train_norm = pd.read_csv('train_norm.csv')
# Minor editing
train_norm.drop(axis=1, labels='Unnamed: 0', inplace=True)
train_norm = train_norm.iloc[1:,:]
Could anyone please tell me what I am doing wrong?
train_norm.csv (1.9 MB)
SalePrice.csv (15.9 KB)
You need a dot product here:
So, I tried out your suggestion.
price = intercept + np.dot(beta,train_norm)
ran endlessly without any messages (and I have a fairly decent cpu so something else must have been not working).
Next, being inspired by the suggestion and looking up the api quickstart more closely, I tried,
price = intercept + beta.dot(train_norm)
, which actually ran (displaying a message about theano first), but ended up giving me a new error:
shapes (270,) and (1460,270) not aligned: 270 (dim 0) != 1460 (dim 0).
Attempting to fix that by manually inputting the shape instead as
shape=(1460,270) (I have 1460 rows and 270 variable columns in the input dataframe), gave me the error now
shapes (1460,270) and (1460,270) not aligned: 270 (dim 1) != 1460 (dim 0)
I am rather confused. I would appreciate any more help greatly.
beta = pm.Normal('beta', mu=0, sd=10000, shape=(270, 1))
price = intercept + pm.math.dot(train_norm, beta)
This finally worked, thank you very much for your help! It was the two numbers in the
shape option, as well as pymc’s math operations being necessary, that were evading me.
Now python keeps crashing for me when I try to run the model, but I figured it was because of the large number of variables my data has, and I have managed to run the model with 40/270 variables (that being the limit). Do you have any advice, or resources you could point me toward, on running pymc3 with large data, since I heard NUTS can supposedly handle hundreds of variables?
Coincidentally, after further sleuthing it was you helping another user with the same problem that gave me the solution.
So, unless you know how to fix such memory issues now preventing multi-core use, I’ll be looking to see how to fix them on os x El Capitan.
I would not expect memory error with input matrix of this size - could you try casting all pandas table into numpy arrays? something like
train_norm = train_norm.values