Baysian hirerchical Linear Regression(Partial Pooling) model using PYMC

Yes for

idata.prior_predictive["Y_obs"]

This would give the predictions of your model using the given x and the priors (without determining the posteriors). I am not sure what is the best representation for comparing it to your observed data though. I was really just imagining doing a histogram of data.prior_predictive[“Y_obs”] flattened and histogram of cleaned_data[‘Data_Value’] and comparing them. The point of prior predictive checks is really just to see if your prior based predictions for data.prior_predictive[“Y_obs”] are within reasonable range, you don’t need to compare them one by one to your observables. If all your observed data are in say range [-10,10] and with your chosen priors your model is producing results like 10000 quite often, then you may for instance need to change the scales of priors. It does not need to produce your observable distribution faithfully at all, it may have biases, it may have large variation etc. It just needs to be “reasonable” (and do not blow-up in more complicated problems).

If I now understand your model correctly by looking at the equation, you have multiple locations (each data coming from one location) and you are basically doing separate multivariate linear regressions for each. If that is the case then you could do a separate prior predictive check for each location (use of coordinates and a working knowledge of xarrays would really help for this matter). You can look at data.prior_predictive[“Y_obs”] for each location separately and either plot a histogram of sampled values vs observed values. Or you can extract a and b samples for each location as you did and plot collections of lines where for each line you fix a randomly selected three of the four predictors and vary the other. Doing it separately for each location may help in the case of unlikely but possible event of data from different locations are given in different scales etc (though I suppose normalizing your data for each location prior to sampling would be more sensical in such a case).

I mean at the end of the day this is a relatively straightforward linear regression model and I think the most important thing you need to check is whether or not with your prior distributions the model produces predictions in a reasonable a range. I am not an expert though so more experienced people may come with alternative suggestions.

1 Like