Following this question Combination of bayesian models in pymc3
and have more questions. Thanks in advance for your time.
I have defined a linear model with multiple variables so that all nodes are observed
HRmax ~ Age + BMI + HRrest:
with pm.Model() as model:
df_hrmax = pm.Data('df_hrmax', df[['hrrest', 'age', 'bmi']])
pred_hrrest = pm.Normal('pred_hrrest', mu=60, sd=20, observed=df_hrmax[:,0])
pred_age = pm.TruncatedNormal('pred_age', mu=60, sigma=10,lower=20,upper=100, observed=df_hrmax[:,1]
pred_bmi = pm.Normal('pred_bmi', mu=30, sd=5, observed=df_hrmax[:,2])
resp_hrmax = pm.Data('resp_hrmax', df['hrmax'])
intercept_hrmax = pm.Normal('intercept_hrmax', sd=100)
error_std_hrmax = pm.HalfNormal('error_std_hrmax', sd=5)
c_hrmax_age = pm.Normal('c_hrmax_age', sd=1)
c_hrmax_bmi = pm.Normal('c_hrmax_bmi ', sd=1)
c_hrmax_hrrest = pm.Normal('c_hrmax_hrrest ', sd=1)
mu_hrmax = intercept_hrmax + pred_age* c_hrmax_age + pred_bmi*c_hrmax_bmi + pred_hrrest*c_hrmax_hrrest
hrmax = pm.Normal('hrmax', mu=mu_hrmax, sd=error_std_hrmax, observed=resp_hrmax)
-
Is this ok to define all nodes as observed? What are the downsides of this?
-
Now when I want to test the model on unseen data and predict hrmax I do
with model: pm.set_data({'df_hrmax ': X_test}) post_pred = pm.sample_posterior_predictive(trace, samples=nsamples)
while X_test can have some missing values like
X_test = pd.DataFrame({'age': np.nan, 'bmi':25, 'hrrest': 60}, index=[0])
and then ‘age’ is derived from the model and is very close to its mu= 60,
but if I know for sure that ‘age’ is 70 (as well as bmi 25 and hrrest 60:
X_test = pd.DataFrame({'age': 70, 'bmi':25, 'hrrest': 60}, index=[0])
I expect the prediction to take those TRUE values.
Instead I get the same distribution no matter which values I pass in X-test.
What I am doing wrong here? -
If I know the response hrmax, how do I visualize distributions of all other nodes, to see what is the most likely age, bmi and hrrest?
-
If I get more data (predictors and response), how do I update my model? Is there a way to continue training, rather then do it from the beginning with extended df_hrmax?
-
If I want to compare the model to the data, I overlap the datapoints (red) with posterior samples (blue) of two predictors (xaxis) and response (y axis). NB: variables here are different from age, bmi and hrrest, so the values are different, but they are also modeled as Normals.
I know that it should look like this
but my model generates different plots
Datapoints are the same for top and bottom, also x and y ranges, but in my model generated data doesn’t match actual data: the regression lines are actually opposite.
How to interpret this mismatch in posterior and what do adjust in the model
priors?