Posterior and data mismatch in linear model with observed predictors

Andrey · August 13, 2020, 10:31am

Following this question Combination of bayesian models in pymc3
and have more questions. Thanks in advance for your time.

I have defined a linear model with multiple variables so that all nodes are observed
HRmax ~ Age + BMI + HRrest:

with pm.Model() as model:
      df_hrmax = pm.Data('df_hrmax', df[['hrrest', 'age', 'bmi']])

      pred_hrrest = pm.Normal('pred_hrrest', mu=60, sd=20, observed=df_hrmax[:,0])
      pred_age = pm.TruncatedNormal('pred_age', mu=60, sigma=10,lower=20,upper=100, observed=df_hrmax[:,1]
      pred_bmi = pm.Normal('pred_bmi', mu=30, sd=5, observed=df_hrmax[:,2])

      resp_hrmax = pm.Data('resp_hrmax', df['hrmax'])

      intercept_hrmax = pm.Normal('intercept_hrmax', sd=100)
      
      error_std_hrmax = pm.HalfNormal('error_std_hrmax', sd=5)

      c_hrmax_age = pm.Normal('c_hrmax_age', sd=1)
      c_hrmax_bmi = pm.Normal('c_hrmax_bmi ', sd=1)
      c_hrmax_hrrest = pm.Normal('c_hrmax_hrrest ', sd=1)
      
      mu_hrmax = intercept_hrmax + pred_age* c_hrmax_age + pred_bmi*c_hrmax_bmi + pred_hrrest*c_hrmax_hrrest
     
      hrmax = pm.Normal('hrmax', mu=mu_hrmax, sd=error_std_hrmax, observed=resp_hrmax)

Is this ok to define all nodes as observed? What are the downsides of this?
Now when I want to test the model on unseen data and predict hrmax I do
```
 with model:
       pm.set_data({'df_hrmax ': X_test})
       post_pred = pm.sample_posterior_predictive(trace, samples=nsamples)
```
while X_test can have some missing values like
X_test = pd.DataFrame({'age': np.nan, 'bmi':25, 'hrrest': 60}, index=[0]) and then ‘age’ is derived from the model and is very close to its mu= 60,
but if I know for sure that ‘age’ is 70 (as well as bmi 25 and hrrest 60:
X_test = pd.DataFrame({'age': 70, 'bmi':25, 'hrrest': 60}, index=[0])
I expect the prediction to take those TRUE values.
Instead I get the same distribution no matter which values I pass in X-test.
What I am doing wrong here?
If I know the response hrmax, how do I visualize distributions of all other nodes, to see what is the most likely age, bmi and hrrest?
If I get more data (predictors and response), how do I update my model? Is there a way to continue training, rather then do it from the beginning with extended df_hrmax?
If I want to compare the model to the data, I overlap the datapoints (red) with posterior samples (blue) of two predictors (xaxis) and response (y axis). NB: variables here are different from age, bmi and hrrest, so the values are different, but they are also modeled as Normals.

I know that it should look like this

but my model generates different plots

Datapoints are the same for top and bottom, also x and y ranges, but in my model generated data doesn’t match actual data: the regression lines are actually opposite.
How to interpret this mismatch in posterior and what do adjust in the model
priors?

Topic		Replies	Views
Observed and Simulated data difference and prediction of unseen data v5	0	307	August 2, 2023
Combination of bayesian models in pymc3 Questions	9	1946	August 13, 2020
How to correctly create a GLM with PyMC v5	2	659	September 29, 2023
"shape mismatch" when new data is set as a predictor for sample_posterior_predictive v5 prediction	2	2819	November 20, 2022
Bayesian network with multiple observed variables Questions	3	1081	March 1, 2022

Posterior and data mismatch in linear model with observed predictors

Related topics