Dataset:
Dataset is of US states , showing mortality rate due to Cardiovascular disease dependency of different features ex. AQI,Obesity etc.
Goal:
I want to predict Mortality rate(Data_value) using given features.
Model:
using correlation matrix i found that mortality rate is more correlated with AQI , Obesity , Temperature and Previous year data values .
So i define model as given below
import pymc as pm
import numpy as np
# Preparing the data
cleaned_data['LocationDesc_encoded'] = cleaned_data['LocationDesc'].astype('category').cat.codes
# Define the model
with pm.Model() as hierarchical_model:
# Hyperpriors for group nodes
mu_a = pm.Normal('mu_a', mu=0, sigma=1)
sigma_a = pm.HalfCauchy('sigma_a', beta=1)
# Priors for individual intercepts
a = pm.Normal('a', mu=mu_a, sigma=sigma_a, shape=len(cleaned_data['LocationDesc_encoded'].unique()))
# Hyperpriors for group slopes
mu_b = pm.Normal('mu_b', mu=0, sigma=1)
sigma_b = pm.HalfCauchy('sigma_b', beta=1)
# Priors for individual slopes
b = pm.Normal('b', mu=mu_b, sigma=sigma_b, shape=(len(cleaned_data['LocationDesc_encoded'].unique()), 4))
# Model error
sigma = pm.HalfCauchy('sigma', beta=1)
# Expected value
mu = a[cleaned_data['LocationDesc_encoded'].values] + \
b[cleaned_data['LocationDesc_encoded'].values, 0] * cleaned_data['obesity_Prevalence'] + \
b[cleaned_data['LocationDesc_encoded'].values, 1] * cleaned_data['data_value_py'] + \
b[cleaned_data['LocationDesc_encoded'].values, 2] * cleaned_data['Avg Temp(Ā°F)'] + \
b[cleaned_data['LocationDesc_encoded'].values, 3] * cleaned_data['AQI']
# Likelihood
Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=cleaned_data['Data_Value'])
# Sampling from the posterior distribution
trace = pm.sample(5000, tune=1000, return_inferencedata=True)
I am new to this topic So there are many doubts can someone help me to resolve these

Is my model correctly defined? if there is any issue then please mention and suggest correction.

As i learned that we calculate accuracy on test data , So how can i predict data values (Mortality rate ) for test data so that i can calculate RMSE.If you can provide any code snippet that will be very helpful to me.

various resources calculate RMSE on the data which they used for finding
posterior distribution , Is is correct way to Evaluate model.
ThankYou