Dataset:-
Dataset is of US states , showing mortality rate due to Cardiovascular disease dependency of different features ex. AQI,Obesity etc.
Goal:-
I want to predict Mortality rate(Data_value) using given features.
Model:-
using correlation matrix i found that mortality rate is more correlated with AQI , Obesity , Temperature and Previous year data values .
So i define model as given below
import pymc as pm
import numpy as np
# Preparing the data
cleaned_data['LocationDesc_encoded'] = cleaned_data['LocationDesc'].astype('category').cat.codes
# Define the model
with pm.Model() as hierarchical_model:
# Hyperpriors for group nodes
mu_a = pm.Normal('mu_a', mu=0, sigma=1)
sigma_a = pm.HalfCauchy('sigma_a', beta=1)
# Priors for individual intercepts
a = pm.Normal('a', mu=mu_a, sigma=sigma_a, shape=len(cleaned_data['LocationDesc_encoded'].unique()))
# Hyperpriors for group slopes
mu_b = pm.Normal('mu_b', mu=0, sigma=1)
sigma_b = pm.HalfCauchy('sigma_b', beta=1)
# Priors for individual slopes
b = pm.Normal('b', mu=mu_b, sigma=sigma_b, shape=(len(cleaned_data['LocationDesc_encoded'].unique()), 4))
# Model error
sigma = pm.HalfCauchy('sigma', beta=1)
# Expected value
mu = a[cleaned_data['LocationDesc_encoded'].values] + \
b[cleaned_data['LocationDesc_encoded'].values, 0] * cleaned_data['obesity_Prevalence'] + \
b[cleaned_data['LocationDesc_encoded'].values, 1] * cleaned_data['data_value_py'] + \
b[cleaned_data['LocationDesc_encoded'].values, 2] * cleaned_data['Avg Temp(Ā°F)'] + \
b[cleaned_data['LocationDesc_encoded'].values, 3] * cleaned_data['AQI']
# Likelihood
Y_obs = pm.Normal('Y_obs', mu=mu, sigma=sigma, observed=cleaned_data['Data_Value'])
# Sampling from the posterior distribution
trace = pm.sample(5000, tune=1000, return_inferencedata=True)
I am new to this topic So there are many doubts can someone help me to resolve these
-
Is my model correctly defined? if there is any issue then please mention and suggest correction.
-
As i learned that we calculate accuracy on test data , So how can i predict data values (Mortality rate ) for test data so that i can calculate RMSE.If you can provide any code snippet that will be very helpful to me.
-
various resources calculate RMSE on the data which they used for finding
posterior distribution , Is is correct way to Evaluate model.
ThankYou