I’ve got a synthetic dataset and I’m trying to predict(yet unrevealed) values of y using x1 and x2. I’m told the irreducible RSME is 1. I’m just trying to make sure my basic workflow makes sense and if anyone has any modeling tips to get an even better answer I’m all ears. In the prompt it says “Assume even minor improvements in accuracy can provide a significant advantage. Hint: the irreducible error measured by RMSE is known to be 1.” So I’m also a little unsure of what a “good” RSME would be?

## Step 1: Data Exploration

# X1 in red
plt.scatter(model_data_raw['x1'],model_data_raw['y'], color='red', label='x1', alpha = .4)

# Plot x2 in blue
plt.scatter(model_data_raw['x2'], model_data_raw['y'], color='blue', label='x2', alpha = .4)

# Set labels and title
plt.xlabel('x')
plt.ylabel('y')
plt.title('Plot of x1 and x2')

# Show a legend
plt.legend()

# Display the plot
plt.show()

print('Data General Summary:')
display(model_data_raw.describe())
print('Data Correlation:')
display(model_data_raw.corr())
print('Data Histogram:')
display(model_data_raw.hist(bins = 20))

So based on the shape of x2 I notice it seems to have an exponential distribution. So to make it easier to work with I apply a log transformation and get:

## Step 2: Model build

That looks better so from there I go on to run this model:

x1 = log_data['x1']
x2 = log_data['x2']
y = log_data['y']

coords = {
'y' : y,
'x1': x1,
'x2': x2
}

def lin_regress(coords, x1, x2, y):
with pm.Model(coords = coords) as model:
#Mutable Data
x1s = pm.MutableData('x1s', x1)
x2s = pm.MutableData('x2s', x2)

#Hyperpriors
# mu_hyper = pm.Normal('mu_hyper', mu = 0, sigma = 10)
# sig_hyper = pm.HalfNormal('sig_hyper', sigma = 2)

#Priors
sigma = pm.HalfNormal('sigma', sigma = 5)
alpha = pm.Normal('alpha', mu = 0, sigma = 4)
beta1 = pm.Normal('beta1', mu = 0, sigma = 4)
beta2 = pm.Normal('beta2', mu = 0, sigma = 4)

#Regression
lin_reg = pm.Deterministic('lin_reg',
alpha
+ x1s * beta1
+ x2s * beta2)

#Likelihood
likelihood = pm.Normal('y', mu=lin_reg, sigma=sigma, observed=y)

trace = pm.sample(draws = 5000, tune = 300)

ppc = pm.sample_posterior_predictive(trace)

return model, trace, ppc

At first I used a hierarchical model but it really didn’t improve the RSME at all and here were a number of divergences so decided it wasn’t worth it.

## Step 3: Posterior Checks

Finally I run a posterior predictive check against the actual data

model, trace, ppc = lin_regress(coords, x1, x2, y)

y_pred_mean = ppc.posterior_predictive['y'].mean(axis = 0).mean(axis = 0)

y_true = log_data['y']

# Calculate the mean squared error
mse = mean_squared_error(y_true, y_pred_mean)

# Take the square root
rmse = np.sqrt(mse)
print("RMSE: ", rmse)

# No-Hierarchy- RMSE:  1.284515001380565
# Hierarchy-    RMSE:  1.2846179535127267 (with 256 divergences)

As a recap: Does this workflow make sense? Any advice on other improvements? There’s 5000 rows of data so is a score of 1.28 if I know the IE is 1 any good?