Help model structure

I believe the problem stems from the model structure. I will try to give a quick concise explanation of my situation, so the readers have a better understanding.
I want to prediction the variable “5Y_performance”, which correspond to the performance of a company after it had a share buyback announcement. The empirical data range is [-1:175]. The prior of the model is the aggregated empirical data of performance of companies that had a share buyback announcement. I update the prior by using the information from the column “marketcap_percentage”, corresponding the the proportion of the share buyback.

The distribution of the untransformed data looks like this:

While the log-normal distribution looks like this (with a shift of 3.01 to have the data positive):

The structure of the Bayesian model looks like the following:

#since my data is in the range of [-1.00:175.00], where -1.00 represent -100% and 1.5 represent 150%, I
#shifted it to have it positive and higher than 1.00 so I can apply log-transformation
shift = 3.01
prior_belief_df = bb_belief_xs.groupby('company_ciqid')['5Y_performance'].mean().reset_index()
prior_belief_df.dropna(subset=['5Y_performance'], inplace=True)
prior_belief_df['5Y_performance_shifted'] = prior_belief_df['5Y_performance'] + shift

intervals = np.arange(0.00, 0.16, 0.01)

X_aggregated = []
Y_aggregated = []

for lower_bound in intervals:
    filtered_df = bb_belief_xs[bb_belief_xs['marketcap_percentage'] >= lower_bound].copy()
    filtered_df.dropna(subset=['5Y_performance'], inplace=True)

    filtered_df['5Y_performance_shifted'] = filtered_df['5Y_performance'] + shift

    log_grouped_means = np.log(filtered_df.groupby('company_ciqid')['5Y_performance_shifted'].mean()).reset_index()

    X_aggregated.extend([lower_bound] * len(log_grouped_means))
    Y_aggregated.extend(log_grouped_means['5Y_performance_shifted'].values)


X = np.array(X_aggregated)
Y = np.array(Y_aggregated)


log_5Y_performance = np.log(prior_belief_df['5Y_performance_shifted'])
global_log_median_performance = np.median(log_5Y_performance)
global_log_std_performance = np.std(log_5Y_performance)

with pm.Model() as model:

    X_data = pm.MutableData("X_data", X)

    alpha = pm.Lognormal('alpha', mu=global_log_median_performance, sigma=global_log_std_performance) 
    beta = pm.Normal('beta', mu=0.01, sigma=0.05)
    sigma = pm.HalfNormal('sigma', sigma=0.05) 

    log_mu = alpha + beta * X_data

    #likelihood (sampling distribution) of observations
    Y_obs = pm.Lognormal('Y_obs', mu=log_mu, sigma=sigma, observed=Y) 

    trace = pm.sample(1000, return_inferencedata=True)

with model:
    #generate posterior predictive samples
    posterior_predictive = pm.sample_posterior_predictive(trace)

The following code gives gives me the distribution of the observed data vs predicted data:

predicted_data = posterior_predictive.posterior_predictive["Y_obs"].values.flatten()

sample_size = 100000

predicted_data_sample = np.random.choice(predicted_data, size=sample_size, replace=False)

observed_data = Y

plt.figure(figsize=(10, 5))

sns.kdeplot(observed_data, color="blue", label="Observed Data")

sns.kdeplot(predicted_data, color="red", alpha=0.5, label="Predicted Data")

plt.title('Observed vs Predicted Data KDE')

plt.legend()

plt.show()


image

The residuals distribution looks like the following:

As we can see, the distribution of the predicted data seems to not be to different from the observed data, or I might be wrong since I am not experienced on the matter. However, the prediction that I obtain has a median always about 50% higher than the one from my empirical data, while the means changes just slightly when using a different “marketcap_percentage”, while my empirical data the means change substantially when playing with this parameter.

Here is the code used to make the prediction. It might not be correct:

#marketcap_percentage input to predict
new_X_values = np.array([0.15])

with model:
    pm.set_data({"X_data": new_X_values})

    #generate posterior predictive samples with updated X
    posterior_predictive = pm.sample_posterior_predictive(trace)

#access the posterior_predictive group directly
predicted_samples = posterior_predictive.posterior_predictive["Y_obs"]
predicted_samples_flat = predicted_samples.values.flatten()

reversed_exp = np.exp(predicted_samples_flat)
reversed_original = reversed_exp - shift

predicted_mean_original = np.mean(reversed_original)
predicted_median_original = np.median(reversed_original)

print("Mean of Predicted Y_obs in Original Scale:", predicted_mean_original)
print("Median of Predicted Y_obs in Original Scale:", predicted_median_original)

I would like to have the opinion of somebody experienced to understand what I am doing wrong.