I believe the problem stems from the model structure. I will try to give a quick concise explanation of my situation, so the readers have a better understanding.
I want to prediction the variable “5Y_performance”, which correspond to the performance of a company after it had a share buyback announcement. The empirical data range is [-1:175]. The prior of the model is the aggregated empirical data of performance of companies that had a share buyback announcement. I update the prior by using the information from the column “marketcap_percentage”, corresponding the the proportion of the share buyback.
The distribution of the untransformed data looks like this:
While the log-normal distribution looks like this (with a shift of 3.01 to have the data positive):
The structure of the Bayesian model looks like the following:
#since my data is in the range of [-1.00:175.00], where -1.00 represent -100% and 1.5 represent 150%, I
#shifted it to have it positive and higher than 1.00 so I can apply log-transformation
shift = 3.01
prior_belief_df = bb_belief_xs.groupby('company_ciqid')['5Y_performance'].mean().reset_index()
prior_belief_df.dropna(subset=['5Y_performance'], inplace=True)
prior_belief_df['5Y_performance_shifted'] = prior_belief_df['5Y_performance'] + shift
intervals = np.arange(0.00, 0.16, 0.01)
X_aggregated = []
Y_aggregated = []
for lower_bound in intervals:
filtered_df = bb_belief_xs[bb_belief_xs['marketcap_percentage'] >= lower_bound].copy()
filtered_df.dropna(subset=['5Y_performance'], inplace=True)
filtered_df['5Y_performance_shifted'] = filtered_df['5Y_performance'] + shift
log_grouped_means = np.log(filtered_df.groupby('company_ciqid')['5Y_performance_shifted'].mean()).reset_index()
X_aggregated.extend([lower_bound] * len(log_grouped_means))
Y_aggregated.extend(log_grouped_means['5Y_performance_shifted'].values)
X = np.array(X_aggregated)
Y = np.array(Y_aggregated)
log_5Y_performance = np.log(prior_belief_df['5Y_performance_shifted'])
global_log_median_performance = np.median(log_5Y_performance)
global_log_std_performance = np.std(log_5Y_performance)
with pm.Model() as model:
X_data = pm.MutableData("X_data", X)
alpha = pm.Lognormal('alpha', mu=global_log_median_performance, sigma=global_log_std_performance)
beta = pm.Normal('beta', mu=0.01, sigma=0.05)
sigma = pm.HalfNormal('sigma', sigma=0.05)
log_mu = alpha + beta * X_data
#likelihood (sampling distribution) of observations
Y_obs = pm.Lognormal('Y_obs', mu=log_mu, sigma=sigma, observed=Y)
trace = pm.sample(1000, return_inferencedata=True)
with model:
#generate posterior predictive samples
posterior_predictive = pm.sample_posterior_predictive(trace)
The following code gives gives me the distribution of the observed data vs predicted data:
predicted_data = posterior_predictive.posterior_predictive["Y_obs"].values.flatten()
sample_size = 100000
predicted_data_sample = np.random.choice(predicted_data, size=sample_size, replace=False)
observed_data = Y
plt.figure(figsize=(10, 5))
sns.kdeplot(observed_data, color="blue", label="Observed Data")
sns.kdeplot(predicted_data, color="red", alpha=0.5, label="Predicted Data")
plt.title('Observed vs Predicted Data KDE')
plt.legend()
plt.show()
The residuals distribution looks like the following:
As we can see, the distribution of the predicted data seems to not be to different from the observed data, or I might be wrong since I am not experienced on the matter. However, the prediction that I obtain has a median always about 50% higher than the one from my empirical data, while the means changes just slightly when using a different “marketcap_percentage”, while my empirical data the means change substantially when playing with this parameter.
Here is the code used to make the prediction. It might not be correct:
#marketcap_percentage input to predict
new_X_values = np.array([0.15])
with model:
pm.set_data({"X_data": new_X_values})
#generate posterior predictive samples with updated X
posterior_predictive = pm.sample_posterior_predictive(trace)
#access the posterior_predictive group directly
predicted_samples = posterior_predictive.posterior_predictive["Y_obs"]
predicted_samples_flat = predicted_samples.values.flatten()
reversed_exp = np.exp(predicted_samples_flat)
reversed_original = reversed_exp - shift
predicted_mean_original = np.mean(reversed_original)
predicted_median_original = np.median(reversed_original)
print("Mean of Predicted Y_obs in Original Scale:", predicted_mean_original)
print("Median of Predicted Y_obs in Original Scale:", predicted_median_original)
I would like to have the opinion of somebody experienced to understand what I am doing wrong.



