Hello PyMC community !
For my thesis, I am trying to predict the performance of a company 5 years after it had a share buyback announcement. I am using as a prior to my Bayesian model the empirical performance of companies that had a share buyback, and I update the prior based on the proportion of the market capitalization that the share buyback represent, using the empirical data of company that had similar share buyback proportion.
However, the prediction obtained by my model does not make sense based on my empirical data, and I have been searching for the problem since few days now without finding the reason. Therefore, I was hoping to get some help here.
Here is my code:
#since my data is in the range of [-1.00:75.00], where -1.00 represent -100% and 1.5 represent 150%, I
#shifted it to have it positive and higher than 1.00 so I can apply log-transformation
shift = 2.01
prior_belief_df = bb_df.groupby('company_ciqid')['5Y_performance'].mean().reset_index()
prior_belief_df.dropna(subset=['5Y_performance'], inplace=True)
prior_belief_df['5Y_performance_shifted'] = prior_belief_df['5Y_performance'] + shift
intervals = [0.00, 0.01, 0.03, 0.05, 0.15]#threshold considered for "marketcap_percentage"
X_aggregated = []
Y_aggregated = []
for lower_bound in intervals:
filtered_df = bb_df[bb_df['marketcap_percentage'] >= lower_bound].copy()
filtered_df['5Y_performance_shifted'] = filtered_df['5Y_performance'] + shift
log_grouped_means = np.log(filtered_df.groupby('company_ciqid')['5Y_performance_shifted'].mean()).reset_index()
X_aggregated.extend([lower_bound] * len(log_grouped_means))
Y_aggregated.extend(log_grouped_means['5Y_performance_shifted'].values)
X = np.array(X_aggregated)
X_squared = X**2
Y = np.array(Y_aggregated)
log_5Y_performance = np.log(prior_belief_df['5Y_performance_shifted'])
global_log_median_performance = np.median(log_5Y_performance)
global_log_std_performance = np.std(log_5Y_performance)
with pm.Model() as model:
#priors parameters using log-normal distribution
alpha = pm.Lognormal('alpha', mu=global_log_median_performance, sigma=global_log_std_performance)
beta = pm.Normal('beta', mu=0.01, sigma=0.005)
gamma = pm.Normal('gamma', mu=-0.005, sigma=0.0025)
sigma = pm.HalfNormal('sigma', sigma=1)
#expected value of outcome, including the quadratic term
log_mu = alpha + beta * X + gamma * X_squared
#likelihood (sampling distribution) of observations
Y_obs = pm.Lognormal('Y_obs', mu=log_mu, sigma=sigma, observed=Y)
#model fitting
trace = pm.sample(1000)
def predict_performance(marketcap_percentage, trace, return_ci=True, ci_level=0.95):
"""
Parameters:
- marketcap_percentage: float, the marketcap percentage for which to predict performance.
- trace: the trace object obtained after MCMC sampling.
- return_ci: bool, whether to return credible intervals for the predictions.
- ci_level: float, the confidence level for the credible interval (default is 95%).
Returns:
- A dictionary containing the predicted performance and, optionally, credible intervals.
"""
#prepare new input data
X_new = np.array([marketcap_percentage])
X_squared_new = X_new ** 2
#extract the posterior samples for the model parameters
alpha_samples = trace.posterior["alpha"].values.flatten()
beta_samples = trace.posterior["beta"].values.flatten()
gamma_samples = trace.posterior["gamma"].values.flatten()
#compute the expected log performance for the new input
log_mu_new = np.mean(alpha_samples) + np.mean(beta_samples) * X_new + np.mean(gamma_samples) * X_squared_new
#transform predictions back to original scale
performance_pred = np.exp(log_mu_new) - shift
if return_ci:
performance_samples = np.exp(alpha_samples[:, None] + beta_samples[:, None] * X_new + gamma_samples[:, None] * X_squared_new) - shift
ci_lower, ci_upper = np.percentile(performance_samples, [(1 - ci_level) / 2 * 100, (1 + ci_level) / 2 * 100], axis=0)
else:
ci_lower, ci_upper = None, None
result = {
'predicted_mean_performance': performance_pred[0], # Unpack the single value
'credible_interval': (ci_lower[0], ci_upper[0]) if return_ci else None
}
return result
prediction = predict_performance(marketcap_percentage = 0.10, trace = trace)
print(prediction)
I obtain as result “{‘predicted_mean_performance’: -1.0041700801651028, ‘credible_interval’: (-1.0063240928821156, -1.0015744422770478)}”.
Base on my empirical data, the median performance of company with “marketcap_percentage” of 0.10 or more is 0.17, while the mean is 1.32. Therefore, non only the result is far from the empirical data, but it is also nonsense since a company can’t perform lower than -100%.
I would be more than happy if somebody tried to help me in finding what is wrong. Let me know if you need more information