Combining/suming multiple hyperpriors in hierarchical model, issues?

Hello,

I am currently working on a model to estimate clicks based on a combination of ads and audiences. The dependent variable is the number of clicks per ad and audience, while the independent variable is the associated cost per ad and audience. Given the large number of ads and audiences, and limited data points, I am concerned about data sparsity and would like to pool the estimates to shrink them toward the mean. Specifically, I want to have one hyperprior for each audience and one for each ad. These hyperpriors should then be combined in some way to form the actual prior for the ad-audience pair (at least that is my idea).

In the context I am modeling, the coefficient b_{ad, audience} should be strictly positive, and the exponent sat_{ad, audience} should be bounded between 0 and 1, which leaves several choices. Currently, I multiply each hyperprior by 0.5 and sum them, though I could also multiply them, or potentially use a Dirichlet hyperprior to distribute across each of the individual priors. As for the likelihood function, I have not yet defined it. The output should be strictly positive and discrete unless I scale the output, so I am considering a Poisson likelihood function.

I am looking for feedback from more experienced Bayesian modelers about any potential issues with this approach. While I plan to check for R-hats, divergences, and energy plots, I’m curious whether there are any structural challenges with this setup that might present problems for an HMC sampler, or any inherent pathologies I might be overlooking. Any intuition on this would be greatly appreciated.

The model is as follows:

y_{ad, audience} = b_{ad, audience} \cdot \text{cost}^{sat_{ad, audience}}_{ad, audience}

Where:

  • y_{ad, audience} is the number of outcomes for a given ad and audience
  • \text{cost}_{ad, audience} is the associated cost for that ad and audience
  • b_{ad, audience} is a parameter for the relationship between the outcomes and cost
  • sat_{ad, audience} is the saturation parameter for each ad and audience

The priors I am using are:

\text{hyperprior}_{coef-ad}, \text{hyperprior}_{coef-audience} \sim \mathcal{N^+}(0, 1)
b_{ad, audience} \sim \mathcal{N^+}\left(\frac{\text{hyperprior}_{ad} + \text{hyperprior}_{audience}}{2}, \sigma \right)
\text{hyperprior}_{sat-ad}, \text{hyperprior}_{sat-audience} \sim \text{Gamma}(2, 1)
\text{hyperprior}_{sat-ad-beta}, \text{hyperprior}_{sat-audience-beta} \sim \text{Gamma}(2, 1)
sat_{ad, audience} \sim \text{Beta}\left( \frac{\text{hyperprior}_{sat-ad} + \text{hyperprior}_{sat-audience}}{2}, \\ \frac{\text{hyperprior}_{sat-ad-beta} + \text{hyperprior}_{sat-audience-beta}}{2} \right)

where \mathcal{N^+} represents a truncated normal.

If I understand correctly, you are trying to use a linear model (in your example you’re lacking an intercept) with a Poisson sampling distribution to infer the probability of an audience member clicking on an add given the cost of that add. It’s easy to simulate data on this, but for time saving I had Deepseek doing it for me this time. You can then fit a simple Poisson model on those data and check if the parameters are properly retrieved:

import numpy as np
import pandas as pd
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt

# Set seed
np.random.seed(42)

# Simulation parameters
N_AUDIENCES = 100  # Number of unique audience members
OBS_PER_AUDIENCE = 20  # Observations per audience member

# True hyperparameters
hyper_mu_intercept = 2.0    # Population-level intercept
hyper_sigma_intercept = 0.5 # Between-audience intercept variability
hyper_mu_slope = -0.03      # Population-level cost slope 
hyper_sigma_slope = 0.01    # Between-audience slope variability

# Simulate audience-specific parameters
audience_ids = np.repeat(np.arange(N_AUDIENCES), OBS_PER_AUDIENCE)
true_intercepts = np.random.normal(hyper_mu_intercept, hyper_sigma_intercept, N_AUDIENCES)
true_slopes = np.random.normal(hyper_mu_slope, hyper_sigma_slope, N_AUDIENCES)

# Simulate cost and clicks
cost = np.random.gamma(shape=5, scale=1, size=N_AUDIENCES*OBS_PER_AUDIENCE)
eta = true_intercepts[audience_ids] + true_slopes[audience_ids] * cost
clicks = np.random.poisson(lam=np.exp(eta))

# Create DataFrame
data = pd.DataFrame({
    "audience_id": audience_ids,
    "cost": cost,
    "clicks": clicks
})

# Visualize relationships for 6 random audiences
sample_audiences = np.random.choice(N_AUDIENCES, 6, replace=False)
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
for ax, aud_id in zip(axes.flat, sample_audiences):
    aud_data = data[data.audience_id == aud_id]
    ax.scatter(aud_data["cost"], aud_data["clicks"], alpha=0.7)
    ax.set_title(f"Audience {aud_id}\nTrue β0={true_intercepts[aud_id]:.2f}, β1={true_slopes[aud_id]:.2f}")
    ax.set_xlabel("Cost"), ax.set_ylabel("Clicks")
plt.tight_layout()
plt.show()


with pm.Model() as hierarchical_model:
    # Hyperpriors for population-level parameters
    mu_intercept = pm.Normal("mu_intercept", mu=0, sigma=2)
    sigma_intercept = pm.HalfNormal("sigma_intercept", sigma=0.5)
    
    mu_slope = pm.Normal("mu_slope", mu=0, sigma=0.1)
    sigma_slope = pm.HalfNormal("sigma_slope", sigma=0.05)
    
    # Non-centered parameterization for varying effects
    alpha_z = pm.Normal("alpha_z", 0, 1, shape=N_AUDIENCES)
    alpha = pm.Deterministic("alpha", mu_intercept + alpha_z * sigma_intercept) #varying intercept
    
    beta_z = pm.Normal("beta_z", 0, 1, shape=N_AUDIENCES)
    beta = pm.Deterministic("beta", mu_slope + beta_z * sigma_slope) #varying slope for clicks
    
    # Linear predictor
    eta = alpha[audience_ids] + beta[audience_ids]*cost
    mu = pm.math.exp(eta)
    
    # Likelihood (observed clicks)
    y = pm.Poisson("y", mu=mu, observed=clicks)
    
    # Sampling
    trace = pm.sample(2000, tune=2000, target_accept=0.95)

# Check hyperparameter recovery
az.plot_forest(trace, var_names=["mu_intercept", "mu_slope"], 
               combined=True, hdi_prob=0.9,
               figsize=(8, 3), 
               r_hat=True, ess=True)
plt.show()

Not a perfect model, but it’s something basic to start with. If you’re concerned about over-dispersion, you could try using a Negative Binomial instead of a Poisson. You can also check the PyMC classic tutorials on hierarchical models: A Primer on Bayesian Methods for Multilevel Modeling — PyMC example gallery . Or the more sophisticated approaches on pymc-marketing : MMM Example Notebook — Open Source Marketing Analytics Solution . Hope this helps. Good luck with the modelling.