Hi all,
Please excuse the long post.
I am currently working on a multi-level linear model for a marketing mix prediction where there are four levels, with each level containing between 2 and 7 groups. For example, the State level has 7 groups (South Australia, Western Australia,…), and the Brand level has two groups (Brand X, Brand Y), and so on.
I have modelled the design by extending both the example notebook here and the example notebook here.
All of my variables are non-centered as suggested is best practice.
The X data is scaled by maximum values within each cross section, and the target values are shifted by the minimum and scaled by the range of each cross section.
The coordinates of the model are:
coords = {
"date": dates,
"channel": channels,
"pos_control": pos_controls,
"neg_control": neg_controls,
"lag": lags,
"state": mn_state,
"age": mn_age,
"brand": mn_brand,
"cohort": mn_cohort,
}
where
date - has length - 156
channel - has length - 23
pos_control - has length - 19
neg_control - has length - 4
lag - has length - 1
state - has length - 7
age - has length - 6
brand - has length - 2
cohort - has length - 3
The model is essentially:
# Model error
sd_y = pm.Exponential("sd_y", 1)
y_hat = pm.Deterministic(
"y_hat",
var=alpha[np.newaxis, :, :, :, :]
+ channel_contribution.sum(axis=1)
+ pos_control_contribution.sum(axis=1)
+ neg_control_contribution.sum(axis=1)
+ lag_contribution.sum(axis=1),
dims=("date", "state", "age", "brand", "cohort"),
)
# Data likelihood
y_like = pm.Normal(
"y_like",
mu=y_hat,
sigma=sd_y[np.newaxis, ...],
observed=target_value,
dims=("date", "state", "age", "brand", "cohort"),
)
I will define channel_contribution
below, because the other components of mu
are analogous.
# Slopes
mu_b = pm.HalfNormal("mu_b", sigma=1, dims=("channel"))
# Non-centered random slopes - state level
z_b_state = pm.HalfNormal("z_b_state", sigma=1, dims=("channel", "state"))
sigma_b_state = pm.Exponential("sigma_b_state", 1)
# Non-centered random slopes - age level
z_b_age = pm.HalfNormal("z_b_age", sigma=1, dims=("channel", "age"))
sigma_b_age = pm.Exponential("sigma_b_age", 1)
# Non-centered random slopes - brand level
z_b_brand = pm.HalfNormal("z_b_brand", sigma=1, dims=("channel", "brand"))
sigma_b_brand = pm.Exponential("sigma_b_brand", 1)
# Non-centered random slopes - cohort level
z_b_cohort = pm.HalfNormal("z_b_cohort", sigma=1, dims=("channel", "cohort"))
sigma_b_cohort = pm.Exponential("sigma_b_cohort", 1)
betas = pm.Deterministic(
"betas",
mu_b[:, np.newaxis, np.newaxis, np.newaxis, np.newaxis]
+ z_b_state[:, :, np.newaxis, np.newaxis, np.newaxis] * sigma_b_state
+ z_b_age[:, np.newaxis, :, np.newaxis, np.newaxis] * sigma_b_age
+ z_b_brand[:, np.newaxis, np.newaxis, :, np.newaxis] * sigma_b_brand
+ z_b_cohort[:, np.newaxis, np.newaxis, np.newaxis, :] * sigma_b_cohort,
dims=("channel", "state", "age", "brand", "cohort"),
)
channel_contribution = pm.Deterministic(
"channel_contribution",
var=betas * _saturated,
dims=("date", "channel", "state", "age", "brand", "cohort"),
)
Clearly, channel_contribution
is non-negative (as it needs to be for media contributions). Further, _saturated
is the media-data with an adstock and saturation function (used directly from pymc-marketing) applied to it.
alpha, pos_control_contribution, lag_contribution
are all defined in a similar way (all weakly positive), and unsurprisingly, neg_control_contribution
is weakly negative.
So now I get to the crux of it.
When I run this model on various specific cross sections of data at a target accept of 80%, I usually get 1-2 divergences, and all of my r_hat values are 1, and I sample in about a minute. This seems good.
Then, increasing the cross sections included (expanding across the levels, for example including all ages, then all brands, then all cohorts, then all states, in no particular order), seems to work ok up to a point (I can’t say exactly what this critical mass is), and then r_hat issues start to appear, but still very few divergences. The model samples very slowly (36 hours for the full model), and the r_hat values become very, very poor with 90% of the variables estimated showing r_hat > 1.01.
So my thinking here was that the sampler must be exploring redundant regions that are not close to reality before getting to where it needs to go, and that is why it isn’t confident in the estimates. To check this, I run pm.sample_prior_predictive
on a specific cross section, and lo and behold I get a plot that looks like this:
Clearly the prior predictions lay above the target value. I think this must be because of the positivity constraints I have on my variables and the priors on those being too large? Interestingly, the prior predictive samples actually match the data pretty well if I scaled the target variable up. Despite this, as I said before, when I run the model on this specific cross section, I get no divergences and good r_hats, and I am happy with the predictions for a first pass.
So I try to tune the priors down a bit, so they can begin at a more reasonable area, and I do this by reducing the sigma values for the half normal distributions in the model to 0.01 (I admit this is quite lazy). I get to a point which seems to cover the target value quite well:
But then actually sampling even the single cross section model produces r_hat issues (3% of variables have r_hat >1.01). Expanding across a hierarchy (state for example) produces poorer r_hat values than the wider priors.
(I am unable to run pm.sample_prior_predictive
on the full model without getting an error of ValueError: conflicting sizes for dimension 'state': length 1 on the data but length 7 on coordinate 'state'
, which seems to be similar to the issue faced in this post, but that isn’t the point of my question.)
So I would like to know:
- Is this behaviour expected?
- What should I try next? (Scaling the y data up doesn’t seem to help)
- Are there some gaps in my understanding about the r_hat reasons?
- Are there any suggestions for speeding up the sampler?
Thank you for reading all of that. I would be super grateful for any pointers.