Calculated Values not matching Posterior Predictive

Hi All,

Throwing this out there with the potential for embarrassing myself with a lack of understanding the sampling process, but I have a strange bug I can’t seem to solve. It’s a long one.

I’m building a Media Mix Model similar to many others. For those unfamiliar, it is big linear regression model where the explanatory variables are first adjusted by two functions before being multiplied by the coefficient.

By sampling, I am hoping to find the most likely values for each parameter of the functions in order to recreate the simple regression model and predict on new data.

The ultimate Problem: When I use the posterior means as parameters to recreate one of the curves, the outcome is strangely lower than the posterior prediction of the outcome (target). The parameters for the other curve and the main regression coefficients work perfectly and calculate to the same value as the posterior prediction.

The code

Versions:

PyMC = 4.0.0b5 // Aesara = 2.5.1 // Arviz = 0.12.0

The data is Marketing spend (or impressions) on a weekly basis for each channel (tv, radio, print). These values are transformed by two functions before being added to the regression

The first function is an Adstock decay curve. This allows the model to realize impact of marketing slightly after the exposure happened:

def weibull_adstock_at(x_t, alpha:float, theta:int=0, L=12, normalize=True):
    """
    Builds the Weibull adstock decay curve and applies it across the array
    """
    w = (alpha**((at.ones(L).cumsum()-1)-theta)**2)

    xx = at.stack([at.concatenate([at.zeros(i), x_t[:x_t.shape[0] - i]]) for i in range(L)])

    if not normalize:
        y = at.dot(w, xx)
    else:
        y = at.dot(w / at.sum(w), xx)

    return y

The next is a saturation curve (spoiler alert: this is my problem curve). I have used the Hill function from this widely cited paper.

def saturation_hill_pymc(x, alpha, gamma):

    x_s_hill = x**alpha / (x**alpha + gamma**alpha)

    return x_s_hill

I build a model as such:


response_mean = []
with pm.Model(rng_seeder=42) as model_2:
    for channel_name in delay_channels:

        x = data_transformed[channel_name].values

        ads_alpha = pm.Beta(f"{channel_name}_ads_alpha", 3, 3)
    
        ads_theta = pm.Gamma(f"{channel_name}_ads_theta", 2, 1) 

        sat_gamma = pm.Beta(f"{channel_name}_sat_gamma", 3, 3)

        sat_alpha = pm.Gamma(f"{channel_name}_sat_alpha", 2, 1)


        x_new = weibull_adstock_at(x,  alpha=ads_alpha,  theta=ads_theta)

        saturation_tensor = hill_saturation_build(x_new, alpha=sat_alpha, gamma=sat_gamma)

        channel_b = pm.HalfNormal(f"{channel_name}_media_coef", sigma=1)

        response_mean.append(saturation_tensor * channel_b)

    for control_var in control_variables:

        x = data_transformed[control_var].values

        control_beta = pm.Normal(f"{control_var}_control_coef")

        control_x = control_beta * x

        response_mean.append(control_x)

    intercept = pm.HalfNormal("intercept", 0.05)

    sigma = pm.HalfNormal("sigma", 0.05)


    likelihood = pm.Normal("outcome",
                           mu=intercept + sum(response_mean),
                           sigma=sigma,
                           observed=data_transformed[target].values)

     idata = pm.sample(2000, tune=2000, return_inferencedata=True) 

Then sample a posterior predictive

with model_2:
    idata.extend(pm.sample_posterior_predictive(idata)) 

Things are going OK so far.

The next step the the project is to attempt to pull out the parameters for the model and recreate the basic equation.

Initially I started with create a summary in arviz like so:

trace_summary = az.summary(idata)

This gave me a mean for each parameter that I could use to recreate the equations like so:

    adstock_theta = trace_summary['mean'][f"{channel}_ads_theta"]
    adstock_alpha = trace_summary['mean'][f"{channel}_ads_alpha"]
    saturation_alpha = trace_summary['mean'][f"{channel}_sat_alpha"]
    saturation_gamma = trace_summary['mean'][f"{channel}_sat_gamma"]
    coef = trace_summary['mean'][f"{channel}_coef"]

    data_transformed_decomposed[delay_channel] = (hill_saturation_build(
                       weibull_adstock_at(
                                         data_transformed_[delay_channel].values, 
                                         alpha=adstock_alpha, 
                                         theta=adstock_theta, 
                                          ), 
                        alpha=saturation_alpha,
                         gamma=saturation_gamma,
               ) * coef).eval()

Sum up the outcomes of all the channels, add the intercept and we’re having an old school statistics party.

However…

When I feed all of the previous data into this equation and compare the results to the Posterior Predictive outcome, the numbers are suspiciously consistently lower.

Debug mode

  1. At first I considered that I might have left out a variable in the follow up calculation. After a triple check it seemed correct,

  2. Next, I thought perhaps the values in idata might be different ever so slightly from what I was initially using in trace summary. So, I went back to recheck different methods of finding the mean of each distribution across chains and draws. I replaced the parameters to the non-rounded values. No change.

  3. Then I thought perhaps just one feature in the data was acting up in response to the functions applied. So, I added pm.Deterministic() to the model to store the resulting contributions of each channel for each week. I could then look at these compared to the calculated values each week before summing up to a final regression predictions. The pm.Deterministic() numbers were a mostly higher each week (row) across most channels (columns). This wasn’t some outlier but every row and column being slightly off. Although some channels were higher than others. HOWEVER!!, The control_channels which did no have any of the functions applied were spot on with the pm.Deterministic numbers

channel_contrib = pm.Deterministic(
            f'contribution_{channel_name}', 
            hill_saturation_build(
                 weibull_adstock_at(x, 
                     alpha=ads_alpha, 
                     theta=ads_theta), 
                alpha=sat_alpha, 
                gamma=sat_gamma
                ) * channel_b)
  1. So, I now know only the channels with functions applied are calculating differently from the posterior_predictive. Next step was to isolate each function to see how it performed. The first step I eliminated the hill_saturation_build function from the model and the calculation. Sampling the data only using the weibull_adstock_decay function and the coefficents. Jackpot. The calculated numbers were nearly exactly the same as the posterior sample. Then I ran the model again using only the hill_saturation_build function and the calculated numbers were once again consistently off. It’s the saturation function doing the damage.

  2. I tried feeding the hill_saturation_build function small samples data of both numpy arrays and aesara tensors to see if it produced different results. No news. Same results with either array or tensor.

  3. I built a visual to help see where the calculated numbers differed from the pm.Deterministic() numbers by channel. Comparing the same weeks prediticted contribution to the overall against the calculated contribution using the same function. I then added a line with the difference…

Finally getting somewhere and to the point of this post!

I can see from the chart that when the value of the individual feature increases the difference between the calculated feild and the posterior decreases. Looking at the visuals I can clearly see they are related.

So, my hill function is working differently inside the sampling process than it works outside in a calculation?

  1. The Hill function used a half-life parameter that assumed all values were between 0-1 (I min-max scaled all features), But perhaps after an adstock decay some features were above 1? So, borrowing from another sample code, I added a line of code to determine the range of values to transform first, then adujsted my half-life parameter to be the same inside that new scale.
def hill_saturation_build(x: np.array, alpha: float, gamma: float) -> np.array:
    """
    Saturation function. Does not need to be scaled to 0-1.
    """
    gammatransformed = (at.max(x) - at.min(x)) * gamma

    sat_curve = x**alpha / (x**alpha + gammatransformed**alpha)

    return sat_curve 

This helped a touch, but still suspiciously off.

I tested the Saturation Hill function again on a range of very small values and large ones plotting the results and found the curve to remain the same across all values.

So, you made it this far… you’re a hero. Is there something in the sampling that would make this Hill Function work differently inside the model? Am I completely missing the point on how sampling works? Is there some obviously easy solution am missing?

Huge thanks in advance to everyone in the pymc community.

1 Like

I skimmed through (briefly). Am I correct in thinking that you pulling in the mean parameter values and using them to generate a single predicted outcome and comparing this to the mean outcome when generating individual outcomes per parameter value? If so, there’s no particular reason to expect these to line up (they should probably be close, but unlikely to be identical). To reproduce the entire posterior predictive process, you would need to take the first sample, extract each parameter’s value, plug these values into you model, generate the expected outcome, repeat this for each sample, and then average the predicted outcomes over all those steps.

1 Like

I’d be less optimistic and say they might be close but there is no guarantee of that (unless you are only doing linear transformations). My latest blog post might be useful

Extra side note. Don’t use az.summary to compute means only and use their values. I think computing the mean explicitly is more clear and much more efficient as you avoid computing all the unused columns.

means = idata.posterior.mean(("chain", "draw"))
adstock_theta = means[f"{channel}_ads_theta"]
...

This page in ArviZ documentation might be useful too for common operations with sampling results: Working with InferenceData — ArviZ dev documentation

1 Like

Fair. This is a classic example of an ecological fallacy.

1 Like

I didn’t know about this was a “named” fallacy, thanks for sharing!

1 Like

Thanks @cluhmann and @OriolAbril!

So, If I understand correctly, @cluhmann, are you saying I shouldn’t use the posterior means as the variables in the regression model? Or just that you wouldn’t expect those variables to perfectly match the posterior outcome mean? If it’s the latter, I assumed as much, but was surprised at how far off the differences were and that they were almost exclusively lower. Then I saw the pattern of high/low and how it accelerated the difference was at the ends and made me thing something else was causing this. Additionally, when resampling with just the adstock function and the coefficents, the posterior and my calculated prediction were close, but off in the ranges I expected.

@OriolAbril Thanks for the note and pointing to the idata.posterior.mean instead of the trace summary. It was 2nd thing I found in debugging and mentioned it above in the post. I started using the more exact idata.posterior.mean(("chain", "draw")) a while ago. However, it didn’t help much and when comparing the new parameters to the trace summary they were all really close. Taking a deep dive through the blog post now. Thanks for putting this together.

Meanwhile I see if I can find a different function that does act the same inside and outside of the model. Still a mystery.

1 Like

What you should do is entirely up to you and obviously depends on what you are trying to accomplish. But if you are trying to sanity-check the posterior predictive results you already generated (and it sounds like you are?), then using the mean parameter values isn’t the right way to to that for the reasons explained above.

But just to be clear, if you were using Pymc for a typical y = mx + b regression, how would you determine the most likely m and b for the data?

1 Like

The idea of “the most likely” parameter values sort of cuts against the grain of a conventional Bayesian analysis. Traditionally, all values are credible to some extent and the posterior tells to to what extent each potential value is credible. Taking a single value and ignoring the other, also credible values, suggests that your results are more certain than they actually are.

So when I run regression-style models I typically avoid discussing “the most likely” value of m or b. However, in such situations, I would be very happy to discuss the probability that m is positive or the probability that b is between 50 and 100. Or if I had 2 predictor variables and thus 2 coefficients, beta1 and beta2, I might provide information about the probability that beta1 was greater than beta2. And so on.

Hope that helps.