Under predicting and large sigma in Synthetic Control Model

Hello all!

I’m new to Bayesian Modeling and have just completed the Intuitive Bayes course. I have started integrating the A/B testing methods in my current work and it is going very well. However, I am running into an issue when trying to use the synthetic control method on a geo lift test that was done earlier in the year.

I have weekly sales and ad spend (for a single channel) for each geo. Some geos were exposed to ad spend, some went dark with $0 spend, and some were excluded (I did not design the experiment - just trying to get some value out of it). In my initial model I summed up all the control geos into a single “control_response”. I did the same with treatment. I then created a Synthetic Control Model with a reasonably high r^2, but is clearly under predicting and over stating the causal impact. Below are some charts to describe this.

Here is my code to create the model. It’s simple.

# filter data to June 2023 and after
df_response_filtered = df_response_pooled[df_response_pooled.index >= '2023-06-01']

# Model the aggregate geo as a linear combination of the untreated units
# with no intercept parameter.
formula = """response_treatment ~ 0 + response_control"""

# Run the analysis
result = cp.SyntheticControl(
    df_response_filtered,
    treatment_time,
    formula=formula,
    model=cp.pymc_models.WeightedSumFitter(
        sample_kwargs={"target_accept": 0.95, "random_seed": 42}
    ),
)


Can you please provide any guidance to get the model fitting better? I’m new at this and trying to provide a reasonable estimate of ROAS for this media channel.

Here is another model I created, but this time using the Excluded markets as well since they seems to correlate pretty highly with the treatment group and they had virtually $0 in spend during the test period.

# Model the aggregate geo as a linear combination of the untreated units
# with no intercept parameter.
formula = """response_treatment ~ 0 + response_control + response_excluded"""

# Run the analysis
result = cp.SyntheticControl(
    df_response_filtered,
    treatment_time,
    formula=formula,
    model=cp.pymc_models.WeightedSumFitter(
        sample_kwargs={"target_accept": 0.95, "random_seed": 42}
    ),
)

As you can see this one has more normal trace plots, but sigma still dominates the model. This model doesn’t suffer from the same under prediction as the initial model so I tend to trust its causal estimates more.

CC @drbenvincent

Hi @Trevor_Smith. I think I know what’s going on here. The synthetic control model is basically a weighted sum of the predictors, but where the weightings sum to 1. Why is another story.

But from what I can see you only have one predictor (response_control) from your formula response_treatment ~ 0 + response_control. So as far as I can tell this is basically modelling response_{treatment} = response_{control}, which isn’t really a great thing.

You might also be interested in a (relatively) recent docs page I put together Multi-cell geolift analysis — CausalPy 0.4.0 documentation

So you could try both of the approaches outlined in that docs page.

Feel very free to tag me in a reply - let me know how it goes.

@drbenvincent

Thank you so much for looking at this. The part about the weighted sum of the predictors weightings summing to 1 helped me resolve this.

I used the notebook you linked to get a feel for the solution. My main issue was scaling. My pooled treatment response was in the millions each week, but my control geos were in the thousands or tens of thousands. Weekly response varied widely by geo as you can imagine because some geos are like NYC and others are very small. This created an issue in the modeling process and led to very small betas and really large sigma values.

To resolve this I still pooled my treatment variable. However, after that I then MinMax scaled everything between 0 and 1 including my pooled treatment and my geos. Now that everything was on a similar scale the model fit was decent. From there I noticed most of my geos (there’s over 100 of them) had very small betas still, so I chose my top 5 geos and used them as predictors. This created a great model. I now have a realistic estimate of the marketing campaigns cumulative causal impact, thank you so much.

1 Like

Great stuff!

The only thing I’d potentially look at further would be the scaling. Depending on the exact situation and what the predictors are, if a zero value has real meaning than you might want to do MaxAbs scaling (e.g. MaxAbsScaler — scikit-learn 1.5.2 documentation) where you simply divide by the maximum. But like I say, it may not be better in your exact situation.

Oh good catch on the scaler! I’ll try the other scaler you recommended. Thanks!

1 Like