Constraint or prior on product of pm.MutableData and Random Variable

I am using pymc to solve the MMM problem and have recently conducted some uplift tests, which provide a true performance data of a certain channel. I want to incorporate this information into my model by setting constraint such that the product of the channel spend and a channel random variable is equal to the results of the uplift tests.
My model:

# input data
channel_A_spend = [1, 2, 3, 4, 5, 6, 7, 8, 9]
channel_B_spend = [2, 4, 56, 78, 9, 765, 4, 3, 0]
channel_C_spend = [2, 3, 30, 78, 9, 10, 5, 8, 0]
conversions = [1, 20, 3, 10, 5, 15, 7, 18, 9]

# priors data
channel_C_min_conversions = 3
channel_C_max_conversions = 10

# uplift results
channel_A_uplift = [0, 5, 2, 4, 1, 5, 0, 0, 2]

channel_C_cp = pm.find_constrained_prior(
    pm.HalfNormal,
    lower=channel_C_min_conversions,
    upper=channel_C_max_conversions,
    mass=0.95,
    init_guess=dict(sigma=1),
)

with pm.Model() as base_model:
    channel_spend_A_data = pm.MutableData(
        name="channel_spend_A_data",
        value=channel_A_spend,
    )
    channel_spend_B_data = pm.MutableData(
        name="channel_spend_B_data",
        value=channel_B_spend,
    )
    channel_spend_C_data = pm.MutableData(
        name="channel_spend_C_data",
        value=channel_C_spend,
    )

    intercept = pm.Normal("intercept", mu=0, sigma=1)
    channel_A_coef = pm.HalfNormal("channel_A_coef", sigma=1)
    channel_B_coef = pm.HalfNormal("channel_B_coef", sigma=1)
    channel_C_coef = pm.HalfNormal("channel_C_coef", **channel_C_cp)
    # set constraint  of  pm.math.dot(channel_A_coef, channel_spend_A_data) = channel_A_uplift
    constraint = pm.Potential(
        "channel_A_constraint",
        -100
        * pm.math.abs(
            pm.math.dot(
                channel_spend_A_data,
                channel_A_coef,
            )
            - (channel_A_uplift)
        ),
    )

    mu = (
        intercept
        + pm.math.dot(channel_A_coef, channel_spend_A_data)
        + pm.math.dot(channel_B_coef, channel_spend_B_data)
        + pm.math.dot(channel_C_coef, channel_spend_C_data)
    )
    sigma = pm.HalfNormal("sigma", sigma=1)

    pm.Normal(name="response", sigma=sigma, mu=mu, observed=conversions)

    base_model_prior_predictive = pm.sample_prior_predictive()```

Questions:

  1. While I have tried implementing constraints, I have some concerns about if it is an appropriate way of setting such constraints, especially because model performance goes off and also, over examples I have seen that pm.Potential was used only with random variables, not the product of data and random variables.
  2. Also mentioned, that if I remove constrained prior from channel_C - model performance becomes a bit better - how this can relate?

Thanks in advance for any help.

Perhaps I’m misunderstanding, but it seems there isn’t any randomness in this setup. That is, given that you know the equation y= X\beta, where y is the channel A spend, X is the channel_spend_A_data, and \beta is channel_A_coef, you can deterministically solve the linear system for \beta = X^{-1}y. In code:

import pytensor.tensor as pt
channel_A_coef = pt.linalg.inv(channel_A_spend) @ channel_A_uplift

That would remove the constraint entirely. You could actually just pre-compute this number using numpy outside of your model to save your CPU from doing the matrix inversion on each logp call, since it’s totally deterministic.

Given the shapes of your dummy data though, I guess you can’t actually invert channel_A_spend (it’s just a column vector?). So then you could obtain channel_A_coef via minimization and use that:

from scipy.optimize import minimize_scalar
def objective(beta):
    return (beta * np.array(channel_A_spend) -  np.array(channel_A_uplift)).sum() ** 2

res = minimize_scalar(func, bounds=[1e-4, 100])

Which you could then deterministically plug into the model, and estimate everything else.

Thinking more about this, you might still want some uncertainty in the estimate of \beta, in which case you could embed a second “mini-regression” into your model:

# ...priors, data, etc
    uplift_mu = channel_A_coef * channel_A_spend #here's the optimization problem -- could add a separate intercept here too
    obs_uplift = pm.Normal('channel_A_uplift', 
        mu= uplift_mu,
        sigma=channel_A_uplife_sigma, 
        obs=channel_A_uplift)
   # rest of the model exactly as you had it before...

The key here is that channel_A_coef will appear in two likelihood functions: that of the observed uplift, and the total MMM. It will thus be adapted to fit in both, incorporating the observed uplift.

I like this this method much more than what I suggested above, because it is a more “principled” approach from the Bayesian perspective. The optimization approach I suggested before discards uncertainty about how the observed uplift data came to be.

Hi @jessegrabowski,
First of all, thanks for your replies. Let’s extend this conversation for a while.
So you have proposed two options:

  1. first - is your initial version
  2. second message of yours and also explaining why 2) is better then 1).
  3. Let me drop another option:
conversions = pm.MutableData(f"{regressor}_conv_data", pivot_table[regressor].values)
eq_determ = pm.Deterministic(f"{regressor}_eq",
                pm.math.sqr(
                    pm.math.sum(
                        (pm.math.dot(channel_A_coef , channel_A_spend) - (conversions))
                    )
                ),
            )

eq_constraint = pm.Potential(
    "channel_A_constraint",
    at.switch(
        pm.logp(pm.Normal.dist(mu = 0, sigma=0.0001), eq_determ), 0, -np.inf
    ),
)

Questions:

  1. Can you please share your opinion on which option you see as the best between 2 and 3? Would be also glad to know why 3 is still worse than 2 (if it is).
  2. I am looking forward to adding pm.GaussianRandomWalk to have time-varying. Is option two still stays the best in this scenario?
  3. Any chance I can make option 3) potential more narrow? I also have tried pm.Uniform instead of pm.Normal - it resulted in model performance really dropping down
    Also, a bit separate question here:
    In my init model I am using pm.find_constrained_prior for a channel. If I am adding GaussianRandomWalk to the model, should I switch to pm.Bound instead or somehow should modify pm.find_constrained_prior to be optimized per GaussianRandomWalk step?
    The overall idea behind is like this: While model performance metrics are good (mape/smape 0.15/0.15) model coefs are super unrealistic - addition find_constrained_prior improoved ‘realisticity’ of coef, so this is something I reely would like to keep in the model

Thanks in advance

My question for this approach is why? You are setting the sigma of a normal to be extremely low; in this case just deterministically plug in the result of the optimizer. Actually the value of sigma doesn’t matter because the return value of the True branch of the switch is zero (the False branch will never occur, because the normal is defined everywhere on R). It has zero gradient everywhere, so I’m not surprised it makes your model complain.

Algebraically though, the whole thing is a round-about implementation of a normal logp. Look again at the logp formula for a normal:

\log f(\mu, \sigma |y) \propto -\frac{1}{\sigma^2}\left ( \sum_{i=0}^N {y_i - \mu} \right)^2

Define \mu_i = \beta \cdot x_i:

\log f(\sigma, \beta |y, X) \propto -\frac{1}{\sigma^2}\left ( \sum_{i=0}^N {y_i - \beta \cdot x_i} \right)^2

As you can see, your deterministic eq_determ sum of squares appears directly in the formula. Using pm.Normal(mu=channel_A_coef * channel_A_spend, sigma=0.0001, observed=channel_A_uplift) is equivalent to what you’ve written (if you then multiplied by zero).

Basically, you need to decide if you want uncertainty in the value of channel_A_coef or not. If you do (and you should), use second likelihood function to add information about the observed uplift process to the model. If you don’t, run the optimizer and plug in the number deterministically – no need to get fancy with potentials.

I’ve never used find_constrained_prior, so I can’t comment on it. For keeping parameters “realistic”, I’d tinker with priors before I reached for hard boundaries.

Adding GRW priors doesn’t change anything in this analysis.