A Bayesian Marginal Structural Model (IPW) in a single model

Hi everyone, I’m a bit new to applied Bayesian statistics. I took Richard McElreath’s course, where he discusses “Full Luxury Bayes” in which you code multiple submodels in a single model.
I come from causal inference where a very common model is the Marginal Structural Model (aka IPW). Briefly, it’s a two-step model: where you first estimate the probabilities to get a treatment, take their inverse, and then fit a univariable regression of the outcome on the treatment weighted by the inverse-probability weights.
I thought coding the two regressions in a single pm.Model() would be a nice use-case for a “Full Luxury” model.

Unfortunately, I was not able to recover the true parameter (treatment effect) even in the simplest case.
However, I don’t think it’s a bug, and I think there might be something more fundamental in work here (or maybe just how MCMC works?) - because when I do a simple weighted average (a single treatment model) and when I do a weighted regression with pre-computed fixed weights (i.e. a non-Bayesian logistic regression with a weighted Bayesian outcome regression) I get the correct answer.
Something fails when I try to jointly learn the treatment and outcome models.

I wonder if someone has more insight about Bayesian IPW here.
I tried my luck at Cross Validated (and the Bayesian inference Discord, but Thomas Wiecki suggested I’ll also try here) but to no avail. Hopefully more Bayesian experts are concentrated here :upside_down_face:.
Additional resources:

  1. The original Cross Validated question
  2. A self-contained Notebook with data and PyMC models

Many thanks

I think you’re thinking too hard. I went through econometrics training, which taught infinity frequentist estimators with millions of hoops to jump through to get to an estimate. As a result, I’m often in the same situation, trying to figure out how to code fGLS in pytensor or whatever. I’ve found that it’s always the wrong approach.

In the spirit of EcElreath, let’s get back to the basics.

You have a data generating process:

def generate_data(seed=0, N=500, effect=0, D=1):
    rng = np.random.default_rng(seed)
    X = rng.normal(0, 1, size=(N, D))
    beta_xa = rng.normal(2, 1, size=D)
    a_logit = 1 + X@beta_xa + rng.normal(0, 0, size=N)
    a_propensity = 1 / (1 + np.exp(-a_logit))
    a = rng.binomial(1, a_propensity)
    beta_xy = rng.normal(-2, 1, size=D)
    y = 1 + X@beta_xy + a*effect + rng.normal(0, 1, size=N)
    data = {
        "X": pd.DataFrame(X).rename(columns=lambda s: f"x_{s}"),
        "a": pd.Series(a).rename("a"),
        "y": pd.Series(y).rename("y"),
        "beta_xa": beta_xa,
        "beta_xy": beta_xy,
    return data

So y is a function of X and a, but a is also a function of X. It implies this DAG:

So you need to close the biasing path from X → a → y by conditioning your estimate of y given X on a. The strategy:

  1. Model a given X, call it a_hat
  2. Use this modeled relationship to estimate y given X and a_hat, since a_hat is a conditioned on values of X.

In PyMC code:

with pm.Model() as msm_model:  # model specifications in PyMC are wrapped in a with-statement
    # Treatment model:
    intercept_a = pm.Normal("intercept_a", mu=0, sigma=2)
    beta_Xa = pm.Normal("beta_Xa", mu=0, sigma=5, shape=data["X"].shape[1])

    mu_lin_a = pm.Deterministic("mu_lin_a", intercept_a + pm.math.dot(data["X"].values, beta_Xa))
    a_obs = pm.Bernoulli("a_obs", logit_p=mu_lin_a, observed=data["a"].values)

    # Outcome model (MSM):
    intercept_y = pm.Normal("intercept_y", mu=0, sigma=10)
    beta_ay = pm.Normal("beta_ay", mu=0, sigma=10)
    beta_Xy = pm.Normal('beta_Xy', mu=0, sigma=10)
    sigma_y = pm.HalfNormal("sigma_y", sigma=10)

    # y given X and (a given X)
    mu_lin_y = pm.Deterministic("mu_lin_y", intercept_y + beta_Xy * data["X"].values.ravel() + beta_ay * a_obs)

    y_obs = pm.Normal('y_obs', mu=mu_lin_y, 
    idata = pm.sample(init='jitter+adapt_diag_grad')

Which gives the following estimates:

az.plot_posterior(idata, var_names=['beta_Xa', 'beta_ay', 'beta_Xy'], ref_val=np.r_[data['beta_xa'], [0.0], data['beta_xy']].tolist())

Moral of the story: if you can, go back to first principles first and see if you can directly write down what you want. The fact that you were able to write the data generating process in numpy told me you were working too hard. A PyMC model should look very close to how you would write the DGP in numpy!

Many thanks for taking a look, Jesse. Truly appreciated.
I indeed successfully coded a similar solution here in cells 33-35 (using the observed treatment, rather than the predicted treatment).
However, my point is not to obtain the average treatment effect in any way (I know that’s possible, especially in this toy example), but in a rather common and specific way - MSMs are very common, and there’s potential value in having a Bayesian version of them. But since I failed, I’m left wondering what could be the cause.
Thanks again for your input.

I still think you are wrong to value the the estimator over the estimand. If MSMs are common, it’s because people want to de-confound graphs like the one that I showed, and using do-calculus to identify and close baising paths in the model’s causal structure will always be the most general possible estimator. This is all WLS is doing anyway, so I’d argue that I did implement an MSM.

That said, if your heart is 100% set on this one specific estimator, I think reason you failed is that you have to whiten the inputs. In principle, what you want to have is something like:

import pytensor.tensor as pt

x_data = np.c_[np.ones_like(data['X']), data['X']]
A_data = np.c_[np.ones_like(data['a']), data['a']]
y_data = data['y'].values

with pm.Model() as msm_model:  # model specifications in PyMC are wrapped in a with-statement

    X = pm.MutableData('X', x_data)
    A = pm.MutableData('A', A_data)
    y = pm.MutableData('y', y_data)
    a = pm.MutableData('a', data["a"].values)
    # Treatment model:
    betas_a = pm.Normal("betas_a", mu=0, sigma=[1,2])
    mu_lin_a = pm.Deterministic("mu_lin_a", X @ betas_a)
    a_obs = pm.Bernoulli("a_obs", logit_p=mu_lin_a, observed=a)
    p_a1 = pm.Deterministic("p_a1", pm.math.sigmoid(mu_lin_a))

    p_a0 = pm.Deterministic("p_a0", 1-p_a1)
    p_a = pm.Deterministic("p_a", a * p_a1 + (1 - a) * p_a0)
    ipa = pm.Deterministic("ipa", 1/p_a)

    w_A = pm.Deterministic('w_A', A * pt.sqrt(ipa)[:, None])
    w_y = pm.Deterministic('w_y',  y * pt.sqrt(ipa))

    beta_y = pm.Normal('beta_y', sigma=10, shape=(2,))    
    mu_y = pm.Deterministic('mu_y', w_A @ beta_y)
    sigma_y = pm.Exponential('sigma_y', 1)

    obs = pm.Normal('obs', mu=mu_y, sigma=sigma_y, observed=w_y)

That corresponds to what WLS is actually estimating. But this code will raise an error, because PyMC doesn’t allow “computed” values like w_y as observed quantities. You would have to directly write down the logp of the WLS model and make either a pm.CustomDist or a pm.Potential. Looking at the statsmodel code, it’s something close to:

    # Priors/data as above
    def logp_wls(y, mu, sigma, weights):
        w_y = y * pt.sqrt(weights)
        logp = pm.logp(pm.Normal.dist(mu=mu, sigma=sigma), w_y)
        logp += 0.5 * pt.log(weights)

        return logp

    w_A = pm.Deterministic('w_A', A * pt.sqrt(ipa)[:, None])

    beta_y = pm.Normal('beta_y', sigma=10, shape=(2,))    
    mu_y = pm.Deterministic('mu_y', w_A @ beta_y)
    sigma_y = pm.Exponential('sigma_y', 1)

    obs = pm.CustomDist('obs', mu_y, sigma_y, ipa, logp=logp_wls, observed=y)

I spent some time working on this but it’s still not right (although it does shift the results towards the correct answer). I am thinking that this is because it is missing a term, but I don’t know what that is. Right now it returns p(w_y | ipa, mu, sigma) but it needs to return p(y | ipa, mu, sigma). Maybe that is a problem you can reason through?

Anyway, I end by once again stressing that the estimand, not the estimator, is the goal, so I see it as a lot of work for very little marginal value.

Jesse, first of all, and again, many thanks for digging into that.

Second, I second every word about focusing on estimands rather than estimators.
My stubbornness around the MSM was originally due to research not being in vacuum and wanting to gap familiar notions (IPW) with new ones (Bayesian inference). However, I’ll admit that now it is also fueled by just wanting to figure out why it does not work.

Which leads me to the third point, I followed your implementation. It seems the first main difference is the square-rooting of the weights (which I thought was an OLS thing, and not necessary for MCMC), and the second is using pm.CustomDist instead of pm.Potential (I followed Junpeng’s answer here, which does work when using fixed pre-comupted weights).
However, your code does not improve on the estimation I previously got (I’m not seeing a shift towards the correct answer - 0).

A follow-up thought, since using pre-computed fixed weights does seem to work, I wonder if the computation could be split into two steps. First, fitting a Bayesian logistic regression, and calculating N_chains-times-N_iterations inverse probability weights. Second, fit a Bayesian linear regression, with each MCMC iteration weighted by the corresponding chain+iteration.
Does that make since in PyMC (or at all)? Is there a way to access every iteration? I tried to naively use the fixed non-bayesian weights with your CustomDist logp_wls function (passing fixed weights instead of ipa) and it works. So I wonder if there’s a way maybe to tell what iteration the sampler is in or alternatively, build a generator yielding the next weights and plugging it into logp_wls.
[and yes, I see what you mean by “a lot of work for very little marginal values” :innocent:]

While not in pymc, I think you may be interested in this paper and its corresponding implementation in STAN: GitHub - ajnafa/Latent-Bayesian-MSM: Taking Uncertainty Seriously: Bayesian Marginal Structural Models for Causal Inference in Political Science

1 Like

Ok I see what you mean by not getting any improvement; just weighting the logp terms by ipa give the same -1 estimate.

The square roots come from this derivation, my idea was to try to implement equation (6) as directly as possible.

Since pre-computed weights work, the problem is definitely a missing term in the logp associated with the probably of the weights. This is the kind of thing PyMC’s logp inference is supposed to help with, but we don’t get access to that when we’re working with Potential or CustomDist. @junpenglao @aloctavodia @RavinKumar are real statisticians, they might have insight into the missing terms.

One other interesting data point: If you drop model 2 and just directly compute the WLS betas in closed from, you will get the right answer:

    V_inv = pt.diag(ipa)
    true_beta = pm.Deterministic('true_beta', pt.linalg.inv(A.T @ V_inv @ A) @ A.T @ V_inv @ y)

From this, my thinking was that if V^{-1} = \text{Diag}(\frac{1}{p(a)}), then V = \text{Diag}(p(a)). From there it should be possible to just directly use equation (1) from that proof I linked. That would imply a model like this:

    beta_y = pm.Normal('beta_y', sigma=10, shape=(2,))    
    mu_y = pm.Deterministic('mu_y', A @ beta_y)
    sigma_y = pm.Exponential('sigma_y', 1)
    obs = pm.Normal('obs', mu=mu_y, sigma=p_a * sigma_y, observed=y)

But that gives the same answer as the CustomDist and the weighted logp Potential, -1, so I’m really stumped. This last answer uses all the logp inference, so nothing should be missing.

For hacking MCMC, it might be possible to write a custom step function that does something like what you’re thinking? You would have to do some research into how that API works, I’m not super familiar with it.

Thanks Jesse, deriving the weighted regression estimator “manually” successfully matches my experience that using the Horvitz-Thompson estimator (i.e., simple weighted averages) also works (cells 22-24) - so weight-based deterministic calculations works - but things start to fail only when explicitly estimating the model jointly.

So stumped, indeed :face_with_diagonal_mouth:

Thanks for the pointer, Aurimas!
I don’t read Stan that well (I barely read PyMC tbh :upside_down_face:), but together with their manuscript, I think I can get the gist of it:

First, it seems their implementation follows a complete separation of the models. They first fit a treatment model and generate IP-weights, and then fit an outcome models where the weights are “fixed”.
Their Stan model treats IPW weights as data, meaning these weights are precomputed in a separate Bayesian model, and they describe it in their manuscript (page 14).

Second, they also seem to acknowledge the fact one could pass the precalculated posterior weights to every iteration of the MCMC (page 14).
Which is what I suggested in the last reply, so it’s comforting to know I’m not completely crazy.

Third, they go further and suggest (page 15) taking the mean/median and standard deviation of the posterior weights, and then scaling them by a Beta distribution to get the variability back, which I don’t fully understand from my quick skimming.

Following the first and second points, I have two questions I wonder if anyone can chime in:

  1. More theoretically: I wonder why the separation into two separate models is required. It seems to align with my current experience that they needn’t be fitted jointly and fixed precalculated weights should be passed (though I was quite confident MSMs will be a good use-case for this multiple sub-models thing).
  2. More practically: I’ll be happy to try separate two-step approach. If I could figure out how to sneak in posterior IP-weights into each MCMC iteration. If anyone with in-depth knowledge of PyMC has any idea, I’ll be happy to hear.

Many thanks

Ok, I gave a shot at answering (2) and hack a way to “sneak in” posterior IP-weights into each MCMC iteration through the logp calculation. It only partially works. I will lay out how here, for the sake of completeness.

  1. Fit a treatment-only model (no outcome model) and compute IP-weights.
    # Given a treatment-only pm.Model with sampled inference data
    # tx_model_idata
  2. Create a Python generator yielding a vector of weights (one for each observation) each time.
    def gen_ipweights():
    ipw = tx_model_idata.posterior.ipa.stack(it=("chain", "draw"))
    j = 0
    while j < ipw.shape[1]:
        yield ipw[:, j].values
        j += 1
  3. Define an outcome model with a CustomDist and a weighted log probability function, yielding a different set of weight each time.
    gen = CountingIterator(gen_ipweights())
    with pm.Model() as outcome_model:
        def logp_wls(y, mu, sigma):
            w = next(gen)
            logp = w * pm.logp(pm.Normal.dist(mu=mu, sigma=sigma), y)
            return logp
    intercept_y = pm.Normal('intercept_y', sigma=10)    
    beta_y = pm.Normal('beta_y', sigma=10)    
    mu_y = pm.Deterministic('mu_y',  intercept_y + data["a"].values*beta_y)
    sigma_y = pm.Exponential('sigma_y', 1)
    obs = pm.CustomDist(
        'obs', mu_y, sigma_y, 
    o_model_idata = pm.sample()

This somewhat works. I get only a slightly biased estimation. But unfortunately it does not fully work as intended: I thought the logp function is called every MCMC iteration, but for some reason (after testing) it is only called 10 times. Namely, only 10 posterior IP-weights are drawn from the generator, not ~8000 (2000 x 4) as I expected.

Therefore, in the meantime, my two questions remain:

  1. Why the joint fitting of the models does not work
  2. How can I “inject” variables (a generator) into every MCMC iteration.