Hi, I have used the new do operator as in this blog post Causal analysis with PyMC: Answering "What If?" with the new do operator - PyMC Labs and have adapted it to my needs. However, I found that the sample_posterior_predictive
function doesn’t seem to correctly use the data inserted via do operator - when a pm.Deterministic was used in the Model.
The following recreates the workflow and the resulting problem (with the minimum amount of variables to not create unnecessary complexity).
with pm.Model(coords_mutable={"i": [0], "campaign_dim":np.arange(2)}) as model_generative:
campaign = pm.Categorical("campaign", p=[0.5,0.5], dims=("i"))
alpha_y_campaign = pm.Normal("alpha_y_campaign",mu=[0,10],sigma=1, dims=("campaign_dim"))
alpha_campaign = pm.Deterministic("alpha_campaign", alpha_y_campaign[campaign])
y = pm.Normal("y", mu=alpha_campaign , dims=("i"))
N=100
with model_generative:
simulate = pm.sample_prior_predictive(samples=N)
observed = {
"campaign": simulate.prior.campaign.values.flatten(),
"y": simulate.prior.y.values.flatten(),
}
df = pd.DataFrame(observed)
This creates a dataframe, where y = 10 when campaign = 1 and y = 0 when campaign = 0.
Now I am gonna reset the mu parameter for alpha_y_campaign to 0 for campaign index variables and let the model try to recover the parameters that were use for the above df and estimate the causal effect.
with pm.Model(coords_mutable={"i": [0], "campaign_dim":np.arange(2)}) as model_generative:
campaign = pm.Categorical("campaign", p=[0.5,0.5], dims=("i"))
alpha_y_campaign = pm.Normal("alpha_y_campaign",mu=0,sigma=1, dims=("campaign_dim"))
alpha_campaign = pm.Deterministic("alpha_campaign", alpha_y_campaign[campaign])
y = pm.Normal("y", mu=alpha_campaign , dims=("i"))
model_inference = pm.observe(model_generative, {"campaign": df["campaign"].values,
"y": df["y"].values
})
model_inference.set_dim("i", N, coord_values=(np.arange(N)))
with model_inference:
idata = pm.sample( random_seed=1)
So far so good. Now I am gonna set for one model all campaign values to 0 and for the other to 1.
model_z0 = do(model_inference, {"campaign": np.zeros(N, dtype="int32")}, prune_vars=True)
model_z1 = do(model_inference, {"campaign": np.ones(N, dtype="int32")}, prune_vars=True)
Now comes the interesting part. When I sample from the posterior_predictive and only write var_names=[“y”]. I don’t get a causal effect:
idata_z0 = pm.sample_posterior_predictive(
idata,
model=model_z0,
predictions=True,
var_names=["y"]
)
idata_z1 = pm.sample_posterior_predictive(
idata,
model=model_z1,
predictions=True,
var_names=["y"]
)
az.plot_posterior(idata_z1.predictions.y.reduce(np.mean,dim="i")-idata_z0.predictions.y.reduce(np.mean,dim="i"))
However, when I include “campaign” in the var_names. I get the correct causal inference:
idata_z0 = pm.sample_posterior_predictive(
idata,
model=model_z0,
predictions=True,
var_names=["campaign","y"]
)
idata_z1 = pm.sample_posterior_predictive(
idata,
model=model_z1,
predictions=True,
var_names=["campaign","y"]
)
az.plot_posterior(idata_z1.predictions.y.reduce(np.mean,dim="i")-idata_z0.predictions.y.reduce(np.mean,dim="i"))
It seems that problem arises when I am using pm.Deterministic inside the Model. Can somebody explain to me, why that makes a difference? I don’t have the issue, when I am not using pm.Deterministic, i.e. when I am using the model like this:
with pm.Model(coords_mutable={"i": [0], "campaign_dim":np.arange(2)}) as model_generative:
campaign = pm.Categorical("campaign", p=[0.5,0.5], dims=("i"))
alpha_y_campaign = pm.Normal("alpha_y_campaign",mu=[0,10],sigma=1, dims=("campaign_dim"))
y = pm.Normal("y", mu= alpha_y_campaign[campaign], dims=("i"))