Linear regression with measurement errors, am I doing it right?

Hi Everyone,

I’m working on a difference-in-difference analysis problem where each observation has a reported standard error and I want to use this uncertainty in my model. The diff-in-diff model is pretty standard (although I have multiple treatments), except for the error part.

I tried to follow McElreath’s book example here: statistical-rethinking-2023/Lecture 17 - Measurement and Misclassification.ipynb at main · dustinstansbury/statistical-rethinking-2023 · GitHub with the errors formulated as an additional Normal. In my model this looks like this:

    mu = pm.Deterministic(
        "mu",
        control_intercept + <diff_in_diff_stuff>
        dims=['obs__']
    )
    
    scale = pm.Exponential('scale', lam=10)

    obs = pm.Normal("obs", mu = mu, sigma = scale, dims=['obs__'])
    
    obs_star = pm.Normal('obs*', mu=obs, sigma=values_se, observed=values, dims=['obs__'])

And this does not converge, even for target_accept=0.999 throwing divergences near the 0 of sigma:

To improve convergence I reformulated the problem in a non-central way:

    
    mu = pm.Deterministic(
        "mu",
        control_intercept + <diff_in_diff_stuff>,
        dims=['obs__']
    )

    error = pm.Normal('error', 0, 1)
    scale = pm.Exponential('scale', lam=10)
    
    obs = pm.Deterministic("obs", mu+scale*error, dims=['obs__'])
    # obs = pm.Normal("obs", mu = mu, sigma = scale, dims=['obs__'])

    obs_star = pm.Normal('obs*', mu=obs, sigma=values_se, observed=values, dims=['obs__'])

The sampling drastically improved, although I still need pretty high values of target_accept to get rid of random divergencies here and there.

Although I feel that the scale is not learnt at all from my dataset:

Am I doing everything correctly here?

I don’t know about the diff in diff but the immediate problem is your model is compatible with a scale of 0.

It does, yes, I think the idea is that the model can differentiate between the regression errors and measurement errors. The hope is that the data points which are off the trend but have large measurement errors would not be influential.

The diff-in-diff example is very similar to this one: Difference in differences — PyMC example gallery

I don’t see where the “reported standard error” comes in. Am I missing something?

Thanks, @ricardoV94 , after thinking about it, indeed, there are enouph parameters in my model to fit all datapoints exactly, hence the issue.

@cluhmann standard error is reported with the data, think of a measurement errors for reported values. They are just given.

Here is a self-contained example:

df_data = (
    pd.DataFrame(
        {
            "t": [0] * 4 + [1] * 4,
            "treatment_1": [1, 0, 0, 0] * 2,
            "treatment_2": [0, 1, 0, 0] * 2,
            "treatment_3": [0, 0, 1, 0] * 2,
            "control": [0,0,0,1]*2,
        }
    )
    .assign(
        value=lambda df: (np.asarray([-0.05, 0.07, 0.21, 0.03] * 2))
        + 1.2 * df.t
        + 0.5 * df.treatment_1*df.t
        - 0.5 * df.treatment_2*df.t
        + 0.0 * df.treatment_3*df.t
    )
    .assign(value_se=0.25)
)

Here is my revised model:

coords = {
    "treatments": ["treatment_1", "treatment_2", "treatment_3"],
    "obs__": df_data.index,
}

with pm.Model(coords=coords) as m1:

    buckets_inds_data = pm.Data(
        "treatments_inds",
        df_data[["treatment_1", "treatment_2", "treatment_3"]],
        dims=["obs__", "treatments"],
    )

    t = pm.Data("t", df_data.t, dims=["obs__"])
    value = pm.Data("value", df_data.value, dims=["obs__"])
    value_se = pm.Data("value_se", df_data.value_se, dims=["obs__"])

    control_intercept = pm.Normal("control_intercept", 0, 1)
    trend = pm.Normal("trend", 0, 2.5)

    pre_bias = pm.Normal("pre_bias", 0, 0.25, dims="treatments")
    effect = pm.Normal("effect", 0, 1.5, dims="treatments")

    mu = pm.Deterministic(
        "mu",
        control_intercept
        + buckets_inds_data @ pre_bias
        + trend * t
        + (buckets_inds_data @ effect) * t,
        dims=["obs__"],
    )

    obs = pm.Normal("obs", mu=mu, sigma=value_se, observed=value, dims=["obs__"])

This seems to work, at least the effect is identified correctly:

and scales nicely with the reported SE, if I reduce SE the credible intervals shrink as well, for example for SE=0.025 instead of the original 0.25

Overall seems a bit of an overkill to use pymc here, I could probably just calculate classical diff-in-diff and the uncertainty as var(diff-in-diff) = se(control_pre)^2 + se(control_post)^2 + se(treatment_post)^2 + se(treatment_post)^2, but here I can put nice priors, as I know for example that the trend should have a greater effect than the treatment.

This looks like this discussion. I would suggest adding a parameter that scales the value_se rather than using it as-is.

Like this?

    scale_se = pm.Gamma('scale_se', alpha=10, beta=10)
    obs = pm.Normal("obs", mu=mu, sigma=value_se*scale_se, observed=value, dims=["obs__"])

I don’t see that the model actually learns anything for scale_se, and to @ricardoV94 's point, there is the same amount of parameters as the datapoints, so there is no pure regression residuals, only the measurement residuals. Or I don’t understand something?

I tried the above - the model either diverges if the prior is not tight enough or if it is tight around one - it just expands the uncertainty in the estimates.

Usually in a covariate measurement error model, you have both the measured/observed covariate value x^\text{obs}_n and a measurement error scale \sigma^\text{obs}_n on the measurement, and you define x_n as a parameter (unknown) and give it the prior distribution

x_n \sim \textrm{normal}(x^\text{obs}_n, \sigma^\text{obs}_n).

If you don’t know the measurement error scale, you can fit it, or replace it with a more elaborate measurement error model.

After that, it’s just an ordinary regression stated in terms of the imputed “true” values x_n rather than the observed values x^\text{obs}_n.

P.S. What is the observed = values doing in the obs_star statement? Isn’t this for observations of obs_star, which we don’t have?

1 Like