Linear regression with measurement errors, am I doing it right?

Cricket4444 · December 10, 2024, 3:25pm

Hi Everyone,

I’m working on a difference-in-difference analysis problem where each observation has a reported standard error and I want to use this uncertainty in my model. The diff-in-diff model is pretty standard (although I have multiple treatments), except for the error part.

I tried to follow McElreath’s book example here: statistical-rethinking-2023/Lecture 17 - Measurement and Misclassification.ipynb at main · dustinstansbury/statistical-rethinking-2023 · GitHub with the errors formulated as an additional Normal. In my model this looks like this:

    mu = pm.Deterministic(
        "mu",
        control_intercept + <diff_in_diff_stuff>
        dims=['obs__']
    )
    
    scale = pm.Exponential('scale', lam=10)

    obs = pm.Normal("obs", mu = mu, sigma = scale, dims=['obs__'])
    
    obs_star = pm.Normal('obs*', mu=obs, sigma=values_se, observed=values, dims=['obs__'])

And this does not converge, even for target_accept=0.999 throwing divergences near the 0 of sigma:

To improve convergence I reformulated the problem in a non-central way:

    
    mu = pm.Deterministic(
        "mu",
        control_intercept + <diff_in_diff_stuff>,
        dims=['obs__']
    )

    error = pm.Normal('error', 0, 1)
    scale = pm.Exponential('scale', lam=10)
    
    obs = pm.Deterministic("obs", mu+scale*error, dims=['obs__'])
    # obs = pm.Normal("obs", mu = mu, sigma = scale, dims=['obs__'])

    obs_star = pm.Normal('obs*', mu=obs, sigma=values_se, observed=values, dims=['obs__'])

The sampling drastically improved, although I still need pretty high values of target_accept to get rid of random divergencies here and there.

Although I feel that the scale is not learnt at all from my dataset:

Am I doing everything correctly here?

ricardoV94 · December 10, 2024, 5:01pm

I don’t know about the diff in diff but the immediate problem is your model is compatible with a scale of 0.

Cricket4444 · December 10, 2024, 5:33pm

It does, yes, I think the idea is that the model can differentiate between the regression errors and measurement errors. The hope is that the data points which are off the trend but have large measurement errors would not be influential.

The diff-in-diff example is very similar to this one: Difference in differences — PyMC example gallery

cluhmann · December 10, 2024, 8:38pm

I don’t see where the “reported standard error” comes in. Am I missing something?

Cricket4444 · December 11, 2024, 3:32pm

Thanks, @ricardoV94 , after thinking about it, indeed, there are enouph parameters in my model to fit all datapoints exactly, hence the issue.

@cluhmann standard error is reported with the data, think of a measurement errors for reported values. They are just given.

Here is a self-contained example:

df_data = (
    pd.DataFrame(
        {
            "t": [0] * 4 + [1] * 4,
            "treatment_1": [1, 0, 0, 0] * 2,
            "treatment_2": [0, 1, 0, 0] * 2,
            "treatment_3": [0, 0, 1, 0] * 2,
            "control": [0,0,0,1]*2,
        }
    )
    .assign(
        value=lambda df: (np.asarray([-0.05, 0.07, 0.21, 0.03] * 2))
        + 1.2 * df.t
        + 0.5 * df.treatment_1*df.t
        - 0.5 * df.treatment_2*df.t
        + 0.0 * df.treatment_3*df.t
    )
    .assign(value_se=0.25)
)

Here is my revised model:

coords = {
    "treatments": ["treatment_1", "treatment_2", "treatment_3"],
    "obs__": df_data.index,
}

with pm.Model(coords=coords) as m1:

    buckets_inds_data = pm.Data(
        "treatments_inds",
        df_data[["treatment_1", "treatment_2", "treatment_3"]],
        dims=["obs__", "treatments"],
    )

    t = pm.Data("t", df_data.t, dims=["obs__"])
    value = pm.Data("value", df_data.value, dims=["obs__"])
    value_se = pm.Data("value_se", df_data.value_se, dims=["obs__"])

    control_intercept = pm.Normal("control_intercept", 0, 1)
    trend = pm.Normal("trend", 0, 2.5)

    pre_bias = pm.Normal("pre_bias", 0, 0.25, dims="treatments")
    effect = pm.Normal("effect", 0, 1.5, dims="treatments")

    mu = pm.Deterministic(
        "mu",
        control_intercept
        + buckets_inds_data @ pre_bias
        + trend * t
        + (buckets_inds_data @ effect) * t,
        dims=["obs__"],
    )

    obs = pm.Normal("obs", mu=mu, sigma=value_se, observed=value, dims=["obs__"])

This seems to work, at least the effect is identified correctly:

and scales nicely with the reported SE, if I reduce SE the credible intervals shrink as well, for example for SE=0.025 instead of the original 0.25

Overall seems a bit of an overkill to use pymc here, I could probably just calculate classical diff-in-diff and the uncertainty as var(diff-in-diff) = se(control_pre)^2 + se(control_post)^2 + se(treatment_post)^2 + se(treatment_post)^2, but here I can put nice priors, as I know for example that the trend should have a greater effect than the treatment.

cluhmann · December 11, 2024, 3:58pm

This looks like this discussion. I would suggest adding a parameter that scales the value_se rather than using it as-is.

Cricket4444 · December 11, 2024, 5:47pm

Like this?

    scale_se = pm.Gamma('scale_se', alpha=10, beta=10)
    obs = pm.Normal("obs", mu=mu, sigma=value_se*scale_se, observed=value, dims=["obs__"])

I don’t see that the model actually learns anything for scale_se, and to @ricardoV94 's point, there is the same amount of parameters as the datapoints, so there is no pure regression residuals, only the measurement residuals. Or I don’t understand something?

I tried the above - the model either diverges if the prior is not tight enough or if it is tight around one - it just expands the uncertainty in the estimates.

bob-carpenter · December 16, 2024, 10:04pm

Usually in a covariate measurement error model, you have both the measured/observed covariate value x^\text{obs}_n and a measurement error scale \sigma^\text{obs}_n on the measurement, and you define x_n as a parameter (unknown) and give it the prior distribution

x_n \sim \textrm{normal}(x^\text{obs}_n, \sigma^\text{obs}_n).

If you don’t know the measurement error scale, you can fit it, or replace it with a more elaborate measurement error model.

After that, it’s just an ordinary regression stated in terms of the imputed “true” values x_n rather than the observed values x^\text{obs}_n.

P.S. What is the observed = values doing in the obs_star statement? Isn’t this for observations of obs_star, which we don’t have?

Topic		Replies	Views
Unable to predict using set_value with an errors-in-variables model Questions	9	1582	December 5, 2021
Curve fitting accounting for data uncertainties in both x and y	2	220	August 5, 2024
Sampling error y inf when adding sigma heteroskedastic modelling v5	11	61	November 15, 2024
Hierarchical model with predictors with uncertainty at a different level than rest of data Questions	1	513	July 11, 2020
Multivariant linear regression with uncertainty (error propagation) Questions modeling	2	1529	September 7, 2022

Linear regression with measurement errors, am I doing it right?

Related topics