Observational error in logistic regression

Hi!

I am trying to correctly simulate the confidence bounds in a logistic regression were x is not perfectly known. Lest assume I have a pair of vectors of observations (cond, data), where cond is discrete and data is continuous and normally distributed with an observational error sigma. So,

with pm.Model() as model:
    beta = pm.Normal("beta", mu=0, tau=0.01)
    alpha = pm.Normal("alpha", mu=0, tau=0.01)
    data_noise = pm.Normal("data_noise", mu=data, sigma=sigma)
    p = pm.Deterministic("p", 1.0/(1. + np.exp(beta*data_noise + alpha)))

with model:
    observed = pm.Bernoulli("observed", p, observed=cond)
    trace = pm.sampling_jax.sample_numpyro_nuts(tune=1000, draws=10000, chains=4, random_seed=307)

The code samples, but the distribution of alpha and beta seems to be insensitive to the value of sigma (and so the overall logistic inference). I would expected broader distributions for (alpha, beta) as sigma increases. However, the distributions of data_noise do become wider as sigma increases.

Am I doing something wrong? And more generally: is this the standard way to introduce uncertainties in observations in the PyMC framework?

Your variable data_obs is not defined. Did you mean to use data_noise? Like this:

p = pm.Deterministic("p", 1.0/(1. + np.exp(beta*data_noise + alpha)))

Indeed! I will change the original post

I guess my question was whether that was the reason you weren’t seeing any connection between sigma and the posteriors of alpha and beta.

Sadly is not. I was just changing variable names to clarify the example and I missed that one.

Things seem to work ok here. When I set sigma=0.1, I get this:

In [22]: az.hdi(trace, var_names=["alpha", "beta"])
<xarray.Dataset>
Dimensions:  (hdi: 2)
Coordinates:
  * hdi      (hdi) <U6 'lower' 'higher'
Data variables:
    alpha    (hdi) float64 -3.263 0.9269
    beta     (hdi) float64 0.01892 0.3808

When I set sigma=10, I get this:

In [24]: az.hdi(trace, var_names=["alpha", "beta"])
<xarray.Dataset>
Dimensions:  (hdi: 2)
Coordinates:
  * hdi      (hdi) <U6 'lower' 'higher'
Data variables:
    alpha    (hdi) float64 -18.85 16.02
    beta     (hdi) float64 0.215 18.57

Here is my code:

import pymc as pm
import numpy as np
import scipy
import arviz as az
import matplotlib.pyplot as plt

rng = np.random.default_rng(12345)

n = 10
data_true = rng.integers(-10, high=10, size=n)
data = data_true + rng.normal(scale=10, size=n)
ps = scipy.special.expit(-10 + (3.14 * data_true))
cond = rng.binomial(1, ps)

with pm.Model() as model:
    beta = pm.Normal("beta", mu=0, tau=0.01)
    alpha = pm.Normal("alpha", mu=0, tau=0.01)
    data_noise = pm.Normal("data_noise", mu=data, sigma=10)
    observed = pm.Bernoulli("observed", pm.math.invlogit(beta*data_noise + alpha), observed=cond)
    trace = pm.sample()

print(az.hdi(trace, var_names=["alpha", "beta"]))
1 Like

Thank you for your feedback. I was testing your implementation against mine, since they are essentially the same. Except, that you used a very small number of observations, and I have >8000. In this context, it seems that error in the observations of almost any magnitude is of practically no consequence to the goodness of fit.

Thank you again!

1 Like