Wrong estimated parameters in a multilevel logistic problem

Hi all, I ran into some issue on getting wrong parameter estimation. Something is off but I am really not sure it which part went wrong. I started a simulation function to create data with multi-leveled parameters.

def simulate_data(sigmoid, sample_n=2000, feat_n=2, group_level_n=3, seed=1):
    np.random.seed(seed)
    x = np.random.normal(size=(sample_n, feat_n))
    group_level = np.random.choice(a=group_level_n, size=sample_n)
    b_base = np.random.choice(a=5, size=(group_level_n, feat_n))
    y = (x * b_base[group_level]).sum(1)    
    if sigmoid:
        y = 1/(1 + np.exp(-y))        
    X = pd.DataFrame(x, columns = [f'X_{i}' for i in range(feat_n)])
    X['y'] = y
    X['g'] = group_level
    return X, b_base


A quick look on the simulated data and parameter values

X_, betas_ = simulate_data(sigmoid=True)

print(X_.head())
> 
           X_0       X_1         y  g
0     1.624345 -0.611756  0.690331  1
1    -0.528172 -1.072969  0.370943  2
2     0.865408 -2.301539  0.761349  0
3     1.744812 -0.761207  0.998011  0
4     0.319039 -0.249370  0.411104  1


Here is the model setup for sigmoid output y

X_, XB_ = simulate_data(sigmoid=True)

y = X_['y'].values
X = X_.drop(['y', 'g'], axis=1).values

group_level = X_['g'].values
group_level_n = len(set(group_level))

with pm.Model() as model_sigmoid:
    betas = pm.Normal('betas', 0, 2, shape=(group_level_n, X.shape[1]))    
    z = (X * betas[group_level]).sum(axis=1)
    p = pm.math.sigmoid(z)
    pm.Bernoulli('y_hat', p, observed=y)
    trace_sigmoid = pm.sample()


We can see that the estimated params are quite off.

print(az.summary(trace_sigmoid).iloc[:, :2])
> 
              mean     sd
betas[0, 0] -0.212  0.083
betas[0, 1] -0.030  0.079
betas[1, 0]  0.050  0.079
betas[1, 1] -0.006  0.075
betas[2, 0]  0.055  0.075
betas[2, 1] -0.018  0.081


print(betas_)
> array([[4, 1],
         [2, 4],
         [1, 0]])


On the other side, if I changed the output without sigmoid function and use pm.Normal to estimate the output, it works just fine.

X_, betas_ = simulate_data(sigmoid=False)

y = X_['y'].values
X = X_.drop(['y', 'g'], axis=1).values

group_level = X_['g'].values
group_level_n = len(set(group_level))

with pm.Model() as model_regression:
    betas = pm.Normal('betas', 0, 2, shape=(group_level_n, X.shape[1]))    
    z = (X * betas[group_level]).sum(axis=1)
    pm.Normal('y_hat', mu=z, sigma=1, observed=y)
    trace_regression = pm.sample()

print(az.summary(trace_regression).iloc[:, :2])
> 
              mean     sd
betas[0, 0]  3.998  0.041
betas[0, 1]  1.000  0.040
betas[1, 0]  1.999  0.038
betas[1, 1]  3.999  0.038
betas[2, 0]  0.999  0.038
betas[2, 1] -0.001  0.041     


Initially, I thought it was the z not getting handled correctly, but it was consistent with simulation and non sigmoid output can be modeled just fine. Here is my pymc version if that helps.

print(pm.__version__)
> 3.11.5
1 Like

Welcome!

Typically, logistic regression is used when you have dichotomous data. Right now it looks like you have a observed variable that is continuous (though bound to [0, 1]) but are modeling it by assuming that the underlying p is only observed in a dichotomous/Bernoulli-transformed manner. If you generate synthetic data that conforms to your modeling assumptions, it works ok for me (running v4 here):

rng = np.random.default_rng()

def simulate_data(sigmoid, sample_n=1000, feat_n=2, group_level_n=3, seed=1):
    ...
    X['y'] = np.random.binomial(1, y)
    ...

Thx! Thats explains it. Two follow up questions:

  1. It makes sense Bernoulli distribution assumes [0, 1]. Just curious, what exactly happened behind the scene when continuous values between 0-1 were given for model to fit? I wonder if that could explain the posterior of p were all around 0.5.

  2. I found the parameters can be off again if I were to hard code y by this way. Is there any reason behind that?

def simulate_data(sigmoid, sample_n=1000, feat_n=2, group_level_n=3, seed=1):
    ...
    X['y'] = np.array(y>=0.5, dtype=int)
    ...

print(az.summary(trace_sigmoid).iloc[:, :2])
> 
               mean     sd
betas[0, 0]  11.585  1.125
betas[0, 1]   2.972  0.341
betas[1, 0]   5.870  0.577
betas[1, 1]  11.257  1.074
betas[2, 0]  11.918  1.158
betas[2, 1]   0.106  0.228
  1. Not actually sure about that. The likelihood of something like y=0.7 seems zero/undefined, but there weren’t any sampling problems. so it’s clearly coping somehow.
  2. Thresholding your p (or y in the simulation block) again creates mis-matches between your data and your model. Imagine a simpler case, without all the predictors If I have a bunch of coins, each with it’s own p, I can simulate flips by assuming all coins for which p>0.5 always give me heads and all coins for which p<0.5 always give me tails. A coin which has a value of p=0.50000000001 now gives me all heads, but inferences based on a Bernoulli likelihood are going to suggest a value of p near 1.0 because I never observed any tails.
1 Like

If you give continuous values to your discrete likelihood it simply truncates everything to the lowest integer (so 0 in your case) if I recall correctly.

Edit: Ah you are using pymc3 still. Then I don’t know, it might just pass those directly to the bernoulli likelihood, which is a function that does not care whether it has integer or float inputs. In V4 they would definitely be truncated to 0 before passing it.

2 Likes