How to include indicator variables

thorstenwagner · February 6, 2025, 10:49am

Hi,

I have the following dataset:

x: Ordinal time
y: Measurement
phs1,phs2: Indicator variables for different phases (they can in principle overlap)

So far I simply fit two models, one for each phase separately. This works. But I know this is not ideal and I try to get them in one model and solve it like this:

model = bmb.Model("y ~ x + x:phs1 + x:phs2", fitdf, priors=priors) 

trace = model.fit(
    draws=10000,
    tune=1000,
    discard_tuned_samples=True,
    chains=4,
    cores=4,
    progressbar=True,
    trace_accept=0.9,
)

However, this somehow does not work and I wondering whats wrong:

Any suggestions?

ricardoV94 · February 6, 2025, 11:58am

CC @tcapretto @zweli who will likely know the solution

cluhmann · February 6, 2025, 12:21pm

Can you say more about what you mean by this? What are you expecting to see?

thorstenwagner · February 6, 2025, 12:25pm

Thank you for you response! As you can see x:phs1 and x:phs2 are essentially zero. As it is simulated data, I know that the slope for both phases is different and when I fit two separate model, the posterior is centered around the true slope. I would expect to see something similar in a combined model.

cluhmann · February 6, 2025, 12:31pm

Presumably the bare (un-interacted) x is absorbing variance that you “separate models” approach did not. What happens when you run "y ~ x:phs1 + x:phs2"?

thorstenwagner · February 6, 2025, 12:48pm

Same:

Here is the code to reproduce the issue

import pandas as pd
import bambi as bmb

fitdf = pd.DataFrame({'x': {0: 738886,
  1: 738926,
  2: 738950,
  3: 739014,
  4: 739065,
  5: 739123,
  6: 739145,
  7: 739192,
  8: 739218,
  9: 739239},
 'y': {0: 48, 1: 47, 2: 46, 3: 45, 4: 43, 5: 42, 6: 42, 7: 42, 8: 41, 9: 41},
 'phs1': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
 'phs2': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}})

model = bmb.Model("y ~ x:phs1 + x:phs2", fitdf)


trace = model.fit(
    draws=10000,
    tune=1000,
    discard_tuned_samples=True,
    chains=4,
    cores=4,
    progressbar=True,
    trace_accept=0.9,
)

cluhmann · February 6, 2025, 1:06pm

Two questions. First, what are the true/expected estimates of your two slopes? Second, are you sure that you aren’t just seeing very small marginal means rather than an actual problem?

This is what I am seeing:

In [19]: (trace.posterior["x:phs1"] > trace.posterior["x:phs2"]).mean()
Out[19]: 
<xarray.DataArray ()>
array(0.999)

thorstenwagner · February 6, 2025, 1:17pm

True slopes are -0.8 points / month during phase 1 and -0.3 points / month during phase 2. But the model estimates points / day, so the slopes needs to get divided by 30.4.

However, I would expect that

posterior = trace.posterior
np.mean(posterior['x:phs1'][0]*30.4)

gives an result close to -0.8?

This is the posterior when I use two separate models:

thorstenwagner · February 6, 2025, 3:21pm

Here is the code for the separate models:

model = bmb.Model("y ~ x", fitdf[fitdf['phs1'] == 1])


trace = model.fit(
    draws=10000,
    tune=1000,
    discard_tuned_samples=True,
    chains=4,
    cores=4,
    progressbar=True,
    trace_accept=0.9,
)
posterior = trace.posterior

s = posterior['x'][0].to_numpy()*30.4
print(np.mean(s))

This prints -0.8. If you do the same with phs2 it prints -0.28.

cluhmann · February 6, 2025, 7:53pm

I assume it has to do with the data-informed priors. Here is what you get with your code:

Reasonable, but probably not what you want.

Here is the same posterior predictive after adding this line as a data pre-processing step:

fitdf["x"] = fitdf["x"] - fitdf["x"].min()

So I would ask @tcapretto what he thinks.

tcapretto · February 7, 2025, 7:13pm

@cluhmann is right. Have a look at the following

import bambi as bmb
import pandas as pd

data = {
    "x": [738886, 738926, 738950, 739014, 739065, 739123, 739145, 739192, 739218, 739239],
    "y": [48, 47, 46, 45, 43, 42, 42, 42, 41, 41],
    "phs1": [1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
    "phs2": [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
}

df = pd.DataFrame(data)

model = bmb.Model("y ~ x + x:phs1 + x:phs2", df)
model

       Formula: y ~ x + x:phs1 + x:phs2
        Family: gaussian
          Link: mu = identity
  Observations: 10
        Priors: 
    target = mu
        Common-level effects
            Intercept ~ Normal(mu: 43.7, sigma: 37539.2986)
            x ~ Normal(mu: 0.0, sigma: 0.0508)
            x:phs1 ~ Normal(mu: 0.0, sigma: 0.0)
            x:phs2 ~ Normal(mu: 0.0, sigma: 0.0)
        
        Auxiliary parameters
            sigma ~ HalfStudentT(nu: 4.0, sigma: 2.4515)

Notice first the huge standard deivation on the intercept and the “zero” sigma for the slope deflections.

If you then inspect the internal priors (to circumvent the rounding in the summary)

print(model.components["mu"].common_terms["x"].prior.args)
print(model.components["mu"].common_terms["x:phs1"].prior.args)
print(model.components["mu"].common_terms["x:phs2"].prior.args)

{'mu': array(0.), 'sigma': array(0.05079222)}
{'mu': array(0.), 'sigma': array(1.65875211e-05)}
{'mu': array(0.), 'sigma': array(1.6582692e-05)}

You see the ones for the deflections are way smaller than the ones for the slope.

This problem still happens when you do y ~ 1 + x:phs1 + x:phs2 (which gives identified slopes for your data, not the case previously).

When you have such magnitudes in numerical values it makes sense to apply some transformation. Subtracting by the minimum seems to work well in this case, and you don’t need to transform slopes. You could also just apply z-score standardization.

Out of curiosity, what kind of measurement y is? Asking to double-check if the normal likelihood makes sense.

thorstenwagner · February 8, 2025, 9:43am

Thanks for your explanation!

Why subtract the min value and not the mean value directly to centralise it?

The data is a health score for a degenerative disease that decreases over time. It is based on a questionnaire. In almost all cases you never see an increase over time. The score is always an integer and has a maximum value of 48.

thorstenwagner · February 12, 2025, 11:43am

Subtracting the mean and using the following model works:

However, when using this model

I see a lot of divergences. Why does that happen?

Topic		Replies	Views
Highly correlated variables v5 bambi , modeling	3	519	January 3, 2023
Proper way to model several variables v5	4	1259	August 8, 2022
Shape parameter is giving different results when compared to multiple variables Questions	0	440	February 20, 2020
Multiple output regression with cross-dependent variables Questions	0	496	October 5, 2020
Multivariate b-spline regression with out-of-sample predictions v5 modeling	4	686	May 5, 2023

How to include indicator variables

Related topics