How to include indicator variables

Hi,

I have the following dataset:

x: Ordinal time
y: Measurement
phs1,phs2: Indicator variables for different phases (they can in principle overlap)

So far I simply fit two models, one for each phase separately. This works. But I know this is not ideal and I try to get them in one model and solve it like this:

model = bmb.Model("y ~ x + x:phs1 + x:phs2", fitdf, priors=priors) 

trace = model.fit(
    draws=10000,
    tune=1000,
    discard_tuned_samples=True,
    chains=4,
    cores=4,
    progressbar=True,
    trace_accept=0.9,
)

However, this somehow does not work and I wondering whats wrong:

Any suggestions?

1 Like

CC @tcapretto @zweli who will likely know the solution

Can you say more about what you mean by this? What are you expecting to see?

Thank you for you response! As you can see x:phs1 and x:phs2 are essentially zero. As it is simulated data, I know that the slope for both phases is different and when I fit two separate model, the posterior is centered around the true slope. I would expect to see something similar in a combined model.

Presumably the bare (un-interacted) x is absorbing variance that you “separate models” approach did not. What happens when you run "y ~ x:phs1 + x:phs2"?

Same:

Here is the code to reproduce the issue

import pandas as pd
import bambi as bmb

fitdf = pd.DataFrame({'x': {0: 738886,
  1: 738926,
  2: 738950,
  3: 739014,
  4: 739065,
  5: 739123,
  6: 739145,
  7: 739192,
  8: 739218,
  9: 739239},
 'y': {0: 48, 1: 47, 2: 46, 3: 45, 4: 43, 5: 42, 6: 42, 7: 42, 8: 41, 9: 41},
 'phs1': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
 'phs2': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1}})

model = bmb.Model("y ~ x:phs1 + x:phs2", fitdf)


trace = model.fit(
    draws=10000,
    tune=1000,
    discard_tuned_samples=True,
    chains=4,
    cores=4,
    progressbar=True,
    trace_accept=0.9,
)

Two questions. First, what are the true/expected estimates of your two slopes? Second, are you sure that you aren’t just seeing very small marginal means rather than an actual problem?

This is what I am seeing:

In [19]: (trace.posterior["x:phs1"] > trace.posterior["x:phs2"]).mean()
Out[19]: 
<xarray.DataArray ()>
array(0.999)

True slopes are -0.8 points / month during phase 1 and -0.3 points / month during phase 2. But the model estimates points / day, so the slopes needs to get divided by 30.4.

However, I would expect that

posterior = trace.posterior
np.mean(posterior['x:phs1'][0]*30.4)

gives an result close to -0.8?

This is the posterior when I use two separate models:

Here is the code for the separate models:

model = bmb.Model("y ~ x", fitdf[fitdf['phs1'] == 1])


trace = model.fit(
    draws=10000,
    tune=1000,
    discard_tuned_samples=True,
    chains=4,
    cores=4,
    progressbar=True,
    trace_accept=0.9,
)
posterior = trace.posterior

s = posterior['x'][0].to_numpy()*30.4
print(np.mean(s))

This prints -0.8. If you do the same with phs2 it prints -0.28.

I assume it has to do with the data-informed priors. Here is what you get with your code:


Reasonable, but probably not what you want.

Here is the same posterior predictive after adding this line as a data pre-processing step:

fitdf["x"] = fitdf["x"] - fitdf["x"].min()

So I would ask @tcapretto what he thinks.

2 Likes

@cluhmann is right. Have a look at the following

import bambi as bmb
import pandas as pd

data = {
    "x": [738886, 738926, 738950, 739014, 739065, 739123, 739145, 739192, 739218, 739239],
    "y": [48, 47, 46, 45, 43, 42, 42, 42, 41, 41],
    "phs1": [1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
    "phs2": [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
}

df = pd.DataFrame(data)

model = bmb.Model("y ~ x + x:phs1 + x:phs2", df)
model
       Formula: y ~ x + x:phs1 + x:phs2
        Family: gaussian
          Link: mu = identity
  Observations: 10
        Priors: 
    target = mu
        Common-level effects
            Intercept ~ Normal(mu: 43.7, sigma: 37539.2986)
            x ~ Normal(mu: 0.0, sigma: 0.0508)
            x:phs1 ~ Normal(mu: 0.0, sigma: 0.0)
            x:phs2 ~ Normal(mu: 0.0, sigma: 0.0)
        
        Auxiliary parameters
            sigma ~ HalfStudentT(nu: 4.0, sigma: 2.4515)

Notice first the huge standard deivation on the intercept and the “zero” sigma for the slope deflections.

If you then inspect the internal priors (to circumvent the rounding in the summary)

print(model.components["mu"].common_terms["x"].prior.args)
print(model.components["mu"].common_terms["x:phs1"].prior.args)
print(model.components["mu"].common_terms["x:phs2"].prior.args)
{'mu': array(0.), 'sigma': array(0.05079222)}
{'mu': array(0.), 'sigma': array(1.65875211e-05)}
{'mu': array(0.), 'sigma': array(1.6582692e-05)}

You see the ones for the deflections are way smaller than the ones for the slope.

This problem still happens when you do y ~ 1 + x:phs1 + x:phs2 (which gives identified slopes for your data, not the case previously).

When you have such magnitudes in numerical values it makes sense to apply some transformation. Subtracting by the minimum seems to work well in this case, and you don’t need to transform slopes. You could also just apply z-score standardization.

Out of curiosity, what kind of measurement y is? Asking to double-check if the normal likelihood makes sense.

3 Likes

Thanks for your explanation!

Why subtract the min value and not the mean value directly to centralise it?

The data is a health score for a degenerative disease that decreases over time. It is based on a questionnaire. In almost all cases you never see an increase over time. The score is always an integer and has a maximum value of 48.

Subtracting the mean and using the following model works:

However, when using this model

I see a lot of divergences. Why does that happen?