Simpsons paradox and mixed models - large number of groups

I am playing around with the great example in Simpson’s paradox and mixed models — PyMC example gallery
and I noticed the following:

i) Increasing the number of groups from 5 to 50 makes it necessary to use pm.sample(target_accept=0.999, tune=2000, draws=10000)
in order to not get any warnings about divergences, too large rhat etc.
Is there a upper limit on the number of groups that one can handle in general with pymc - in the sense that sampling “doesn’t get stuck” necessitating larger and larger acceptance probabilities and draws?

ii) My posterior expectation for the population level slope (slope_mu hyperprior in the hierarchical model from the example) seems to not converge to the true value - 0.5, but instead seems to systematically overestimate it - the posterior is distributed around -0.45. If I increase the number of observations per each group (say e.g. by a factor of ten), then the posterior peaks around 0.5 as expected. Intuitively I would also have expected a “similar” effect when increasing the number of groups (by a factor of ten) as the total number of observations that enters the population level slope estimate is the same. Is there an intuitive explanation why increasing samples per group and number of groups has different effects on the posterior of the population level parameters? Is this caused by the choice of the hyperprior parameters (sigma=1)?

1 Like

Welcome!

No. The sampler will simply be unhappy with some combinations of model and data. “Number of groups” isn’t a general attribute of models that makes sampling more/less difficult. FYI. if you should ever have to crank target_accept up beyond 0.9 or so, your model probably needs to be reparameterized (or you have made a mistake).

@cluhmann thanks a lot for the quick reply. I"d be very glad for any pointers w.r.t a possible reparametrisation for the model. Please note, I am using the hierarchical model in non-centered form from the example directly, and need to increase the acceptance probability to e.g 0.999 when I go from 5 to 50 groups.

Also the example sets the acceptance probability to 0.99 already for 5 groups. This doesn’t seem to match wit your expectation, i.e. this specific example (using the recommended non centered parametrisation) already warns about divergences for the default 0.9 acceptance threshold.

I am probably overlooking sonething here. I am just getting started…

The original model is probably already bad

As @ricardoV94 said, having to crank up the target acceptance is an unambiguous signal the the sampler is struggling. The fact that a notebook in the PyMC documentation presents a model (plus data) which required a target acceptance value of 0.99 is not great and does not reflect best practices we would recommend.

1 Like

You could for instance rewrite the regression like this:

coords = {"group": group_list}

with pm.Model(coords=coords) as hierarchical:
    x = pm.MutableData("x", data.x, dims="obs_id")
    group_idx = pm.MutableData("group_idx", data.group_idx, dims="obs_id")

    intercept = pm.Normal("intercept", sigma=1)

    group_sigma = pm.HalfNormal("group_sigma", sigma=2)
    group_offset = pm.Normal("group_offset", dims="group")
    group_effect = pm.Deterministic("group_effect", group_sigma * group_offset, dims="group")

    x_effect = pm.Normal("x_effect")

    x_group_sigma = pm.HalfNormal("x_group_sigma", sigma=2)
    x_group_offset = pm.Normal("x_group_offset", dims="group")
    x_group_effect = pm.Deterministic("x_group_effect", x_group_sigma * x_group_offset, dims="group")

    mu = (
        intercept
        + group_effect[group_idx]
        + x_effect * x
        + x_group_effect[group_idx] * x
    )
    sigma = pm.HalfNormal("sigma", sigma=2)
    pm.Normal("y", mu=mu, sigma=sigma, observed=data.y, dims="obs_id")

def generate():
    # <--- I changed the groups
    group_list = [f"group_{i}" for i in range(500)]
    trials_per_group = 20
    group_intercepts = rng.normal(0, 1, len(group_list))
    # <--- I changed the group slopes to not be all exactly identical
    group_slopes = rng.normal(-0.5, 0.1, size=len(group_list))
    group_mx = group_intercepts * 2
    group = np.repeat(group_list, trials_per_group)
    subject = np.concatenate(
        [np.ones(trials_per_group) * i for i in np.arange(len(group_list))]
    ).astype(int)
    intercept = np.repeat(group_intercepts, trials_per_group)
    slope = np.repeat(group_slopes, trials_per_group)
    mx = np.repeat(group_mx, trials_per_group)
    x = rng.normal(mx, 1)
    y = rng.normal(intercept + (x - mx) * slope, 1)
    data = pd.DataFrame({"group": group, "group_idx": subject, "x": x, "y": y})
    return data, group_list

I changed the data generation slightly so that not all groups have the exact same slope, and so that we have 500 groups. The model for the most part a reparametrization, I removed the different sigma values per group though. That wasn’t in the data generation anyway, and sounds a bit strange to me.

The default sampler still struggels a little bit with this (somewhat low ess), but nutpie for instance seems to be perfectly fine, and samples in ~5s. The lowest ess is ~400

import nutpie
compiled = nutpie.compile_pymc_model(hierarchical)
tr = nutpie.sample(compiled)
ess = arviz.ess(tr)
ess.min()

Some more things that could further improve things:

  • Use a ZeroSumNormal where appropriate
  • Normalize the predictor x

Depending on dataset it is also possible that some centered parametrization improves things (when datasizes get bigger, it is not unusual that those get better)

Edit I’m also a bit confused by the bias right now. I’ll have another look tomorrow.

2 Likes

@aseyboldt thank you very much for your quick reply and the very detailed suggestions. I have tried this, and I am also getting lowest ESS larger than 400.

With the original 5 groups I am getting the following lower ess values

<xarray.Dataset> Size: 96B
Dimensions:              ()
Data variables:
    intercept            float64 8B 30.51
    group_sigma_log__    float64 8B 177.3
    group_offset         float64 8B 36.71
    x_effect             float64 8B 29.39
    x_group_sigma_log__  float64 8B 27.33
    x_group_offset       float64 8B 198.2
    sigma_log__          float64 8B 5.215e+03
    group_sigma          float64 8B 177.3
    x_group_sigma        float64 8B 27.33
    sigma                float64 8B 5.215e+03
    group_effect         float64 8B 29.7
    x_group_effect       float64 8B 27.33

I am assuming (as a total beginner) that ESS stands for effective sample size, meaning that larger is better?

So with those numbers my effective sample size for, e.g the (most interesting) x_effect seems to be too low (given that number of draws is 1000 so an ESS of ~30 indicates an autocorrelation time of ~30 steps - is that reasoning valid at all, my MCMC theory is a bit rusty so I might be totally off here. So for the original number of groups sampling is still difficult, is that understanding correct?

As someone who is new to Bayesian hierarchical modelling, it is quite difficult to understand why this specific reparametrisation, i.e. replacing
\mu = \beta_0[g] + \beta_1[g] *x in the original (linear) model with \mu = intercept + group_effect[g] + x_effect *x + x_group_effect[g] and/or adding noise to the slope in the data generating process improves sampling.

Thanks for looking into the “bias” on x_effect, I’d be very glad to get any insights on that.

By the way I think the notebook is an excellent example for getting started with hierarchical modelling and I am very glad for the excellent pymc documentation and examples, as well as the communities willingness to answer questions and help out. Thanks a lot to everyone!

Edit: Surprisingly the “bias goes away” if I reduce the factor 2 in group_mx to 0.2. If I increase this to 20 the ESS drops significantly, so there seems to be some undesired interaction between the “distance” of the different groups and the sampling and convergence properties - I don’t understand why, yet but if I do I will also follow up here.

The arviz documentation is pretty good about explaining the various diagnostics. Here is the page on effective sample size.

1 Like

I think the issue is the usual panel data unobserved confounding due to the mx term in the slope. The causal graph is something like this:

image

McElreath addresses graphs like this in this video, bonus section. Basically we want to model x and y simultaneously in order to obtain estimates for mx:

coords = {"group": group_list}

with pm.Model(coords=coords) as hierarchical:
    x = pm.MutableData("x", data.x, dims="obs_id")
    group_idx = pm.MutableData("group_idx", data.group_idx, dims="obs_id")
    
    x_intercept = pm.Normal('x_intecept')
    x_beta = pm.Normal('x_beta')
    mx_hat = pm.Normal('mx', dims='group')
    x_mu = x_intercept + x_beta * mx_hat[group_idx]

    x_sigma = pm.HalfNormal('x_sigma')
    x_hat = pm.Normal('x_hat', mu=x_mu, sigma=x_sigma, observed=x, dims='obs_id')
    
    mu_intercept = pm.Normal('mu_intercept')
    sigma_intercept = pm.Gamma("sigma_intercept", alpha=2, beta=1)
    offset_intercept = pm.ZeroSumNormal("offset_intercept", dims="group")
    intercept = pm.Deterministic("intercept", mu_intercept + sigma_intercept * offset_intercept, dims="group")

    mu_slope = pm.Normal('mu_slope')
    sigma_slope = pm.Gamma("sigma_slope", alpha=2, beta=1)
    offset_slope = pm.ZeroSumNormal("offset_slope", dims="group")
    slope = pm.Deterministic("slope", mu_slope + sigma_slope * offset_slope, dims="group")
    
    mx_slope = pm.Normal('mx_slope ')

    mu = (intercept[group_idx]
            + slope[group_idx] * x
            + mx_slope * mx_hat[group_idx]
    )
    sigma = pm.HalfNormal("sigma", sigma=2)
    
    pm.Normal("y", mu=mu, sigma=sigma, observed=data.y, dims="obs_id")
    idata = pm.sample(nuts_sampler='nutpie')

This scheme obtains unbaised estimates of the population slope:

This doesn’t solve the ESS problems, though – that’s still very bad.

1 Like

@jessegrabowski Thanks a lot for sharing the video and suggesting another re-parametrisation that explicitly models the group mean along the x-coordinate. One thing that I noticed is that ‘x_hat’ is never used. Not sure whether this is intended?

I have tried it out and can confirm that it results in the “unbiased” posterior estimate of the “effect” of interest (mu_slope in the model above), similar to the plot above. I can confirm the low ESS for e.g. the group mean of the x-coordinate with an ESS below 10, cf. output below.

Dimensions:                     ()
Data variables: (12/20)
    x_intecept                  float64 8B 202.3
    x_beta                      float64 8B 9.959
    mx                          float64 8B 9.838
    x_sigma_log__               float64 8B 1.33e+04
    mu_intercept                float64 8B 203.2
    sigma_intercept_log__       float64 8B 1.959e+03
    ...                          ...
    offset_intercept            float64 8B 7.995e+03
    sigma_slope                 float64 8B 2.036e+03
    offset_slope                float64 8B 7.175e+03
    sigma                       float64 8B 1.216e+04
    intercept                   float64 8B 250.8
    slope                       float64 8B 4.277e+03

One thing that is not really clear to me in the above re-parametrisation is the usage of the ZeroSumNormal, what’s the benefit of using it here?

The low ESS for x_beta and mx seem to be independent of the group seperation in x, e.g. changing
group_mx = group_intercepts * 2.0 to group_mx = group_intercepts * 0.2 or to group_mx = group_intercepts * 20.0 does not improve this number or make it worse. However, for group_mx = group_intercepts * 20.0 the ESS values for e.g. mu_intercept are significantly reduced, so that the “sampling efficiency” still seems to depend on this parameter of the data generating process. I guess the definition of mx_hat is not compatible with those “scaling changes”, I am wildly guessing here, though.

I will dig deeper here. Thanks for all the great help and suggestions so far. If there are ideas on how to make the example "work out of the box ", I’d be super glad to hear them. I am learning a lot thanks to all the excellent help and explanations.

Edit: The x_beta posterior looks to have two disconnected symmetric modes (probably due to insufficient mixing, but again wild guessing on my side here), cf. below figure

It is used because it is directly connected to the observation of x (thus the observed=x). That is, it forms part of the likelihood.

1 Like

@cluhmann

I am not sure, I guess I am overlooking something, cf. below where x_hat does not occur. I just noticed this because my editor (PyCharm) notified me about it earlier, but again I might be just overlooking something.

Thanks a lot for helping out here @cluhmann .

I guess there’s a more subtle point here. The PyMC parameter x_hat is used in (part of) the likelihood. The python variable x_hat is defined, but never used thereafter. Even in simple models like the one you defined, it’s pretty customary to assign the likelihood to a variable:

mu = pm.Normal("mu")
    sigma = pm.HalfNormal("sigma")
    
    likelihood  = pm.Normal("likelihood", mu=mu, sigma=sigma, observed=my_data)
2 Likes

@cluhmann thanks for clarifying. I think I roughly see what your getting at but I defiitively need to sit down and write it out in order to develop a thorough understanding.

Your explanations are incredibly helpful, thanks a lot for your time.

If there are any suggestions on how to improve the low ESS I am observing then I’d be glad to look into that.

I am learning a lot here and the available pymc examples are terrific.

2 Likes

I don’t think the 2 modes of x_beta are due to mixing. If that were true, you’d see jumping between the modes within the chains. What you see instead is that the algorithm has converged to two equi-probable posteriors, based on the starting position of the sampler.

I noticed it as well, but since the x model is just a bunch of nusance parameters from the perspective of inference on mu_slope, I ignored it. Ideally you’d want to just marginalize it all away mathematically, but that’s way beyond my ability. You could try removing the slope component from x_hat and see if it does any better/worse.

Maybe nothing? It’s meant to take into account the fact that you only have n-1 degrees of freedom when computing slope offsets. It’s related to the dummy variable trap, if you ever saw that in an OLS class. In my experience you can get away with using normals just fine, but ZeroSumNormal is new and hip. There are also some extremely subtle differences when it comes to model interpretation (check the very last paragraph of that post), but I’m not sure they’re relevant here.

About the bias: I totally missed the mx term in the data generation, I guess that explains it. :slight_smile:

So the model and the actual data generating process did not match up exactly, so there really was no reason why some parameter in the data gen process should be the exact same number as a different parameter in the model that happens to have the same name and a similar interpretation.

As Jesse showed we can change the model to fit the actual data generation process. If we do, the results match up. In that case the model isn’t really a simple linear regression anymore, but something a little different.

I’m actually a bit confused about the generate function and the causal model behind it. I don’t have any ideas about where that exact causal model would make sense?

Maybe we can use this to improve that example a bit. I think it would really help a great deal if we could use a more concrete application where something like this might turn up. (still using synthetic data, but some application that could guide us to a more realistic model).


So perhaps something along those lines:

We generate data using the following causal model (but hide the details from the reader): We measure student grades (or some related quantity, something to indicate how well students are doing in absolute terms). For each student we also know how much money was spent for their education. Each student comes from a school (that’s our group). Each school has a latent quantity “education quality” or so. Schools with higher quality tend to have better student outcomes, and also higher spending for their students. But within each school, more money is spent for students who have worse grades.

So that way we should get our Simpsons paradox: In each group we have a negative correlation between spending and grades, but globally we have a positive correlation.

We then start the example by just investigating the dataset from a single school, and fit a linear model, observing a negative correlation between spending and grades.

We then extend the dataset to multiple schools, and observe a positive correlation.

Then we fit a mixed model as in the current notebook.

We can then use a posterior predictive check to show that the clear positive global correlation isn’t picked up by the model if we resample schools. Which could then lead us to include a latent variable and model the whole generative process.

Do you think that makes more sense than the current example? Or any other ideas about how we could improve it?

3 Likes

Thanks @jessegrabowski for your suggestions. I will try removing the slope component from x_hat and will try to understand whether I can marginalize the nuisance parameters as suggested.

Thanks for motivating the zero sum normal that was definitively very helpful.

@aseyboldt actually I found the existing example very intuitive and accessible, so I wouldn’t necessarily change it.

But the existence of an apparent bias (due to the confound) and the fact that the sampler has problems, especially when the number of groups increases perfectly set the stage for discussing confounds, reparametrisation, etc. so I’d extend the existing example with sections discusing those thigs or linking to a “second part” that further develops the example and demonstrates how successful pooling with a large number of groups can be facilitated - instead of fitting the paradox into a common example.

1 Like

@aseyboldt what’s the rationale behind the “normalize the predictor x” argument.

I am currently experimenting with a variant of the example for which x is sampled from a common (univariate standard normal) distribution for all groups and posterior sampling is improved (no divergences, larger ESS) when I demean each the predictors x per each group.

I have seen the “normalize the predictor” suggestion also elsewhere, but I haven’t found a convincing argument yet…

Btw: Modeling the group mean of predictors explicitly on the other hand does not help at all, in fact sampling efficiency is severely reduced and also the estimate of the population level intercept (alpha) seems to be biased - this is also something I am currently struggling to understand.