Simpsons paradox and mixed models - large number of groups

aseyboldt · March 9, 2024, 6:18pm

About the bias: I totally missed the mx term in the data generation, I guess that explains it.

So the model and the actual data generating process did not match up exactly, so there really was no reason why some parameter in the data gen process should be the exact same number as a different parameter in the model that happens to have the same name and a similar interpretation.

As Jesse showed we can change the model to fit the actual data generation process. If we do, the results match up. In that case the model isn’t really a simple linear regression anymore, but something a little different.

I’m actually a bit confused about the generate function and the causal model behind it. I don’t have any ideas about where that exact causal model would make sense?

Maybe we can use this to improve that example a bit. I think it would really help a great deal if we could use a more concrete application where something like this might turn up. (still using synthetic data, but some application that could guide us to a more realistic model).

So perhaps something along those lines:

We generate data using the following causal model (but hide the details from the reader): We measure student grades (or some related quantity, something to indicate how well students are doing in absolute terms). For each student we also know how much money was spent for their education. Each student comes from a school (that’s our group). Each school has a latent quantity “education quality” or so. Schools with higher quality tend to have better student outcomes, and also higher spending for their students. But within each school, more money is spent for students who have worse grades.

So that way we should get our Simpsons paradox: In each group we have a negative correlation between spending and grades, but globally we have a positive correlation.

We then start the example by just investigating the dataset from a single school, and fit a linear model, observing a negative correlation between spending and grades.

We then extend the dataset to multiple schools, and observe a positive correlation.

Then we fit a mixed model as in the current notebook.

We can then use a posterior predictive check to show that the clear positive global correlation isn’t picked up by the model if we resample schools. Which could then lead us to include a latent variable and model the whole generative process.

Do you think that makes more sense than the current example? Or any other ideas about how we could improve it?

Topic		Replies	Views
Improve Sampling of Multilevel Regressions Model Questions	13	1224	March 9, 2020
Partial pooling: Understanding sampling efficiency and "scale dependence"	0	185	March 13, 2024
Sampler occasionally gets static in pymc3.1, but not in pymc3.0 Questions development	19	2206	July 5, 2018
Multinomial hierarchical regression with multiple observations per group ("Bad energy issue") Questions	22	3370	March 14, 2019
Sampling from multilevel multivariable regression model is very slow Questions	15	1392	May 19, 2020

Simpsons paradox and mixed models - large number of groups

Related topics