About the bias: I totally missed the mx term in the data generation, I guess that explains it.
So the model and the actual data generating process did not match up exactly, so there really was no reason why some parameter in the data gen process should be the exact same number as a different parameter in the model that happens to have the same name and a similar interpretation.
As Jesse showed we can change the model to fit the actual data generation process. If we do, the results match up. In that case the model isn’t really a simple linear regression anymore, but something a little different.
I’m actually a bit confused about the generate
function and the causal model behind it. I don’t have any ideas about where that exact causal model would make sense?
Maybe we can use this to improve that example a bit. I think it would really help a great deal if we could use a more concrete application where something like this might turn up. (still using synthetic data, but some application that could guide us to a more realistic model).
So perhaps something along those lines:
We generate data using the following causal model (but hide the details from the reader): We measure student grades (or some related quantity, something to indicate how well students are doing in absolute terms). For each student we also know how much money was spent for their education. Each student comes from a school (that’s our group). Each school has a latent quantity “education quality” or so. Schools with higher quality tend to have better student outcomes, and also higher spending for their students. But within each school, more money is spent for students who have worse grades.
So that way we should get our Simpsons paradox: In each group we have a negative correlation between spending and grades, but globally we have a positive correlation.
We then start the example by just investigating the dataset from a single school, and fit a linear model, observing a negative correlation between spending and grades.
We then extend the dataset to multiple schools, and observe a positive correlation.
Then we fit a mixed model as in the current notebook.
We can then use a posterior predictive check to show that the clear positive global correlation isn’t picked up by the model if we resample schools. Which could then lead us to include a latent variable and model the whole generative process.
Do you think that makes more sense than the current example? Or any other ideas about how we could improve it?