Hello,
I am building a Poisson regression model and I am confused about what I think is a really basic principle of generalized linear modeling. Specifically, a Poisson model for count data allows for heteroscedastic variances in the counts due to the fact that the variance is equal to the mean, which is the rate parameter. In the GLM context you are modeling the log of the rate parameter as a normal distribution. E.g.
y_i \sim Poisson(\lambda_i)
\lambda_i = \exp(\mu_i)
\mu_i \sim \mathcal{N}(x_i\beta, \sigma^2)
\sigma \sim InvGamma(\alpha, \beta)
etc…
Given the fact that by using the Poisson model you are already assuming heteroscedastic errors, would it be wrong to model the error associated with the \mu parameter (\sigma^2) individually for each observation? E.g.
y_i \sim Poisson(\lambda_i)
\lambda_i = \exp(\mu_i)
With subscripts on \sigma^2:
\mu_i \sim \mathcal{N}(x_i\beta, \sigma_i^2)
\sigma_i \sim InvGamma(\alpha, \beta)
In OLS I would assume that this would be pretty bad, but in the count model I’m working on I have found that it is much more robust during fitting (no divergences) and accurate (posterior predictive checks are closer to the actual counts) when modeling the observations with their own \sigma's as opposed to finding a single \sigma for all of the \mu's.
Any insight into this would be greatly appreciated. This is also related to another post I made about the same model, however I think this question is more general than the one I previously posted.
Hi Andrew,
Yeah it makes sense to me that the second model works better: I think there is a lot of over-dispersion in your data, which is very common in Poisson models – counts are very often over-dispersed relative to Poisson expectations.
What your second model does is basically allowing each observation to have its own Poisson rate, which enables to get more of the variation in the data. This is the idea behind gamma-Poisson (aka negative-binomial) models – look at paragraphs 11H1 and 11H2 of this NB for a detailed explanation.
Even better would be to use a hierarchical structure: each cluster would have its own expected Poisson rate, and this rate would both inform and be informed by the higher-level parameters of the population (i.e the parameters across clusters).
Hope this helps 
1 Like
HI @AlexAndorra, thanks again for your suggestions on this! I did try fitting the model with a negative binomial but the issues I was facing in my previous post were not resolved (predictions weren’t very good and there were many divergences unless I included a variance parameter for each observation in the linear term as described in this post).
I’m not sure I follow on the hierarchical structure comment. How would you define the clusters? Would you just use the intersection of all the factor variables to define the clusters and find separate intercepts and coefficients for the covariates within each cluster? How would this be different than the dummy variable approach?
Please excuse my ignorance on this. I have definitely looked at some of the tutorials on hierarchical regression and partial pooling on the pymc3 website, but I’m still not clear on the differences between the hierarchical approach and using dummy variables with an overall intercept.
Yeah I’m not a fan of this kind of mixture models – they are quite hard to fit and they don’t have the same flexibility as hierarchical models.
I don’t know your use-case and data, and it’s a big topic so I can’t expand much on it here, but the clusters can basically be any non-ordered entity (individual, county, state, year, hospital…).
I’m not sure what you call the “the dummy variable approach”, but I think it’s close to the hierarchical approach indeed – except that in the latter you use index variables (much more flexible), and you also have hyper priors that modelize the overall population (ie. across clusters), which pools information across clusters and results in shrinkage of within-cluster parameters towards the overall mean.
If this could be useful to you, I really recommend chapter 12 of McElreath’s Rethinking (1st edition). It’s a very thorough and pedagogical explanation – here is the port of this chapter to PyMC3.
I also updated the radon-level notebook on PyMC’s website, but it’s not online yet.