Hi there,

I’d like to make sure I’m understanding hierarchical models properly, so I’ve got a toy problem. Imagine we’ve got data for a city, that predicts the probability of there being a crime on any one day. It uses a bunch of features, like rainfall, unemployment, and outside temperature. The data is collected over the course of 5 years, and so we have 60 months of data.

I could simply train a logistic regression model on all the data however this would miss the correlation within months. I believe this is called a pooled model. Explicitly it would be p(crime|X) ~ X + eps, where eps is the error term. The problem is here our error terms has a correlation with X.

What we really want to find is p(crime_i|X_i) ~ X_i + eps_i + month_j. I think this is usually written in pymc3 by introducing new variables for each month that are centered around the variables of interest, and training with groups given by month_i. Something like crime = a[month_idx] + b[month_idx]*X.

I’ve played around with a few months for this with simple linear models, but it’s already quite slow. I can imagine training a neural network model would be virtually intractable. Is there a way around this?

I was thinking another solution could be to train a model that uses the month information at training time:
crime_hat = p(crime|X,month)

but then whenever we come to make a prediction, and we don’t have the months info (or we want to see the behaviour without it) we just integrate over all months, e.g.
crime_hat = average_over_months { p(crime|X, month) }.

Is this a valid approach? I’ve not seen it anywhere. I’m also not sure how I’d quantify my errors if I do this. Thanks.

This reminds me a recent blog post Latent GP and Binomial Likelihood.

I think a similar approach could applies here, the fundamental idea here is that you want to capture regional and temporal-specific effect (random effect in the classical mixed-effect model terminology ), as well as general effect of your features (e.g., rainfall, unemployment, and outside temperature in your case, or fixed effect). You can potentially have very flexible random effect structure (e.g., spatial correlation among different cities, temporal correlation among different months), which come downs to constructing a large covariate matrix that captures these correlation.

For spatial correlation in the random effect, you can also have a look at CAR model in pymc3.

1 Like