Hi there,
I’d like to make sure I’m understanding hierarchical models properly, so I’ve got a toy problem. Imagine we’ve got data for a city, that predicts the probability of there being a crime on any one day. It uses a bunch of features, like rainfall, unemployment, and outside temperature. The data is collected over the course of 5 years, and so we have 60 months of data.
I could simply train a logistic regression model on all the data however this would miss the correlation within months. I believe this is called a pooled model. Explicitly it would be p(crime|X) ~ X + eps, where eps is the error term. The problem is here our error terms has a correlation with X.
What we really want to find is p(crime_i|X_i) ~ X_i + eps_i + month_j. I think this is usually written in pymc3 by introducing new variables for each month that are centered around the variables of interest, and training with groups given by month_i. Something like crime = a[month_idx] + b[month_idx]*X.
I’ve played around with a few months for this with simple linear models, but it’s already quite slow. I can imagine training a neural network model would be virtually intractable. Is there a way around this?
I was thinking another solution could be to train a model that uses the month information at training time:
crime_hat = p(crime|X,month)
but then whenever we come to make a prediction, and we don’t have the months info (or we want to see the behaviour without it) we just integrate over all months, e.g.
crime_hat = average_over_months { p(crime|X, month) }.
Is this a valid approach? I’ve not seen it anywhere. I’m also not sure how I’d quantify my errors if I do this. Thanks.