Hi @Riccardo_Zhu!
Looking at your model there are a few things that spring to mind:
-
Its not clear what you are trying to predict here. The
linearterm in your model specification isn’t in the original dataset or created in your code. I am going to take it as a given you want to predict the number ofclaims, because in the linked page to the data it says ‘…For this data, the number of claims has a Poisson distribution’. So I am going to assume that’s what you’re after. You could aim to predictRate, but a Poisson likelihood would probably not the best for that. If thats what you want, we can lean on theBinomialand discuss further. -
Do you have to use
pm.Metropolisas the step? The default NUTS is state-of-the-art. -
Part of the reason the model sampled so slowly to begin with is because of the variable specification. Reading the linked document it looked like all variables in the equation apart from
bonuswere categorical, while the original approach assumed they were continuous. You can haveformulaeforce them to categories as I do below, or usepd.Categoricalto assign categorical versions. The model will then have ‘dummy’ coefficients contrasting each category with the reference.
My version of your model is below, sampling in about 30-40 seconds (I am using PyMC 5.5, and suggest you upgrade if you can!). I’ve used Metropolis as the step but I do get convergence warnings, but none with NUTS:
df = pd.read_csv('SwedishMotorInsurance.csv')
# Create the design matrix using formulae
# Wrapping the variable in 'C()' enforces categorical values, with the first entry as the reference
dm = design_matrices('Claims ~ C(Kilometres) + C(Zone) + Bonus + C(Make)', data=df, na_action='error')
# Expand out into X (ex) and y (en)
X = dm.common.as_dataframe()
y = dm.response.as_dataframe().squeeze() # This ensures y is 1-dimensional. 2D observed values cause issues for sampling as the model assumes it needs to predict for multiple observed responses.
# Create labelled coords which help the model keep track of everything
c = {'predictors': X.columns, # Model now aware of number of and labels of predictors
'N': range(len(y))} # and number of observations
# Set up the model
with pm.Model(coords=c) as poisson_reg:
# Set the data (if you want to), and convert to NumPy arrays
Xdata = pm.MutableData('Xdata', X.to_numpy())
Ydata = pm.MutableData('Ydata', y.to_numpy())
# You use the same priors for each coefficient in the model, so this can be specified like this
# If you wish to use different priors you must do so like in your original parameterisation
beta = pm.Normal('beta', mu=0, sigma=10, dims='predictors')
# Can now take the linear combination of Xdata with beta, and exponent it - Xdata contains the Intercept column so this is no problem
theta = pm.math.exp(Xdata @ beta) # @ operator is for matrix multiplication
# Plug into the Poisson likelihood
pm.Poisson('y', mu=theta, observed=Ydata)
# Sample according to your specifications, but tune for longer and take fewer draws
trace = pm.sample(draws=1000, tune=3000, cores=4, random_seed=45, step=pm.Metropolis())
I do get some bad r-hat values from this which is problematic, but these do go away with the use of NUTS. The model sampling now comes at a cost - now we have the issue of interpreting the categorical coefficients for variables like Kilometres, which represent the difference in the rate of the observed counts. So for example in my output, the coefficient of Zone 5 is -1.358. exp(-1.358) ~ 0.25, so changing from Zone 1 to Zone 5, the rate of claim counts will change by a factor of 0.25, assuming all other values in the regression are held constant (beware of the fact this implies zero bonus make == 1 and kilometres == 1). Hope that helps!
@GrantDaly does the categorical predictor approach help your model parameterisation too?