Poisson regression model for beginner

Hi @Riccardo_Zhu!

Looking at your model there are a few things that spring to mind:

  1. Its not clear what you are trying to predict here. The linear term in your model specification isn’t in the original dataset or created in your code. I am going to take it as a given you want to predict the number of claims, because in the linked page to the data it says ‘…For this data, the number of claims has a Poisson distribution’. So I am going to assume that’s what you’re after. You could aim to predict Rate, but a Poisson likelihood would probably not the best for that. If thats what you want, we can lean on the Binomial and discuss further.

  2. Do you have to use pm.Metropolis as the step? The default NUTS is state-of-the-art.

  3. Part of the reason the model sampled so slowly to begin with is because of the variable specification. Reading the linked document it looked like all variables in the equation apart from bonus were categorical, while the original approach assumed they were continuous. You can have formulae force them to categories as I do below, or use pd.Categorical to assign categorical versions. The model will then have ‘dummy’ coefficients contrasting each category with the reference.

My version of your model is below, sampling in about 30-40 seconds (I am using PyMC 5.5, and suggest you upgrade if you can!). I’ve used Metropolis as the step but I do get convergence warnings, but none with NUTS:

df = pd.read_csv('SwedishMotorInsurance.csv')

# Create the design matrix using formulae
# Wrapping the variable in 'C()' enforces categorical values, with the first entry as the reference
dm = design_matrices('Claims ~ C(Kilometres) + C(Zone) + Bonus + C(Make)', data=df, na_action='error')

# Expand out into X (ex) and y (en)
X = dm.common.as_dataframe()
y = dm.response.as_dataframe().squeeze() # This ensures y is 1-dimensional. 2D observed values cause issues for sampling as the model assumes it needs to predict for multiple observed responses.

# Create labelled coords which help the model keep track of everything
c = {'predictors': X.columns, # Model now aware of number of and labels of predictors
     'N': range(len(y))} # and number of observations

# Set up the model
with pm.Model(coords=c) as poisson_reg:
    
    # Set the data (if you want to), and convert to NumPy arrays
    Xdata = pm.MutableData('Xdata', X.to_numpy())
    Ydata = pm.MutableData('Ydata', y.to_numpy())
    
    # You use the same priors for each coefficient in the model, so this can be specified like this
    # If you wish to use different priors you must do so like in your original parameterisation
    beta = pm.Normal('beta', mu=0, sigma=10, dims='predictors')
    
    # Can now take the linear combination of Xdata with beta, and exponent it - Xdata contains the Intercept column so this is no problem
    theta = pm.math.exp(Xdata @ beta) # @ operator is for matrix multiplication 
    
    # Plug into the Poisson likelihood
    pm.Poisson('y', mu=theta, observed=Ydata)
    
    # Sample according to your specifications, but tune for longer and take fewer draws
    trace = pm.sample(draws=1000, tune=3000, cores=4, random_seed=45, step=pm.Metropolis())

I do get some bad r-hat values from this which is problematic, but these do go away with the use of NUTS. The model sampling now comes at a cost - now we have the issue of interpreting the categorical coefficients for variables like Kilometres, which represent the difference in the rate of the observed counts. So for example in my output, the coefficient of Zone 5 is -1.358. exp(-1.358) ~ 0.25, so changing from Zone 1 to Zone 5, the rate of claim counts will change by a factor of 0.25, assuming all other values in the regression are held constant (beware of the fact this implies zero bonus make == 1 and kilometres == 1). Hope that helps!

@GrantDaly does the categorical predictor approach help your model parameterisation too?