Insurance Loss Reserving Model

Hi Forum,

I am pretty new to Bayesian statistics and pymc3, apologies if I read really bad at it… and thanks for your help anyway!
I am trying to use some hierarchical model in order to determine the distribution of losses incurred month after month by an insurance company. I have found some documentation with models in Stan (which I have never used whilst I quite like Python!) and essentially, the idea is:

  • for each cohort, determine a cohort-specific distribution of what can be called a ‘ultimate loss’; by cohort, I mean, a group of insurance policies that have somewhat similar characteristics. Practically speaking, a cohort groups policies issued during the same quarter or year or month. In my case, a cohort is for instance 2019_1 which means, ‘Jan 2019’. There are therefore as many Ultimate_Loss distributions as cohorts. Let’s say the dimension is nb_cohorts.
  • determine an omega and a theta distribution that will be applied to all cohorts, so that the losses we observe as time passes follow the equation Avg_Loss_Cohort_i_Period_k = Ultimate_Loss_Cohort_i * (1-exp(-(k/theta)**omega)),
  • define a sigma distribution so that sigma_i_k = sigma * sqrt(Avg_Loss_Cohort_i_Perod_k) to account for a lower variability of losses for early observations, and where sigma follows a half normal distribution
    -calibrate losses so that losses_i_k follow a gaussian distribution(mu = Avg_Loss_Cohort_i_Period_k, sigma = sigma_i_k)

In practice, the observations for ‘old cohorts’ range from 1 to 20 as we already have 20 months of observations, while for the cohort that is x months old, we only have x months of observation. In other words, for the most recent cohort, we only have one observation. So, whilst time ranges from 1 to 20 for an old cohort, it only takes one value for the most recent one: 1

I am struggling to set up the data right in order to account for the triangular shape of the data. I have it under vectorized form that looks like.

Here is a bit of code, but obviously, if it does run, it does not provide anything relevant.

def Cohorts_Losses_Calibration(inputs):

cohorts = pd.DataFrame(inputs['Cohort'])
Periods = pd.DataFrame(inputs['Period']).to_numpy()
Losses = pd.DataFrame(inputs[L]).to_numpy()

uniquecohorts = np.unique(cohorts)
nbcohorts = len(uniquecohorts)
uniquePeriods = np.unique(Periods)
nbPeriods = len(uniquePeriods)

Periods = Periods[:,None] # transpose Periods


with pm.Model() as model:
    
    omega = pm.HalfNormal("omega", sigma = 3)
    theta = pm.HalfNormal("theta", sigma = 5)
    ultimate_loss = pm.TruncatedNormal("ultimate_loss", mu = 60, sigma = 10, lower = 10, upper = 110, shape = (nbcohorts, 1))
    sigma = pm.HalfNormal("sigma", sigma = 20)
    
    mu_i_t = pm.Deterministic("mu_i_t", ultimate_loss * (1 - np.exp(-((uniquePeriods/theta) ** omega))))
    sigma_i_t = pm.Deterministic("sigma_i_t", sigma * mu_i_t ** 0.5)
    
    Losses_i_t = pm.Normal("Losses_i_t",
                                mu = mu_i_t,
                                sigma = sigma_i_t,            
                                observed = Losses,
                                shape = (nbcohorts, nbPeriods))
         
    results = pm.sample(draws = 50, tune = 50, cores = 8)
1 Like

Hey, i was wondering if you ever got your model working? I’ve also attempted to port something like the stan insurance model found here: Modelling Loss Curves in Insurance with RStan

I have a model which samples and seems to estimate the loss ratios correctly, but i haven’t been able to successfully emulate the estimation of the growth factors:

Data from the R model is like:

## Set up as Stan example
n_data = len(usetable)
n_time = len(usetable['dev_lag'].unique())
n_cohort = len(usetable['acc_year'].unique())
cohort_id, cohort = pd.factorize(usetable['acc_year'])
cohort_maxtime = usetable.groupby('acc_year')['dev_year'].max().to_list()
t_values = list(np.sort(usetable['dev_lag'].unique()).astype(int))
t_idx,  = pd.factorize(usetable['dev_lag'])
premium = usetable.groupby('acc_year')['premium'].mean().to_list()
loss_real = usetable['cum_loss']
coords = {"t_idx": t_idx, "cohort": cohort, "t_values": t_values, 'obs': range(n_data),'cohort_id':cohort_id}

### Model

with pm.Model(coords=coords) as basic_model:

    # Priors for unknown model parameters
    mu_LR = pm.Normal('mu_LR', 0, 0.5);
    sd_LR = pm.Lognormal('sd_LR', 0, 0.5);
    
    LR = pm.Lognormal('LR', mu_LR, sd_LR, dims='cohort')
    
    loss_sd = pm.Lognormal('loss_sd', 0, 0.7);
    
    ## Parameters for the growth factor
    omega = pm.Lognormal('omega', 0, 0.5);
    theta = pm.Lognormal('theta', 0, 0.5);
    t = pm.Data("t_values", t_values)
    gf = pm.Deterministic('growth_factor', 1-(np.exp(-(t/theta)**omega)))
    
    ## Premium
    prem = pm.Data("premium", premium)
    

    lm = pm.Deterministic('lm', LR[cohort_id] * prem[cohort_id] *gf[t_idx])
    
    # Likelihood (sampling distribution) of observations
    loss = pm.Normal('loss', lm, (loss_sd * prem[cohort_id]), observed=loss_real)
    
    prior_checks = pm.sample_prior_predictive(samples=50, random_seed=100)
    
    trace = pm.sample(return_inferencedata=True, tune=1000)
    ppc = pm.sample_posterior_predictive(
    trace, var_names=["loss", "LR", "lm"], random_seed=100)
        

Which gives a model with this structure:

But I must have done something wrong since it doesn’t get the parameter estimates on the growth factors correct. Would love any help people can give either on this particular model or porting Stan models to pymc3 models more generally.

Thanks in advance!

Hey Nat,

Yes, it works, thanks for asking! Just be careful to some dependencies between packages. You need for sure pymc3, theano and probably numpy and pandas to manage the data.
Packages I have used: theano 1.0.5 / pymc3 3.9 / pandas 1.1.3

So, I have created a class ‘Calibrator’ which reads the inputs and arranges a bit the data as a first step. The data comes from an XL spreadsheet with 3 columns: Cohort, Period, Data and stored in pd dataframe.

class calibrator():
# Virtual class
def init(self, *args, summary = False):

    self.data = args[0]
    self.tunes = args[1]
    self.samples = args[2]
    
    self.summary = summary
           
    # contains list of params names used by the model
    self.specificparams = []
    self.commonparams = []
    
    self.cohorts_idx, self.cohorts = pd.factorize(self.data['cohort'])
    self.periods_idx, self.periods = pd.factorize(self.data['period'])
    
    self.coords = {
        "cohort": self.cohorts,
        "period": self.periods,
        "collections": np.arange(len(self.cohorts_idx))
        }

As I have more than one model to run, each of them inherits from calibrator (which also has a method “calibrate” that is overwritten by each sub class). The “calibrate” function does the job, then simulates the posterior predictive checks (“ppc” function). It saves everything in a dataframe that I store in a database after each run and which allows me to retrieve older curves, determine higher probability densities,… and I don’t need to rerun.

class weibullcalibrator(calibrator):

def __init__(self, *args, summary = False):
    calibrator.__init__(self, *args, summary)
    
    self.specificparams = ['ult'] # only one specific param we need to retrieve afterwards
    self.commonparams = ['omega', 'theta']
    
def calibrate(self):
             
    """ 
    Model description:
    
    CumDraws[i] = UltimateDraws * (1 - exp(-(T/Theta)**Omega))
       
    """

    with pm.Model(coords = self.coords) as model:
    
        # common parameter
        omega = pm.HalfNormal("omega", sigma = 4)
        theta = pm.HalfNormal('theta', sigma = 4)
        sigma = pm.HalfNormal("sigma", sigma = 10)
        # specific parameter
        u = pm.HalfNormal('ult', sigma = 50, dims = 'cohort')
    
        mu_it = u[self.cohorts_idx] * (1 - tt.exp(- ((self.periods[self.periods_idx].to_numpy() / theta) ** omega)))
        sigma_it = sigma * mu_it ** 0.5
        # decenter the model for better convergence?
        
        _ = pm.Normal("Collections_i_t",
                                    mu = mu_it,
                                    sigma = sigma_it,
                                    observed = self.data['data'],
                                    dims = "collections")
             
        results = pm.sample(draws = self.samples, tune = self.tunes, cores = 8)
        
        if self.summary:
            return pm.summary(results)
        else:
            ppc = pm.sample_posterior_predictive(results, var_names = ['Collections_i_t', 'theta', 'omega', 'ult', 'sigma'])
            # convert to dataframe
            
            results = pd.DataFrame(ppc['theta'])
            results.columns = ['theta']
            results['omega'] = ppc['omega'][:]
            
            for i in range(0, len(self.cohorts)):
                results['ult' + str(self.cohorts[i]).lower()] = ppc['ult'][:,i]
                
            return results

Thanks so much for replying, this is really good to see the kind of workflow you built around the model. Will be experimenting a bit more with output and reporting now.

Thankfully I figured out how to make my (above) model run this morning, the issue I was having last night was a simple error on my part feeding the data through the model. I was passing the date column ‘dev_year’ in instead of the numeric ‘dev_lag’ column (now corrected above). :grimacing:

Great if you sorted it!

I have observed some issues dealing with flat curves (I am not an actuary but the approach works well many applications). I am not sure why; I tried a few hacks like introducing some noise, removing some values,… but in the end, I removed the ‘cohorts’ that created the issue and it runs just fine.

Good luck! :slight_smile:

1 Like

Thank you!