Modeling with highly-skewed data

Hey everyone,

I’m interested in defining threshold values for repayment behavior of clients that use financial services. Specifically, I want to use the mean of the HPD in order to establish the named threshold. For this, I want to use this simple logic

if n(late_days) > threshold: reject, else approve.

After having plotted the distribution of late payments, I assume the data-generating process to follow Student’s-T distribution (based on the large tail, right-skewed) and have defined the correspondent prior distribution to be uniformly distributed (weak prior cos I am not sure which prior distribution to use).

The shape of my data is (3_900_000,12).

Model:

lower_bound = 0
upper_bound = np.max(data['late_days'])

with pm.Model() as threshold_model:
    late_days = pm.Uniform(
    'payment_behavior', 
    lower_bound = lower_bound, 
    higher_boundhigher_bound
    )

    degress_freedom = pm.Exponential('degrees_freedom', lam = 1)
    
    thresholds = pm.StudentT(
    'thresholds',
    nu = degrees,
    mu = late_days,
    sd = 1,
    observed = data['late_days'].values
    )

    trace = pm.sample(20_000, tune = 35_000)

Error:

    RuntimeWarning: overflow encountered in _beta_ppf 
    return _boost._beta_ppf(q, a, b)

Perhaps anyone has ideas on what I am doing wrong here?

Many thanks in advance!

Welcome!

A couple of things. First, what version of PyMC are you using? It looks to be an older version based on some of the argument names you are using. Second, you seem to have a variety of typos and other odd mismatches in your code. For example, you pass degrees as the nu argument, which I assume is supposed to be degress_freedom. And higher_boundhigher_bound isn’t syntactically valid. Finally, if you find yourself needing 20,000 samples and/or 35,000 tuning samples, something is almost certainly wrong with your model.

To your specific question, can you provide the full traceback? I ran a corrected version of your model on some toy data and everything samples just fine.

You’re getting a warn not an error, and I think it was solved in a more recent version of the library

1 Like

Many thanks, @cluhmann and @ricardoV94!

Here’s the corrected code (which I can’t copy-paste directly into this editor from my notebook - long story):

lower_bound = 0
upper_bound = np.max(data['late_days'])

with pm.Model() as threshold_model:
    late_days = pm.Uniform(
    'payment_behavior', 
    lower = lower_bound, 
    higher = upper_bound
    )

    degress_freedom = pm.Exponential('degrees_freedom', lam = 1)
    
    thresholds = pm.StudentT(
    'thresholds',
    nu = degrees_freedom,
    mu = late_days,
    sd = 1,
    observed = data['late_days'].values
    )

    trace = pm.sample(5_000, tune = 5_000) # based on your input re samples

Regarding the other points:

  • I’m running the project on pymc3 3.11.4, (MacOS Ventura)
  • the complete warning stated:
/opt/homebrew/anaconda3/envs/pymc_framework/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf 
    return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 1

Then the kernel broke and the sampling process was interrupted.

However, I have now rerun the code successfully and conclude my mistake was taking too large sample sizes.

As a follow up question: what criteria could be taken into account for defining the values for draws and for tune params?

Thank you again!

1 Like

If you cannot run pm.sample() with all the default values (and all your diagnostics look good, no divergences, etc.), I would be mildly suspicious. Maybe you crank up target_accept to 0.9 (maybe). Maybe you increase the number of tuning steps to 2000 (default is 1000). But much beyond that (and really, even then), I might suspect some underlying issues (e.g., model-data incompatibility). I know that’s may not be particularly helpful, but perhaps the other way to think about it is that you want to generate some bad diagnostics, because they provide clues about where the model/sampler is running into problems.

This is very helpful, @cluhmann. Many thanks, really!
Kind regards

1 Like