Modeling with highly-skewed data

Carlos_Pumar-Frohber · September 4, 2023, 9:55am

Hey everyone,

I’m interested in defining threshold values for repayment behavior of clients that use financial services. Specifically, I want to use the mean of the HPD in order to establish the named threshold. For this, I want to use this simple logic

if n(late_days) > threshold: reject, else approve.

After having plotted the distribution of late payments, I assume the data-generating process to follow Student’s-T distribution (based on the large tail, right-skewed) and have defined the correspondent prior distribution to be uniformly distributed (weak prior cos I am not sure which prior distribution to use).

The shape of my data is (3_900_000,12).

Model:

lower_bound = 0
upper_bound = np.max(data['late_days'])

with pm.Model() as threshold_model:
    late_days = pm.Uniform(
    'payment_behavior', 
    lower_bound = lower_bound, 
    higher_boundhigher_bound
    )

    degress_freedom = pm.Exponential('degrees_freedom', lam = 1)
    
    thresholds = pm.StudentT(
    'thresholds',
    nu = degrees,
    mu = late_days,
    sd = 1,
    observed = data['late_days'].values
    )

    trace = pm.sample(20_000, tune = 35_000)

Error:

    RuntimeWarning: overflow encountered in _beta_ppf 
    return _boost._beta_ppf(q, a, b)

Perhaps anyone has ideas on what I am doing wrong here?

Many thanks in advance!

cluhmann · September 4, 2023, 6:15pm

Welcome!

A couple of things. First, what version of PyMC are you using? It looks to be an older version based on some of the argument names you are using. Second, you seem to have a variety of typos and other odd mismatches in your code. For example, you pass degrees as the nu argument, which I assume is supposed to be degress_freedom. And higher_boundhigher_bound isn’t syntactically valid. Finally, if you find yourself needing 20,000 samples and/or 35,000 tuning samples, something is almost certainly wrong with your model.

To your specific question, can you provide the full traceback? I ran a corrected version of your model on some toy data and everything samples just fine.

ricardoV94 · September 4, 2023, 6:26pm

You’re getting a warn not an error, and I think it was solved in a more recent version of the library

Carlos_Pumar-Frohber · September 4, 2023, 8:23pm

Many thanks, @cluhmann and @ricardoV94!

Here’s the corrected code (which I can’t copy-paste directly into this editor from my notebook - long story):

lower_bound = 0
upper_bound = np.max(data['late_days'])

with pm.Model() as threshold_model:
    late_days = pm.Uniform(
    'payment_behavior', 
    lower = lower_bound, 
    higher = upper_bound
    )

    degress_freedom = pm.Exponential('degrees_freedom', lam = 1)
    
    thresholds = pm.StudentT(
    'thresholds',
    nu = degrees_freedom,
    mu = late_days,
    sd = 1,
    observed = data['late_days'].values
    )

    trace = pm.sample(5_000, tune = 5_000) # based on your input re samples

Regarding the other points:

I’m running the project on pymc3 3.11.4, (MacOS Ventura)
the complete warning stated:

/opt/homebrew/anaconda3/envs/pymc_framework/lib/python3.9/site-packages/scipy/stats/_continuous_distns.py:624: RuntimeWarning: overflow encountered in _beta_ppf 
    return _boost._beta_ppf(q, a, b)
Sampling 4 chains for 1

Then the kernel broke and the sampling process was interrupted.

However, I have now rerun the code successfully and conclude my mistake was taking too large sample sizes.

As a follow up question: what criteria could be taken into account for defining the values for draws and for tune params?

Thank you again!

cluhmann · September 4, 2023, 11:27pm

If you cannot run pm.sample() with all the default values (and all your diagnostics look good, no divergences, etc.), I would be mildly suspicious. Maybe you crank up target_accept to 0.9 (maybe). Maybe you increase the number of tuning steps to 2000 (default is 1000). But much beyond that (and really, even then), I might suspect some underlying issues (e.g., model-data incompatibility). I know that’s may not be particularly helpful, but perhaps the other way to think about it is that you want to generate some bad diagnostics, because they provide clues about where the model/sampler is running into problems.

Carlos_Pumar-Frohber · September 5, 2023, 5:06am

This is very helpful, @cluhmann. Many thanks, really!
Kind regards

Topic		Replies	Views
What parameter do I manipulate to spread the tails in a student T distribution? version agnostic modeling	5	424	June 30, 2023
Prediction using sampling Questions	0	400	September 16, 2019
Help With A/B Testing For a Learner Questions	2	376	March 1, 2023
Theoretical and Practical Considerations and Questions v5 development , modeling , sampling	0	23	September 13, 2024
[Beginner level question on modeling] Bayesian analysis of F1 scores from two ML models v5 modeling	5	406	January 24, 2023

Modeling with highly-skewed data

Related topics