How to model observed percentages (bounded from 0 to 1)

EtienneT · December 27, 2017, 1:38am

I am trying to model a percentage response variable. Looked at bounded distributions but you can’t specify observed variables. I also tried using a Beta distribution to constraint my observed data from 0 to 1 but I had some exceptions with PyMC3. I would prefer using Normal distributions. Maybe there’s something I could do with the logistic function?

I am doing a multi-level model:

# Intercept for each store, distributed around group mean mu_a
a = pm.Normal('alpha', mu=mu_a, sd=sigma_a, shape=len(uniqueStores))
# Intercept for each store, distributed around group mean mu_a
b = pm.Normal('Shift1Score', mu=mu_b, sd=sigma_b, shape=len(uniqueStores))

score_est = pm.math.sigmoid(a[storeIDX] + (b[storeIDX] * audits.Shift1Score.values))
    
# Data likelihood
y_like = pm.Normal('y_like', mu=score_est, sd=eps, observed=audits.FinalScore)

Even with the sigmoid here I still get a posterior where the HPC contains values greater than 1.

posterior

Help would be greatly appreciated! Thanks!

junpenglao · December 27, 2017, 5:41pm

You are right that you need to model it with a Bounded Normal. However, in PyMC3 you can not have bounded observed variable (yet).
In general, there are two ways to deal with it:
1, transformed the data so that the observer is now (-∞, ∞), an inverse logistic transform is one of them.
2, model it with a Beta distribution:

sd = pm.Uniform('sd', 0, tt.sqrt(score_est * (1 - score_est)))
# Parameterized the Beta with mu and sd
y_like = pm.Beta('y_like', mu=score_est, sd=eps, observed=audits.FinalScore)

Also, if you know where the percentage comes from (the n and the k), then ideally you should use a Binomial.

fonnesbeck · December 27, 2017, 9:39pm

The normal distribution is not appropriate for variables constrained to the unit interval. What sort of expections were you getting with the beta? Another strategy, if you have an unusual likelihood, is to use a Potential (factor potential) which allows you to specify arbitrary log-probability terms, and ends up being treated like a likelihood.

EtienneT · December 28, 2017, 3:46pm

Thank you for your replies! I tried transforming my FinalScore with a logit function and then fit the Normal to that. It worked but then it makes interpreting the traceplot graphs difficult. Would it be possible to transform the trace to be able to interpret it more easily?

Here is my code when I try to fit it to a Beta distribution and I will also include the exception stack trace as an attachment because it is huge…

with pm.Model() as hierarchical_model:
    # Hyperpriors
    mu_a = pm.Normal('mu_alpha', mu=0, sd=1)
    sigma_a = pm.HalfCauchy('sigma_alpha', beta=1)
    mu_b = pm.Normal('mu_beta', mu=0, sd=1)
    sigma_b = pm.HalfCauchy('sigma_beta', beta=1)
    
    # Intercept for each store, distributed around group mean mu_a
    a = pm.Normal('intercept', mu=mu_a, sd=sigma_a, shape=len(uniqueStores))
    # Intercept for each store, distributed around group mean mu_a
    b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=len(uniqueStores))
    
    # Model error
    eps = pm.HalfCauchy('eps', beta=1)
    
    # Expected value
    score_est = a[storeIDX] + (b[storeIDX] * audits['Shift1Score'].values)
    
    # Data likelihood
    sd = tt.sqrt(score_est * (1 - score_est))
    
    y_like = pm.Beta('y_like', mu=score_est, sd=sd, observed=audits.FinalScore)

    trace = pm.sample(progressbar=True, njobs=4)

exception.txt (62.8 KB)

EtienneT · December 29, 2017, 5:32am

Alright I managed to make the Beta posterior work. I am a software engineer and still learning bayesian stats for fun so… I saw this post and transformed my code like his and used a pretty small standard deviation. Made sure my observed variable did not contain values exactly 0 or 1.

# Intercept for each store, distributed around group mean mu_a
a = pm.Normal('intercept', mu=mu_a, sd=sigma_a, shape=len(uniqueStores))
# Intercept for each store, distributed around group mean mu_a
b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=len(uniqueStores))

# Model error
eps = pm.HalfNormal('eps', sd=0.1)

# Expected value
u = pm.math.invlogit(a[storeIDX] + (b[storeIDX] * audits['Shift1Score'].values))

# Data likelihood    
y_like = pm.Beta('y_like', mu=u, sd=eps, observed=audits.AdjustedScore)

Thank you!

junpenglao · December 29, 2017, 6:56am

Yes, you can extract the mu from the trace: mu = trace['mu'], do the logistic transform: p = logistic(mu), and add it back to the trace trace.add_values(p), then you can visualize it using traceplot etc.

You Beta model is not working initially because you fix the sd to sd = tt.sqrt(score_est * (1 - score_est)), but for the mu and sd parametrization to work, sd must within the bound of (0, sqrt(mu*(1-mu)))

EtienneT · January 3, 2018, 3:27am

I could not find any examples for binomial. I have access to the n and the k because each row in my dataset is actually the result of a test. Each test has a total # of points and the actual points the person got. We divide those to have a percentage. So I guess I could use the total # of points as my number of trial and the actual scored points as the number of successes. But how do I specify those observed data to the Binomial distribution? The distribution needs a total # of trials n and we want to find p. The total # of points can change for each row in my dataset also.

Thanks,

junpenglao · January 3, 2018, 6:54am

In Binomial, the n could be a vector of trial Ns, so in you case, after
score_est = pm.math.sigmoid(a[storeIDX] + (b[storeIDX] * audits.Shift1Score.values))
you can do something like:

    #likelihood
    y_like = pm.Binomial('observed', n=audits.trial_n, p=score_est, observed=audits.observed_n)

EtienneT · January 3, 2018, 4:08pm

Worked perfectly thank you!

Topic		Replies	Views
Beta Regression in pyMC3 Questions	7	5650	September 18, 2017
LogitNormal vs. Beta vs. Logistic Questions	1	1110	August 15, 2018
Modeling observed data that can take values between 0 &1 Questions	3	403	June 12, 2021
Fitting a Beta Regression Questions	3	1806	September 19, 2018
[quick conceptual question] shouldn't the lognormal distribution be used as the likelihood more often? Questions	5	1217	August 6, 2019

How to model observed percentages (bounded from 0 to 1)

Related topics