How to model observed percentages (bounded from 0 to 1)


#1

I am trying to model a percentage response variable. Looked at bounded distributions but you can’t specify observed variables. I also tried using a Beta distribution to constraint my observed data from 0 to 1 but I had some exceptions with PyMC3. I would prefer using Normal distributions. Maybe there’s something I could do with the logistic function?

I am doing a multi-level model:

# Intercept for each store, distributed around group mean mu_a
a = pm.Normal('alpha', mu=mu_a, sd=sigma_a, shape=len(uniqueStores))
# Intercept for each store, distributed around group mean mu_a
b = pm.Normal('Shift1Score', mu=mu_b, sd=sigma_b, shape=len(uniqueStores))

score_est = pm.math.sigmoid(a[storeIDX] + (b[storeIDX] * audits.Shift1Score.values))
    
# Data likelihood
y_like = pm.Normal('y_like', mu=score_est, sd=eps, observed=audits.FinalScore)

Even with the sigmoid here I still get a posterior where the HPC contains values greater than 1.

posterior

Help would be greatly appreciated! Thanks!


#2

You are right that you need to model it with a Bounded Normal. However, in PyMC3 you can not have bounded observed variable (yet).
In general, there are two ways to deal with it:
1, transformed the data so that the observer is now (-∞, ∞), an inverse logistic transform is one of them.
2, model it with a Beta distribution:

sd = pm.Uniform('sd', 0, tt.sqrt(score_est * (1 - score_est)))
# Parameterized the Beta with mu and sd
y_like = pm.Beta('y_like', mu=score_est, sd=eps, observed=audits.FinalScore)

Also, if you know where the percentage comes from (the n and the k), then ideally you should use a Binomial.


#3

The normal distribution is not appropriate for variables constrained to the unit interval. What sort of expections were you getting with the beta? Another strategy, if you have an unusual likelihood, is to use a Potential (factor potential) which allows you to specify arbitrary log-probability terms, and ends up being treated like a likelihood.


#4

Thank you for your replies! I tried transforming my FinalScore with a logit function and then fit the Normal to that. It worked but then it makes interpreting the traceplot graphs difficult. Would it be possible to transform the trace to be able to interpret it more easily?

Here is my code when I try to fit it to a Beta distribution and I will also include the exception stack trace as an attachment because it is huge…

with pm.Model() as hierarchical_model:
    # Hyperpriors
    mu_a = pm.Normal('mu_alpha', mu=0, sd=1)
    sigma_a = pm.HalfCauchy('sigma_alpha', beta=1)
    mu_b = pm.Normal('mu_beta', mu=0, sd=1)
    sigma_b = pm.HalfCauchy('sigma_beta', beta=1)
    
    # Intercept for each store, distributed around group mean mu_a
    a = pm.Normal('intercept', mu=mu_a, sd=sigma_a, shape=len(uniqueStores))
    # Intercept for each store, distributed around group mean mu_a
    b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=len(uniqueStores))
    
    # Model error
    eps = pm.HalfCauchy('eps', beta=1)
    
    # Expected value
    score_est = a[storeIDX] + (b[storeIDX] * audits['Shift1Score'].values)
    
    # Data likelihood
    sd = tt.sqrt(score_est * (1 - score_est))
    
    y_like = pm.Beta('y_like', mu=score_est, sd=sd, observed=audits.FinalScore)

    trace = pm.sample(progressbar=True, njobs=4)

exception.txt (62.8 KB)


#5

Alright I managed to make the Beta posterior work. I am a software engineer and still learning bayesian stats for fun so… I saw this post and transformed my code like his and used a pretty small standard deviation. Made sure my observed variable did not contain values exactly 0 or 1.

# Intercept for each store, distributed around group mean mu_a
a = pm.Normal('intercept', mu=mu_a, sd=sigma_a, shape=len(uniqueStores))
# Intercept for each store, distributed around group mean mu_a
b = pm.Normal('b', mu=mu_b, sd=sigma_b, shape=len(uniqueStores))

# Model error
eps = pm.HalfNormal('eps', sd=0.1)

# Expected value
u = pm.math.invlogit(a[storeIDX] + (b[storeIDX] * audits['Shift1Score'].values))

# Data likelihood    
y_like = pm.Beta('y_like', mu=u, sd=eps, observed=audits.AdjustedScore)

Thank you!


#6

Yes, you can extract the mu from the trace: mu = trace['mu'], do the logistic transform: p = logistic(mu), and add it back to the trace trace.add_values(p), then you can visualize it using traceplot etc.

You Beta model is not working initially because you fix the sd to sd = tt.sqrt(score_est * (1 - score_est)), but for the mu and sd parametrization to work, sd must within the bound of (0, sqrt(mu*(1-mu)))


#7

I could not find any examples for binomial. I have access to the n and the k because each row in my dataset is actually the result of a test. Each test has a total # of points and the actual points the person got. We divide those to have a percentage. So I guess I could use the total # of points as my number of trial and the actual scored points as the number of successes. But how do I specify those observed data to the Binomial distribution? The distribution needs a total # of trials n and we want to find p. The total # of points can change for each row in my dataset also.

Thanks,


#8

In Binomial, the n could be a vector of trial Ns, so in you case, after
score_est = pm.math.sigmoid(a[storeIDX] + (b[storeIDX] * audits.Shift1Score.values))
you can do something like:

    #likelihood
    y_like = pm.Binomial('observed', n=audits.trial_n, p=score_est, observed=audits.observed_n)

#9

Worked perfectly thank you!