Zero One Inflated Beta Regression

Hello everyone!

I am trying to implement a Zero One Inflated Beta Regression model in PyMC3. Here’s what I have done so far,

 with pm.Model():
      # Priors for unknown model parameters
      theta1 = pm.Normal("theta1", mu=theta1_mu, sigma=theta1_sd)
      theta2 = pm.Normal("theta2", mu=theta2_mu, sigma=theta2_sd)
      theta3 = pm.Normal("theta3", mu=theta3_mu, sigma=theta3_sd)
      theta4 = pm.Normal("theta4", mu=theta4_mu, sigma=theta4_sd)
      theta5 = pm.Normal("theta5", mu=theta5_mu, sigma=theta5_sd)
      theta6 = pm.Normal("theta6", mu=theta6_mu, sigma=theta6_sd)
      precision = pm.Uniform("precision", 0, 100) # based on the original code

      # Expected value of outcome
      PI_0 = pm.math.invlogit(theta3_mu + theta4_mu * v)
      PI_1 = pm.math.invlogit(theta5_mu + theta6_mu * v)
      mu = pm.math.invprobit((1/theta2_mu) * pm.math.log(v/theta1_mu))

      # Likelihood (sampling distribution) of observations
      y_zero = pm.Bernoulli("y_zero", p=PI_0,  observed=y_zero_obs)
      y_one = pm.Bernoulli("y_one", p=PI_1, observed=y_one_obs)
      y_beta = pm.Beta("y_beta", alpha=mu * precision, beta=(1 - mu) * precision, observed=y_beta_obs)

      trace = pm.sample(1500, tune=1000, chains=3, cores=1) # tuned samples are being discarded

Since the probabilities PI_0, PI_1 are distributions, the model described here cannot be used as per my understanding.

The problem is that I have to mix these distributions (y_zero, y_one, y_beta) to have the following. Note that this is the likelihood.


However, with the code I have this is not happening. Secondly, I believe that the posteriors are not updating as per the likelihood, instead they are just recreating the priors. Could someone help me on this?

Thanks in advance!

@ricardoV94 how would you mix deterministics like PI_0 and PI_1?

What’s the purpose of all the theta parameters? You don’t use them anywhere in the model; you use theta_mu everywhere instead.

As per the paper I am currently looking at, I need to use thetas to get the PI_0, PI_1 values. The equations look like this,


Ideally, I think I should be using the complete distributions but I could not work with that so I decided to use the mu values.

But if the mus are fixed values, then there’s no flow of information between pi and theta. If you take your model and use pm.model_to_graphviz to look at the DAG:

You can see that none of the parameters are being used to inform any of the observed nodes.

If you rewrite the model differently, for example:

 with pm.Model() as zero_one_beta:
    # Priors for unknown model parameters
    thetas = pm.Normal("theta", mu=0, sigma=10, size=6)
    precision = pm.Uniform("precision", 0, 100) # based on the original code
    # I set v = 1, but that could be a random variable as well
#     # Expected value of outcome
    PI_0 = pm.Deterministic('pi_0', pm.math.invlogit(thetas[2] + thetas[3] * v))
    PI_1 = pm.Deterministic('pi_1', pm.math.invlogit(thetas[4] + thetas[5] * v))
    mu = pm.Deterministic('mu', pm.math.invprobit(pm.math.log(v / thetas[0] ** 2) / thetas[1]))

#     # Likelihood (sampling distribution) of observations
    y_zero = pm.Bernoulli("y_zero", p=PI_0,  observed=y_zeros)
    y_one = pm.Bernoulli("y_one", p=PI_1, observed=y_ones)
    y_beta = pm.Beta("y_beta", alpha=mu * precision, beta=(1 - mu) * precision, observed=y_between)

#     trace = pm.sample(init='jitter+adapt_diag_grad', target_accept=0.99) # tuned samples are being discarded

The DAG becomes:

Which I think is closer to what you are looking for. Notice, though, that the values \pi_0 and \pi_1 are not flowing into \mu_y, as indicated in your formula. That might be a hint that there’s still an error in the model somewhere.


Do you have a reference for that likelihood? It doesn’t seem to be a Mixture nor a Hurdle model, but some sequential process where to observe y == 1, you first have to “fail” to see y == 0, and to observe 0 > y > 1 you have to fail to see both y == 0 and y == 1. Your original model did not encode this in the probabilities of the y_one nor y_beta

I am using the model described in this paper
Using rapid damage observations for Bayesian updating of hurricane vulnerability functions: A case study of Hurricane Dorian using social media - ScienceDirect

I was under the impression that it is a mixture model because they straight away mention that the likelihood is simply a zero inflated beta regression model but I might have miscommunicated it.

I think the likelihood is where the problem is as @jessegrabowski pointed out rightly. But I am not sure on how to correct it precisely. For now I will try including the theta distributions.

My best guess for a direct implementation of what the paper suggests would be to use a CustomDist with switches, like this:

import pytensor.tensor as pt

    def zero_one_beta_logp(y, pi0, pi1, mu, precision):
        alpha = mu * precision
        beta = (1 - mu) * precision
        return pt.switch(pt.eq(y, 0),
                         pt.switch(pt.eq(y, 1),
                                   pt.log((1 - pi0) * pi1),
                                   pt.log((1 - pi0) * (1 - pi1)) + pm.logp(pm.Beta.dist(alpha=alpha, beta=beta), y)))
    y_obs = pm.CustomDist('y_obs', PI_0, PI_1, mu, precision, logp=zero_one_beta_logp, observed=y)

But in general I agree that it’s not really a true mixture, because of the sequential nature. But you’re not allow to mix (har har) continuous and discrete variables in a pm.Mixture, so I don’t have any other good suggestions.

It couldn’t really be a mixture, only a hurdle model, because if you see a 0 or 1 there’s no ambiguitiy as to where they came from.

Incidentally, that’s why Mixture doesn’t allow mixing discrete and continuous.

Yes, that CustomDist is the correct implementation of the logp mentioned above.


I have implemented this model and plotted the DAG which looks more sensible now. Thanks for that!

However, I looked at the trace plots and saw that the distributions are not converging with respect to the data I fed them. I copied the code into a colab notebook for everyone’s reference, since copying and pasting multiple images here may not fully describe the problem.

Here’s the link

I see that r_hat is stuck at 1.66 which is not good. The plots also do not converge as I mentioned before.

It doesn’t look to me like the sampler doesn’t converge, it looks to me like the convergence is dependent on the initial conditions of the sampler – that is, it finds two equi-probable models. If you look at the trace plot, 3 chains converge to one posterior, while the 4th finds a different, but also stationary, posterior. The variables implicated are theta 1, theta 2, and precision. These are all the pieces of the Beta-GLM regression component of the model. The form of the expected value for that regression, \Phi^{-1}\left(\frac{\log(v_i) - \log(\theta_1))}{\theta_2}\right), is frankly baffling to me. Why not just use a normal Beta GLM here, with a canonical link function?

If you insist on following the paper, you are going to need to use more informative priors on \theta_1 and \theta_2 to pin down one of the two (statistically) equivalent models.

One other comment: You are taking the log of the probabilities in your likelihood function, so it might be nice to skip the pm.invlogit everywhere and just pass the log-odds directly to your likelihood function.

Thanks for keeping the thread ongoing!

As for the model, I will not be able to change it at all since I am doing it as a part of a course where I have to recreate this paper. Also, I believe that as per the authors, this makes the most sense considering the problem statement and the current research surrounding it.

The reason why I chose these exact priors is because the authors have done the exact same but in r using JAGS. Here’s the JAGS model which they wrote down,

      # priors
      # dnorm is specified in terms of mean and precision. It gives the entire pdf of the normal distribution (discrete)
      zero0 ~ dnorm(priors.zero_mu0, 1 / (priors.zero_sd0 ^ 2))
      zero1 ~ dnorm(priors.zero_mu1, 1 / (priors.zero_sd1 ^ 2))
      one0 ~ dnorm(priors.one_mu0, 1 / (priors.one_sd0 ^ 2))
      one1 ~ dnorm(priors.one_mu1, 1 / (priors.one_sd1 ^ 2))
      b0 ~ dnorm(priors.b0_mu, 1 / (priors.b0_sigma ^ 2))  # sigma to tau (tau = 1 / sigma**2) - sigma here represents precision - specified this way because of BUGS syntax
      b1 ~ dnorm(priors.b1_mu, 1 / (priors.b1_sigma ^ 2))  # sigma to tau (tau = 1 / sigma**2)
      phi ~ dunif(0, 100)

      # probabilities for zero and one 
      for (i in 1:n) {
        logit(alpha_zero[i]) <- zero0 + zero1 * x[i]
        y.iszero[i] ~ dbern(alpha_zero[i]) # pi 0
        logit(alpha_one[i]) <- one0 + one1 * x[i]
        y.isone[i] ~ dbern(alpha_one[i]) # pi 1

      # arrays for zeros and ones
      for (i in {[i] ~ dbern(0) # the probability of success p here is estimated in BUGS, based on the value of
      for (i in {[i] ~ dbern(1) # the probability of success here is estimated based on the value of
      # likelihood for mu and phi (precision) - probit link function
      for (i in 1:n.cont) {
        y.c[i] ~ dbeta(p[i], q[i]) 
        p[i] <- mu2[i] * phi 
        q[i] <- (1 - mu2[i]) * phi 
        probit(mu2[i]) <- (1 / b0) * log(x.c[i] / b1) 

With this model, they were able to achieve trace plots that looked ideal and that is what is mentioned in the paper I believe.

I also tried changing the distribution of the precision from Uniform to a HalfNormal with a large variance as per the suggestion given here about reparameterizing the model but the trace plots remained the same.

I tried to directly plug in the logodds as well but the result was an inf which could have been an oversight on my end or a genuine calculation mistake but for now I am keeping the same invlogit calculations.

Finally to make theta1 and theta2 more informative I tried reducing the standard deviation and then used a normal distribution on the precision. This gave a single model at convergence but now it does not match with what the authors have.

Could you refer to the JAGS model again and suggest if any improvements can be made to the model we have done so far?


I looked the paper and functions provide above. I wonder why you add “log” to covert the value of pi0 and others? I see the pi0 and others are calculated from logitinv function. Should thta be all accourding to the equations. I cannot figure out why pt.log is needed. Tha

thank you very much for your answer

Hi Josh,

That function computes the log probability of the distribution, so I took the log of the probabilities computed by the inverse logit function. This is common in statistical computing, because it is more numerically stable, allowing us to work on the entire (negative) number line, rather than the 0-1 interval. See this lecture for a detailed discussion.

Hello Jess,

Thank you for your quick reply.
I have watched the video but haven’t watched the one you mentioned yet.
once agian, sincerely appreciate your info.