Predictions on holdout dataset with beta binomial regression

I am trying to predict the rate of success from a dataset containing trials and successes.

At the moment the model is parametrised as following:

with pm.Model() as betabinomial:
    
    predictor = pm.Data('predictor', df.predictors)
    trials = pm.Data('trials', df.trials)
    
    intercept = pm.Normal('intercept', mu=np.log(0.1), sd=0.001)
    beta = pm.Normal('beta', mu=np.log(0.8), sd=0.001)
    
    p_sigma = pm.Exponential('p_sigma', 100)

    # Expected value
    p_mu = pm.math.invlogit(intercept + beta * predictor)
    p_kappa = ((p_mu * (1-p_mu))/p_sigma**2)-1 
    p_alpha = p_mu * p_kappa
    p_beta = (1-p_mu) * p_kappa
    
    # Outcome definition
    Y = pm.BetaBinomial('successes', n=impressions, alpha=p_alpha, beta=p_beta, observed=df.successes)

I would like to estimate the parameter p from the Beta distribution on a holdout dataset but I cannot figure out how to do it. I am happy to also separate the beta binomial into a beta prior + a binomial likelihood if it makes it any easier.

Below was my attempt but I was not able to make predictions because the shape of the beta is fixed to the size of the train dataset and sample_posterior_predictive fails if the test dataset is smaller.

with pm.Model() as beta_binomial:
    
    predictor = pm.Data('predictor', df.predictor)
    trials = pm.Data('trials', df.trials)
    
    intercept = pm.Normal('intercept', mu=np.log(0.1), sd=0.001)
    beta = pm.Normal('beta', mu=np.log(0.8), sd=0.001)
    
    p_sigma = pm.Exponential('p_sigma', 100)

    # Expected value
    p_mu = pm.math.invlogit(intercept + beta * predictor)
    p_kappa = ((p_mu * (1-p_mu))/p_sigma**2)-1 
    p_alpha = p_mu * p_kappa
    p_beta = (1-p_mu) * p_kappa

    # Data likelihood
    p = pm.Beta('p', alpha=p_alpha, beta=p_beta, shape=df.shape[0])
    
    # Outcome definition
    Y = pm.Binomial('successes', n=impressions, p=p, observed=df.successes)

Any suggestion would be greatly appreciated, couldn’t find any suggestion on the Web.

1 Like

Hi,
Did you trying changing the data / shapes with pm.set_data, as here for instance?

I tried, and it works fine if all parameters have shape 1 (or not specifically defined)…because in the beta_binomial case the shape of beta is defined to be the size of the train dataset I cannot make it to dynamically change to the size of the test dataset.

For the betabinomial case I can swap datasets with no problems but I am not sure how to sample p after I changed the dataset.

Can you maybe share the code you use to do that, as well as the error it raises?

This is how I am trying to make predictions on successes:

with beta_binomial
pm.set_data({‘predictor’: new_predictor,
‘trials’: new_trials]})
posterior = pm.sample_posterior_predictive(trace, samples=10000)

When I try to run this I get the following error:

Cannot broadcast provided shapes (88,), (1053,), () given size: () where 88 is the number of rows in the test dataset while 1053 is the number of rows in the train dataset.

I guess the problem is due to the fact that in the beta_binomial model the shape of p is defined as the size of the train dataset. I don’t know how to fix this.

Try using the beta_binomial, and after you sample, do:

with betabinomial:
    pm.set_data({'predictor': new_predictor, 'trials': new_trials})
    p_mu = pm.math.invlogit(intercept + beta * predictor)
    # re-propagate the value so that the shape is correct
    p_kappa = ((p_mu * (1-p_mu))/p_sigma**2)-1 
    p_alpha = p_mu * p_kappa
    p_beta = (1-p_mu) * p_kappa
    # create new RV for prediction
    p = pm.Beta('p', alpha=p_alpha, beta=p_beta, shape=new_predictor.shape[0])
    posterior = pm.sample_posterior_predictive(trace, var_names=['p'])
3 Likes

Thanks, this works fine

How can I modify your suggestion if instead of using just one global intercept now I want to use random intercepts?
p_mu = pm.math.invlogit(random_intercept[tt.cast(test_site, ‘int64’)] + beta_position * test_position)

I guess I can reuse the code you sent me when predicting for an existing site. But I am not sure how to modify it when I need to predict on a new site. I was trying to modify the code from here How do we predict on new unseen groups in a hierarchical model in PyMC3? but I am not sure how to do it for this case since I also need access to the parameter p and not just to the number of successes.

Thanks a lot in advance and sorry for the trivial questions, I am just starting out.

In that case you will need to draw a new random_intercept:

random_intercept_new = pm.{the distribution for the random_intercept}

with the same parameter (so that we can substitute it with the trace ie the posterior samples)
and then do

p_mu = pm.math.invlogit(random_intercept_new + beta_position * test_position)

So if my random_intercept was defined as:

mu_random_intercept = pm.Normal('mu_random_intercept', mu=np.log(0.2), sd=0.01)
sigma_random_intercept = pm.HalfNormal('sigma_intercept', 0.1)
random_intercept_offset = pm.Normal('random_intercept_offset', mu=0, sd=1, shape=num_unique_sites_train)
random_intercept = pm.Deterministic('random_intercept', mu_random_intercept + random_intercept_offset * sigma_intercept)

I would define

random_intercept_new_ offset = pm.Normal(‘random_intercept_new_offset’, mu=0, sd=1, shape=num_unique_sites_test)
random_intercept = pm.Deterministic(‘random_intercept_new’, mu_random_intercept + random_intercept_new_offset * sigma_intercept_url)

Is that correct?

yes, alternatively, you can do:

random_intercept_offset = pm.Normal('random_intercept_offset', mu=0, sd=1, shape=num_unique_sites_train_and_test)
random_intercept = pm.Deterministic('random_intercept', mu_random_intercept + random_intercept_offset[train_index] * sigma_intercept)

and during prediction, do:

random_intercept_test = pm.Deterministic('random_intercept_test', mu_random_intercept + random_intercept_offset[test_index] * sigma_intercept)

However(!!), since you dont have hyper prior on random_intercept_offset, the random intercept_offset for your test set will just draw from a Normal(0, 1)

1 Like

Perfect. This really clarifies things. Thanks a lot!
Would you recommend having an hyperprior for the offset as well?

Yes! usually i recommend something like:

random_sd = pm.HalfNormal('random_sd', 2.5)
random_intercept_offset = pm.Normal('random_intercept_offset', mu=0, sd=random_sd, shape=num_unique_sites_train_and_test)
1 Like