Beta distribution failing for missing value imputation?

In Statistical Rethinking, 2nd ed, there is problem 15M2:

Reconsider the primate milk missing data example from the chapter. This time, assign B a distribution that is properly bounded between zero and 1. A beta distribution, for example, is a good choice."

I have been able to replicate the code from the original problem (where B used a normal distribution). However, it fails when I use a beta distribution in pymc, even though a colleague (@MarcoGorelli) was able to do this in numpyro.

df_primates["neocortex_prop"] = df_primates["neocortex.perc"] / 100
df_primates["k_std"] = standardize(df_primates["kcal.per.g"])
df_primates["logmass"] = np.log(df_primates["mass"])
df_primates["logmass_std"] = standardize(df_primates["logmass"])

with pm.Model() as m15_5b:

    # priors
    a = pm.Normal("a", 0, 0.5)
    bB = pm.Normal("bB", 0, 1)
    bM = pm.Normal("bM", 0, 0.5)
    sigma = pm.Exponential("sigma", 1)

    # obs/imputed B with beta distribution (avoid standardizing)
    Bi = pm.Beta("Bi", alpha=2, beta=2, observed=df_primates["neocortex_prop"])

    # linear model
    mu = a + bB * (Bi - 0.67) + bM * df_primates["logmass_std"]

    # likelihood
    K = pm.Normal("K", mu, sigma, observed=df_primates["k_std"])

    # sample
    trace_m15_5b = pm.sample(draws=1000, random_seed=19, return_inferencedata=True)

Here is the error message I get:

SamplingError                             Traceback (most recent call last)
<ipython-input-27-6bcd320dd5ad> in <module>
     24     # sample
---> 25     trace_m15_5b = pm.sample(draws=1000, random_seed=19, return_inferencedata=True)

~/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/pymc3/ in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, idata_kwargs, mp_ctx, pickle_backend, **kwargs)
    504             if start is None:
    505                 start = start_
--> 506                 check_start_vals(start, model)
    507         except (AttributeError, NotImplementedError, tg.NullTypeGradError):
    508             # gradient computation failed

~/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/pymc3/ in check_start_vals(start, model)
    234         if not np.all(np.isfinite(initial_eval)):
--> 235             raise SamplingError(
    236                 "Initial evaluation of model at starting point failed!\n"
    237                 "Starting values:\n{}\n\n"

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'a': array(-0.80057862), 'bB': array(0.31003719), 'bM': array(0.06270603), 'sigma_log__': array(-1.31484483), 'Bi_missing': array([ 1.42181326,  1.29176305,  0.89386098,  1.39275087,  0.14458384,
       -0.39778831,  0.79556126, -0.09505109,  0.84795039,  0.27948074,
        0.02310535,  0.13706759])}

Initial evaluation results:
a               -1.51
bB              -0.97
bM              -0.23
sigma_log__     -1.58
Bi_missing       0.00
Bi               -inf
K             -342.50
Name: Log-probability of test_point, dtype: float64

Any ideas on what I can be doing wrong?

Just a hunch: does neocortex_prop contain values of zero or 1? The Beta(2, 2) rules out those values, so that would be an issue. Beta(1, 1) (uniform) would work. As I said though, just a hunch, let me know!

Not a direct answer to you questions, but have you checked out the Rethinking (2ed) notebooks? If not, they may be a useful resource for this and other examples from the book.

Thank you for the suggestion @Martin_Ingram. Values of 0 or 1 are not present and a Beta(1,1) gives the same error.

1 Like

Thank you @cluhmann. I was able to replicate the original problem in Chapter 15 with the repo’s help. But the end of chapter problems in the repo only go up to Chapter 9 (although your suggestion reminds me that I can contribute what I have for Chapters 10-15).

1 Like

Ah, makes sense. And yes, I am sure the contribution would be appreciated!

1 Like

Bi_missing has two negative intial values as you can see in the error message. The problem is that in the current PyMC version imputed variables are not transformed as default variables are. That means imputation is not very stable for constrained variables.

This is fixed in the upcoming version of PyMC. In the meantime I would suggest you split the observed and missing variables into two separate Betas


Awesome, thank you!

I was able to get this working, but it’s a bit hacky. I concatenated the missing and observed variables from the two beta distributions, but then that changed the order of the indexes. To accommodate, I changed the order of predictor and outcome variables to match the Bi order. But is there a cleaner way to do this? Specifically, I’m wondering if there’s a way to let Bi concatenate based on index of missing (idx_miss) and observed (idx_obs).
@roesta07, @MarcoGorelli

with pm.Model() as m15_5b:

    # priors
    a = pm.Normal("a", 0, 0.5)
    bB = pm.Normal("bB", 0, 1)
    bM = pm.Normal("bM", 0, 0.5)
    sigma = pm.Exponential("sigma", 1)

    # paramaterization for beta, split into observed/missing
    idx_miss = df_primates.loc[df_primates["neocortex_prop"].isna(), "neocortex_prop"].index
    idx_obs = df_primates.loc[df_primates["neocortex_prop"].notna(), "neocortex_prop"].index
    Bi_miss = pm.Beta("Bi_miss", alpha=2, beta=2, shape = len(idx_miss))
    Bi_obs = pm.Beta("Bi_obs", alpha=1, beta=1, observed = df_primates.loc[idx_obs, "neocortex_prop"])
   # Is there a way to concatenate based on index?
    Bi = pm.math.concatenate([Bi_miss, Bi_obs], axis=0) 

    # linear model, changed the order of logmass_std to match Bi index
    mu = a + bB * (Bi - 0.67) + bM * df_primates.loc[idx_miss.tolist() + idx_obs.tolist(), "logmass_std"]
    # likelihood, changed the order of k_std to match Bi index
    K = pm.Normal("K", mu, sigma, observed=df_primates.loc[idx_miss.tolist() + idx_obs.tolist(), "k_std"])

    # sample
    trace_m15_5b = pm.sample(draws=1000, random_seed=19, return_inferencedata=True)