# Beta distribution failing for missing value imputation?

In Statistical Rethinking, 2nd ed, there is problem 15M2:

Reconsider the primate milk missing data example from the chapter. This time, assign B a distribution that is properly bounded between zero and 1. A beta distribution, for example, is a good choice."

I have been able to replicate the code from the original problem (where B used a normal distribution). However, it fails when I use a beta distribution in `pymc`, even though a colleague (@MarcoGorelli) was able to do this in `numpyro`.

``````df_primates["neocortex_prop"] = df_primates["neocortex.perc"] / 100
df_primates["k_std"] = standardize(df_primates["kcal.per.g"])
df_primates["logmass"] = np.log(df_primates["mass"])
df_primates["logmass_std"] = standardize(df_primates["logmass"])

with pm.Model() as m15_5b:

# priors
a = pm.Normal("a", 0, 0.5)
bB = pm.Normal("bB", 0, 1)
bM = pm.Normal("bM", 0, 0.5)
sigma = pm.Exponential("sigma", 1)

# obs/imputed B with beta distribution (avoid standardizing)
Bi = pm.Beta("Bi", alpha=2, beta=2, observed=df_primates["neocortex_prop"])

# linear model
mu = a + bB * (Bi - 0.67) + bM * df_primates["logmass_std"]

# likelihood
K = pm.Normal("K", mu, sigma, observed=df_primates["k_std"])

# sample
trace_m15_5b = pm.sample(draws=1000, random_seed=19, return_inferencedata=True)

``````

Here is the error message I get:

``````---------------------------------------------------------------------------
SamplingError                             Traceback (most recent call last)
23
24     # sample
---> 25     trace_m15_5b = pm.sample(draws=1000, random_seed=19, return_inferencedata=True)

~/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, idata_kwargs, mp_ctx, pickle_backend, **kwargs)
504             if start is None:
505                 start = start_
--> 506                 check_start_vals(start, model)

~/opt/anaconda3/envs/stats_rethinking/lib/python3.8/site-packages/pymc3/util.py in check_start_vals(start, model)
233
234         if not np.all(np.isfinite(initial_eval)):
--> 235             raise SamplingError(
236                 "Initial evaluation of model at starting point failed!\n"
237                 "Starting values:\n{}\n\n"

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'a': array(-0.80057862), 'bB': array(0.31003719), 'bM': array(0.06270603), 'sigma_log__': array(-1.31484483), 'Bi_missing': array([ 1.42181326,  1.29176305,  0.89386098,  1.39275087,  0.14458384,
-0.39778831,  0.79556126, -0.09505109,  0.84795039,  0.27948074,
0.02310535,  0.13706759])}

Initial evaluation results:
a               -1.51
bB              -0.97
bM              -0.23
sigma_log__     -1.58
Bi_missing       0.00
Bi               -inf
K             -342.50
Name: Log-probability of test_point, dtype: float64
``````

Any ideas on what I can be doing wrong?

Just a hunch: does neocortex_prop contain values of zero or 1? The Beta(2, 2) rules out those values, so that would be an issue. Beta(1, 1) (uniform) would work. As I said though, just a hunch, let me know!

Not a direct answer to you questions, but have you checked out the Rethinking (2ed) notebooks? If not, they may be a useful resource for this and other examples from the book.

Thank you for the suggestion @Martin_Ingram. Values of 0 or 1 are not present and a Beta(1,1) gives the same error.

1 Like

Thank you @cluhmann. I was able to replicate the original problem in Chapter 15 with the repo’s help. But the end of chapter problems in the repo only go up to Chapter 9 (although your suggestion reminds me that I can contribute what I have for Chapters 10-15).

1 Like

Ah, makes sense. And yes, I am sure the contribution would be appreciated!

1 Like

Bi_missing has two negative intial values as you can see in the error message. The problem is that in the current PyMC version imputed variables are not transformed as default variables are. That means imputation is not very stable for constrained variables.

This is fixed in the upcoming version of PyMC. In the meantime I would suggest you split the observed and missing variables into two separate Betas

3 Likes

Awesome, thank you!

I was able to get this working, but it’s a bit hacky. I concatenated the missing and observed variables from the two beta distributions, but then that changed the order of the indexes. To accommodate, I changed the order of predictor and outcome variables to match the `Bi` order. But is there a cleaner way to do this? Specifically, I’m wondering if there’s a way to let `Bi` concatenate based on index of missing (`idx_miss`) and observed (`idx_obs`).
@roesta07, @MarcoGorelli

``````with pm.Model() as m15_5b:

# priors
a = pm.Normal("a", 0, 0.5)
bB = pm.Normal("bB", 0, 1)
bM = pm.Normal("bM", 0, 0.5)
sigma = pm.Exponential("sigma", 1)

# paramaterization for beta, split into observed/missing
idx_miss = df_primates.loc[df_primates["neocortex_prop"].isna(), "neocortex_prop"].index
idx_obs = df_primates.loc[df_primates["neocortex_prop"].notna(), "neocortex_prop"].index
Bi_miss = pm.Beta("Bi_miss", alpha=2, beta=2, shape = len(idx_miss))
Bi_obs = pm.Beta("Bi_obs", alpha=1, beta=1, observed = df_primates.loc[idx_obs, "neocortex_prop"])
# Is there a way to concatenate based on index?
Bi = pm.math.concatenate([Bi_miss, Bi_obs], axis=0)

# linear model, changed the order of logmass_std to match Bi index
mu = a + bB * (Bi - 0.67) + bM * df_primates.loc[idx_miss.tolist() + idx_obs.tolist(), "logmass_std"]

# likelihood, changed the order of k_std to match Bi index
K = pm.Normal("K", mu, sigma, observed=df_primates.loc[idx_miss.tolist() + idx_obs.tolist(), "k_std"])

# sample
trace_m15_5b = pm.sample(draws=1000, random_seed=19, return_inferencedata=True)
``````
2 Likes