Testing difference between two Negative Binomial distributions

Hi, I am completely new to Bayesian statistics and pymc3 and have run into trouble.
I am trying to measure the change in the number of deaths in hospitals before and after the intervention. I decided to mostly follow BEST procedure (Bayesian Estimation Supersedes the T-Test — PyMC3 3.10.0 documentation) to evaluate the effect. However, I am not sure if my model is correct.

My questions are:

  • is my model adequate to the situation?
  • can we use the difference of means in this case? (the resulting posterior distributions don’t look normal to me)
  • what are other ways to evaluate the pre and post difference?

Here is my code:

import pymc3 as pm

pre_mu = pre.mean()
pre_sd = pre.std() * 2
post_mu = post.mean()
post_sd = post.std() * 2
σ_low = 0
σ_high = 10

with pm.Model() as model:
    
    pre_mean = pm.Normal("pre_mean", mu=pre_mu, sd=pre_sd)
    post_mean = pm.Normal("post_mean", mu=post_mu, sd=post_sd)
    
    pre_std = pm.Uniform("pre_std", lower=σ_low, upper=σ_high)
    post_std = pm.Uniform("post_std", lower=σ_low, upper=σ_high)
    
    diff_of_means = pm.Deterministic("difference of means", pre_mean - post_mean)
    diff_of_stds = pm.Deterministic("difference of stds", pre_std - post_std)
    effect_size = pm.Deterministic(
        "effect size", diff_of_means / np.sqrt((pre_std ** 2 + post_std ** 2) / 2)
    )
    

    group1 = pm.NegativeBinomial("pre", mu=pre_mu, alpha=pre_sd, observed=pre_nzk)
    group2 = pm.NegativeBinomial("post", mu=post_mu, alpha=post_std, observed=post_nzk)
    
    trace = pm.sample(5000)

I think most of your questions would be best answered knowing what form you data is in. Does your data consist of pre-post pairs? Or do you have a pile of pre observations and a separate pile of post observations? You mention the empirical distribution not looking well-behaved. What does it actually look like?

1 Like

Thanks for your response!

  1. No, they are not pairs, but two observational periods (pre and post), where the number of lethal cases/day was recorded as a single number.
  2. Attaching trace plot.

Then I would say what you have is at least reasonable. Two quick comments, however. First, uniform priors are often not a great idea because the discontinuities at each end of the interval give the sampler a hard time. Sampling is often easier if you use something like a gamma prior to “encourage” the sampler to stay >=0. Second, you have a large number of divergences (the tick marks along the x-axes) that should discourage you from interpreting the results you are currently getting. I suspect that these two points are related, but the divergences could arise from other sources as well.

1 Like

I reparametrized using Lognormal instead of Binomial, as I converted raw lethal cases to lethal cases per hospitalized patients. However now I receive SamplingError and I can’t figure out what’s wrong

import pymc3 as pm

pre_mu = pre_nzk.mean()
pre_sd = pre_nzk.std() * 2
post_mu = post_nzk.mean()
post_sd = post_nzk.std() * 2


alpha = 1
beta = 5

with pm.Model() as model:
    
    pre_rrt_mean = pm.Normal("pre_mean", mu=pre_mu, sigma=pre_sd)
    post_rrt_mean = pm.Normal("post_mean", mu=post_mu, sigma=post_sd)
    
    pre_std = pm.Gamma("pre_std", alpha=alpha, beta=beta)
    post_std = pm.Gamma("post_std", alpha=alpha, beta=beta)
    
    diff_of_means = pm.Deterministic("difference of means", pre_mean - post_mean)
    diff_of_stds = pm.Deterministic("difference of stds", pre_std - post_std)
    effect_size = pm.Deterministic(
        "effect size", diff_of_means / np.sqrt((pre_std ** 2 + post_std ** 2) / 2)
    )
    

    group1 = pm.Lognormal("pre", mu=pre_mean, sigma=pre_std, observed=pre_nzk)
    group2 = pm.Lognormal("post", mu=post_mean, sigma=post_std, observed=post_nzk)
    
    trace = pm.sample(5000, target_accept=0.99)

And the error is:

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'pre_mean': array(3.67540805), 'post_mean': array(2.31661707), 'pre_std_log__': array(-1.60943791), 'post_std_log__': array(-1.60943791)}

Initial evaluation results:
pre_mean           -2.58
post_mean          -1.99
pre_std_log__      -1.00
post_std_log__     -1.00
pre                  NaN
post             -144.00
Name: Log-probability of test_point, dtype: float64

I had to make some guesses about what your data looks like so that I could run your model. But it seems to sample ok for me. So maybe an awkward match of priors and your data?