Sum of two Binomial Distributions

I’m relatively new to PyMC, so I might be making a very simple error. A little bit about the problem I’m trying to model. There is food that is potentially contaminated which is being tested. We know the false positive and false negative rates of the testing procedure. I’m trying to get some bounds on the actual number of contaminated samples.

n = 100 
y = 50
fn = .2
fp = .1
with Model() as food_model:
    positives = pm.DiscreteUniform("positives", lower = 0, upper = n)
    true_positives = pm.Binomial("true_positives", n = positives, p = (1 - fn))
    false_positives = pm.Binomial("false_positives", n = y - positives, p = fp)
    model = pm.Deterministic('model', true_positives + false_positives, observed = y)
    

with food_model:

    # Draw samples
    trace = sample(1000, njobs=2)
    # Plot parameter
    forestplot(trace, varnames=['positives']) 

The error is:
TypeError Traceback (most recent call last)
in ()
7 true_positives = pm.Binomial(“true_positives”, n = positives, p = (1 - fn))
8 false_positives = pm.Binomial(“false_positives”, n = y - positives, p = fp)
----> 9 model = pm.Deterministic(‘model’, true_positives + false_positives, observed = y)
10
11

TypeError: Deterministic() got an unexpected keyword argument ‘observed’

For reference, I used the advice given here: https://github.com/pymc-devs/pymc3/issues/1108

1 Like

@ggluck Deterministic is not a random variable, it is correspondent to represent a deterministic variable (as a combination of several RVs). We can use deterministic variables to capture the transformations to the random variables and store them in the trace.

However, if you can fix the deterministic using observed then there is no point of storing it in the trace.

Please look this git issue for more details - https://github.com/pymc-devs/pymc3/issues/1971

Why do you want to use deterministic at the end? And how are you going to determine the bounds using a single observed value. I think you should provide multiple actual number of contaminated samples from multiple experiments.

I don’t think that I can help you with your exact usecase, if you can provide the probabilistic model of your usecase then someone can help you.

1 Like

Thanks for the prompt response. I was using one data point just to test if the code was working.
Here are some simulated data which should make the probabilistic model clearer.

fn = .2
fp = .05
n = [100] * 200
contaminated  = np.random.binomial(100,.4)
not_contaminated = 100 - contaminated
true_positive = np.random.binomial(contaminated, (1-fn)) 
false_positive =np.random.binomial(not_contaminated,fp)
y = true_positive + false_positive

with Model() as food_model:
    positives = pm.DiscreteUniform("positives", lower = 0, upper = n)
    proportion_positives = pm.Deterministic("proportion_positives", positives/100.0)
    true_positives = pm.Binomial("true_positives", n = positives, p = (1 - fn))
    false_positives = pm.Binomial("false_positives", n = y - positives, p = fp)
    model = pm.Deterministic('model', true_positives + false_positives, observed = y)
    


with food_model:

    # Draw samples
    trace = sample(1000, njobs=2)
    # Plot parameter
    forestplot(trace, varnames=['proportion_positives'])

print pm.stats.hpd(trace['proportion_positives'], .05)

Of course things should work with one data point (running without an error does not mean that you’ll get the answer). The problem is that you can’t use deterministic as the likelihood. I don’t have sufficient knowledge on your exact usecase to decide the suitable likelihood for that.

Assume that suitable likelihood is Normal distribution.

n = 100
y = 50
fn = .2
fp = .1
with pm.Model() as food_model:
    positives = pm.DiscreteUniform("positives", lower=0, upper=n)
    true_positives = pm.Binomial("true_positives", n=positives, p=(1 - fn))
    false_positives = pm.Binomial("false_positives", n=y - positives, p=fp)
    model = pm.Normal('model', true_positives + false_positives, observed=y)

with food_model:
    # Draw samples
    trace = pm.sample(1000, njobs=2)
    # Plot parameter
    pm.forestplot(trace, varnames=['positives'])

Now this will work. It is because deterministic can not be a likelihood of a probabilistic model.

When I said the probabilistic model, I requested the equations (or conditional relationship) between the each random variable and how they should be mapped to likelihood.

Lets see if someone else may have a better insight on this problem

@Nadheesh is right, the problem here is that you cannot assign observed to pm.Deterministic, in this case it is also difficult to get a close form likelihood function for the observed Positive.

We are working to implement a likelihood-free algorithm (Approximate Bayesian Computation) for these cases. Currently, this is an ongoing project by @agustinaarroyuelo, @aloctavodia and me - Hopefully, this will be selected as one of the GSOC 2018 projects.

In this relatively simple case, you can try a reject sampling implemented in numpy:

n = 100
y = 50
fn = .2
fp = .1

# reject sampling
Nsample = 1000
yhat = 0
TrueP = []
for i in range(Nsample):
    # only accept if the simulated data is the same as the observed
    while yhat != y:
        # simulate from latent true postive, can be replaced with informative prior
        trueP = np.random.randint(0, n)
        # simulate observation
        yhat = np.random.binomial(trueP, 1 - fn) +\
               np.random.binomial(n - trueP, fp)
    TrueP.append(trueP)
    yhat = 0
TrueP = np.asarray(TrueP)

# display result
plt.hist(TrueP, np.arange(n));

image

2 Likes

Thanks for all your help. I appreciate the response.

@ggluck: how did you end up modeling your problem?