# Inference with bayesian network in PyMC3

Hello everyone! I’m setting up an experiment in which raters have to evaluate whether something is ‘1’ or ‘0’, and am trying to infer the proportion p of ‘1’ in the population of things that are evaluated. The problem is that raters have a certain probability of making type I and II errors, so those have to be estimated as well.

I was considering estimating p and the error probabilities of raters using a bayesian network, and have written the following code:

``````@as_op(itypes=[tt.dscalar, tt.dscalar, tt.lscalar], otypes=[tt.dscalar])
def rater(alpha, beta, rel):
if (rel == 1):
return (1 - beta)
else:
return (alpha)

with pm.Model() as generate_net:

#we know from earlier research that the p is probably rather low
p = pm.Beta('p',2,5)

#uniform priors for the errors
#assume for now that alpha and beta are the same for our two raters
alpha =pm.Beta('alpha',1,1)
beta =pm.Beta('beta',1,1)

relevance = pm.Bernoulli('relevance', p = p)

rater1 = pm.Bernoulli('rater1', p = rater(alpha, beta, relevance), observed = rater1_data)
rater2 = pm.Bernoulli('rater2', p = rater(alpha, beta, relevance), observed = rater2_data)
``````

rater1_data and rater2_data are just arrays of 1’s and 0’s. Unfortunately, when I try sampling, I get a `SamplingError: Bad initial energy` . Looking for a solution, I encountered this thread on bayes nets in PyMC3, and now have two questions:

1. Is it possible to run this kind of model in PyMC3, and what might I be doing wrong?
2. Can anyone recommend other python tools I might use. I know there’s various options around (Pomegranate, Pgmpy, Edward, Pyro) but have experience with neither and find it hard to choose.

thanks!

Hi @SCon.
So I think that the thread you link to (and Pomegranate) are a distraction. I believe everything you want to do here can be done in PyMC.

It really sounds like you are describing Signal Detection Theory, so you are basically doing psychophysics here

Note that in signal detection theory they use terms like

• “hit rate” which is the same as a “correct detection” (1-alpha) and
• “miss rate” which is the same as a Type II error / false negative (beta)

See Type I and type II errors - Wikipedia for more on this if it’s not clear.

Feel very free to shout if this doesn’t solve your problem.

1 Like

Also see this paper for all the gory details.

Lee, M. D. (2008). BayesSDT: Software for Bayesian inference with signal detection theory. Behavior Research Methods , 40 (2), 450–456. BayesSDT: Software for Bayesian inference with signal detection theory | SpringerLink

That is implemented in JAGS, but it would be easy to port to PyMC.

And in fact a chapter on SDT has already been ported to PyMC, pymc-resources/SignalDetectionTheory.ipynb at main · pymc-devs/pymc-resources · GitHub

Thanks for the quick reply and pointers, @drbenvincent! I must confess I had never even heard of this field of research.

I won’t have time until monday to look at it in more detail, but perhaps a quick question already: I get the impression that signal detection theory is for cases where we know which observations are hits, misses and false alarms. I think my case is different (and, apologies in case my description was unclear!). In my case, all I have are 1’s and 0’s, and the information that there is a true rate p, and a false positive and false negative rate for each rater (the code I posted has 2 raters with the same rates, but there could be more and with different rates). That’s why I was looking at various classifiers, such as a Bayesian network.

Sure. So if you don’t know the ground truth, then I think SDT and type I and type II errors is not the right approach. If you don’t know the ground truth then I’m not sure if it’s possible to distinguish these error types.

What is the goal of what you are trying to infer? The consistency of two independent raters? How responses change as a function of some predictor variable(s)? From the initial description it sounds like you just want to estimate a proportion for rater 1 and a proportion for rater 2?

1 Like

Thank you! I’m mostly interested in the true proportion, but also in the rates of Type I and II errors. I hoped that by having multiple raters rating the same thing, that would be possible.

If you want to know Type I and Type II errors then you need to know the ground truth. But seeing as this is a rating task, and from what you say, it sounds like you don’t have that.

What you can do is to work out consistency between raters. How many raters do you have? If you have 2 then you can imagine how this forms a 2x2 table of response 0 or 1 for each rater. So that would form a 2x2 table of 0 and 1 ratings and you could model response consistency. So it would bear a passing resemblance to Type I Type II error table, but it’s not quite the same as you are comparing one person to another, not to the ground truth.

This also has a slight hint of Item Response Theory to it - if the emphasis was more on the things being assessed by multiple raters for example. But there’s not enough detail in the question to know if that’s relevant.

I was hoping that with a few raters (and we’re planning on having at least 30), it would be possible to estimate type I and II errors. For example, if 29 raters rate ‘1’, and one rates ‘0’, than there is a higher probability that the ‘0’ was a false negative (unless they are all biased in the same way, but we have reason to believe that raters are reasonably good at evaluating this).

Originally, I tried the following:

Assume T is the true proportion, and O is the observed proportion. Then for each rater:

``````P(O=0|T=1) = false_neg
P(O=1|T=0) = false_pos
P(O=0|T=O) = 1 - false_neg
P(O=1|T=1) = 1 - false_pos
``````

From this, I derived P(O = 1), and tried modeling it with r raters:

``````with pm.Model() as model_error:

p = pm.Beta('p',1,1)

for i in range(r):
er1 = pm.Beta(f'er1{i}',1,1)
er2 = pm.Beta(f'er2{i}',1,1)
y = pm.Bernoulli(f'y{i}', p = er1 * (1 - p) + (1 - er2) * p, observed = df.iloc[:,i])
``````

However, this gives only very broad estimates. The problem is that it doesn’t take the amount of agreement between raters into account for each individual observation – I might as well have used a binomial. So that’s what I tried to resolve with the bayesian network in my original post.

But I will give modeling response consistency a try. I was intending to to use some common measure of interrater reliability as well, but this might be more interesting.

Once again, thanks for the help, it is much appreciated!

A few ideas

1. If you could defend it, could could assume you know ground truth, based on high rater consistency, then treat it as SDT.

2. Or you could stay ‘fully Bayesian’ and treat it like SDT but where you are inferring ground truth from response consistency. I can’t guess at how precise the estimates would be - it would depend on the data.

In both cases you’d benefit from some hierarchical structure… the individual false negatives and false positives are going to be drawn from group-level distributions. And you’d certainly want some informed priors on the error rates that accurately reflect what you believe.

1 Like

Thanks for the ideas, I prefer to go fully Bayesian, as it’ll be a bit arbitrary to decide what counts as high consistency. So I’ll try a multi-level model and see how it goes.