Making a query for a simple BN made in PYMC3

I successfully made the student BN example with PYMC3. It seems it gives the general answer correctly.
40%20AM

Is there an easy way to make a query for observation like {L=L0, S= S1}
The code and CPDs are as follow:

D = np.array([0.6, 0.4])
I = np.array([0.7, 0.3])
G = np.array([[[0.3, 0.4, 0.3],
               [0.05, 0.25, 0.7]],
              [[0.9, 0.08, 0.02],
               [0.5, 0.3, 0.2]]])
S = np.array([[0.95, 0.05], 
              [0.2, 0.8]])
L = np.array([[0.1, 0.9], 
              [0.4, 0.6], 
              [0.99, 0.01]])

with pm.Model() as model:
    I_ = pm.Categorical('I', p=I)
    D_ = pm.Categorical('D', p=D)
    G_prob = theano.shared(G)
    G_0 = G_prob[I_, D_]
    G_ = pm.Categorical('G', p=G_0)
    S_prob = theano.shared(S)
    S_0 = S_prob[I_]
    L_prob = theano.shared(L)
    L_0 = L_prob[G_]
    L_ = pm.Categorical('L', p=L_0)

with model:
    trace = pm.sample(20000)

Thanks

You canā€™t query the trace directly with ease. trace can either be given the variableā€™s name and find all values associated with it, or return the values of certain index in the trace. To easily query you can convert the trace to a pandas.DataFrame with pymc3.trace_to_dataframe. The Dataframe will have the variable names as columns and each row is a point from trace. Then you can do queries just like in pandas:

df = pm.trace_to_dataframe(trace)
print(df.query("L == 0 and S == 1"))

See also: Intercausal Reasoning in Bayesian Networks

Thank you, it worked.

Ramin

I assume you mean to sample from the posterior.
Can you explain more? and would you kindly send me the script of how you would set L==0 and S==1 and sample again from the posterior?

Thanks
Ramin

You mean, you would like to set evidence L = 0, and S = 1, and calculate posterior probabilities of other variables in the network P(D=1 | L = 0, S = 1)?
It could be done like this:

with pm.Model() as model:
    I_ = pm.Categorical('I', p=I)
    D_ = pm.Categorical('D', p=D)
    G_prob = T.shared(G)
    G_0 = G_prob[I_, D_]
    G_ = pm.Categorical('G', p=G_0)
    S_prob = T.shared(S)
    S_0 = S_prob[I_]
    S = pm.Categorical('S', p=S_0, observed = 1)
    L_prob = T.shared(L)
    L_0 = L_prob[G_]
    L_ = pm.Categorical('L', p=L_0, observed = 0)

So now you can do the inference like:

with model:
    trace = pm.sample(20000)

Youā€™ve gotten your posteriors of D, I and G, and if youā€™d like make a querry for example:
P(D=1 | L = 0, S = 1):

If im not wrong, all you have to do is count how many times D was equal to 1 from all the samplesā€¦

Somebody correct me if iā€™m wrong (lucianopaz, junpenglao)

Or that could be just calculating posterior distribution, and if we wanted to evaluate quarry:

Since you already have samples from joint distribtuion given priors (or posteriors),
you can evaluate it as:

df = pm.trace_to_dataframe(trace)
P(D=1 | L = 0, S = 1) = P(D=1,L=0,S=1)/P(L=0,S=1) =
len(df_b = df.query(ā€œL == 0 and S == 1 and G == 1ā€)/len(df_b = df.query(" S == 1 and G == 1")

1 Like

As @tomkov said, you can perform conditional sampling by setting some variables as observed. These variables will be fixed to 0 or 1 throughout the entire trace, so you cannot query values that are different from those. To get the probability of D=1 conditional on the observed L and S you can simply do np.mean(trace['D'] == 1)

1 Like

to be honest, iā€™m not getting the right results with that method.
Iā€™ve found somewhere on the Internet this example done with OpenBUGS:
This was thair network:

Iā€™ve defined PCTs as:

D = np.array([0.6, 0.4])
I = np.array([0.7, 0.3])
G = np.array([[[0.3, 0.7],
               [0.05, 0.95]],
              [[0.9, 0.1],
               [0.5, 0.5]]])
S = np.array([[0.95, 0.05], 
              [0.2, 0.8]])
L = np.array([[0.1, 0.9], 
              [0.4, 0.6]])

On the other hand, with the code:

and using

the result is: 0.53
:unamused::unamused:

Thank you very much.
shouldnā€™t it give me the same answer as the query method @lucianopaz suggested?

The example is taken from Kollerā€™s book (Probabilistic graphical models Chapter 3)
The difference in your setup and her example is G has 3 and CPD of L is little bit different.

I tried using the posterior sampling using trace[ā€˜Dā€™] vs query method; I am getting the same answer with an error of third digit (0.601612 vs 0.603644)
Do you guys think there is an advantage of doing it through sampling with observed value vs using directly query method?

Thanks I truly appreciate your help saved me tons of time.

Ramin

Iā€™ve tried both ways also, and compared them with Chpater 3, PGM Kollerā€™s book.

The thing is, in case when you use query method, youā€™ve already made assumption the prior probabilities in you network are correct, and therefore you can get examples from joint distribution.

On the other hand, if youā€™re wrong about your priors (ex. change book probabilities to P(D=1)=0.01, and P(S=1)=0.001), then youā€™ll get better (and more correct) values if youā€™ve observed some evidence.

In your very particular case, I would favor what you call the query method. What truly underlies said method is that you draw single samples from the prior generative process and then you try to infer the probability of certain conditional outcomes. The query method works for this problem for three very important reasons:

  1. The sample space of this problem is very small.
  2. The inference is done based on the output of a single observed value, S=1 and L=0, and not a list of many iid observations that came from the generative process, if S and L's observed were longer arrays.
  3. The event S=1 and L=0 is not highly unlikely, so drawing some scalar samples from the prior yields enough data to get a more or less precise answer.

When one of the previous three points does not hold, the query method stops being practical and one can only attempt to do MCMC.

On the other hand, when doing MCMC on discrete variables, we are forced to use metropolis Monte Carlo, and we must take more care that the chain converge, mix well, donā€™t get stuck and have low autocorrelation. If we donā€™t take these precautions, we end up with biased estimates. Usually NUTS is so efficient that if we donā€™t see divergences and the traces seem fine, we can work with all the points in the trace, but metropolis is much less efficient. It will likely be necessary to thin the chains to reduce the autocorrelations.

1 Like

Thank you very much for your kind help.

I agree. In case of changing the prior, the PJD Iā€™ve gained with trace is not valid anymore and I have to resample the whole process again.

On the same note, I wonder if how I can estimate the prior (and more CPDs) better. Ideally, I would guess the prior should come from a Dirichlet distribution. I followed this example for the syntax:

Chris: Problem with pm.Categorical

import theano.tensor as tt
D = np.array([0.6, 0.4])
I = np.array([0.7, 0.3])
G = np.array([[[0.3, 0.4, 0.3],
[0.05, 0.25, 0.7]],
[[0.9, 0.08, 0.02],
[0.5, 0.3, 0.2]]])
S = np.array([[0.95, 0.05],
[0.2, 0.8]])
L = np.array([[0.1, 0.9],
[0.4, 0.6],
[0.99, 0.01]])
II = data.I.values # data = pm.trace_to_dataframe(prev_trace) from known priors
DD = data.D.values
GG = data.G.values
with pm.Model() as cpd_estimate:
Ie = pm.Dirichlet(ā€˜Ieā€™, a = np.ones(2))
I_ = pm.Categorical(ā€˜Iā€™, p=Ie, observed=II)
De = pm.Dirichlet(ā€˜Deā€™, a = np.ones(2))
D_ = pm.Categorical(ā€˜Dā€™, p=De, observed=DD)
G_prob = pm.Dirichlet(ā€˜Geā€™, a=np.ones(G.shape), shape=G.shape)
G_0 = G_prob[I_, D_]
G_ = pm.Categorical(ā€˜Gā€™, p=G_0, observed=GG)
trace = pm.sample(1000)

it is very slow!!! (~2 draws/s) any suggestions? I can always use VI but I was wondering if there is something I could do to make it faster. It becomes really slow when I add G!

PS it seems there is something wrong about Dirichlet; the output of the trace log looks like this whenever I use Dirichlet in my model:

Sorry for these long messages. I really appreciate your help.

Ramin

It makes total sense.

Thanks for the thorough explanation.

Ramin

Have you ever figured out why is it that slow?

Hi,

Actually No, I believe better definition of the prior will lead to a faster sampling. and for now I use a very simple approach to approximate the priors better which works only for discrete variables.