Semantik / description of Kruschke diagrams (model_to_graphviz)

Hi,

I would like to know if there is somewhere a nice and handy description of the exact semantics of a Kruschke diagram like

What is the exact meaning of

  • the forms (elliptic/circle, rectangle with round corners, rectangle)
  • the background (white vs. grey)
  • the arrows
  • anything else?

I would also like to know if there is any usual way how to put the arrow into words when it points from some A to B. The most usual ways to express this meaning always seams to come from B (like “B is dependent on A” or “B conditional on A”). But as the arrow points from A to B, it would be more natural to say something like “A parameterises B”. How do you express this?

In case there is some nice documentation or tutorial explaining all this?

Best regards
Matthias

So at a high level, the graph is showing how data comes about. The term that gets thrown around is “ancestral sampling”, but I like to think about it as the “flow of information” in the model.

Nodes are values. If the values is a box, it’s data (provided exogenously to the model). If it’s a circle, it means that it’s a random variable.

The coloring is also related to data. If a shape is gray, it means that it is associated with some non-random data. If it’s white, it means there’s no associated data.

So in your graph specifically, sigma is just a value, so it’s a constant data. It’s gray because some values from outside the model are informing it. It’s square because it’s just holding that data – there’s no random variable associated with it.

Obs, on the other hand, is round and gray. This means that, on one hand it’s a random variable, but on the other hand it is conditioned on observations. People use the term “likelihood” for variables like obs (this comes from the well-known formula for Bayes’ law, posterior = likelihood * prior), but I have beef with that. That’s neither here nor there. Maybe the best way think about it is latent vs observed. When a variable is shaded, we observe and measure it. Otherwise it’s a latent (or hidden) variable.

mu is white and round, that means it’s a random variable with no associated data.

The graph is read top to bottom. The arrows show dependencies in the model. mu is a root node. It doesn’t depend on anything else, so you can start the generative story of this model by taking unconditional samples from mu. obs depends on both mu and sigma . The generative story of obs, then, is that you take a random draw from mu, then a fixed sigma, and use them to parameterize a normal distribution. 1000 independent draws are taken from this single mu, sigma pair. I know this because of the box around obs. This is called a plate, and says there are iid copies of whatever is inside the plate.

pm.Potential terms appear on the graph as funny polygons. They don’t really fit into the generative story, but they will appear if you have them.

The arrows point from the mu to obs because that’s the order that samples are generated. You sample mu, then pass that sample into obs. The graph is telling you how to generate data, from start to finish. All of the terms you propose (B depends on A, B is conditioned on A, A parameterizes B) are valid imo.

2 Likes

The diagrams produced by model_to_graphviz() are not actually Kruschke diagrams. They are usually called Bayesian networks or graphical models. According to ChatGPT, they have their roots in a book by Judea Pearl called Probabilistic Reasoning in Intelligent Systems. The WinBUGs/OpenBugs sampling program that is the great grandfather of PyMC had a tool allowing you to draw them and could even do its sampling directly from the diagram. They were very important for those early samplers because they used Gibbs sampling which broke down the problem by conditional variances and so did a lot of symbolic work on the diagrams or the programmatic representation of them that it created from modeling code.

Kruschke introduced what are now known as Kruschke diagrams in his book Doing Bayesian Data Analysis. Actually, I’m not sure about that. He may have published them in an earlier journal article. You can get a short introduction here. They are graphical models or Bayesian networks made pretty and it would be very nice if PyMC could generate them.

Opher

1 Like

Thanks for you comments, which are very helpful!

I find it interesting that this special graphical notation, which is so established in the Bayesian community and especially in the PyMC community, seems to be a mix of different diagram types (Bayesian network, Kruschke diagram, Plate notation), but obviously neither has a name nor some complete documentation. Seems like Jesse’s description is the most complete documentation for this so far.

I probably perceive it like this because I do not have a long background in Bayesian modeling and am now starting this journey in the PyMC context. Most of you probably have already experience with other probabilistic languages and have read more literature about this, so that it just looks natural to you.

May the force be with you
Matthias

1 Like

These are usually called “Bayesian networks” or “directed acyclic graphical models” (DAGs) and provide a way to define a joint distribution with density p(\theta, y) over parameters \theta and data y. Ancestral sampling (aka forward sampling) is what you do to generate random draws of \theta, y from the joint model over data and parameters, which you can always do with a directed acyclic graphical model.

The observed/unobserved and random/constant distinctions are orthogonal. Usually you have boxes for constants and circles for random variables, with circles optionally shaded if they’re observed. I think that’s the convention here.

For example, in a linear regression y_n \sim \textrm{normal}(\alpha + \beta \cdot x_n, \sigma), you typically treat y_n, \alpha, \beta, \sigma as random variables in that they are assigned probability distributions, with x_n being the only constant or non-random variable (other than, say, the size of y). Then we treat y_n, x_n as observed, but \alpha, \beta, \sigma as unobserved. Gelman and Hill use the term modeled data for the the observed outcomes y_n in this example and unmodeled data for the observed covariates x_n in this example (assuming they’re not given a distribution, for example to deal with missing data). In ML, when there is unmodeled data other than constant sizes, such as the covariates x in a regression, they tend to call it a “discriminative” model.

You’ll also note that there’s no way to tell which order the arguments for Normal are displayed—here it’s leaning on our intuitions that mu is the location and sigma is the scale parameter, but that’s not part of the notation! The graphical renderings are massively underspecified in most cases.

The bigger box with the number in it is a plate.

Very often, the graphical models you find in papers are incomplete in that they don’t talk about the requisite indexing (e.g., the classic LDA graphical models) or don’t talk about the distributions (e.g., the classic LDA graphical models) or don’t tell you the argument order (e.g., in the example here, the notation doesn’t tell you that mu is the location and sigma is the scale of the normal—that’s just a naming convention that’s not part of the graphical model notation—usually people try to keep the natural order of parameters in the distribution, so the example here is extra confusing).

1 Like

It’s not at all common among Bayesian statisticians. Not really a single complete example in Gelman et al.'s textbooks, for example, though he does draw some partial ones in chapter 5 of Bayesian Data Analysis for hierarchical models. Statisticians will write down graphical models precisely using sampling notation, which is precise, but they’ll rarely write out a sketchy diagram that only partly describes the model. Computer scientists, in contrast, will insist on the sketchy diagrams, as I have learned from reviewers.

Here, “conditional variances” is a typo here and should have been “conditional distributions”.

For Gibbs sampling, the graphical operations calculate the Markov blankets of a node in the network to find all the variables that are involved in its conditional distribution. This notion was introduced by Pearl in the 1980s. This is what both BUGS and JAGS did and what NIMBLE still does. Since PyMC is largely graphical modeling based and has discrete sampling, I assume it calculates Markov blankets somewhere internally.

There’s complete documentation with BUGS in the early 1990s. Presumably Pearl had precise definitions. You can find everything you need in Kohler and Friedman’s book on graphical models from the 00s.

1 Like

I’m not sure that’s exactly right. I’m sure there are people who understand this better than me, but I think that PyMC get (in the model specification) a list of the parameters necessary to calculate the joint likelihood contributed by each node. It then runs over the list of nodes and passes each one the appropriate parameters and adds the log likelihood into the total. That is, it isn’t calculating a Markov blanket internally (if it makes sense to talk about a Markov blanket in this sense, it is being told he dependencies of each node) and it doesn’t use any such representation.

I’m curious if this is right or not.
Opher

@opherdonchin PyMC can compute that, but you are right that most step samplers don’t exploit this directly. When you assign a variable to a step sampler like NUTS or Slice it computes a model logp function that has all the terms, even those that won’t change when the variable changes:

(NUTS will also compute the dlogp, and that is only for the variables that are being updated)

In contrast, some step samplers, like Metropolis ask for a delta_logp function, which is logp(x1) - logp(x2). In this case the PyTensor backend will automatically remove terms that are identical between the two sides, only keeping terms that do depend on the changing variable:

The only place I am aware we explicitly exploit conditional dependencies in in MarginalModel, where we compute the logp of a marginalized discrete variable by only considering variables in the markov blanket for efficiency purposes: pymc-experimental/pymc_experimental/model/marginal/marginal_model.py at 4deeec6490c755ff77ae6f79d20d88f233e508e1 · pymc-devs/pymc-experimental · GitHub

We could probably exploit this in more places, but historically we didn’t because 1) most times we just use a single sampler to update all variables and 2) we didn’t have a good internal representation in previous versions of PyMC (< 4.0)

Here is a minimal example of the delta_logp:

import pymc as pm
from pytensor.compile.mode import get_mode

with pm.Model() as m:
    x = pm.Normal("x")
    y = pm.Normal("y")
    
logp1 = m.logp(vars=[x, y])
logp2 = m.logp(vars=[x])

mode = get_mode("FAST_RUN").excluding("fusion")  # for readability
m.compile_fn(logp1, mode=mode).f.dprint()
# Add [id A] '__logp' 4
#  ├─ -1.8378770664093453 [id B]
#  ├─ Mul [id C] 3
#  │  ├─ -0.5 [id D]
#  │  └─ Sqr [id E] 2
#  │     └─ x [id F]
#  └─ Mul [id G] 1
#     ├─ -0.5 [id D]
#     └─ Sqr [id H] 0
#        └─ y [id I]

m.compile_fn(logp1 - logp2, mode=mode).f.dprint()   
# Sub [id A] 'sigma > 0' 2
#  ├─ Mul [id B] 1
#  │  ├─ -0.5 [id C]
#  │  └─ Sqr [id D] 0
#  │     └─ y [id E]
#  └─ 0.9189385332046727 [id F]

Notice x is not part of the second function

Opened an issue to exploit this stuff: Compute only dependent logp in step samplers · Issue #7591 · pymc-devs/pymc · GitHub

Thanks, @ricardoV94, that’s super helpful. We went through a long period of thinking about similar options for Stan, but never managed to implement anything practical.