Possible Performance Issue Using Sparse Matrices in Bipartite Noisy-Or Model

Github issue #2558:

Hello,

I’m trying to implement a certain model using pymc3, and have been running into what I believe are some severe performance issues.

My system:
2015 Macbook Pro, 3.1GHz i7, 16GB RAM
Python 3.5
pymc3 version 3.1
theano version 0.9.0

Some background:
The model I’m trying to implement is a bipartite, binary noisy-OR network. I have visible binary variables (call them ‘effects’), each of which is dependent upon one or more hidden binary variables (call them ‘causes’). There’s a many-to-many relationship, where each effect has multiple causes, and each cause influences multiple effects. The CPD for a given visible variable is a noisy-OR, where an effect is “on” if at least one cause is “on”. Where the “noisy” part comes in is that each cause can fail to activate a given effect with a certain probability (i.e. cause is “on” but effect stays “off”), and each effect can also be “on” with a certain probability, even when all causes are “off” (so-called “leak” node).

Perhaps the best succinct explanation for this type of model is this page on the QMR-DT network (the canonical application of this kind of model, where the presence of diseases (hidden layer) is inferred based on the presence of various symptoms (visible layer)).

As for some reproducible code, heres code to simulate some data:

import numpy as np
import pystan
import random
import scipy as sp
from scipy import optimize
from scipy import stats
import pandas as pd
import pymc3 as pm
import theano.tensor as T
import theano.printing
import matplotlib.pyplot as plt
%matplotlib inline
# Simulate some data. n hidden nodes, 
# all of which have probability 0.05 of being
# "on", except for node h02, which has probability
# 0.90. We'd like our inference to suggest that the
# parameter theta_h[2] has a likely value much higher
# than that of the others.

np.random.seed(42)

m,n = 10, 4
H = {"h"+str(i).zfill(2): 0.05 for i in range(1,n+1)}
H["h02"] = 0.90
H_keys = sorted(list(H.keys()))
V = {"v"+str(i): random.sample(H.keys(),random.randint(1,n)) for i in range(1,m+1)}

n_samples_per_v = 1 # in case we want more visible datapoints.

data = []
for v,pa_t in V.items():
    pa_t = pa_t
    for _ in range(n_samples_per_v):
        if any(np.random.binomial(1,H[h]) == 1 for h in pa_t):
            v_value = 1
        else:
            v_value = 0
        data.append((v_value,pa_t))
m,n = len(data), len(H)

# Making the sparse matrix. This could
# definitely be made more efficient, but
# its a one-time operation, so doesn't really
# matter at this point.
h_lookup = dict(zip(sorted(H_keys),range(len(H))))
observed, parents = zip(*data)
A_dense = np.zeros((m,n),dtype=np.int)
for i, parents_i in enumerate(parents):
    for parent in parents_i:
        A_dense[i,h_lookup[parent]] = 1
A = sp.sparse.csc_matrix(A_dense)

The data variable l made ends up looking like this:

[(1, ['h03', 'h04', 'h01', 'h02']),
 (1, ['h03', 'h02']),
 (1, ['h02', 'h03', 'h01']),
 (1, ['h02', 'h04', 'h03', 'h01']),
 (1, ['h03', 'h02']),
 (0, ['h04', 'h03']),
 (0, ['h04']),
 (1, ['h04', 'h03']),
 (1, ['h01', 'h02', 'h04']),
 (1, ['h03', 'h01', 'h02', 'h04'])]

and the resulting observed data array and dense mask matrix A_dense, look like

(1, 1, 1, 1, 1, 0, 0, 1, 1, 1)

and

array([[1, 1, 1, 1],
       [0, 1, 1, 0],
       [1, 1, 1, 0],
       [1, 1, 1, 1],
       [0, 1, 1, 0],
       [0, 0, 1, 1],
       [0, 0, 0, 1],
       [0, 0, 1, 1],
       [1, 1, 0, 1],
       [1, 1, 1, 1]])

, respectively

Here’s my attempt at a pymc3 model:

with pm.Model() as model:
    # Hidden nodes: Nodes are bernoulli distributed,
    # with Uniform priors on theta_h.
    theta_h = pm.Uniform("theta_h",lower=0,upper=1,shape=n,testval = 0.5)
    h = pm.Bernoulli("h",p=theta_h, shape=n, testval = 1)
    
    # Noisy-OR and leak priors:
    # theta_i is the probability that the link between hidden node h[i]
    # and a visible node dependent on it "fails".
    # We're stipulating here that we expect the relationship between
    # hidden and visible nodes to be not too noisy, by specifying
    # with a left-skewed Beta prior that we expect the probability of
    # link failure to be low. Same with the leak probability.
    theta_leak = pm.Beta("theta_leak",alpha=2,beta=20,shape=m,testval=0.05)
    theta_i = pm.Beta("theta_i",alpha=2,beta=20,shape=n, testval = 0.05)
    
    # Log transformed
    wi = pm.math.log(theta_i).dimshuffle((0,'x')) # see below for why we dimshuffled
    w0 = pm.math.log(theta_leak)
    
    # Constructing the likelihood: 
    # P(v[i] = 1| h) = 1 - \exp{w0[i] + \sum_{h[j] \in parents(v[i])}{h[j]*wi[j]}}
    # Equivalent to exponential formualtion at the bottom of
    # http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume10/jaakkola99a-html/node2.html
    # We can express this using the following matrix equation: A*H*wi, where
    # - A is our sparse (m*n) mask of child-to-parent mappings.
    # - H is an (n*n) diagonal matrix with the vector of RV's h down the main diagonal
    # - wi is our log(theta_i), transformed from an (n,) array into an (n*1) column vector
    # NOTE: I would get cryptic errors about dropping non-broadcastable dimensions
    # if I used anything but structured_dot when constructing AHwi. Not sure what the deal
    # is there...
    H = theano.sparse.square_diagonal(h)
    AH = theano.sparse.basic.cast(theano.sparse.true_dot(A,H),"float64")
    AHwi = theano.sparse.structured_dot(AH,wi).ravel()
    p_v_given_h = 1 - pm.math.exp(w0 + AHwi) # voila
    #theano.printing.Print()(p_v_given_h) # for debugging
        
    # Observed data consists of m binary observations, 
    # representing the states of the m visible nodes (1 = 'on').
    v = pm.Bernoulli("v",p=p_v_given_h,shape=m,observed=observed)
    
    trace = pm.sample()
    pm.forestplot(trace,varnames=["theta_h"])

My general difficulty is in implementing the many-to-many relationship between hidden nodes and visible nodes in an efficient manner in pymc3/theano.
As you can see, I’ve tried to express the likelihood in terms of matrix operations, where the parent-child relationships are encoded in the binary sparse-matrix A, which I then multiply by a diagonal matrix containing the hidden variable RVs, and multiply that by the column vector of the noise parameters. The model works just fine when the size of the data is small (both m and n below, say, 1000). However, my actual data I want to run this model on consists of around 2500 hidden nodes, and nearly 750,000 visible nodes (observations). When I try that, I get extremely slow speeds (on the order of 400s/iteration), despite the size of my sparse matrix being only on the order of around 20MB. I would have thought that pymc3 (and theano underneath) would be equipped to handle sparse structures of that size with relative ease.

I’m not sure if this is an issue with pymc3’s handling of sparse data structures, or the way I’ve implemented my model (or both).

If anybody has any insight into why this runs so slowly, and how I might go about modifying my implementation, it would be much appreciated. Thanks in advance!

Exchange so far on Github:

Junpenglao:
Have you try to run it with ADVI?

PriceHardman:
Just tried it. Told me ADVI doesn’t work with discrete RV’s. I read a bit of this part of the documentation mentioning that one can sidestep this by trying to marginalize out the discrete variables. I don’t know how to do this, so I’ll likely have to read some more before ADVI becomes a possible path forward. Thanks for the suggestion!

Junpenglao:
Oh I see you have hidden nodes are bernoulli distributed, yes, in that case, marginalize out the discrete variable is the way to go. Just on top of my head: is it possible to skip the h and just do H = theano.sparse.square_diagonal(theta_h)?

Junpenglao:
For example, something like:

with pm.Model() as model:
    theta_h = pm.Uniform("theta_h",lower=0,upper=1,shape=n)
    theta_leak = pm.Beta("theta_leak",alpha=2,beta=20,shape=m,testval=0.05)
    theta_i = pm.Beta("theta_i",alpha=2,beta=20,shape=n, testval = 0.05)

    # Log transformed
    wi = pm.math.log(theta_i)
    w0 = pm.math.log(theta_leak)
    AHwi = tt.dot(tt.dot(A_dense, tt.diag(theta_h)), wi)
    p_v_given_h = 1 - pm.math.exp(w0 + AHwi) # voila
    v = pm.Bernoulli("v",p=p_v_given_h,shape=m,observed=observed)
    trace = pm.sample()

PriceHardman
Interesting idea. It speeds things up a bit, but also changes the results.
Made these changes:

    H = theano.sparse.square_diagonal(theta_h)
    AH = theano.sparse.structured_dot(A,H)
    AHwi = theano.sparse.structured_dot(AH,wi).ravel()
    p_v_given_h = 1 - pm.math.exp(w0 + AHwi)

The forestplot of the posteriors of theta_h are quite a bit different than before.

Junpeng’s modification improves things:

with pm.Model() as model:
    theta_h = pm.Uniform("theta_h",lower=0,upper=1,shape=n)
    theta_leak = pm.Beta("theta_leak",alpha=2,beta=20,shape=m,testval=0.05)
    theta_i = pm.Beta("theta_i",alpha=2,beta=20,shape=n, testval = 0.05)

    # Log transformed
    wi = pm.math.log(theta_i)
    w0 = pm.math.log(theta_leak)
    AHwi = tt.dot(tt.dot(A_dense, tt.diag(theta_h)), wi)
    p_v_given_h = 1 - pm.math.exp(w0 + AHwi)
    v = pm.Bernoulli("v",p=p_v_given_h,shape=m,observed=observed)
    trace = pm.sample()

With 1000 visible nodes and 10 hidden nodes, I’m getting about 22 it/s.
At 10000 visible and 100 hidden, sampling seems to be about 1.5 it/s.

Given that this really isn’t all that much data, I have to imagine there are ways to formulating this problem that will result in faster sampling speeds.

I think in this case ADVI might help, you can try

with model:
    trace = pm.sample(3000, njobs=4, init='advi+adapt_diag')

The first thing I do in cases like that is usually to try and find out why it is slow. There are two basic reasons for the slowness. Either the gradient evaluations take a lot of time or we need a lot of gradient evaluations for each sample. If it is the first, the problem is the shape of the posterior, so you need to look into reparametrizations that reduce funnels or correlations. If it is the second, you can use theano’s profiling tools to figure out where the problem is.

You can access the profile using something along those lines:

func = model.logp_dlogp_function(profile=True)
func.set_extra_values({})
x = np.random.randn(func.size)
%timeit func(x)

func.profile.print_summary()

You can access the number of grad evals using trace.tree_size. There are also a couple of other diagnostics that you could look at, there is a list in the doc for pymc3.NUTS.

1 Like

Thanks, I’ll look into both!

Nice. I managed to get decent speed on medium-sized data by switching to ADVI and doing the following:

with pm.Model() as model:
    theta_h = pm.Uniform("theta_h",lower=0,upper=1,shape=n)
    theta_leak = pm.Beta("theta_leak",alpha=2,beta=20,shape=m,testval=0.05)
    theta_i = pm.Beta("theta_i",alpha=2,beta=20,shape=n, testval = 0.05)

    # Log transformed
    wi = pm.math.log(theta_i)
    w0 = pm.math.log(theta_leak)
    AHwi = tt.dot(tt.dot(A_dense, tt.diag(theta_h)), wi)
    p_v_given_h = 1 - pm.math.exp(w0 + AHwi) # voila
    v = pm.Bernoulli("v",p=p_v_given_h,shape=m,observed=observed)
    approx = pm.fit(n=30000, method=pm.ADVI())
    trace = approx.sample(draws=5000)
    pm.forestplot(trace,varnames=["theta_h"],figsize=(20,10))

Notice however that we’re using the dense matrix A_dense to specify the parent-child relationships. When our data gets large, this matrix becomes sparse. For example, the real-world data I’m hoping to apply this model to has ~750,000 observations with about ~2500 hidden nodes, and only 0.13% of the elements are nonzero.

Would using sparse matrices be useful in this situation? Is PyMC3 able to leverage sparse matrices, or would I be better off continuing to use dense matrices and trying to run the inference on a more powerful machine?

If ADVI works, NUTS should works well as well. In my experience NUTS is much more accurate than ADVI, as the approximation from mean field ADVI consistently underestimates the variance of the marginal posterior, sometimes even wrong estimate.

There is probably some speed up using sparse matrix. However, I don’t have much experience with it. Please let us know how is your experience with it. :wink:

Sounds good. I’ll report back my findings. My next thing I want to look into is mini-batch, which I think would allow me to sidestep the scale issues I’ve been having.

Yep, the minibatch solution would fit well in your problem. I am always curious to know how it performs compare to NUTS in large problems but I never have a big enough dataset.

You could try to use the estimation using minibatch to initialized NUTS. First rewrite your model using minibatch (replace the A_dense etc with minibatch tensor, also careful of the total_size kwarg). Then extract the mu and rho from the advi fit, and use them to initialize NUTS in the original non-minibatch model.

1 Like

Update: I was able to get the runtime using ADVI down to about 3.5 hours (not great, but much better than the 86 hours that pymc3 was originally estimating). Unfortunately, the model output was not anywhere close to what I was expecting, which makes me think I need to go back to the drawing board in terms of modeling my problem. Big thanks to Junpeng for all your help!