Latent Space Network Model is very slow, does not converge

I’ve built out the latent space model from my earlier post, but I’m still having sampling issues. The sampling is very very slow and gets slower over time. The first couple hundred draws are about 1s each, the next thousand are ~30s/draw, and now I’m up to 90s/draw. With 4 chains, that’s 6 minutes each on a system where each chain appears to be using 18 cores.

I’ve already reparameterized the hierarchical component. And I’ve tightened the variance on the prior so that it constrains it to draws that are on the right order of magnitude to produce a graph of similar density, which are the two things that I’ve seen recommended, and I’m not sure what to try next.

Also, a version of the model with no hierarchical component or covariates at all, just estimating the latent space, takes 30s/draw which is very slow but less slow. Even more so because it’s also running on a less powerful server (~2cores/chain). Also, it doesn’t converge either, but does have higher acceptance probabilities and did a little better.

I’ve pasted code for a simulated version of the model. It samples faster than the real data, (settling around 10s/draw) but does not converge.

I usually like to ask a more specific question, but at this point I’m at a loss for what to try next, especially since it’s hard to tell when poor convergence is just due to having too few samples (since even a test run with only a couple hundred samples takes around 6 hours). Can anyone help me troubleshoot the model, and maybe narrow down solutions?

I’ve run a handful of short tests, and finished one longer run with the real data (though the longer run has a less optimized version of the model since I started it 72 hours ago). I get the following errors.

There were 433 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.06801044279154617, but should be close to 0.8. Try to increase the number of tuning steps.
There were 999 divergences after tuning. Increase `target_accept` or reparameterize.
The acceptance probability does not match the target. It is 0.09502623861687749, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

Also, whenever I try to do a traceplot, the kernel crashes. But this may be a coincidence since I usually try to start a second model as soon as one’s finished, so the system resources get tied up before I try to plot the trace.

Any suggestions for next steps would be greatly appreciated.

    # Create Data with two individual-level variables, two dyad-level variables, and latent space. 
    K = 3
    n = 300
    z = TT._shared(np.random.normal(0,.5,(n,K)))
    xp = TT.sum(TT.pow(z,2),axis=1,keepdims=True,)
    xp = TT.tile(xp,[1,n])
    xp = xp+TT.transpose(xp) - 2* TT.dot(z,TT.transpose(z))
    # xp = 1.0/ TT.sqrt(xp + TT.eye(n,n)*100000)
    xp = TT.sqrt(xp+ TT.eye(n,n)*.00001)
    dist = xp.eval()

    b_1_s = -.08
    b_2_s = .008
    b_1_r = .09
    b_2_r = .3
    b_d1 = -.5
    b_d2 = .4

    v1 = np.random.poisson(10,(n,1)) 
    # v1 = np.log(v1)
    v2 = np.random.binomial(1, .5, (n,1)) 

    sender = (b_1_s * v1) + (b_2_s * v2)
    sender_mat = np.tile(sender,[1,n]).T

    receiver = (b_1_r * v1) + (b_2_r * v2)
    receiver_mat = np.tile(receiver.T,[n,1])

    #dyadic variable, distance between two attributes
    d1 = np.abs(np.random.normal(0,1,(n,1)))
    s = np.tile(d1,[1,n])
    r = np.tile(d1.T,[n,1])
    adj_d1 = np.abs(s-r)
    np.fill_diagonal(adj_d1,0)

    #dyadic variable, shared attribute
    adj_d2 = np.random.binomial(1, .2, (n,n)) 
    np.fill_diagonal(adj_d2,0)
    adj_d2[np.triu_indices(n)]=0
    adj_d2 = np.maximum(adj_d2,adj_d2.T)

    dyad = b_d1 * adj_d1 + b_d2 * adj_d2

    eta =  dyad - dist + sender_mat + receiver_mat
    eta = TT._shared(eta)
    eta = (1-TT.eye(n,n))*eta
    p = TT.nnet.sigmoid(eta).eval()

    adj = np.random.binomial(1,p)
    np.fill_diagonal(adj,0)

And the model:

    with pm.Model() as test_net:
        # Latent Space Positions
        z = pm.Normal('z',mu=0,sd=10,shape=(n,K),testval=np.random.rand(n, K))
        # Matrix of latent distances
        latent = TT.sum(TT.pow(z,2),axis=1,keepdims=True)
        latent = TT.tile(latent,[1,n])
        latent = latent+TT.transpose(latent) - 2* TT.dot(z,TT.transpose(z))
    #     latent = 1.0/ TT.sqrt(latent + TT.eye(n,n)*10000)
        latent = TT.sqrt(latent+ TT.eye(n,n)*.00001)
        # Sender effects
        b_1_s = pm.Normal('Sender v1', mu=0,sd=100)
        b_2_s =  pm.Normal('Sender v2', mu=0,sd=100)
        sig_send = pm.HalfCauchy('Sender_sigma',beta=2.5)
        send_raw = pm.Normal('Sender_raw', mu=0,sd=1, shape=n)
        sender = pm.Deterministic('Sender',
                                  b_1_s*x['v1'].values + 
                                  b_2_s*x['v2'].values+
                                  sig_send*send_raw)
        sender_mat = TT.transpose(TT.tile(sender,[n,1]))
        # Receiver effects
        b_1_r = pm.Normal('Receiver v1', mu=0,sd=10)
        b_2_r =  pm.Normal('Receiver v2', mu=0,sd=10)
        sig_rec = pm.HalfCauchy('Receiver_sigma',beta=2.5)
        rec_raw = pm.Normal('Receiver_raw', mu=0,sd=1, shape=n)
        receiver = pm.Deterministic('Receiver',
                                  b_1_r*x['v1'].values + 
                                  b_2_r*x['v2'].values+
                                  sig_rec*rec_raw)
        receiver_mat = TT.tile(receiver,[n,1])
        #Dyad effects
        b_d1 = pm.Normal('Attribute Distance', mu=0,sd=10)
        b_d2 = pm.Normal('Shared attribute', mu=0,sd=10)
        dyad = pm.Deterministic('dyad',
                               b_d1*adj_d1+
                               b_d2 * adj_d2)
        # Put it all together
        eta =  dyad - latent + sender_mat + receiver_mat
        eta = eta - (TT.eye(n,n)*100)
        p=TT.nnet.sigmoid(eta)
        
        y = pm.Bernoulli('x',p,observed=adj)
    
        trace = pm.sample(200,tune=100,chains=2)

Another model finished and I had a rather obvious realization. There’s no reason for the estimated latent space to have the same orientation. The coefficients all seem to have converged well, but the z parameter not at all. Which in retrospect makes a lot of sense.
So, new question: How might I assess whether the model has converged, if I wouldn’t expect the chains to be the same for the latent variables?

First thing that comes to mind is to make latent, (the variable containing the distances) a deterministic variable and look for convergence there, since I think distance still should be consistent across chains.

Not sure I understand, but in general you look at the rhat and effective sample size. Ideally, the rhat of logp should be also around 1, but that’s usually a very highbar to pass. Alternatively, make sure the parameters you are interested int shows good mixing and rhat, and reasonable effective sample size.

Thanks for the reply, I understand how to check for convergence generally, but not in this particular case. I’ll clarify a bit.
My understanding of rhat is that it measures the similarity of the sampled posterior across chains. This requires that z[0,0] be the same for each chain. But there are many latent variables that are equivalent to one another. The distance between node i and j in the network is the same if the first and second latent dimensions are switched. Or if a constant c is added to z. So it may be that z[:,0] in chain 1 is the same as z[:,1], or z [:,2]+5. So I can check the trace plots for good sampling within a chain, but r-hat I think is no longer appropriate, and I am still trying to figure out if there is a good way to compare the chains. I re-ran with the distance as a deterministic, so I can at least check the r-hat scores there, since I think distance should still be consistent across chains. Not sure if there are good alternatives for latent spaces.

Oh I see, so you are having label switching… Well with so many latent label it is pretty difficult to dealt with. I would suggest sampling with 1 chain only, and compute split-rhat (it is available in arviz)

split r-hat still requires mutliple chains, because you compare variance compare across chains. Still thinking through whether it’s okay to cut up a long chain. You won’t really know if you got stuck in a local minima and never made it out.

But I think I have an answer for myself. All I can do is use convergence statistics that only depend on a single chain. I will look for autocorrelation and variance over time, and calculate geweke. I will check rhat for the first and last portion of the chain, (though I’m still not sure whether this really checks what rhat is supposed to check without infinite samples, and how long is long enough?). And in addition, I’ll compare distances across chains, since those should be more stable than the latent variables themselves.

Sampling is still very slow, but I’ve managed to get it down to 8-15 seconds/draw (depending on the specification) and it may be slow, but it does appear to be converging in the end.
When it’s all done I’ll share a more complete example of a latent space network model. But for the time being the one in the question does appear to work. Though, I’d of course love any further suggestions on speeding it up.

Edit: fixed typo. I wish I’d gotten it that fast. 10 seconds per draw not ten draws per second.

1 Like