Why would scipy/numpy wrapped in @as_op cause much faster sampling than using pytensor operations?

I’m a newcomer to PyMC and am trying to fit the means and covariance matrix of a 3D Gaussian. I’m assuming a Uniform prior on each of the 3 means, and an LKJCholeskyCov prior on the covariance matrix with n=3 and eta=2.

My forward model involves the following sequence of steps:

  1. Evaluate the probabilities of drawing each of 1 million pairs (x,y,z) given the current values of the 3D Gaussian’s mean vector and covariance matrix. The 1 million pairs are pre-defined.
  2. Do a tensordot of those resulting 1 million probabilities with 1 million 2D (20x20) maps corresponding to each of those 1 million (x,y,z) pairs. This would result in a single weighted 2D (20x20) map.
  3. Use a Poisson likelihood to compare the expected counts from that weighted 2D map with an actual observed 2D map that I have.

The following code works well (though it takes several hours) – here I have off-loaded the above sequence of steps into a separate function decorated with pytensor.compile.ops.as_op which makes use of scipy/numpy functions:

@as_op(itypes=[pt.dvector,pt.dmatrix], otypes=[pt.dmatrix]) 
def pytensor_forward_model(theta_mu, theta_cov): 
    mvn = scipy.stats.multivariate_normal(mean=theta_mu,cov=theta_cov,allow_singular=True)
    probs = mvn.pdf(points) 
    weighted_pdf_2d = np.tensordot(pdfs_2d,probs,axes=1).reshape((20,20))
    weighted_pdf_2d = weighted_pdf_2d / (np.sum(weighted_pdf_2d) * bin_area) 
    return weighted_pdf_2d

with pm.Model() as model:

    mu_x = pm.Uniform('mu_x', lower=0.01,upper=0.99) 
    mu_y = pm.Uniform('mu_y', lower=0.01,upper=0.99) 
    mu_z = pm.Uniform('mu_z', lower=-1,upper=1) 
    mu_vector = pt.stack([mu_x,mu_y,mu_z])
    chol, corr, stds = pm.LKJCholeskyCov('chol', n=3, eta=2, sd_dist=pm.Exponential.dist(1.0,shape=3))
    cov_matrix = pm.Deterministic('cov', chol.dot(chol.T))

    mu_pdf = pytensor_forward_model(mu_vector, cov_matrix)    

    z_obs = pm.Poisson('z_obs', mu=mu_pdf*bin_area*poisson_hist2d.sum(), observed=poisson_hist2d)        

When I do pm.sample(), PyMC automatically assigns a Slice sampler for each of the 3 mean RVs, and Metropolis for chol, and the code takes ~10 hours to run, which is I guess not bad given the gigantic tensordots involved. Using %timeit shows that pytensor_forward_model() takes ~50 ms per call.

In an effort to try to make it run faster, I rewrote the above to use purely pymc+pytensor operations hoping that maybe the pre-compilation + NUTS sampler would decrease the runtime. But the progress bar says ~100 hours… It’s also very slow if I specify the auto Slice+Metropolis sampler combo.

Here is my pure pymc+pytensor version of pytensor_forward_model() – these lines directly replace the mu_pdf = pytensor_forward_model() line inside the with context (negating the need for the separate @as_op decorated function):

mvn = pm.MvNormal.dist(mu=mu_vector,chol=chol)    
probs = pm.Deterministic('probs', pt.exp(pm.logp(mvn,points_tensor)))
weighted_pdf_2d = pm.Deterministic('weighted_pdf_2d',pt.tensordot(pdfs_2d,probs,axes=1).reshape((20,20)))
mu_pdf = pm.Deterministic('mu_pdf',weighted_pdf_2d / (pt.sum(weighted_pdf_2d) * bin_area))

where I created points_tensor = pt.as_tensor_variable(points) right before/above the with context, otherwise pm.logp(mvn,points) complains that points isn’t a pytensor variable.

Did I do something wrong in the pure pytensor approach? Is it just slow because of my priors? Are there obvious ways to speed it up that I’m missing?

Do you get any blas warnings when you import pymc?

You may want to temove those Deterministic calls. They are only need if you want the records in the final trace, but they can slow down sampling considerably.

Thanks @ricardoV94 But don’t I need the Deterministic calls so that pymc doesn’t think probs, weighted_pdf_2d and mu_pdf are random variables that need to be sampled? Or are you saying to do

mvn = pm.MvNormal.dist(mu=mu_vector,chol=chol)    
probs = pt.exp(pm.logp(mvn,points_tensor))
weighted_pdf_2d = pt.tensordot(pdfs_2d,probs,axes=1).reshape((20,20))
mu_pdf = weighted_pdf_2d / (pt.sum(weighted_pdf_2d) * bin_area)

I don’t need any of those in my final trace – but why would Deterministic slow down sampling a lot?

I’m not getting blas warnings anymore (I was at some point getting WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions. right after doing import pymc as pm but I reinstalled and that warning doesn’t show up anymore).

Deterministic just stores the computations in the trace, the name is perhaps unfortunate. More background on why it can slowdown things here: Parallelizing chains with custom likelihood on multiple cores - #18 by aseyboldt

Thanks again @ricardoV94

Removing all the Deterministic calls does not seem to have made a big improvement but it’s a good thing to do regardless.

3 other questions since I’m still puzzled why NUTS says ~15 hours time remaining after running for ~10 min vs. the Slice+Metropolis version saying ~4-6 hours time remaining after running for ~10 min:

  1. does pdfs_2d need to be a pytensor variable? It’s currently a fixed numpy array of shape ((400,1000000)) but I could wrap it ahead of the with context via pdfs_2d = pt.as_tensor_variable(pdfs_2d). Would that speed up pt.tensordot?

  2. Similarly, in the z_obs = pm.Poisson(...) likelihood call, should poisson_hist2d also be a pytensor variable? It’s currently a 20x20 2D numpy array. Would that help speed up sampling / calculations?

  3. Even though NUTS initially says many more hours remaining compared to Slice+Metropolis, do we expect NUTS to speed up later on as it navigates the landscape faster via gradient descent?

The gradients may be very costly to compute, but you shouldn’t compare nuts vs metropolis+slice based on runtime but on effective sample size. It doesn’t matter if Metropolis is 3 times as fast if it gives you 6x less effective samples.

About your questions, no need to wrap in tensor variables, that’s down automatically by PyMC already.

Thanks a lot @ricardoV94 this is extremely helpful and I really appreciate it.

I have a few more quick questions if that’s okay.

  1. I was looking for a way to benchmark the worst case timing for NUTS, and I found an old suggestion from you to use %timeit after calling model.compile_logp(), model.compile_dlogp() and model.initial_point(). For the above pure pymc+pytensor version of my MvNormal model, it takes ~50 ms to evaluate logp and ~100 ms to evaluate dlogp. Am I right to assume that in a single step NUTS may evaluate logp and dlogp up to ~1000 times? Then in that worst case scenario, assuming I’ve requested 500 tune + 500 draws = 1000 steps, would the total runtime could be 1000 steps * 1000 evaluations * 150ms/(1000*ms/s * 3600 s/hour) ~ 41 hours? In practice my NUTS run just finished in < 3 hours so NUTS must be doing <<1000 evaluations per step.

  2. Despite using a Uniform prior with lower=0.01 for mu_x and mu_y above, model.initial_point() has mu_x=0 and mu_y=0 – why? Doesn’t this violate my imposed prior? Shouldn’t the initial point be drawn from the prior?

  3. For LKJCholeskyCov, if I change sd_dist from pm.Exponential.dist(1.0,shape=3) to pm.Uniform.dist(lower=0.01,upper=1.0,shape=3), NUTS fails during the initialization with

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'mu_x_interval__': array(-0.42898456), 'mu_y_interval__': array(-0.57982898), 'mu_z_interval__': array(-0.13813203), 'chol_cholesky-cov-packed__': array([ 0.80203512,  0.69006191,  0.78059895, -0.52865383,  0.8978877 ,
       -0.1333528 ])}
Initial evaluation results:
{'mu_x': -1.43, 'mu_y': -1.47, 'mu_z': -1.39, 'chol': -inf, 'z_obs': -1162.73}

You see that chol=inf, and indeed logp_fn(ip) = inf for any initial point ip that doesn’t have an all-zero array for chol. Any idea why sd_dist=Uniform.dist(lower=0.01,upper=1.0) is not a good idea for LKJCholeskyCov (note that this still guarantees positive, reasonable magnitude sigma’s)? What WOULD be a well-behaved relatively uninformed prior for LKJ’s sd_dist?

  1. Finally I noticed that with my small tune=500 and draws=500 over 4 chains, NUTS finished and the results on a mock truth test look pretty good but I got 4 divergences (out of 4000 steps) and pm.sample complained at the end that my “effective sample size for some chains is less than 100”. Would this just be fixed by increasing tune, or could there be something deeper?

Hallo :slight_smile:

The 1000 dlogp evaluations in NUTS is an upper bound. This is controlled by the max_treedepth argument, the max number of evals is 2^max_treedepth, and that has a default of 10. Usually, the number is much smaller. And if it does get that large that means that the posterior geometry is really bad (so for instance if you have a funnel or if you have very strong correlations in the posterior).

I’m afraid a uniform doesn’t work as expected as the sd_dist argument of LKJCholeskyCov. This is a pretty unusual special case. Usually we completely avoid any parameter points by mapping constrained distributions to an unconstrained space, but in the case of LKJCholeskyCov we sidestep that machinery unfortunately (I guess it would actually be good to raise a warning here…).

Why do you want to have a uniform distribution here? There might be other ways of achieving what you want that avoid that problem…
I’d typically use a HalfNormal or possibly a Gamma distribution for those parameters. (But this is hard to make general statements about that).

500 tune and 500 draws is a bit on the low side. Usually the default of 1000 each is pretty decent.

Hi @aseyboldt Thank you so much for the clarifications!

Yeah I’m confident the number of evals is << 2^max_treedepth in my case since the programs are finishing in ~4-10 hours compared to the maximum time of ~40+ hours if all 2^maxtree_depth evals were being taken.

I don’t necessarily want a Uniform prior for the standard deviations but I was curious why I got that error since the docs don’t seem to impose any restrictions for sd_dist. I agree that maybe Gamma is the way to go even moreso than my current Exponential(1) so that I could allow for a small-scale cutoff on the standard deviations (to prevent numerical instabilities in chol/cov) whereas Exponential can in principle go to very small numbers. Ideally I would pick alpha and beta for Gamma such that the standard deviations can’t go below 0.01 and are likely to stay <1.

I assume a TruncatedNormal would work for sd_dist? (Again to prevent very small standard deviations)

I’ve upped my tune to 2000 and draws to 1000 with 2-4 chains and am still getting anywhere from 5-20 divergences, but the posteriors look good (I am doing mock truth tests with known model parameters). I tried increasing tune further without increasing draws and it gives fewer divergences. So I guess I don’t need to worry?

The LKJCholskyCov implicitly assumes the sd_dist is a distribution on the positive real line. It applies a LogTransform that is only really valid/useful for these, when doing the relevant transformation / computing the jacobian (my details here are hazy, but @aseyboldt can probably clarify).

A TruncatedNormal would suffer the same fate (divergences if the sampler gets close to the boundaries) because that would require an IntervalTransform to make it properly unbounded, but the LKJ implies/hardcodes a LogTransform. The LogTransform may actually make things worse because of the backward exponentiation when evaluating the logp.

I did try to make this explicit when I wrote sd_dist: A positive scalar distribution..., but should have been explicit about being unbounded on the right.

Would be good to update the docs to reflect this. Would you be interested in doing it?

@aseyboldt forLKJCholeskyCov’s sd_dist, I’ve tried pm.Gamma.dist(alpha=1.5,beta=3.0,shape=3) and pm.InverseGamma.dist(alpha=2.0,beta=0.5,shape=3) and they seem to give similar answers as when I was using pm.Exponential(1,shape=3). I don’t think the NUTS sampler is significantly faster for any of these.

Just to be clear, are you saying that I also shouldn’t use my pm.Uniform('mu_x',lower=0.01,upper=0.99) priors for the 3 means of pm.MvNormal because that might be slowing down NUTS (if it hits into the sharp boundaries of the Uniform prior)? What would a uniform-like well-behaved prior be for the means of MvNormal then? My mu_x and mu_y have no physical significance outside of 0-1, and mu_z has no meaning outside of -1 to 1. Would a pm.Beta(alpha=1,beta=1) be better behaved vs. pm.Uniform for the first 2 means, and keep pm.Uniform(lower=0,upper=1) for mu_z?

@ricardoV94 would InverseGamma(alpha=2,beta=0.5) also lead to divergences because its probability goes to 0 very fast at x<<0.1? If so, I thought InverseGamma was the usual (conjugate) prior to use for the standard deviations of a normal rather than Gamma or Exponential.

I am actually leaning towards keeping my original sd_dist=pm.Exponential(1,shape=3) because it also has the advantage that it pushes the standard deviations towards low values unless the data prefer otherwise, which reduces the variance of the model around the mean and maybe makes it easier for the model to draw covariance matrices from the LKJ prior and converge faster? Thoughts?

I would be happy to help update the docs after I understand this better.