Why would scipy/numpy wrapped in @as_op cause much faster sampling than using pytensor operations?

Thanks a lot @ricardoV94 this is extremely helpful and I really appreciate it.

I have a few more quick questions if that’s okay.

  1. I was looking for a way to benchmark the worst case timing for NUTS, and I found an old suggestion from you to use %timeit after calling model.compile_logp(), model.compile_dlogp() and model.initial_point(). For the above pure pymc+pytensor version of my MvNormal model, it takes ~50 ms to evaluate logp and ~100 ms to evaluate dlogp. Am I right to assume that in a single step NUTS may evaluate logp and dlogp up to ~1000 times? Then in that worst case scenario, assuming I’ve requested 500 tune + 500 draws = 1000 steps, would the total runtime could be 1000 steps * 1000 evaluations * 150ms/(1000*ms/s * 3600 s/hour) ~ 41 hours? In practice my NUTS run just finished in < 3 hours so NUTS must be doing <<1000 evaluations per step.

  2. Despite using a Uniform prior with lower=0.01 for mu_x and mu_y above, model.initial_point() has mu_x=0 and mu_y=0 – why? Doesn’t this violate my imposed prior? Shouldn’t the initial point be drawn from the prior?

  3. For LKJCholeskyCov, if I change sd_dist from pm.Exponential.dist(1.0,shape=3) to pm.Uniform.dist(lower=0.01,upper=1.0,shape=3), NUTS fails during the initialization with

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'mu_x_interval__': array(-0.42898456), 'mu_y_interval__': array(-0.57982898), 'mu_z_interval__': array(-0.13813203), 'chol_cholesky-cov-packed__': array([ 0.80203512,  0.69006191,  0.78059895, -0.52865383,  0.8978877 ,
       -0.1333528 ])}
Initial evaluation results:
{'mu_x': -1.43, 'mu_y': -1.47, 'mu_z': -1.39, 'chol': -inf, 'z_obs': -1162.73}

You see that chol=inf, and indeed logp_fn(ip) = inf for any initial point ip that doesn’t have an all-zero array for chol. Any idea why sd_dist=Uniform.dist(lower=0.01,upper=1.0) is not a good idea for LKJCholeskyCov (note that this still guarantees positive, reasonable magnitude sigma’s)? What WOULD be a well-behaved relatively uninformed prior for LKJ’s sd_dist?

  1. Finally I noticed that with my small tune=500 and draws=500 over 4 chains, NUTS finished and the results on a mock truth test look pretty good but I got 4 divergences (out of 4000 steps) and pm.sample complained at the end that my “effective sample size for some chains is less than 100”. Would this just be fixed by increasing tune, or could there be something deeper?