Adapting Hawkes process Stan code to PyMC

Just an update. I have re-run the R code a couple of times and I cannot make it converge again.

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

              mean se_mean    sd 2.5%  25%   50%   75% 97.5% n_eff Rhat
lengthscaleS  0.09    0.04  0.06 0.00 0.05  0.11  0.13  0.19     2 2.51
lengthscaleT 17.96    7.41 11.15 0.02 7.02 21.61 25.79 33.40     2 2.80
a             0.16    0.04  0.51 0.07 0.09  0.11  0.13  0.34   198 1.02
mu            0.90    0.00  0.06 0.78 0.86  0.90  0.94  1.02  3837 1.00

I have done some ad-hock prior adjustment to the PyMC model and I get close-ish results in this way:

with pm.Model() as mod:
    lengthscaleS = pm.HalfNormal("lengthscaleS", 10) #length scale space
    lengthscaleT = pm.HalfNormal("lengthscaleT", 10) #length scale time
    a = pm.HalfNormal("a", 0.025) #excitation parameter
    gs = gk(spaceD, lengthscaleS) #gaussian kernel
    mu0 = pm.HalfNormal("mu0", 1) 
    mu = pm.HalfNormal("mu", 1)
    mu_xyt = mu0 + mu*muhat
    lam0 =  (a * lengthscaleT * at.exp(-timeD * lengthscaleT)) * gs
    lam = (lam0 * at.tril(at.ones(timeD.shape), k=-1)).sum(axis=-1)
    lam = pm.Deterministic("lam", mu_xyt + lam)
    muSTintegral =  len(xyt) 
    
    def lp(name, lam, a, lengthscaleT, mu, mu0, time, integral):
        l = [at.sum(at.log(lam)) - integral * mu - mu0 * space_window * time_window + 
              a * at.sum(at.exp(-time*lengthscaleT)) - integral]
        return l[0]
    
    y = pm.Potential("y", lp("ylp", lam, a, lengthscaleT, mu, mu0, timeD2, muSTintegral)) 

with mod:
    idata = pm.sample(1000, tune=1000, chains=4, cores=12, random_seed=33) #target_accept=0.95)    
 
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 12 jobs)
NUTS: [lengthscaleS, lengthscaleT, a, mu0, mu]
 |████████| 100.00% [8000/8000 04:26<00:00 Sampling 4 chains, 2,191 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 292 seconds.

There were 2191 divergences after tuning. Increase `target_accept` or reparameterize.
                mean     sd  hdi_3%  ...  ess_bulk  ess_tail  r_hat
lengthscaleS   0.159  0.032   0.101  ...    1401.0    2026.0    1.0
lengthscaleT  20.995  3.177  15.567  ...    1011.0    1432.0    1.0
a              0.138  0.019   0.102  ...    1746.0    1749.0    1.0
mu             0.867  0.061   0.752  ...    1085.0    1396.0    1.0

Note that the adjustment for a (excitation parameter) is quite drastic. And I still get over 2000 divergences (even though Rhats and ESS seem okay).