Adapting Hawkes process Stan code to PyMC

Hi all. I’ve been struggling for a while trying to adapt this model (GitHub - flaxter/gunshot-contagion: Replication files for Loeffler and Flaxman, "Is gun violence contagious?") to PyMC. The Stan code in that repo works very well:

functions {
  real gaussian(real x, real lengthscale) {
    return(1/(2*pi())*exp(-.5 * x^2/lengthscale^2)/lengthscale^2);
  }
}
data {
  int<lower=1> n;
  int hour[n];
  matrix[n,2] Space;
  vector[n] Time; // NB: these must be sorted from smallest to largest!
  real time_window;
  real space_window;

  vector[n] muST;
  real muSTintegral;
}
transformed data {
  matrix[n,n] timeD;
  matrix[n,n] spaceD;
  vector[n] timeD2;
  for(i in 1:n) {
    for(j in 1:n) {
      timeD[i,j] = -(Time[i] - Time[j]);
      spaceD[i,j] = distance(Space[i], Space[j]);
    }
    timeD2[i] = -(time_window - Time[i]); 
  }
}
parameters {
  real<lower=0> lengthscaleS;
  real<lower=0> lengthscaleT;
  real<lower=0> a;
  real<lower=0> mu;
  real<lower=0> mu0;
}
transformed parameters {
  vector[n] ll;
  real lp;
  ll = mu0 + muST * mu;
  for(i in 2:n) { // todo: vectorize these calculations
    for(j in 1:(i-1)) {
      ll[i] = ll[i] + (a * lengthscaleT * exp(timeD[i,j] * lengthscaleT) * gaussian(spaceD[i,j],lengthscaleS));
    }
  }
  lp = sum(log(ll)) - muSTintegral * mu - mu0 * space_window * time_window +
          a * (sum(exp(timeD2*lengthscaleT)) - n);
}

model {
  increment_log_prob(lp);
    
  lengthscaleS ~ normal(0,10); #sigma
  lengthscaleT ~ normal(0,10); #omega
  a ~  normal(0,10); #theta
  mu0 ~ normal(0,1); #m0
  mu ~ normal(0,1);
}

generated quantities {
  vector[n] background;
  real lengthscale_minutes;
  lengthscale_minutes = 24*60/lengthscaleT;
  background = (mu0 + muST *mu ) ./ ll;
}

Running in RStan as in the repo, gives this output:

Inference for Stan model: anon_model.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

              mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
lengthscaleS  0.12     0.0 0.03  0.08  0.10  0.12  0.14  0.19  1750    1
lengthscaleT 24.00     0.1 5.05 14.96 20.38 23.83 27.17 34.92  2791    1
a             0.10     0.0 0.02  0.06  0.09  0.10  0.12  0.15  2165    1
mu            0.90     0.0 0.06  0.78  0.86  0.90  0.94  1.02  2068    1

However, with my clumsy attempt using PyMC I get this non-converging output:

T = xyt['T'].values - min(xyt['T'].values) #I don't quite understand the reason for this

timeD = np.array([T - T[i] for i in range(T.shape[0])]) 

timeD2 = time_window - T

#Gaussian kernel function
def gk(x, ls):
    return (0.5*np.pi)*at.exp((-0.5 * x**2)/ls**2)/ls**2

with pm.Model() as mod:
    lengthscaleS = pm.TruncatedNormal("sigma", 0, 10, lower=0) #length scale space
    lengthscaleT = pm.TruncatedNormal("omega", 0, 10, lower=0) #length scale time
    a = pm.TruncatedNormal("theta", 0, 10, lower=0) #excitation parameter
    gs = gk(spaceD, lengthscaleS) #gaussian kernel
    mu0 = pm.TruncatedNormal("mu0", 0, 1, lower=0) 
    mu = pm.TruncatedNormal("mu", 0, 1, lower=0)
    lam0 = mu0 + mu*muhat
    lam = lam0[1:] + (a * lengthscaleT * at.exp(-timeD[1:,:250] * lengthscaleT)) * gs[1:,:250]

    muSTintegral =  len(xyt) 

    def lp(name, lam, a, lengthscaleT, mu, mu0, time, integral):
        l = [at.sum(at.log(lam)) - integral * mu - mu0 * space_window * time_window + 
             a * at.sum(at.exp(-time*lengthscaleT)) - integral]
        return l[0]

    y = pm.Potential("y", lp("ylp", lam, a, lengthscaleT, mu, mu0, timeD2, muSTintegral)) 

with mod:
    idata = pm.sample(1000, tune=1000, chains=4, cores=12)#, target_accept=0.95)    


pos = idata.stack(sample = ['chain', 'draw']).posterior

summ = az.summary(idata, var_names=["theta", "mu", "omega", "sigma"])

print(summ)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 12 jobs)
NUTS: [lengthscaleS, lengthscaleT, a, mu0, mu]
 |██████████| 100.00% [8000/8000 05:01<00:00 Sampling 4 chains, 113 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 326 seconds.

               mean     sd  hdi_3%  hdi_97%  ...  mcse_sd  ess_bulk  ess_tail  r_hat
lengthscaleS  1.190  0.521   0.557    1.905  ...    0.198       5.0      15.0   3.19
lengthscaleT  6.691  1.820   3.847   10.146  ...    0.649       5.0      21.0   2.50
a             0.765  0.307   0.401    1.241  ...    0.117       5.0      24.0   2.84
mu            1.512  0.810   0.431    2.466  ...    0.309       5.0      16.0   3.08

[4 rows x 9 columns]
 

I have also attempted a version of the model looping over the parameters, as in the Stan code, like this:

with pm.Model() as mod:
    lengthscaleS = pm.TruncatedNormal("lengthscaleS", 0, 10, lower=0) #length scale space
    lengthscaleT = pm.TruncatedNormal("lengthscaleT", 0, 10, lower=0) #length scale time
    a = pm.TruncatedNormal("a", 0, 10, lower=0) #excitation parameter
    gs = gk(spaceD, lengthscaleS) #gaussian kernel
    mu0 = pm.TruncatedNormal("mu0", 0, 1, lower=0) 
    mu = pm.TruncatedNormal("mu", 0, 1, lower=0)
    lam0 = mu0 + mu*muhat
    
    lam = at.zeros(n-1)
    for i in tqdm(range(1,n)):
        ll = at.zeros(n-1)
        for j in range(i-1):
            l = lam0[i] + (a * lengthscaleT * at.exp(-timeD[i,j] * lengthscaleT )) * gk(spaceD[i,j], lengthscaleS)
            ll = at.inc_subtensor(ll[i], l) 
        lam = at.inc_subtensor(lam[i], at.sum(ll))
    
    muSTintegral =  len(xyt) 
    
    def lp(name, lam, a, lengthscaleT, mu, mu0, time, integral):
        l = [at.sum(at.log(lam)) - integral * mu - mu0 * space_window * time_window + 
             a * at.sum(at.exp(-time*lengthscaleT)) - integral]
        return l[0]
    
    y = pm.Potential("y", lp("ylp", lam, a, lengthscaleT, mu, mu0, timeD2, muSTintegral)) 

However the model doesn’t start sampling, as pytensor seems to get stuck with the evaluation of the lam parameter, for some reason I don’t understand. When I try lam.eval() it also gets stuck.

All in all, I must be doing something very wrongly, but I cannot figure out what.

I will be very thankful for any help or advice on how to properly translate the Stan model to PyMC. Many thanks in advance.

If I understand the problem correctly (a shakey assumption at times), then:

timeD can be quickly computed with broadcast subtraction: timeD = T[:, None] - T[None].

For spaceD, you can use scipy.spatial.distance.pdist to compute the euclidean distance between each point, then normalize according to the formula in the appendix:

xy_coords = xyt[['x', 'y']].values
spaceD = np.exp(pdist(xy_coords, 'euclidian')  / 2 / sigma ** 2) / 2 / np.pi / sigma

So that will give you two N \times N matrices with all the preliminaries you need. As an aside, they subtract T.min() from T as a normalization, so that the first time in the dataset is 0. It keeps you from having underflows when you exponentiate. Personally I’d even go one step further and normalize it so the dates are all on the 0-1 interval (e.g. with sklearn.preprocessing.MinMaxScaler), but that’s a personal preference.

Next you need to do the ll computation. For this I like to use masks. I’d start by just naively computing everything, as in:

ll = a * lengthscaleT * exp(timeD * lengthscaleT) * spaceD

This ll holds the element-wise computation for every possible (i,j) combination. The trick now is only to add up certain elements. It turns out to be the lower triangle of this matrix. You can convince yourself of this by running the following code (which is the loop in the stan code, remembering that R starts indices from 1):

n = 10
mask = np.zeros((n , n))
for i in range(1,n):
    for j in range(0, i):
        mask[i,j] = 1
mask
>> Out: array([[0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 1., 0., 0., 0.],
       [1., 1., 1., 0., 0.],
       [1., 1., 1., 1., 0.]])

Anyway there’s an easier way to make this matrix than doing that loop:

addition_mask = np.zeros_like(timeD)
addition_mask[np.tril_indices_from(addition_mask, k=-1)] = 1

So we multiply ll by this mask then sum away the 2nd dimension: ll = (ll * addition_mask).sum(axis=-1) and we’re done.

Note that you could also use this masking method to do the summation even if the rows are not sorted by time, you’d just have to be a bit more clever about how to construct the mask.

For actually making the PyMC model, I’d make the addition mask and the distance matrix in numpy then call pt.as_tensor_variable on the results, otherwise things should map over 1:1. You could even use a guassian covariance kernel from pm.gp.cov instead of pdist if you wanted to be fancy.

Thank you very much for the reply. Regarding the spatial matrix, what I’m actually doing (sorry I didn’t added it in my previous post) is using the same specified Gaussian kernel you mentioned as normalisation:

#Gaussian kernel
def gk(x, ls):
    return (0.5*np.pi)*at.exp((-0.5 * x**2)/ls**2)/ls**2

That is the parameter gs I added in my model. I’ll edit my initial question to make that clearer.

I should’ve realised that an upper triangular with negative times was a problem :sweat_smile: , but it never occurred to me to use a mask to turn them to zero. Sadly, the mask method doesn’t work as expected. The Stan code runs in about 2 minutes, but when I implement the mask method I get divergences at about 40% of sampling and then the sampler gets extremely slow:

with pm.Model() as mod:
    lengthscaleS = pm.TruncatedNormal("lengthscaleS", 0, 10, lower=0) #length scale space
    lengthscaleT = pm.TruncatedNormal("lengthscaleT", 0, 10, lower=0) #length scale time
    a = pm.TruncatedNormal("a", 0, 10, lower=0) #excitation parameter
    gs = gk(spaceD, lengthscaleS) #gaussian kernel
    mu0 = pm.TruncatedNormal("mu0", 0, 1, lower=0) 
    mu = pm.TruncatedNormal("mu", 0, 1, lower=0)
    mu_xyt = mu0 + mu*muhat
    lam0 =  (a * lengthscaleT * at.exp(-timeD * lengthscaleT)) * gs
    lam = mu_xyt + (lam0 * addition_mask).sum(axis=-1)
    muSTintegral =  len(xyt) 
    
    def lp(name, lam, a, lengthscaleT, mu, mu0, time, integral):
        l = [at.sum(at.log(lam)) - integral * mu - mu0 * space_window * time_window + 
              a * at.sum(at.exp(-time*lengthscaleT)) - integral]
        return l[0]
    
    y = pm.Potential("y", lp("ylp", lam, a, lengthscaleT, mu, mu0, timeD2, muSTintegral)) 

with mod:
    idata = pm.sample(1000, tune=1000, chains=4, cores=12)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 12 jobs)
NUTS: [lengthscaleS, lengthscaleT, a, mu0, mu]
 |██████-------| 48.14% [3851/8000 10:24<11:13 Sampling 4 chains, 5 divergences]

Over ten minutes seems to indicate that there’s still a problem. Note that I added a parameter called mu_xyt, which is the excitation part of the Hawkes process and is calculated from time and space distances using an Epanechikov kernel. This mu_xyt should be part of the parameter ll (I named it lam0), but I add mu_xty to lam0 after applying the mask and doing the summation, because otherwise the model breaks.

Quite possibly I misunderstood something from your answer or maybe I applied something wrongly.

Yes you should add mu_xyt after summation. Referencing equation 4 from the paper’s appendix 1, we are computing:

λ(s, t) = μ(x, y, t) + θ ∑_{i:t_i<t} ω · exp(−ω(t − t_i))k((x, y), (x_i, y_i))

You can see that the term \mu(x,y,t) is not inside the summation, so we do the two parts separately.

As for sampling, there are a bunch of steps we need to check before we even look at the model. We should make sure that all the data preprocessing steps match, and then check that the masking method is computing the same thing as the naive loop method. I checked their replication code and did the following preprocessing steps:

import pandas as pd
import os
os.environ['USE_PYGEOS'] = '0'
import geopandas as gpd

raw_df = pd.read_csv('https://raw.githubusercontent.com/flaxter/gunshot-contagion/master/WashingtonDC_Incidents_2006-2013_Raw_Data_2.csv')
df = raw_df.iloc[:, 2:6].copy()
df.columns = ['datetime', 'Type', 'long', 'lat']
df = df[df.lat > 37]
df.datetime = pd.to_datetime(df.datetime, format='%m/%d/%y %H:%M')

holidays = ["07-01","07-02", "07-03", "07-04", "07-05", "07-06", "12-29", "12-30", "12-31", "01-01", "01-02"]

def zero_pad(s, n):
    n_zeros = n - len(s)
    return '0' * n_zeros + s

def check_for_holiday(x):
    day = zero_pad(str(x.day), 2)
    month = zero_pad(str(x.month), 2)
    
    date_str = f'{day}-{month}'
    
    return date_str in holidays

# Drop holidays (Appendix 2)
df = df[~df.datetime.apply(check_for_holiday)]

# Sort by dates (forces the mask to be lower-triangular)
df = df.sort_values(by='datetime')

Time = df.datetime.astype('int64')
assert Time.is_monotonic_increasing

# Hours since first observation
Time = (Time - min(Time)) / 1e9 / 60 / 60

# No idea what the CRS for this data is, but EPSG:4326 is used by census shapefiles so it's my guess
df = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df.long, df.lat), crs="EPSG:4326")

# EPSG:26985 is projected meteres for the DC area, convert to km and scale
xy = df.to_crs('EPSG:26985').geometry.get_coordinates().apply(lambda x: (x - x.min()) * 1e-4)

One potential problem I noticed doing this exercise was that the scale of time (hours since the first obseration) still covers 7 years, so the units are quite large (median 32907), so it’s overflow city inside the exp function. I’d make this something smaller, like days or weeks, and work out the algebra to convert back to their paper. I don’t know if Stan does anything under the hood to handle large values in exp. These overflows are an easy place to look for divergences.

I missed that the authors make their timeD negative on definition, so I was getting overflows. If everything is correctly negative, you can ignore all these comments.

Anyway, using this data I confirmed that the naive loop and the masking method are computationally equivalent, so a computation error isn’t the problem. In so doing I found that these computations take a long time with numpy. You should try using nutpie to see if you can get any speedup by jit compiling everything to numba.

I will say the masking method is quite computationally wasteful, especially at the scale of this data (we’re computing values for 625,500,765 elements that will be masked away). There might be a better way to do this, but I’d try JIT first to see if compiler magic can solve the problem for you. Using a scan with JAX is my next idea, but it might also be possible to ravel the lower triangles of timeD and spaceD and just work with those vectors by using some clever indexing.

I also would reduce the number of cores you assign to pm.sample to just equal the number of chains; check the most recent post in the pinned FAQ about how multiprocessing works with the sampler. I’m not sure what routines are used to do large elementwise matrix operations, but if they parallelize and dispatch themselves it might be causing problems.

My tl;dr is: re-scale the time variable to help with divergences and try to use nutpie to speed up sampling.

Hi. Many thanks again for taking the time to answer this. I’m not sure whether it is necessary to re-scale if using this file: “preprocessed_xyt.csv”. One problem of putting the time in hours rather than in days is that the time magnitude will be too big respect to space in km. This breaks the model immediately generating a nan initial value for y.

There a couple of things I don’t really understand. Why would the masking method using numpy be different from doing something like timeD[timeD<0] = 0 ? or something like at.tril(at.ones(timeD.shape))? I’ve tried both and they perform faster, but divergences are even worst than with the numpy method. And results do not match either:

# -*- coding: utf-8 -*-
import pandas as pd
import pymc as pm
import pytensor.tensor as at
import arviz as az
import numpy as np
from shapely import Polygon
from datetime import datetime
from scipy.spatial import distance_matrix


df = pd.read_csv("preprocessed_xyt.csv")

xyt = df

#band width constants for Epanechikov kernel:
bw_time = 14 
bw_space = 1.609  #1.6 kilometers

start_date = datetime(2010,1,1)
end_date = datetime(2010,1,31)
time_window = end_date-start_date
time_window = time_window.days+1 


xyt = xyt.sort_values(by="T")

xyt['T'] = xyt['T']

xyt["hour"] = np.round(xyt["T"].values) % 24 + 1

xyt["hour_idx"] = xyt.hour.values.astype("int").astype("str")

phat_hour = [len(xyt[xyt.hour==xyt.hour.values[h]])/len(xyt) for h in range(len(xyt))]


xy = np.array([xyt['X'], xyt['Y']])
spaceD = distance_matrix(xy.T, xy.T)  #spcae distance matrix in km


space_window = Polygon(xy.T).area  #np.pi*(np.max(spaceD)*0.5)**2 #

T = xyt['T'].values #- min(xyt['T'].values)

TT = np.array([T,T])
timeD = distance_matrix(TT.T, TT.T)


## Define Epanechikov kernel
def epa(x,ls):
    d = (0.75*(1 - (x/ls)**2))/ls
    d[d < 0] = 0
    return d

d_space = sum(epa(spaceD, bw_space)) #np.array([epa(s, bw_space) for s in spaceD])#Epanechikov kernel for space
d_time = sum(epa(timeD, bw_time))/len(xyt) #np.array([epa(t, bw_time) for t in timeD]) #Epanechikov kernel for time

muhat = d_space*d_time #intensity estimate from Epanechikov kernels

#Gaussian kernel
def gk(x, ls):
    return (0.5*np.pi)*at.exp((-0.5 * x**2)/ls**2)/ls**2
    

mu_xy = d_space
mu_t = d_time

T = xyt['T'].values - min(xyt['T'].values)  #I don't quite understand the reason for this

timeD = T[:, None] - T[None]

timeD2 = time_window - T

n = len(xyt)


# addition_mask = np.zeros_like(timeD)
# addition_mask[np.tril_indices_from(addition_mask, k=-1)] = 1
# addition_mask = at.as_tensor_variable(addition_mask)



with pm.Model() as mod:
    lengthscaleS = pm.TruncatedNormal("lengthscaleS", 0, 10, lower=0) #length scale space
    lengthscaleT = pm.TruncatedNormal("lengthscaleT", 0, 10, lower=0) #length scale time
    a = pm.TruncatedNormal("a", 0, 10, lower=0) #excitation parameter
    gs = gk(spaceD, lengthscaleS) #gaussian kernel
    mu0 = pm.TruncatedNormal("mu0", 0, 1, lower=0) 
    mu = pm.TruncatedNormal("mu", 0, 1, lower=0)
    mu_xyt = mu0 + mu*muhat
    lam0 =  (a * lengthscaleT * at.exp(-timeD * lengthscaleT)) * gs
    lam = mu_xyt + (lam0 * at.tril(at.ones(timeD.shape))).sum(axis=-1)
    muSTintegral =  len(xyt) 
    
    def lp(name, lam, a, lengthscaleT, mu, mu0, time, integral):
        l = [at.sum(at.log(lam)) - integral * mu - mu0 * space_window * time_window + 
              a * at.sum(at.exp(-time*lengthscaleT)) - integral]
        return l[0]
    
    y = pm.Potential("y", lp("ylp", lam, a, lengthscaleT, mu, mu0, timeD2, muSTintegral)) 

with mod:
    idata = pm.sample(1000, tune=1000, chains=4, cores=12)#, target_accept=0.95)    
    

pos = idata.stack(sample = ['chain', 'draw']).posterior

summ = az.summary(idata, var_names=["lengthscaleS","lengthscaleT","a", "mu"])

print(summ) ...
There were 3099 divergences after tuning. Increase `target_accept` or reparameterize.
                 mean     sd   hdi_3%  ...  ess_bulk  ess_tail  r_hat
lengthscaleS    0.000  0.000    0.000  ...     293.0     639.0   1.01
lengthscaleT  158.484  7.299  144.881  ...     607.0     718.0   1.01
a             158.590  6.585  147.135  ...     431.0     474.0   1.01
mu              0.004  0.004    0.000  ...     802.0     750.0   1.01

I’ve compared the muhat values I obtain with the Epanechikov kernel and they match those in the R code generally well (same for the time scale and space scale). So I don’t see why would this generate a problem here. So, my impression is that what needs to be fixed is not really sampling speed, but speed problems seem to be a consequence of something wrong with the model’s structure. (It’s quite telling that divergences start happening during tuning). I’ve reproduced the model directly from the repo, but maybe I missed something.

My apologies, I realised that I should’ve used: at.tril(at.ones(timeD.shape), k=-1) , with k=-1. In this case the model gets stuck at the same place as with the numpy method at slightly over 30-40% . I’ve tried running with 1 core only, but the problem persists, get’s stuck at around 30%. I’ve changed the init method to “adapt_diag” but the same problem occurs. And using init=“advi” the sampling gets stuck at around 60% . I’m clueless :person_shrugging:t2: . May be something internal to pytensor or PyMC? I’m not sure whether JIT would help much with this, or maybe it could?

I don’t see preprocessed_xyt.csv in the repo. Not sure where it’s coming from, but I guess you ran their R code and exported it as a csv? If you upload it somewhere I’ll use your exact code as a starting point and see if I can be more helpful (I’m having trouble installing the required R packages to run their data script). I already see a typo in your gk function (should have 0.5 / np.pi).

I’m sorry, you’re totally right, I forgot that the file is not in the repo but I saved it after running these lines from the R code:

library(spatstat)
library(rstan)
rstan_options(auto_write = T)
options(mc.cores = parallel::detectCores())
source("utils.r")

load("WashingtonDC-agls.rdata")

library(docopt)

'Usage:
model-mismatch.r [--start start] [--end end] [--bwSpace bwSpace] [--bwTime bwTime] [--nonseparable] [--duplicates] [--model model]

Options:
--start start [default: 2010-01-01]
--end end [default: 2010-01-31]
--bwSpace bw-space [default: 1.609]
--nonseparable
--duplicates
--bwTime bw-time [default: 14]
--model model [default: hawkes-model.stan]
]' -> doc

opts <- docopt(doc)

start_date = as.Date(opts$start)
end_date = as.Date(opts$end) 
bw_space = as.numeric(opts$bwSpace)
bw_time = as.numeric(opts$bwTime)
separable = !opts$nonseparable
remove.duplicates = !opts$duplicates

## Preprocess the data based on the command line arguments
keep = which(data$date >= start_date & data$date < end_date & data$holiday == F)
xyt = xyt[keep,]
xy = subset(xyt, select = c("X","Y")) #= xy[keep,]
nrow(xyt)

if(remove.duplicates) {
  time_dist = dist(xyt[,3])
  space_dist = dist(xyt[,1:2])
  library(igraph)
  ig = graph.adjacency(as.matrix(space_dist) < .1 & as.matrix(time_dist) <= .5  & lower.tri(space_dist))
  n.orig = nrow(xyt)
  clust = clusters(ig)$membership
  keep = !duplicated(clust)
  xyt = xyt[keep,]
  xy = xy[keep,]
}

xyt$T = xyt$T / 24

### Then I saved:
write.csv(xyt, "preprocessed_xyt.csv")

Here’s the output file:
preprocessed_xyt.csv (9.0 KB)

Thank you for spotting the typo! I always have great difficulty with typos.

Just in case I rerun with the typo fixed, but the problem persists (as expected).

1 Like

Just an update. I have re-run the R code a couple of times and I cannot make it converge again.

Inference for Stan model: anon_model.
4 chains, each with iter=2000; warmup=1000; thin=1; 
post-warmup draws per chain=1000, total post-warmup draws=4000.

              mean se_mean    sd 2.5%  25%   50%   75% 97.5% n_eff Rhat
lengthscaleS  0.09    0.04  0.06 0.00 0.05  0.11  0.13  0.19     2 2.51
lengthscaleT 17.96    7.41 11.15 0.02 7.02 21.61 25.79 33.40     2 2.80
a             0.16    0.04  0.51 0.07 0.09  0.11  0.13  0.34   198 1.02
mu            0.90    0.00  0.06 0.78 0.86  0.90  0.94  1.02  3837 1.00

I have done some ad-hock prior adjustment to the PyMC model and I get close-ish results in this way:

with pm.Model() as mod:
    lengthscaleS = pm.HalfNormal("lengthscaleS", 10) #length scale space
    lengthscaleT = pm.HalfNormal("lengthscaleT", 10) #length scale time
    a = pm.HalfNormal("a", 0.025) #excitation parameter
    gs = gk(spaceD, lengthscaleS) #gaussian kernel
    mu0 = pm.HalfNormal("mu0", 1) 
    mu = pm.HalfNormal("mu", 1)
    mu_xyt = mu0 + mu*muhat
    lam0 =  (a * lengthscaleT * at.exp(-timeD * lengthscaleT)) * gs
    lam = (lam0 * at.tril(at.ones(timeD.shape), k=-1)).sum(axis=-1)
    lam = pm.Deterministic("lam", mu_xyt + lam)
    muSTintegral =  len(xyt) 
    
    def lp(name, lam, a, lengthscaleT, mu, mu0, time, integral):
        l = [at.sum(at.log(lam)) - integral * mu - mu0 * space_window * time_window + 
              a * at.sum(at.exp(-time*lengthscaleT)) - integral]
        return l[0]
    
    y = pm.Potential("y", lp("ylp", lam, a, lengthscaleT, mu, mu0, timeD2, muSTintegral)) 

with mod:
    idata = pm.sample(1000, tune=1000, chains=4, cores=12, random_seed=33) #target_accept=0.95)    
 
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 12 jobs)
NUTS: [lengthscaleS, lengthscaleT, a, mu0, mu]
 |████████| 100.00% [8000/8000 04:26<00:00 Sampling 4 chains, 2,191 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 292 seconds.

There were 2191 divergences after tuning. Increase `target_accept` or reparameterize.
                mean     sd  hdi_3%  ...  ess_bulk  ess_tail  r_hat
lengthscaleS   0.159  0.032   0.101  ...    1401.0    2026.0    1.0
lengthscaleT  20.995  3.177  15.567  ...    1011.0    1432.0    1.0
a              0.138  0.019   0.102  ...    1746.0    1749.0    1.0
mu             0.867  0.061   0.752  ...    1085.0    1396.0    1.0

Note that the adjustment for a (excitation parameter) is quite drastic. And I still get over 2000 divergences (even though Rhats and ESS seem okay).

What do the pair plots look like? If there’s loads of divergences but good r_hats, it can mean there’s some kind of demented geometry in the posterior that the sampler can’t explore (so the samples are biased, because a long tail is missing for example). Sometimes you’ll see it in the pair plots.

That’s true. I’ve checked, and these are the pairplots:

Not too bad I guess?

I leave rankplots here just in case:

Pass divergences=True to az.pair_plot so you can see if they are all clustered in one place

No prob. Here it is:

Dang, I was really hoping for an obvious pattern. Nothing’s ever easy.

So true, it’s quite puzzling :sweat_smile: