I use spatial climate data to model tick counts in each county.
I have used CAR model to model spatial error.
However there is a small problem.
I used 2023 data as training, 2024 for test. but in 2023, there’s like 10 regions don’t have tick counts. in 2024, there are different regions but more than 10 regions without tick counts. In my code, I just simply remove the regions that don’t have tick counts to build the adjacency matrix, otherwise the dimension is mismatched. Is this correct? But in test as phi dimension is different from 2023, which also dim mismatch again. I know in R CAR model it’s possible to use full adjacency matrix, but the region with NA value will not be considered in training. I don’t know in this python package it is possible to include NA value region, and during training, those regions are excluded. Could anyone help me? Thank you so much.
Hey Christ,
Definitely use the full adjacency matrix for both years – as long as there hasn’t been some major region boundary change between the years (these cases are annoying but not uncommon).
Zeros are data. So if these are zeros (and not missing data) you should put them in the model as zeros and use the full adjacency matrix.
If these are NA (e.g. you expect there were ticks but there was no counting done in that region), why don’t you sketch out your initial PyMC model code and we can see what we can do.
Here’s is the code:
#code start here
geoinfo=gdf.sort_values(by='kom_name')
geoinfo= geoinfo.drop( index=[52,170,27,62,119,92]) #this part is remove NA instances
geoinfo = geoinfo.reset_index(drop=True)
G = nx.Graph()
gdf_2023=gdf.sort_values(by=['kom_name'])
gdf_2023 = gdf_2023.drop(index=[52,170,27,62,119,92])#this part is remove NA instances
gdf_2023 = gdf_2023.reset_index(drop=True)
kommun_names = gdf_2023["kom_name"].tolist() # Store municipality names
for name in kommun_names:
G.add_node(name)
for idx1, row1 in gdf_2023.iterrows():
for idx2, row2 in gdf_2023.iterrows():
if idx1 != idx2 and row1.geometry.touches(row2.geometry):
G.add_edge(row1["kom_name"], row2["kom_name"])
adj_matrix = nx.to_numpy_array(G, nodelist=kommun_names)
adj_matrix += np.eye(adj_matrix.shape[0])
logE = 0
with pm.Model(coords={"area_idx": np.arange(len(tick))}) as fixed_spatial_model:
beta0 = pm.Normal("beta0", mu=0.0, tau=1.0e-5)
beta1 = pm.Normal("beta1", mu=0.0, tau=1.0e-5,shape=(df_X.shape[1],)) #df_X is the input variables that remove NA instances(2023)
# variance parameter of the independent random effect
tau_ind = pm.Gamma("tau_ind", alpha=3.2761, beta=1.81)
# variance parameter of the spatially dependent random effects
tau_spat = pm.Gamma("tau_spat", alpha=1.0, beta=1.0)
alpha = pm.Beta("alpha", alpha=1, beta=1)
theta = pm.Normal("theta", mu=0, tau=tau_ind, dims="area_idx")
# spatially dependent random effect, alpha fixed
phi = pm.CAR("phi", mu=np.zeros(len(tick)), tau=tau_spat, alpha=alpha, W=adj_matrix, dims="area_idx")
# exponential of the linear predictor -> the mean of the likelihood
mu = pm.Deterministic("mu", pt.log(1+pt.exp(logE + beta0 + pt.dot(df_X, beta1) + theta + phi)), dims="area_idx")
#overdisp = pm.Gamma("overdisp", alpha=2, beta=0.1)
inv_sqrt_overdisp = pm.HalfNormal("inv_sqrt_overdisp", sigma=1)
# Transform to obtain overdisp
overdisp = pm.Deterministic("overdisp", 1 / inv_sqrt_overdisp**2)
# likelihood of the observed data
y_i = pm.NegativeBinomial("y_i", alpha=overdisp,mu=mu, observed=tick, dims="area_idx")
res = pm.Deterministic("res", tick - mu, dims="area_idx")
initial_beta0 = np.log(np.mean(tick) + 1e-3)
start = {
"beta0": initial_beta0,
"beta1": np.zeros(df_X.shape[1]), # Vector of zeros, shape (16,)
"tau_ind": 1.0, # Reasonable starting value for tau_ind
"tau_spat": 1.0, # Reasonable starting value for tau_spat
"theta": np.zeros(len(tick)), # Zero random effect for each area
"phi": np.zeros(len(tick)) # Zero spatial effect for each area
}
step = pm.NUTS(max_treedepth=15,target_accept=0.95)
# sampling the model
fixed_spatial_idata = pm.sample(2000, tune=3000,start=start,step=step)
I come from R as well. Maybe if you post the R model too and I can see what you’re trying out. Seeing that you give initial values makes me think you’re trying an R package with some Gibbs sampling method (nimble maybe?).
And could you give a little bit more context. Are the zeros ticks true zeros or are they NaN?
Hi, theo, I don’t use R, otherwise I would directly use R package instead of python. I only saw someone use R code, since I already have python code, I want to figure it out.
tick reports in some area are missing, not zero, it’s NA. If it’s 0, I think it’s much easier.
Okay, then if it’s NA, I would set it up as a missing value problem.
CAR model is a special kind of multivariate normal, and I would keep the same dimensions throughout the study i.e. use the same adjacency matrix for both years.
And then you need some way of dealing with NaN in the model. PyMC has methods for this and you can follow the detailed docs and the blog posts.
docs: Bayesian Missing Data Imputation — PyMC example gallery
And some general tips on the model:
- Not sure how you came up with
pm.Gamma("tau_ind", alpha=3.2761, beta=1.81)
but that’s very specific. If you set the prior based on prior knowledge of the problem, we have a helper function find_constrained_prior - PyMC has broadcasting so you can just write
pm.CAR("phi", mu=0.0, tau=tau_spat, alpha=alpha, W=adj_matrix, dims="area_idx")
and the mean zero will effectively be filled as a vector - with the sampling, I’d probably just leave the NUTS defaults. You can play around with tuning if you want but I’d wait for the model to create divergences first. Same with the starting values: this is common practice for Gibbs sampling packages like JAGS, but here I would just let PyMC decide