Hierarchical Model with a Correlated Prior - Pipe Broken Error

Hello everyone!

I am having a bit of the problem. I am running a Hierarchical Bayesian task (Windows 10, pymc 5.10.3, Python 3.11), where the goal is to inveresely estimate a set of parameters that are predictors for arterial displacements, around 101 datapoints per subject in time. The displcements are modeled on the basis of a Gaussian Process defined in the custom op. Generally there are a lot of parameters possible, but for now I am working with subject and group-level Young’s modulus (E, independent) and mean shift of pressure and flow (shift_mvn), which are correlated (the correlation is based on the LKJ prior from: Tomi Capretto - Hierarchical modeling with the LKJ prior in PyMC). The model worked perfectly fine with just the Young’s modulus - it managed to estimate subject level and group level Young’s moduli very well with high ESS (more than 1k), Rhat ~ 1.01 for all params etc. However when I introduced the correlated pressure and flow something started to break down: when running on multiple (20) cores I got the ‘Pipe Broken Error’ at around 50% mark (running 1k tune, 3k samples, 4 chains):

Multiprocess sampling (4 chains in 6 jobs)

NUTS: [mu_E, sigma_E, sig_S, Corr_triu, mu_S, sig]

Metropolis: [E]
Metropolis: [shift_mvn]

BrokenPipeError: [WinError 109] … EOFError…

Then out of desperation (and reading that error is possibly due to Windows’ way of handling multiprocessing) I got this after 2 chains were completed:

Sequential sampling (4 chains in 1 job)

NUTS: [mu_E, sigma_E, sig_S, Corr_triu, mu_S, sig]

Metropolis: [E]
Metropolis: [shift_mvn]
|-----------| 4.34% [152/3500 3:48:28<83:52:36 Sampling chain 2, 0 divergences]

Fatal Python error: Aborted

Thread 0x00004ecc (most recent call first):
File “C:\Users\asinek\anaconda3\envs\pymc_env\Lib\site-packages\zmq\utils\garbage.py”, line 47 in run
File “C:\Users\asinek\anaconda3\envs\pymc_env\Lib\threading.py”, line 1045 in _bootstrap_inner
File “C:\Users\asinek\anaconda3\envs\pymc_env\Lib\threading.py”, line 1002 in _bootstrap

Main thread:
Thread 0x00002a68 (most recent call first):

The graph for the model as it is looks like this:

Sorry for the long setup. My question is: by looking at the problem description and the graph, can someone tell me that there is an error in my model (most likely related to the way I handle the correlation matrix prior of parameter shfit_mvn; it’s my primary suspect as it’s my first time doing that and the model worked perfectly fine before, when it was only E). Are there other possibilities for the failure? That’s what I want to know before I install Linux :sweat_smile:

Thanks for anyone’s input in advance!

####### CODE BELOW #############

The general code for the model and custom op is provided below, although I can set it up on github if someone really wants to dig in:

if __name__ == '__main__':      #since I'm on windows

  with pm.Model(coords=coords) as model:   #coords is a dict that assigns indices to subjects
    # Subjects
    exp_index = pm.Data("exp_idx",exp_idx ,dims = "obs_id", mutable=False)

    #nomalized priors
    ##Youngs Modulus
    mu_E = pm.Normal("mu_E", mu = 0, sigma = 1, initval=0)
    sigma_E = pm.HalfNormal('sigma_E', sigma = 1)
    ##Joint distribution parameters
    S_sigma = pm.HalfNormal('sig_S',sigma = np.array([1,1]))
    Corr_triu = pm.LKJCorr('Corr_triu',eta=10,n=2)
    Corr = pm.Deterministic('Corr',pytensor.tensor.extra_ops.fill_diagonal(Corr_triu[np.zeros((2,2),dtype=np.int64)],1)) 
    S_diagonal = pm.Deterministic("sig_diag", pytensor.tensor.eye(2) * S_sigma)
    Cov = pm.Deterministic('Cov',pytensor.tensor.nlinalg.matrix_dot(S_diagonal, Corr, S_diagonal))
    L = pm.Deterministic("L",pytensor.tensor.slinalg.cholesky(Cov))
    mu_S = pm.Normal('mu_S',mu = np.array([0,0]), sigma = np.array([1,1]), initval = np.array([0,0]) )

    E = pm.Normal("E", mu=mu_E,sigma=sigma_E,dims='exp', initval = theta_scaled[:,0])
     #theta scaled are just randomly generated samples from similar distributions 
                                             #to start somewhere reasonable

    shift_Vp = pm.MvNormal('shift_mvn', mu = mu_S, chol = L,dims=('exp','corrs'), initval=theta_scaled[:,1:])   
    sigma = pm.HalfNormal('sig',1)

    new_inputs =pt.math.stack([E[exp_index], shift_Vp[exp_index,0],shift_Vp[exp_index,1]],axis=1)#.reshape([10,2])
    #Gaussian Process predictions
    test = pm.Deterministic('gp_sol',pytensor_forward_model_matrix(new_inputs ))
    # Likelihood
    Y_obs = pm.Normal("Y_obs", mu=test, sigma=sigma, observed=simulated_data)
    trace_M = pm.sample(tune=1000, draws=3000, chains = 4, cores=20)

Custom op is defined in a separate script file (otherwise it doesn’t work):

  #these are just some local variables to scale & rescale the data for the Gaussian Process
  COV_MX = np.array([[0.0004,-0.04],[-0.04,100]])
  mean = np.array([0.2,90])
  L_std = np.linalg.cholesky(COV_MX)
  L_inv = np.linalg.inv(L_std)

@as_op(itypes=[pt.dmatrix], otypes=[pt.dmatrix])
def pytensor_forward_model_matrix(inputs):

  # given that parameter values repeat with time for a given case, get a unique number of subjects
  unique = np.unique(inputs,axis=0)

  #this is a helper function, everything's ok here
  f_temp_mcmc, p_temp_mcmc= gsmc.BC_generator_MCMC(flow_p,pres_p, unique ,plot = False)
  #same as here
  train = gsmc.design_IP_multiple(f_temp_mcmc, unique)

  #rescale the sampled inputs from stand'd distributions
  train[:,1] = train[:,1]*200_000 + 700_000   #E
  train[:,2:] = gsmc.inv_std_mvn(train[:,2:], mean, L_std)  #mvn pressure and flow

  #copy the generated inputs for a GP
  X_input_mcmc_scaled = copy.deepcopy(train)
  # column 0 is for time
  # scale the inputs using the scaler on which the GP was trained
  X_input_mcmc_scaled[:,1:]= gs.transform_data(X_input_mcmc_scaled[:,1:], scaler_for_mcmc)
  #return the predictions
  gp_solution, _ = m_sgpr_mcmc.predict_y(X_input_mcmc_scaled,full_cov=False)
  return gp_solution.numpy()

Pipe error is common on Windows with multiprocessing. Does it go away with cores=1?

Thanks for the response. It does go away but I unfortunately get the “Fatal Python error: Aborted” (as described in the post)

UPDATE: I managed to run the sampling on a Windows Server 2019. I don’t really know why it worked, maybe it uses a different multiprocessing approach to regular Windows 10. However I now run into the problem of very small ESS and very large Rhat for most of the parameters. Could anybody please let me know whether my model specification is conceptually ok?