Including a multivariate constrained improper prior in a hierachical model

I am wondering about the most effective way of incorporating the above improper prior for eta in a hierarchical model as a statial random effect. The Q matrix is singular and has rank (n-1) where n is the number of data points. The rho has a prior following a Gamma(0.5, 0.005). The reason why the constraint is added onto the improper prior joint distribution is so it can be identifiable. This is what i’ve tried to do.

`with pm.Model() as m:
    .... # other parameter definitions before this line
    tau = pm.Gamma('tau', alpha=0.5, beta=0.005)
    #eta = pm.MvNormal('eta', mu=0, tau=tau*Q, shape=(1, n))
    eta = pm.DensityDist('eta', lambda value: 0.5*(n-1)*tt.log(tau)-0.5*tau* 
                             ,, value)) )
    #eta_c = pm.Deterministic('eta_c', eta-tt.mean(eta))
    eta = eta - tt.mean(eta)
    .... #line using the generated eta vector as an input to expression for the distribution of another parameter`

AT first it seems to run and shortly after while give this output to the console:
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (2 chains in 2 jobs) NUTS: [eta, tau__log__, alpha, beta] Could not pickle model, sampling singlethreaded. Sequential sampling (2 chains in 1 job) NUTS: [eta, tau__log__, alpha, beta]

A then shortly after will throw this error:
ValueError: Mass matrix contains zeros on the diagonal. Some derivatives might always be zero.

From the output it seems to be generating log(tau) instead of tau, plus the eta posterior samples are not needed.

My question now is, what am I doing wrong in the implementation? I cant use the available pyMC3 distributions since the prior is improper.

BTW this formulation could be used as an extension to a model like the one on this link (Problem with hierachical occupancy model) where the eta is added as a spatial random effect on the x*beta expression modelling psi.

This is an error usually when you have prior that are too uninformative, which results in small gradient, you can have a look at some usually suggestion: Search results for 'Mass matrix contains zeros on the diagonal' - PyMC Discourse

I tried different values, it doesnt seem to be helping at all. it just jams and throws the error at around 7% everytime.

Do you think I implemented the improper prior and constraint correctly given my code above? maybe I could change some things to make it work better?

It is possible the logp is not implemented correctly (I did not check). Did you verify the implementation in numpy/scipy?

I removed the coefficient (tau^(n-1)/2) since it doesnt contain the variable eta. I ran the model without the line eta = eta - tt.mean(eta) which I believe enforces the constraint (sum(eta) = 0). It ran without error but the output was wrong for the beta parameter which includes the spatial random effect introduced by the improper prior. The traces were very wide or very tall.

If I included the line eta = eta - tt.mean(eta) to enforce the constraint (which I believe should fix the bad output), I get an error at around 30% which reads: ValueError: Mass matrix contains non-finite values.

Do you have any suggestion regarding an effective way of including the [ sum(eta) = 0 ] constraint without encountering errors?

Usually you don’t include a normalization within the model block, as it makes the parameters unidentifiable. Plus in your case eta is from a zero mean distribution and multiple by a matrix which should still satisfied that mean is 0.

It would be easier if you could share your code and data in a notebook.

the data and code is the same as the one here (, except for the following changes:

class ICAR(distribution.Continuous):

    def __init__(self, tau, Q, *args, **kwargs):
        super(ICAR, self).__init__(*args, **kwargs)
        self.tau = tau
        self.Q = Q
        self.mode = 0.

    def logp(self, x):
        tau = self.tau
        Q = self.Q
        return -0.5 * tau *,, x)) #I removed the rho^(n-1)/2 coefficient from the likelihood

with pm.Model() as m:
    beta = pm.Normal('beta', mu=0, sd=10, shape=(2, 1)) 
    alpha = pm.Normal('alpha', mu=0, sd=10, shape=(2, 1))
    tau = pm.Gamma('tau', alpha=0.5, beta=0.005)
    eta = ICAR('eta', tau=tau, Q=Q)
    #eta = pm.DensityDist('eta', lambda value: 0.5*(n-1)*tt.log(tau)-0.5*tau* 
    #          ,, value)))
    x_b =, beta)
    #constr = pm.Potential('constr', eta - tt.mean(eta))
    #eta = eta - tt.mean(eta)
    psi = pm.math.invlogit(x_b + eta).flatten() #note the eta spatial random effect added to x_b
    w_a =, alpha)
    d = pm.math.invlogit(w_a)
    # z = pm.Bernoulli('z', psi, shape=n)
    # prob = z[index] * d
    prob = psi[index] * d
    obs = pm.Bernoulli('y', p=prob, observed=y_r)
    trace = pm.sample(10000, tune=1000, cores=1, chains=1) #it throws a Processor error if I dont set cores to 1.

So I ran the model without the constraint and the alpha0, alpha1 and beta1 parameters return reasonable values, but beta0 samples are completely off plus the standard error is huge, as can be seen from the above plots. Plus the tail of tau is unusually long because the mean should be very close to zero instead of being that large. I am sure correctly including the constraint somehow in the model without errors being thrown should improve the samples of beta0 (As confirmed by the direct gibbs model I built a while ago using mathematical derivations to obtain the full conditional distributions to sample from).

So what do you think can be done to improve this sampling while making sure to incorporate the constraint?

looking at the beta_0 is seems it is just sampled from the prior. I dont have time right now, but for CAR model you can have a look at this detail doc: Maybe you can get some inspirations.

Unfortunately for me I’ve already read the doc you linked and it hasn’t been of much help. CAR is fairly easy to implement since the prior for eta has a proper Normal distribution. The ICAR model im dealing with is much more of a challenge since the prior is improper, and implementing that sum-to-zero constraint is my biggest concern right now.

Thanks for your time, you can always come back and offer input when you can.

I tried to run your code but I dont have the information of Q
There are some good discussion on this on the Stan discourse:

Maybe you can try similar strategy here to put a soft constraint, using potential class (this is something that added to the model logp):

s = theano.shared(0.001)
with pm.Model() as m:
    eta = ... # the definition of your eta that is unconstrained
    pm.Potential('soft_sum', pm.Normal.dist(0, s*n).logp(eta.sum()))
1 Like

Sorry, I forgot to include the code I used to generate the Q matrix, Here it is:

    A = np.random.choice([0,1], size=n*n, p=[0.995,0.005]).reshape(n, n)
    A = np.tril(A).T + np.tril(A)
    A[np.diag_indices_from(A)] = 0
    D = np.diag(A.sum(axis=1))
    Q = D - A
    Q = Q.astype(np.float64, copy=False)

I looked into the link you posted and the discussion was very interesting. I incorperated the soft-sum constraint as you suggested and it seems to have fixed my issue, atleast for the simulated dataset I used.

When I tried it on a real dataset, it provided very different estimates compared to the ICAR model I built and the Restricted Spatial Regression model output (both of which shared very similar posterior estimates). I will have to experiment with different soft-sums on the real data and see if estimation can be improved.

by the way, is there a way to run the sampler in parallel without running into this error?
BrokenProcessPool: A process in the executor was terminated abruptly while the future was running or pending.

Thank you so much for your help!

1 Like

Just want to jump in and ask a question about the unidentifiable issue regarding to including normalization within the model block. In the book “Data Analysis Using Regression and Multilevel/Hierarchical Models” Chap 19.4 and 19.5, they suggest adding “Redundant parameters” to intentionally make hierarchical models non-identifiable to speed up the convergence of the Gibbs sampler. And in this blog post( ), a similar reparameterization trick was applied to the solve the “funnel” problem in the MC convergence. I think I’m a little fuzzy about when will these reparameterization tricks cause a problem and how to implement them properly? Thank you!

In thomas’ blog post the reparameterization is not a normalization. And I have not read the book, although speeding up Gibbs sampler is quite different than speeding up HMC, likely it is depending on different heuristics. Maybe you should open a new post to discuss this more in depth.

I am trying to implement an ICAR model as well at the moment. As it seems you have fixed the issue, would you mind pasting your working code as an example? I am convinced many other users of pymc3 will benefit greatly from it.
Thank you very much!