Hierarchical linear model with two parent distributions

ysagiv · September 4, 2018, 8:07pm

Hi,

I would like to specify a hierarchical linear model where each coefficient is associated with one of two prior distributions. In particular, if you represent the N^2 coefficients as an NxN matrix, then the coefficients on the main diagonal should have as their prior some distribution D1 and the coefficients off the diagonal should have a different distribution D2.

I’ve attempted to implement this myself, but running it even on data produced by this process results in a sampling failure (i.e. the sampler does not converge). This makes me think that I’ve misspecified the model in my implementation — so my question is, what is the principled way to implement what I described above?

To be clear, the model should look something like:

y = b_{0, 0} x_{0, 0} + b_{0, 1} x_{0, 1} + … + b_{i, j} x_{i, j} + … b_{N, N} x_{N, N}

where b_{i, i} ~ D_1(\phi) and b_{i, j} ~ D_2(\theta)

and \phi and \theta are themselves inferred but with priors corresponding to known distributions.

Thanks.

My attempted implementation:

import numpy as np
import theano as tt
import pymc3 as pm
import matplotlib.pyplot as plt
import sys

n = 2
num_regressors = n ** 2
num_data_samples = 2000

# hyperprior
d0_mean_mu = 0
d0_mean_sd = 10

d1_mean_mu = -10
d1_mean_sd = 10

# sample true regressor vector

true_bi_mean = np.random.normal(d0_mean_mu, 0.1)
true_bij_mean = np.random.normal(d1_mean_mu, 0.1)
true_bi = np.random.normal(true_bi_mean, 0.1, size=n)
true_bij = np.random.normal(true_bij_mean, 0.1, size=num_regressors - n)

# generate data

X = np.random.randint(2, size=[num_data_samples, num_regressors])
regress = np.hstack((true_bi[0], true_bij[0 : n - 1]))

for k in range(1, n):
    regress = np.hstack((regress, np.hstack((true_bi[k], true_bij[k*(n - 1) : (k*(n-1)) + n - 1]))))

y = np.zeros((num_data_samples, 1))

for k in range(0, num_data_samples):
    y[k, 0] = np.dot(regress, X[k, :])

# update weights

with pm.Model() as hierarchical_model:

    # hyperpriors for group nodes

    mu_theta_s = pm.Normal('mu_theta_s', mu=d0_mean_mu, sd=d0_mean_sd)
    mu_phi_m = pm.Normal('mu_phi_m', mu=d1_mean_mu, sd=d1_mean_sd)

    # distributions for diagonal (b_i) and off-diagonal weights (b_ij)
    b_i = pm.Normal('b_i', mu=mu_theta_s, sd=1, shape=(n, 1))
    b_ij = pm.Normal('b_ij', mu=mu_phi_m, sd=1, shape=(num_regressors - n, 1))

    # model error
    eps = pm.HalfNormal('eps', 1)

    # reward estimate
    # the following 3 lines basically convert the two vectors of coefficients b_i and b_ij into a
    # matrix form with the b_i on the diagonal and the b_ij filling in around
    reg = pm.math.concatenate( [b_i[0:1], b_ij[0:n - 1]], axis=0)
    for k in range(1, n):
        reg = pm.math.concatenate((reg, pm.math.concatenate((b_i[k:k+1], b_ij[k*(n - 1) : (k*(n - 1)) + n - 1]), axis=0)), axis=0)

    reward_est = pm.math.dot(X, reg)

    # data likelihood
    reward_like = pm.Normal('reward_like', mu=reward_est, sd=eps, observed=y)

with hierarchical_model:
     trace = pm.sample(5000)

pm.traceplot(trace)

falk · September 4, 2018, 11:45pm

Dear @ysagiv,

I’ve tried to run your code, but there are some dimensionality mismatches.

num_tasks is not defined
the “data” dimension (X.shape[1]) does not match n.

It was hard to follow your stacking of the “reg” matrix, and (having experienced that theano is better with matrix operations than with loops) maybe that is also the pitfall here.

Maybe it helps to reparametrize:
instead of giving a separate prior to the diagonal elements, initialize a diagonal matrix which models the difference of the priors? Note that only works if all of these are Normal, it might be less straight forward with different distribution families.

b_i = pm.Normal('b_i', mu=mu_theta_s, sd=1, shape=n)
diagonal = tt.tensor.eye(n).dot(b_i) # couldn't test if the shape is right
b_ij = pm.Normal('b_ij', mu=mu_phi_m, sd=1, shape=(n, n))
reg = b_ij + diagonal

Just an idea.

Cheers,

Falk

ysagiv · September 5, 2018, 12:09am

Ah, oops. num_tasks should just be n (edited and fixed). I’m not sure what you mean by the “data” dimension not matching n, though. The data consists of a num_data_samples x num_regressors matrix, where each row is a data sample, and each data sample is binary vector of length num_regressors == n ** 2.

Your suggestion seems really interesting and like it would be much simpler. I think you would need to set the diagonal of b_ij to 0 before you do reg = b_ij + diagonal (unless I’m misunderstanding your intention). I will try this and see if it works.

falk · September 5, 2018, 6:59am

Actually, I do not think you need to zero the diagonal.
If, according to your original intention, the prior for b_i was \alpha_{i} \sim \mathcal{N} , and for b_ij it was \beta_{ij} \sim \mathcal{N},
then with parametrization B = b + \delta you could as well initialize \delta_{ii} as \alpha_{ii} - \beta_{i} (with \delta_{ij} = 0\ \forall i\neq j).

Of course, to interpret the results, you would also have to add them up.
In my previous post, I think the tensor multiplication was erroneous, but could not test. You would need something like numpy.diag, but for Theano arrays.

Yet maybe I got it all wrong and this just shifts the problem, seeing \delta is a matrix with a special prior in the diagonal. Certainly someone else can comment.

It would be great if you could post your result, once it works!
Thanks,

Falk

junpenglao · September 5, 2018, 8:36am

For this purpose, you should probably use theano.tensor.inc_subtensor. You can have a look at a similar code in LKJCholeskyCov

github.com

pymc-devs/pymc/blob/17b9981123fba3b868ec98ad755ddbbb5b37e31e/pymc3/distributions/multivariate.py#L976-L980


      
          variance = tt.zeros(n)
          variance = tt.inc_subtensor(variance[0], x[0] ** 2)
          variance = tt.inc_subtensor(
              variance[1:],
              cumsum[diag_idxs[1:]] - cumsum[diag_idxs[:-1]])

ysagiv · September 13, 2018, 6:54pm

I ended up following this advice, and it does make things simpler. However, I still have convergence issues. The sampler error output was:

There were 2967 divergences after tuning. Increase target_accept or reparameterize.

The acceptance probability does not match the target. It is 0.007385212125874771, but should be close to 0.8. Try to increase the number of tuning steps.

There were 2921 divergences after tuning. Increase target_accept or reparameterize.

The acceptance probability does not match the target. It is 0.0161309766124837, but should be close to 0.8. Try to increase the number of tuning steps.

There were 2382 divergences after tuning. Increase target_accept or reparameterize.

The acceptance probability does not match the target. It is 0.006341937098760246, but should be close to 0.8. Try to increase the number of tuning steps.

There were 2524 divergences after tuning. Increase target_accept or reparameterize.

The acceptance probability does not match the target. It is 0.005570805800177296, but should be close to 0.8. Try to increase the number of tuning steps.

The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.

The estimated number of effective samples is smaller than 200 for some parameters.

I find it pretty strange that the sampler does not converge given that the true values are very close to the mean of the priors. For completeness, here is my updated code:

import numpy as np
import theano as tt
import pymc3 as pm
import matplotlib.pyplot as plt
import sys

n = 2
num_regressors = n ** 2
num_data_samples = 2000

# hyperprior
d0_mean_mu = 0
d0_mean_sd = 10

d1_mean_mu = -10
d1_mean_sd = 10

# sample true regressor vector
true_bi_mean = np.random.normal(d0_mean_mu, 0.1)

true_bij_mean = np.random.normal(d1_mean_mu, 0.1)

true_bi = np.random.normal(true_bi_mean, 0.1, size=n)
true_bij = np.random.normal(true_bij_mean, 0.1, size=num_regressors - n)

# generate data
X = np.random.randint(2, size=[num_data_samples, num_regressors])

diag_idxs = np.diag_indices(n)
off_diag_idxs = np.where(~np.eye(n,dtype=bool))

regress = np.zeros((n, n))
regress[diag_idxs] = true_bi
regress[off_diag_idxs] = true_bij

y = np.zeros((num_data_samples, 1))

for k in range(0, num_data_samples):
	y[k, 0] = np.dot(regress.reshape(-1,), X[k, :])

# update weights
with pm.Model() as hierarchical_model:
	# hyperpriors for group nodes
	mu_theta_s = pm.Normal('mu_theta_s', mu=d0_mean_mu, sd=d0_mean_sd)
	mu_phi_m = pm.Normal('mu_phi_m', mu=d1_mean_mu, sd=d1_mean_sd)

	# distributions for diagonal weights (b_i) and off-diagonal weights (b_ij)
	b_i = pm.Normal('b_i', mu=mu_theta_s, sd=1, shape=(n, 1))
	b_ij = pm.Normal('b_ij', mu=mu_phi_m, sd=1, shape=(num_regressors - n, 1))

	# model error
	eps = pm.HalfNormal('eps', 1)

	# reward estimate
	reg = tt.tensor.zeros((n, n))

	reg = tt.tensor.inc_subtensor(reg[diag_idxs], b_i[:, 0])
	reg = tt.tensor.inc_subtensor(reg[off_diag_idxs], b_ij[:, 0])
	
	reward_est = pm.math.dot(X, tt.tensor.reshape(reg, (num_regressors, 1)))

	# data likelihood
	reward_like = pm.Normal('reward_like', mu=reward_est, sd=eps, observed=y)

with hierarchical_model:
	trace = pm.sample(5000)

pm.traceplot(trace)

Is there an obvious reason why this is the case?

junpenglao · September 13, 2018, 7:13pm

Nothing very obvious to improve to me - did you try to increase the number of tuning?

ysagiv · September 13, 2018, 7:23pm

What do you mean by number of tuning?

junpenglao · September 13, 2018, 8:08pm

Something like pm.sample(5000, tune=2000)

ysagiv · September 13, 2018, 8:26pm

I just tried this right now, both with tune=2000 and tune=1000 and I get the following error:

pymc3.parallel_sampling.RemoteTraceback:
“”"
Traceback (most recent call last):
File “/usr/local/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 73, in run
self._start_loop()
File “/usr/local/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 113, in _start_loop
point, stats = self._compute_point()
File “/usr/local/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 139, in _compute_point
point, stats = self._step_method.step(self._point)
File “/usr/local/lib/python3.7/site-packages/pymc3/step_methods/arraystep.py”, line 247, in step
apoint, stats = self.astep(array)
File “/usr/local/lib/python3.7/site-packages/pymc3/step_methods/hmc/base_hmc.py”, line 115, in astep
self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
File “/usr/local/lib/python3.7/site-packages/pymc3/step_methods/hmc/quadpotential.py”, line 201, in raise_ok
raise ValueError(‘\n’.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV b_ij.ravel()[0] is zero.
The derivative of RV b_ij.ravel()[1] is zero.
“”"

The above exception was the direct cause of the following exception:

ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV b_ij.ravel()[0] is zero.
The derivative of RV b_ij.ravel()[1] is zero.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “pymc3test.py”, line 66, in
trace = pm.sample(5000, tune=1000)
File “/usr/local/lib/python3.7/site-packages/pymc3/sampling.py”, line 449, in sample
trace = _mp_sample(**sample_args)
File “/usr/local/lib/python3.7/site-packages/pymc3/sampling.py”, line 999, in _mp_sample
for draw in sampler:
File “/usr/local/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 305, in iter
draw = ProcessAdapter.recv_draw(self._active)
File “/usr/local/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 223, in recv_draw
six.raise_from(RuntimeError(‘Chain %s failed.’ % proc.chain), old)
File “”, line 3, in raise_from
RuntimeError: Chain 0 failed.

My guess was that there is something wrong with the specification because this seems like it should be a simple model, but maybe the shared prior is causing some difficulty?

junpenglao · September 13, 2018, 9:19pm

Usually, this error is an indication of that there is not enough information flowing towards these parameters - for example, prior is too vague and the sampler is stuck at the tail area. In this case, I have the feeling that your model is over specific and there is not enough information from the data to estimate the covariance.
Maybe try increasing the standard deviation of the generation of true_bi and true_bij.

ysagiv · September 14, 2018, 2:56am

I increased the standard deviation of the generation of true_bi and true_bij to 10 as opposed to 0.1 and the same error persisted:

pymc3.parallel_sampling.RemoteTraceback:
“”"
Traceback (most recent call last):
File “/usr/local/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 73, in run
self._start_loop()
File “/usr/local/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 113, in _start_loop
point, stats = self._compute_point()
File “/usr/local/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 139, in _compute_point
point, stats = self._step_method.step(self._point)
File “/usr/local/lib/python3.7/site-packages/pymc3/step_methods/arraystep.py”, line 247, in step
apoint, stats = self.astep(array)
File “/usr/local/lib/python3.7/site-packages/pymc3/step_methods/hmc/base_hmc.py”, line 115, in astep
self.potential.raise_ok(self._logp_dlogp_func._ordering.vmap)
File “/usr/local/lib/python3.7/site-packages/pymc3/step_methods/hmc/quadpotential.py”, line 201, in raise_ok
raise ValueError(‘\n’.join(errmsg))
ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV b_ij.ravel()[0] is zero.
The derivative of RV b_ij.ravel()[1] is zero.
The derivative of RV b_i.ravel()[0] is zero.
The derivative of RV b_i.ravel()[1] is zero.
The derivative of RV mu_phi_m.ravel()[0] is zero.
“”"

The above exception was the direct cause of the following exception:

ValueError: Mass matrix contains zeros on the diagonal.
The derivative of RV b_ij.ravel()[0] is zero.
The derivative of RV b_ij.ravel()[1] is zero.
The derivative of RV b_i.ravel()[0] is zero.
The derivative of RV b_i.ravel()[1] is zero.
The derivative of RV mu_phi_m.ravel()[0] is zero.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File “pymc3test.py”, line 66, in
trace = pm.sample(5000, tune=2000)
File “/usr/local/lib/python3.7/site-packages/pymc3/sampling.py”, line 449, in sample
trace = _mp_sample(**sample_args)
File “/usr/local/lib/python3.7/site-packages/pymc3/sampling.py”, line 999, in _mp_sample
for draw in sampler:
File “/usr/local/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 305, in iter
draw = ProcessAdapter.recv_draw(self._active)
File “/usr/local/lib/python3.7/site-packages/pymc3/parallel_sampling.py”, line 223, in recv_draw
six.raise_from(RuntimeError(‘Chain %s failed.’ % proc.chain), old)
File “”, line 3, in raise_from
RuntimeError: Chain 0 failed.

In fact, it seems that in some sense “more” has gone wrong because the derivative of more RVs is zero. As a more complete description of the behaviour: the sampler in this case only gets to about 10% of the way through the total draws before dying out. When I set tune to default, the sampler samples all the way through but just doesn’t converge.

ysagiv · September 14, 2018, 2:57am

As a general follow-up, I tried just increasing the number of data points to 3000, but this for some reason just straight up crashes Python. I’m not sure why.

junpenglao · September 14, 2018, 5:02am

Hmmm, I thought it would provide more information if the true_bij is more distributed, but seems it is not.
In that case, try more informative prior (smaller sd in the priors)

Because there is problem in the posterior parameter space - more tuning let you discover the problem

ysagiv · September 14, 2018, 6:15am

I tried that as well, and it was again unsuccessful (similar error).

falk · September 14, 2018, 9:37am

I tried to run this as well and played around with the parameters.
Some remarks:

you carried around some extra dimensions that might be unnecessary (see code below)
it might be that your data generation (reg\cdot X) did not match the model (X\cdot reg), but could be that that is just transposed.
also, there were some hard coded standard deviations, I tried to simplify to match simulation and model.
in addition to @junpenglao’s comment on standard deviations, maybe the chain failure comes from the fact that your simulated (observed) ys are not really normally distributed (this gives the mentioned tail areas where sampling might be stuck).

Hope this helps!

Falk

import numpy as np
import theano as tt
import pymc3 as pm
import matplotlib.pyplot as plt
import sys

n = 2
num_regressors = n ** 2
num_data_samples = 1000

# hyperprior
d0_mean_mu = 0.2
d0_mean_sd = 1.

d1_mean_mu = -0.3
d1_mean_sd = 1.

# sample true regressor vector
true_bii_mean = d0_mean_mu#np.random.normal(d0_mean_mu, 0.1)
true_bij_mean = d1_mean_mu#np.random.normal(d1_mean_mu, 0.1)

true_bii = np.random.normal(true_bii_mean, d0_mean_sd, size=n)
true_bij = np.random.normal(true_bij_mean, d1_mean_sd, size=num_regressors - n)

# generate data
X = np.random.randint(2, size=[num_data_samples, num_regressors])

diag_idxs = np.diag_indices(n)
off_diag_idxs = np.where(~np.eye(n,dtype=bool))


regress = np.zeros((n, n))
regress[diag_idxs] = true_bii
regress[off_diag_idxs] = true_bij

# y = np.zeros((num_data_samples,))

y = X.dot(regress.reshape(-1,))
# for k in range(num_data_samples):
#     y[k] = np.dot(regress.reshape(-1,), X[k, :])
plt.hist(y)
plt.show()

# update weights
with pm.Model() as hierarchical_model:
    # hyperpriors for group nodes
    mu_theta_s = pm.Normal('mu_theta_s', mu=d0_mean_mu, sd=1)
    mu_phi_m = pm.Normal('mu_phi_m', mu=d1_mean_mu, sd=1)

    # distributions for diagonal weights (b_ii) and off-diagonal weights (b_ij)
    b_ii = pm.Normal('b_ii', mu=mu_theta_s, sd=d0_mean_sd, shape=(n))
    b_ij = pm.Normal('b_ij', mu=mu_phi_m, sd=d1_mean_sd, shape=(num_regressors - n))

    # model error
    eps = pm.HalfNormal('eps', 1)

    # reward estimate
    reg = tt.tensor.zeros((n, n))
    reg = tt.tensor.inc_subtensor(reg[diag_idxs], b_ii)
    reg = tt.tensor.inc_subtensor(reg[off_diag_idxs], b_ij)
    
    reward_est = pm.math.dot(X, tt.tensor.reshape(reg, (num_regressors, 1)))

    # data likelihood
    reward_like = pm.Normal('reward_like', mu=reward_est, sd=eps, observed=y)

with hierarchical_model:
    trace = pm.sample()

pm.traceplot(trace)
plt.show()

ysagiv · September 15, 2018, 4:50am

@falk

This was very useful! I will have to think more about the points that you have made (e.g. about y not being normally distributed).

At the moment, this is sort of a toy model wrt to what I would actually like to do (in reality, X should not be arbitrary binary data), so I’m curious how well it will translate. Nevertheless, thanks for your and @junpenglao help over these posts!

Topic		Replies	Views
Bad initial energy: inf. The model might be misspecified Questions	5	3578	June 29, 2018
Multinomial hierarchical regression with multiple observations per group ("Bad energy issue") Questions	22	3381	March 14, 2019
Hierarchical logistic regression giving non-sensible results? Do I formulate it correctly? Questions	15	854	August 4, 2020
Mixed multivariate Gauss distribution Questions	39	4879	April 12, 2018
Including a multivariate constrained improper prior in a hierachical model Questions	13	2469	January 12, 2020

Hierarchical linear model with two parent distributions

Related topics