Mixture of hierarchical model

peppe · May 30, 2018, 3:02am

Hello,

the data I am trying to model is composed of behavioural scores from 20 subjects. For each subject, we collected approximately 15 trials before treatment and 15 trials after treatment. In the following image, each dot represents the average score of each subject:

Unfortunately, the data are proprietary and can’t be shared, but I will try my best to characterise the problem.

As you can see, subjects with a pre-treatment score above 0.011 have a negative slope (let’s call this “decreasing group”), while subjects below that threshold have a positive slope (let’s call this “increasing group”).

I have successfully modelled the data using a hierarchical model with a varying intercept and slope model:
https://docs.pymc.io/notebooks/multilevel_modeling.html

However, I would like to have a model that automatically clusters subjects in the groups mentioned above (i.e. decreasing/increasing).

For this purpose I am trying to fit a “mixture of hierarchical models” to this data:

k=2
# observed: vector of observations
# x_time: for each observation 0 for pre-treatment, 1 for post-treatment
# subjects: for each observation contains the subject ID
with Model() as model:
    # cluster sizes
    p = pm.Dirichlet('p', np.ones(k))
    # ensure all clusters have some points
    p_min_potential = pm.Potential('p_min_potential', tt.switch(tt.min(p) < .1, -np.inf, 0))

    # latent cluster of each subject
    category = pm.Categorical('category',
                              p=p,
                              shape=n_sbj)

    # Priors
    mu_a = Normal('mu_a', mu=0., sd=1e5,shape=k)
    sigma_a = HalfCauchy('sigma_a', 5)
    mu_b = Normal('mu_b', mu=0., sd=1e5,shape=k)
    sigma_b = HalfCauchy('sigma_b', 5)

    # Intercepts ntercepts
    a = Normal('a', mu=mu_a[category], sd=sigma_a, shape=n_sbj)

    # Slopes 
    b = pm.Normal('b',mu=mu_b[category],sd=sigma_b,shape=n_sbj)

    # Model error
    sd_y = HalfCauchy('sd_y', 5)

    # Expected value
    y_hat = a[subjects] + b[subjects] * x_time

    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, sd=sd_y, observed=observed)

The sampling returns several warnings:

Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
NUTS: [sd_y_log__, b, a, sigma_b_log__, mu_b, sigma_a_log__, mu_a, p_stickbreaking__]
BinaryGibbsMetropolis: [category]
100%|██████████| 6000/6000 [00:40<00:00, 149.02it/s]
There were 5 divergences after tuning. Increase target_accept or reparameterize.
There were 1 divergences after tuning. Increase target_accept or reparameterize.
There were 3 divergences after tuning. Increase target_accept or reparameterize.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.

And the traces look like this:

And the forestplot like this:

Do you have any idea how to fix the model?
Or if you think that there is a better way of modelling this data, please let me know.

Thank you in advance.

junpenglao · May 30, 2018, 5:15am

There are quite a few discussion here on how to improve sampling of mixture models, eg:
Properly sampling mixture models. In your case, you should try to:

rewrite the model using pm.Mixture
A latent mixture is usually better as you can use NUTS to sample all parameters - otherwise mixing Gibbs Sampler and NUTS might not always works.
restrict the order of the slopes
In this way you limit the mode switching problem, see How to improve fit of Poisson Mixture Model in PyMC3?
better prior
More informative prior for mu_a and mu_b would also help.

peppe · May 30, 2018, 9:04am

Hi thanks for the reply.
I am trying to rewrite the model using pm.Mixtures, by following other discussions:

However, I find it difficult to handle the fact that observations are “nested” in subjects. I tried several models (attached only two, i.e. Model 1 and Model2), but none works. Could you be a bit more specific about the way of rewriting the model?

Thanks.

Model 1

k=2
#x_t = time
#x_time = shared(x_t)
x_t = time[:,np.newaxis]
x_time = shared(x_t,broadcastable=(False,True))
with Model() as varying_intercept:
    # Priors
    mu_a = Normal('mu_a', mu=0., sd=0.1,shape=k)
    sigma_a = HalfCauchy('sigma_a', 5)
    mu_b = Normal('mu_b', mu=0., sd=0.1,shape=k)
    sigma_b = HalfCauchy('sigma_b', 5)

    # Random intercepts
    a = Normal('a', mu=mu_a, sd=sigma_a, shape=(n_sbj,k))
    b = pm.Normal('b',mu=mu_b,sd=sigma_b,shape=(n_sbj,k))

    # Model error
    sd_y = HalfCauchy('sd_y', 5)

    # Expected value
    y_hat = a[subjects,:] + b[subjects,:] * x_time

    # Data likelihood   
    p = pm.Dirichlet('p', np.ones(k))
    likelihood = pm.NormalMixture('likelihood', p, y_hat, sd=sd_y, observed=observed)

Auto-assigning NUTS sampler…
Initializing NUTS using jitter+adapt_diag…
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [p_stickbreaking__, sd_y_log__, b, a, sigma_b_log__, mu_b, sigma_a_log__, mu_a]
100%|██████████| 6000/6000 [01:26<00:00, 69.28it/s]
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

Model 2

k=2
x_t = time
x_time = shared(x_t)
with Model() as varying_intercept:

    # cluster sizes
    p = pm.Dirichlet('p', np.ones(k))

    mu_a = pm.Normal('mu_a', mu=0, sd=0.1, shape=k) # intercept
    mu_b = pm.Normal('mu_b', mu=0, sd=0.1, shape=k) # slope

    sd_a = HalfCauchy('sd_a', 5,shape=k)
    sd_b = HalfCauchy('sd_b', 5,shape=k)

    a = pm.NormalMixture('mix_mu_a', p, mu_a, sd=sd_a,shape=n_sbj)
    b = pm.NormalMixture('mix_mu_b', p, mu_b, sd=sd_b,shape=n_sbj)

    # Expected value
    y_hat = a[subjects] + b[subjects] * x_time

    #Model error
    sd_y = HalfCauchy('sd_y', 5)

    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, sd=sd_y, observed=observed)

Auto-assigning NUTS sampler…
Initializing NUTS using jitter+adapt_diag…
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sd_y_log__, mix_mu_b, mix_mu_a, sd_b_log__, sd_a_log__, mu_b, mu_a, p_stickbreaking__]
100%|██████████| 6000/6000 [00:37<00:00, 160.68it/s]
There were 1 divergences after tuning. Increase target_accept or reparameterize.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

junpenglao · May 30, 2018, 11:01am

model 1 is definitely in the right direction. However, you should still try to restrict the coefficient to make sure there is no mode switching. As a quick fix try adding this constraint into your model1:

    # break symmetry
    order_means_potential = pm.Potential('order_means_potential',
                                         tt.switch(mu_b[1]-mu_b[0] < 0, -np.inf, 0))

junpenglao · May 30, 2018, 11:07am

Also, you might find some inspiration from this question:

I had two notebook that explore different parameterization and inference:
https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/PyMC3QnA/mixture/Mixture_discourse.ipynb
https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/PyMC3QnA/mixture/Mixture_discourse_order.ipynb

peppe · May 31, 2018, 1:54am

Hi thanks for the quick replies.

I tried to add this:

but I got

ValueError: Bad initial energy: inf. The model might be misspecified.

With respect to your second suggestion

I am not sure where I should get inspiration from. Are you suggesting to “switch label by hand”?

junpenglao · May 31, 2018, 6:08am

Try pm.sample(init='adapt_diag'), and also providing a test_value to mu_b: mu_b = Normal('mu_b', mu=0., sd=0.1, shape=k, testval=np.asarray([-.1,.1]))

Those two notebooks are mixture regressions, sorry there is not many comments but you can see different ways of modelling it.

peppe · May 31, 2018, 6:44am

Hi,

I tried as you suggested without any luck (results attached at the end). Do you have other ideas?

In alternative, do you think that I could just model the data using a simple varying intercept and slope model (https://docs.pymc.io/notebooks/multilevel_modeling.html) and then say something about the two clusters a posteriori? In this case, can you suggest a method?

Results of the suggested modifications:

k=2
x_t = time[:,np.newaxis]
x_time = shared(x_t,broadcastable=(False,True))
with pm.Model() as varying_intercept:

    # Priors
    mu_a = pm.Normal('mu_a', mu=0., sd=0.1,shape=k)
    sigma_a = pm.HalfCauchy('sigma_a', 5)
    mu_b = pm.Normal('mu_b', mu=0., sd=0.1,shape=k,testval=np.asarray([-.1,.1]))
    sigma_b = pm.HalfCauchy('sigma_b', 5)

    order_means_potential = pm.Potential('order_means_potential',
                                          tt.switch(mu_b[1]-mu_b[0] < 0, -np.inf, 0))

    # Random intercepts
    a = pm.Normal('a', mu=mu_a, sd=sigma_a, shape=(n_sbj,k))
    b = pm.Normal('b',mu=mu_b,sd=sigma_b,shape=(n_sbj,k))

    # Model error
    sd_y = pm.HalfCauchy('sd_y', 5)

    # Expected value
    y_hat = a[subjects,:] + b[subjects,:] * x_time

    # Data likelihood   
    p = pm.Dirichlet('p', np.ones(k))
    likelihood = pm.NormalMixture('likelihood', p, y_hat, sd=sd_y, observed=observed)

    varying_intercept_trace = pm.sample(5000, n_init=50000, tune=1000,init='adapt_diag')[1000:]

Auto-assigning NUTS sampler…
Initializing NUTS using adapt_diag…
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [p_stickbreaking__, sd_y_log__, b, a, sigma_b_log__, mu_b, sigma_a_log__, mu_a]
100%|██████████| 6000/6000 [00:58<00:00, 102.81it/s]
There were 2997 divergences after tuning. Increase target_accept or reparameterize.
There were 4073 divergences after tuning. Increase target_accept or reparameterize.
There were 2945 divergences after tuning. Increase target_accept or reparameterize.
There were 2646 divergences after tuning. Increase target_accept or reparameterize.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

junpenglao · May 31, 2018, 6:59am

Mixture models are difficult to inference - you need to carefully construct the prior and put constraint to limit mode switching during sampling. Unfortunately without the data it is difficult to give you more insight.

If you already have the mixed-effect model set up, you can do clustering on the random effect slope - after all it is valid to compute expectation of some function using the MCMC samples. I would try Kmean clustering on each iteration and then compute the mean.

Topic		Replies	Views
Hierarchical Gaussian Mixture Regression Model Questions	5	1800	March 16, 2019
Hierarchical Mixture of Multivariate Normal Questions	0	590	September 29, 2021
Question on Regression Mixture Model	0	325	July 11, 2023
Multinomial hierarchical regression with multiple observations per group ("Bad energy issue") Questions	22	3352	March 14, 2019
Hierarchical model with mixture prior Questions	1	1785	January 12, 2019

Mixture of hierarchical model

Related topics