Mixture of hierarchical model

Hello,

the data I am trying to model is composed of behavioural scores from 20 subjects. For each subject, we collected approximately 15 trials before treatment and 15 trials after treatment. In the following image, each dot represents the average score of each subject:
image
Unfortunately, the data are proprietary and can’t be shared, but I will try my best to characterise the problem.

As you can see, subjects with a pre-treatment score above 0.011 have a negative slope (let’s call this “decreasing group”), while subjects below that threshold have a positive slope (let’s call this “increasing group”).

I have successfully modelled the data using a hierarchical model with a varying intercept and slope model:
https://docs.pymc.io/notebooks/multilevel_modeling.html

However, I would like to have a model that automatically clusters subjects in the groups mentioned above (i.e. decreasing/increasing).

For this purpose I am trying to fit a “mixture of hierarchical models” to this data:

k=2
# observed: vector of observations
# x_time: for each observation 0 for pre-treatment, 1 for post-treatment
# subjects: for each observation contains the subject ID
with Model() as model:
    # cluster sizes
    p = pm.Dirichlet('p', np.ones(k))
    # ensure all clusters have some points
    p_min_potential = pm.Potential('p_min_potential', tt.switch(tt.min(p) < .1, -np.inf, 0))

    # latent cluster of each subject
    category = pm.Categorical('category',
                              p=p,
                              shape=n_sbj)

    # Priors
    mu_a = Normal('mu_a', mu=0., sd=1e5,shape=k)
    sigma_a = HalfCauchy('sigma_a', 5)
    mu_b = Normal('mu_b', mu=0., sd=1e5,shape=k)
    sigma_b = HalfCauchy('sigma_b', 5)

    # Intercepts ntercepts
    a = Normal('a', mu=mu_a[category], sd=sigma_a, shape=n_sbj)

    # Slopes 
    b = pm.Normal('b',mu=mu_b[category],sd=sigma_b,shape=n_sbj)

    # Model error
    sd_y = HalfCauchy('sd_y', 5)

    # Expected value
    y_hat = a[subjects] + b[subjects] * x_time

    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, sd=sd_y, observed=observed)     

The sampling returns several warnings:

Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
NUTS: [sd_y_log__, b, a, sigma_b_log__, mu_b, sigma_a_log__, mu_a, p_stickbreaking__]
BinaryGibbsMetropolis: [category]
100%|██████████| 6000/6000 [00:40<00:00, 149.02it/s]
There were 5 divergences after tuning. Increase target_accept or reparameterize.
There were 1 divergences after tuning. Increase target_accept or reparameterize.
There were 3 divergences after tuning. Increase target_accept or reparameterize.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.

And the traces look like this:

And the forestplot like this:
image

Do you have any idea how to fix the model?
Or if you think that there is a better way of modelling this data, please let me know.

Thank you in advance.

There are quite a few discussion here on how to improve sampling of mixture models, eg:
Properly sampling mixture models. In your case, you should try to:

  1. rewrite the model using pm.Mixture
    A latent mixture is usually better as you can use NUTS to sample all parameters - otherwise mixing Gibbs Sampler and NUTS might not always works.

  2. restrict the order of the slopes
    In this way you limit the mode switching problem, see How to improve fit of Poisson Mixture Model in PyMC3?

  3. better prior
    More informative prior for mu_a and mu_b would also help.

1 Like

Hi thanks for the reply.
I am trying to rewrite the model using pm.Mixtures, by following other discussions:

However, I find it difficult to handle the fact that observations are “nested” in subjects. I tried several models (attached only two, i.e. Model 1 and Model2), but none works. Could you be a bit more specific about the way of rewriting the model?

Thanks.

Model 1

k=2
#x_t = time
#x_time = shared(x_t)
x_t = time[:,np.newaxis]
x_time = shared(x_t,broadcastable=(False,True))
with Model() as varying_intercept:
    # Priors
    mu_a = Normal('mu_a', mu=0., sd=0.1,shape=k)
    sigma_a = HalfCauchy('sigma_a', 5)
    mu_b = Normal('mu_b', mu=0., sd=0.1,shape=k)
    sigma_b = HalfCauchy('sigma_b', 5)

    # Random intercepts
    a = Normal('a', mu=mu_a, sd=sigma_a, shape=(n_sbj,k))
    b = pm.Normal('b',mu=mu_b,sd=sigma_b,shape=(n_sbj,k))

    # Model error
    sd_y = HalfCauchy('sd_y', 5)

    # Expected value
    y_hat = a[subjects,:] + b[subjects,:] * x_time

    # Data likelihood   
    p = pm.Dirichlet('p', np.ones(k))
    likelihood = pm.NormalMixture('likelihood', p, y_hat, sd=sd_y, observed=observed)

Auto-assigning NUTS sampler…
Initializing NUTS using jitter+adapt_diag…
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [p_stickbreaking__, sd_y_log__, b, a, sigma_b_log__, mu_b, sigma_a_log__, mu_a]
100%|██████████| 6000/6000 [01:26<00:00, 69.28it/s]
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

Model 2

k=2
x_t = time
x_time = shared(x_t)
with Model() as varying_intercept:

    # cluster sizes
    p = pm.Dirichlet('p', np.ones(k))

    mu_a = pm.Normal('mu_a', mu=0, sd=0.1, shape=k) # intercept
    mu_b = pm.Normal('mu_b', mu=0, sd=0.1, shape=k) # slope

    sd_a = HalfCauchy('sd_a', 5,shape=k)
    sd_b = HalfCauchy('sd_b', 5,shape=k)

    a = pm.NormalMixture('mix_mu_a', p, mu_a, sd=sd_a,shape=n_sbj)
    b = pm.NormalMixture('mix_mu_b', p, mu_b, sd=sd_b,shape=n_sbj)

    # Expected value
    y_hat = a[subjects] + b[subjects] * x_time

    #Model error
    sd_y = HalfCauchy('sd_y', 5)

    # Data likelihood
    y_like = Normal('y_like', mu=y_hat, sd=sd_y, observed=observed)

Auto-assigning NUTS sampler…
Initializing NUTS using jitter+adapt_diag…
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sd_y_log__, mix_mu_b, mix_mu_a, sd_b_log__, sd_a_log__, mu_b, mu_a, p_stickbreaking__]
100%|██████████| 6000/6000 [00:37<00:00, 160.68it/s]
There were 1 divergences after tuning. Increase target_accept or reparameterize.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

model 1 is definitely in the right direction. However, you should still try to restrict the coefficient to make sure there is no mode switching. As a quick fix try adding this constraint into your model1:

    # break symmetry
    order_means_potential = pm.Potential('order_means_potential',
                                         tt.switch(mu_b[1]-mu_b[0] < 0, -np.inf, 0))

Also, you might find some inspiration from this question:

I had two notebook that explore different parameterization and inference:
https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/PyMC3QnA/mixture/Mixture_discourse.ipynb
https://github.com/junpenglao/Planet_Sakaar_Data_Science/blob/master/PyMC3QnA/mixture/Mixture_discourse_order.ipynb

Hi thanks for the quick replies.

I tried to add this:

but I got

ValueError: Bad initial energy: inf. The model might be misspecified.

With respect to your second suggestion

I am not sure where I should get inspiration from. Are you suggesting to “switch label by hand”?

Try pm.sample(init='adapt_diag'), and also providing a test_value to mu_b: mu_b = Normal('mu_b', mu=0., sd=0.1, shape=k, testval=np.asarray([-.1,.1]))

Those two notebooks are mixture regressions, sorry there is not many comments but you can see different ways of modelling it.

Hi,

I tried as you suggested without any luck (results attached at the end). Do you have other ideas?

In alternative, do you think that I could just model the data using a simple varying intercept and slope model (https://docs.pymc.io/notebooks/multilevel_modeling.html) and then say something about the two clusters a posteriori? In this case, can you suggest a method?

Results of the suggested modifications:

k=2
x_t = time[:,np.newaxis]
x_time = shared(x_t,broadcastable=(False,True))
with pm.Model() as varying_intercept:

    # Priors
    mu_a = pm.Normal('mu_a', mu=0., sd=0.1,shape=k)
    sigma_a = pm.HalfCauchy('sigma_a', 5)
    mu_b = pm.Normal('mu_b', mu=0., sd=0.1,shape=k,testval=np.asarray([-.1,.1]))
    sigma_b = pm.HalfCauchy('sigma_b', 5)

    order_means_potential = pm.Potential('order_means_potential',
                                          tt.switch(mu_b[1]-mu_b[0] < 0, -np.inf, 0))

    # Random intercepts
    a = pm.Normal('a', mu=mu_a, sd=sigma_a, shape=(n_sbj,k))
    b = pm.Normal('b',mu=mu_b,sd=sigma_b,shape=(n_sbj,k))

    # Model error
    sd_y = pm.HalfCauchy('sd_y', 5)

    # Expected value
    y_hat = a[subjects,:] + b[subjects,:] * x_time

    # Data likelihood   
    p = pm.Dirichlet('p', np.ones(k))
    likelihood = pm.NormalMixture('likelihood', p, y_hat, sd=sd_y, observed=observed)

    varying_intercept_trace = pm.sample(5000, n_init=50000, tune=1000,init='adapt_diag')[1000:]    

Auto-assigning NUTS sampler…
Initializing NUTS using adapt_diag…
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [p_stickbreaking__, sd_y_log__, b, a, sigma_b_log__, mu_b, sigma_a_log__, mu_a]
100%|██████████| 6000/6000 [00:58<00:00, 102.81it/s]
There were 2997 divergences after tuning. Increase target_accept or reparameterize.
There were 4073 divergences after tuning. Increase target_accept or reparameterize.
There were 2945 divergences after tuning. Increase target_accept or reparameterize.
There were 2646 divergences after tuning. Increase target_accept or reparameterize.
The gelman-rubin statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.

Mixture models are difficult to inference - you need to carefully construct the prior and put constraint to limit mode switching during sampling. Unfortunately without the data it is difficult to give you more insight.

If you already have the mixed-effect model set up, you can do clustering on the random effect slope - after all it is valid to compute expectation of some function using the MCMC samples. I would try Kmean clustering on each iteration and then compute the mean.