How to do model comparison with a dummy variable

chfin · September 14, 2023, 2:40pm

Hi!

I’m trying to do a model comparison in a hierarchical meta model using a dummy variable that encodes model choice. Sampling from the full model doesn’t work well because the chains get me stuck in local optima: every chain “picks” one model and sticks to it, updating only the corresponding parameters correctly.

To work around this, I have tried to sample the posteriors for all parameters first (training each submodel independently), and to then infer the the posterior of the model variable from based on these posteriors. However, I’m not really sure how to do that correctly:

When using sample_posterior_predictive, the model variable is sampled from its prior, not its posterior. (I was able to get this to work in numpyro using its Predictive class).
I’ve instead tried to use an alternative version of the meta model, in which the parameter priors are the previously estimated posteriors. That works but I don’t think it’s correct since the observations are the same as before, so I essentially observe the same data twice. I can see that the sampled parameter values are more narrowly distributed than before.
Using pyro, I was able to get Variational Inference to work (numpyro gives me weird NaNs with the same model). This gives me the same results as the predictive approach in numpyro.

Any ideas, how to do this? (By the way, I’m aware of other forms of model comparison, but this is intended as a demonstration of the principles of Bayesian modeling, not an actual application, so I think showing the meta-model / Bayes-factor approach would be cool.)

Here is the code that I’m using. I’m trying to model the sizes of intervals of consecutive notes in polyphonic music. The interval size is always estimated as a geometric distribution (non-negative integers), but the three competing models use different predictors:

model 1 assumes a globally constant parameter
model 2 assumes different parameters for each voice (4 voices, the pieces are string quartets)
model 3 assumes the parameter to depend on the pitch of the preceding note, or the “register” in musical terms (logistic)

You can see this reflected in the meta model.
(The non-flat prior on the model is just there to show that sample_posterior_predictive indeed samples from the prior.)

# given data
observations = [...] # observed step sizes
staff = [...] # the staff/voice of each datapoint
p0 = [...] # the pitch of the first note corresponding to each datapoint

with pm.Model() as model_meta:
    # model choice
    model_choice = pm.Categorical("model_choice", [0.5, 0.3, 0.2])
    
    # global model
    theta_global = pm.Beta("theta_global", 0.5, 0.5)

    # voice model
    theta_voice = pm.Beta("theta_voice", 0.5, 0.5, shape=4)

    # register model
    a = pm.Normal("a_register", 0, 10)
    b = pm.Normal("b_register", 0, 10)
    theta_register = pm.math.sigmoid(p0*a + b)

    # observation
    theta = ptn.tensor.stack((
        ptn.tensor.fill(p0, theta_global),
        theta_voice[staff],
        theta_register,
    ))
    pm.Geometric("obs", p=theta[model_choice], observed=observations+1)

I use the following auxiliary model to obtain posterior samples for the parameters:

with pm.Model() as model_joint:
    # global model
    theta_global = pm.Beta("theta_global", 0.5, 0.5)
    pm.Geometric("obs_global", p=theta_global, observed=observations+1)

    # voice model
    theta_voice = pm.Beta("theta_voice", 0.5, 0.5, shape=4)
    pm.Geometric("obs_voice", p=theta_voice[staff], observed=observations+1)

    # register model
    a = pm.Normal("a_register", 0, 10)
    b = pm.Normal("b_register", 0, 10)
    theta_register = pm.math.sigmoid(p0*a + b)
    pm.Geometric("obs_register", p=theta_register, observed=observations+1)

    # draw samples
    idata_joint = pm.sample(5_000, chains=4)

I then try to infer model_choice like this:

with model_meta:
    idata_model_choice_meta = pm.sample_posterior_predictive(idata_joint, var_names=["model_choice"])

az.plot_posterior(idata_model_choice_meta.posterior_predictive);

Which gives me samples from the prior (0.5, 0.3, 0.2)

Any advice is welcome.

drbenvincent · September 23, 2023, 1:05pm

This approach is taken in Kruschke’s DBDA book 2nd ed. Chapter 10. There’s a PyMC port by @cluhmann here https://nbviewer.org/github/cluhmann/DBDA-python/blob/master/Notebooks/Chapter%2010.ipynb. Maybe thus will be useful?

chfin · September 24, 2023, 3:45pm

Awesome, thanks for the hint. I’ve tried the pseudo prior approach and it works really well for avoiding the sampling problem.

I still wonder, if there is a way to make the step-wise inference work, or if it is valid in the first place. I feel like I’m making a simplification there that doesn’t work.

In any case, i’ve written up a little notebook that includes both the pseudo-prior and the VI solution for my models, in case that’s interesting for anyone.

cluhmann · September 25, 2023, 11:02am

I think the Kruschke approach/example is illustrative, but should be avoided in practice (e.g., he never uses HMC/NUTS). Instead, a typical approach would be to marginalize out the indicator variable. In the case of just 2 models, you can just use a single, continuous “mixing” parameter bound to [0, 1] (e.g., using a Beta prior) and make your likelihood a weighted-mixture of the 2 models. With more than 2 models, you’re looking at a Dirichlet-distributed set of mixing parameters.

chfin · September 25, 2023, 2:46pm

Mh, I see. So if I got this right, I would

add yet another level of hierarchy to the model and
introduce the the model probability as another variable (let’s say \mu), so the model becomes
p(\mu, m, \theta, x) = p(\mu) \cdot p(m | \mu) \cdot p(\theta) \cdot p(x | \theta_m, m)
then marginalize out the model choice variable m analytically:
p(\mu, \theta, x) = p(\mu) \cdot p(\theta) \cdot p(x | \theta, \mu)
(where $p(x | \theta, \mu) is the weighted mixture likelihood that you mentioned)
and finally look at the posterior of \mu instead of the posterior of m.

Is that right?

cluhmann · September 25, 2023, 3:32pm

You would just replace your indicator parameter m with the mixing parameter \mu so that \mu =p(m) [edit: or, to be a bit more precise, \mu =p(m=1)].

edit:
In the case of 3 sub-models, you would need a Dirichlet-distributed parameter something like this:

# model definitions
# replace with something useful
model_components = [
        # model 1
        pm.Poisson.dist(mu=mu1),
        # model 2
        pm.StudentT.dist(mu=mu2),
        # model 3
        pm.Normal.dist(mu=mu3),
    ]

# mixture weights
w = pm.Dirchlet("w", a=np.ones(num_models))

# likelihood
like = pm.Mixture(
        "like",
        w=w,
        comp_dists=model_components,
        observed=data,
    )

chfin · February 5, 2024, 10:39am

Hi, sorry for getting back to this so much later.

I finally tried to apply the pm.Mixture() approach to my model, and I’m running into similar problems as with a naive mixture (where the dummy variable is not marginalized). After increasing target_accept to 0.9, I don’t get divergences, but sampling is extremely slow and the chains still get stuck in different regions. Is there something wrong with the model or is marginalizing out the dummy variable not sufficient in a case like this?

Here is the full model specification:

with pm.Model() as model_mixture:
    # model weights
    model_weights = pm.Dirichlet("model_weights", np.ones(3))

    # model 1: global parameter
    theta_global = pm.Beta("theta_global", 0.5, 0.5)

    # model 2: parameters per voice
    theta_voice = pm.Beta("theta_voice", np.full(4, 0.5), np.full(4, 0.5))

    # model 3: parameter depends on preceding pitch
    a = pm.Normal("a", 0, 10)
    b = pm.Normal("b", 0, 10)
    theta_register = pm.math.sigmoid(p0*a + b)

    # mixture components
    components = [
        pm.Geometric.dist(theta_global),
        pm.Geometric.dist(theta_voice[staff]),
        pm.Geometric.dist(theta_register),
    ]

    # observation
    pm.Mixture("obs", w=model_weights, comp_dists=components, observed=observations+1)

    idata_mixture = pm.sample(1000, chains=4, target_accept=0.9)

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 4882 seconds.
The rhat statistic is larger than 1.01 for some parameters. This indicates problems during sampling. See https://arxiv.org/abs/1903.08008 for details
The effective sample size per chain is smaller than 100 for some parameters.  A higher number is needed for reliable rhat and ess computation. See https://arxiv.org/abs/1903.08008 for details

And here is the trace:

Would be curious to hear what you think about this.

cluhmann · February 5, 2024, 8:42pm

That all looks reasonably good to me. There’s only a single chain that is failing to mix. I would expect that tweaking the tuning routine might help to take care of that. As for the fact that the model ultimately ends up preferring a mixture in which a single component is dominant, that’s likely to be a function of your components and your data.

chfin · February 6, 2024, 12:06pm

I see. Thanks for the feedback!

Topic		Replies	Views
Sample_posterior_predictive using prior instead of posterior parameters? v5 bug , modeling	0	36	August 27, 2024
[Beginner level question on modeling] Bayesian analysis of F1 scores from two ML models v5 modeling	5	411	January 24, 2023
Comparing posterior predictive errors across samples Questions	0	391	October 7, 2020
Question about PyMC3 inference in hierarchial model Questions	7	751	April 23, 2018
Posterior predictive sampling with data variance Questions	10	2392	September 14, 2018

How to do model comparison with a dummy variable

Related topics