Marginal likelihood for distributions with discrete variables

Hi,

I’ve been reading the excellent post about approximating the marginal likelihood for model selection from @junpenglao [Marginal_likelihood_in_PyMC3] (Motif of the Mind | Junpeng Lao, PhD) and learnt a lot. It will be highly appreciated if I can have a chance to discuss some follow-up questions in this forum.

The parameters in the given examples are all continuous. For me,I want to apply the strategies mentioned in the post like SMC and Bridge sampling to a mixture model. The mixture model includes a discrete variable called assignment indicating which group each subject belongs to, and a corresponding weights variable called p drawn from a Dirichlet distribution so that assignment ~ Categorical(p). I’m not sure how to modify the implementation so that these methods can also apply to the case where the discrete variable assignment and the constraint variable p (between 0 and 1) exist.

  • Regarding the Bridge sampling, in the post, a multivariate normal distribution is used as the proposal distribution, and the corresponding code is:
# get mean & covariance matrix and generate samples from proposal
m = np.mean(samples_4_fit, axis=1)
V = np.cov(samples_4_fit)
L = chol(V, lower=True)

# Draw N2 samples from the proposal distribution
gen_samples = m[:, None] + dot(L, st.norm.rvs(0, 1, size=samples_4_iter.shape))

# Evaluate proposal distribution for posterior & generated samples
q12 = st.multivariate_normal.logpdf(samples_4_iter.T, m, V)
q22 = st.multivariate_normal.logpdf(gen_samples.T, m, V)

# Evaluate unnormalized posterior for posterior & generated samples
q11 = np.asarray([logp(point) for point in samples_4_iter.T])
q21 = np.asarray([logp(point) for point in gen_samples.T])

For my mixture model case, I’m wondering how to modify the proposal becuase it may not appropriate to draw discrete variables from the multivariate normal distribution. But my model do have other continuous variables as well. If I directly implement the code, I will get:
LinAlgError: 31-th leading minor of the array is not positive definite for the L = chol(V,lower= True)
command, because for each subject, the samples of the discrete variable called assignment converges to the same number, then some rows of the matrix V have identical elements.

  • Regarding the SMC method, it seems that the pm.sample_smc function has some problem sampling the discrete variable since I got
11 frames
<__array_function__ internals> in broadcast_arrays(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/lib/stride_tricks.py in _broadcast_shape(*args)
    418     # use the old-iterator because np.nditer does not handle size 0 arrays
    419     # consistently
--> 420     b = np.broadcast(*args[:32])
    421     # unfortunately, it cannot handle 32 or more arguments directly
    422     for pos in range(32, len(args), 31):

ValueError: shape mismatch: objects cannot be broadcast to a single shape

I think my model statement is correct because it can be sucessfully implemeted using pm.sample with CategoricalGibbsMetropolis for discrete variables and Metropolis for continuous variables.

Therefore, I’m wondering for my case with both discrete and continuous variables, as well as a weight variable which needs to be between [0,1], is it possible to use Bridge sampling or SMC to obtain the approximation of mariginal likelihood for model comparison? If not, is there any alternative menthod to estimate the marginal likelihood?

Thank you so much for your help!

1 Like

Glad you enjoy the post :grin:!

Sounds like you have a latent discrete parameter in your model, in general, you should first try to marginalized it (see e.g., Frequently Asked Questions - #15 by junpenglao for how to do that). With that, all the latent parameter in your model is continuous, you can then use Bridge Sampling

As for the error in SMC, are you on the pymc v3 (current release) or pymc v4 (beta)? It might work better in pymc v4 as the forward sampling is much better.

3 Likes

Thank you so much for your reply! It’s my great honor to receive your help!

May I ask a follow up question on this?

You suggested that I can try to marginalize out the discrete variables. I noticed that in one example in your provided link, pm.Mixture function has been used to realize the mixture likelihood without discrete latent variable. Is pm.Mixture the method that you meant for marginalizing the discrete variables out?

I took a closer look at the Mixture — PyMC3 3.11.5 documentation. It seems that the pm.Mixture mixes several distributions eg. dist1: Poisson(mean = mu1), dist2: Poisson (mean = mu2). For subjects belong to the same cluster, they have the same fixed mean (either mu1 or mu2). Here is the code taken from the documentation:

# 2-Mixture Poisson using iterable of distributions.

with pm.Model() as model:
    lam1 = pm.Exponential('lam1', lam=1)
    lam2 = pm.Exponential('lam2', lam=1)

    pois1 = pm.Poisson.dist(mu=lam1)
    pois2 = pm.Poisson.dist(mu=lam2)

    w = pm.Dirichlet('w', a=np.array([1, 1]))
    like = pm.Mixture('like', w=w, comp_dists = [pois1, pois2], observed=data)

However, my case is a bit more complex, as is shown in the below figure. Suppose there are three trajectories, f1(t), f2(t), f3(t). I want to figure out which trajectory is each subject (red dot) most likely to belong? That means I want to mix “processes” instead of “distributions”, or, in other words, at a specific time point ti where the subject i locates, the distrbution of subject i is a mixture of three distributions with mean [f1(ti),f2(ti),f3(ti)]. Thus center of the mixture distributions rely on t, rather than fixed.

The idea can be successfully implemented when the discrete latent variable (indicating which distribution each subject belongs to) exists. If I want to marginalize out this latent discrete variable, how can I achieve that? Is pm.Mixture able to handle this? I’m trying to do something like this

for i in range(nsubjects):
    t = ts[i]
    mus = [f1(t),f2(t),f3(t)]
    comp_dists  = pm.Normal.dist(mu= mus, sigma = 1, shape=(3,))
    Y = pm.Mixture('Y'+str(i), w = w, comp_dists = comp_dists, observed= yobs[i]) 

But the estimated running time is 18 hours (and it’s still running), compared with several minutes when the discrete variables exists in my previous case. So I think I’m not doing something correct.

If not, can you sugguest any other solutions to marginlizing out discrete variables?
Thank you so much for your time and help!

Could you share the initial model with discrete variable?

Thank you so much for your time and help!

I’ve uploaded a ipynb file via Github. It contains a simplified version of my case. (My case has more complex trajectory function f, higher dimesion and more subjects. But this simplified version is easier to read and enough to describe the questions.)

There are four parts in this note book.

Part1: Model with latent discrete variable.
Part2: Marginalize the latent variable out using for loop + pm.Mixture.
Part3: Bridge Sampling using the marginalized model.
Part4: SMC

I have some questions and they are written in detail in the notebook. Brifly speaking:
Question1: Is my implementation of marginalizing out discrete variable using for loop + pm.Mixture correct? Any other better strategies? (I made a mixtake yesterday now the running time is not that long.)

Question2: In the bridge sampling part, when using model.free_RVs, I found that some veriables have transformed distribution and their var names have been changed. How can I deal with such situations?

Question3: Some function in your bridge sampling code is out of date. I modified them. Could you please help me to check if my modification is correct?

Question4: I also tried SMC to estimate the marginal likelihood and want to make comparison to bridge sampling. But got error: ValueError: cannot convert float NaN to integer. I wonder how to solve this.

(You suggested using pymc4 for better SMC. But I’m currently using pymc3 becuase I gotAttributeError: Module 'numpy.core' has no attribute 'numerictypes' when loading pymc4. Will try to solve it later.)

I really appreciate your help and suggestions. Learnt so much from you!

1 Like

For loop is definitely something you want to avoid (vectorizing the operation usually resulting in much faster sampling). Another critic is you should avoid using metropolis (very easy to get wrong or bias result because it is just very inefficient in high dimension).

You shouldnt make these changes. free_RVs are the transformed random variable, and you want to do bridge sampling in the transformed space because all the variables need to be unbounded.

I did some modification of your model code: Google Colab, hope you find helpful.

4 Likes

Thank you so much for the code and detailed explanation! The implementaion is very brilliant! Very glad to learn this effective strategy. :blush: May I have some follow up discussions with you? (I’m sorry I’ve asked so much!)

Discussion 1

Following your excellent example, I tried a higher dimension case: now assume that therer are 5 subjects, each subject is a 4-element vector instead of a scalar. There are two possible cluster centers given. The aim is to find out which center each subject belongs to. The code and result is below.

It seems that the result makes sense. But I’m not sure if I’m allowed to do that. My concern is: the data in pm.Mixture() should be independent (if I undertsand correctly), but here the 4 elements in one subject’s vector is treated as a entirety (they share the same weight vector and should always be assigned to the same cluster), thus each of the four are not independent? In your opinion, do your think my implementation violates the independence assumption? If so, any modification can be done?

#number of types 
nsubtype = 2 
#number of subjects
nsubs = 5
#number of elements in one subject 
nelem = 4

#data shape is (nelem,nsubs)=(4,5), representing 5 subjects, each subject is a 4-element vector
data = np.array([[1.1,1.9,3.2,4.1],[1.2,2.0,3.2,3.9],[1.1,2.0,3.1,3.8],[10.1,20.0,30.2,39.],[11.1,20.2,32.2,40.4]]).T

#two possible cluster centers, [1,2,3,4] and [10,20,30,40]
mu = np.zeros([nelem,nsubs,nsubtype])
mu[:,:,0] = np.array([1.,2.,3.,4.]).reshape(nelem,1)
mu[:,:,1] = np.array([10.,20.,30.,40.]).reshape(nelem,1)

# The aim is to assign the 5 subjects to the two centers
# Ideally the fist 3 subjects will belong to the fisrt center, the last two subjects belong to the second

with pm.Model() as model:
    mu = tt.as_tensor_variable(mu)
    # each subject represented by a vector of weight instead of an int
    mix_cat = pm.Dirichlet("mix_cat", a=np.ones(2), shape= (nsubs,nsubtype))
    components = pm.Normal.dist(mu=mu, sigma=1, shape= (nelem,nsubs,nsubtype))  
    like = pm.Mixture('like', w=mix_cat, comp_dists = components, observed=data, shape = (nelem,nsubs)) # The resulting mixture is nd-shaped
    trace_margin = pm.sample(1000,tune=2000,cores=4,return_inferencedata=True)

Discussion 2

Regarding my previous question 2 about the modification in your Marginal_llk() function, thank you for your explanation. Now I understand that we’re supposed to do bridge sampling in the transformed space. However, I found that the two functions called pm.stats.dict2pd and pm.effective_n that you used are no longer supported in pymc. A similar function called pm.ess only takes original variables (eg mix_cat) rather than the transformed ones (eg. mix_cat_stickbreaking__ ~ TransformedDistribution) as inputs of var_names.

Thus, I’m wondering if the number of effective samples in the transformed space and original space is the same. If the two numbers are the same, perhaps I can actually use original variable names as input of pm.ess and get the effective number of sample for transformed variables (like what I did in my previous notebook)? If not, could you please suggest any other way to get the effective number in the transformed space?

Discussion 3

Finally, for variables like “mix_cat” (which is a nsubjects * ntypes matrix), the effective number of different element of this matrix is different. Should I take a median of all the elements as the effective number of this variable?

I’m sorry that I’ve asked so many questions. I really enjoyed all these discussions and I’m very thankful for these chances to learn from you.

In this case the components are still independent, you need to rewrite them using MvNormal for them to be dependent.

Use the effective sample size function in arviz: arviz.ess — ArviZ dev documentation

Not sure I understand - you should keep all the dimension of mix_cat in the effective sample size computation? Bridge sampling is not expecting you to post process the posterior samples i believe.

Thank you very much for giving so many suggestions! I’m so sorry for bother you continuously.

For your first suggestion of using MvNormal to deal with dependence:

I’m a bit unsure how to define the covariance matrix if I use MvNormal Instead. The case is there are 5 subjects to cluster, each subject is a 4-element vector instead of a scalar. So the only dependence for the 4 elements within one subject is that they should be assigned to the same cluster. (And I’ve already achived it by enforcing the four elements to share the same weight vector in pm.mixture). So it’s hard to quantify the covariance between the four elements using covariance…

For your second suggestion of using az.ess():

It seems that arviz.ess is similar to pm.ess I used before, which only allows the original variable as input, rather than the transformed variable. Using the Google colab
example you worte as an example. Variables alpha, mix_cat and sigma have all been transformed.

However, using az.ess(trace_margin,var_names = ['mix_cat_stickbreaking__']) I got error KeyError: 'var names: "[\'mix_cat_stickbreaking__\'] are not present" in dataset'. I can only use az.ess(trace_margin,var_names = ['mix_cat']) instead, where “mix_cat” is the original variable, not the tranformed one. Have I done something wrong?

For your third suggestion
I meant to calculate median effective sample size (scalar) in your Marginal_llk() code. It seems that the final effective size is a scalar. But for variables like mix_cat with over one dimension, the effective sample size is a matrix

That’s why I’m asking whether I should take a median of this matrix to get the effective sample size which should be a scalar.

Apologies again for keeping asking questions. Thank you a lot

Yeah I think what you are doing assigning a cluster per subject is sufficient - using MvNormal is more to introduce correlation among the 4 elements but that might not be necessary.

As for the error in az.ess, there should be a way to get the trace and ess for the transform RVs, @OriolAbril might know more.

For the getting a scalar n_eff - I think my logic of doing that is to take into account the MCMC sample quality (instead of just using N1 the number of MCMC samples), so it could make sense to take the mean then compute the median, or flatten then concatenate everything and take the median at the last step. I have not run experiment to compare it, and this is different from the implementation from the paper so if you want to simplify things you can also replace the computation of effective sample size with just the MCMC sample size.

1 Like

Thank you so much for all of your help and suggestions!I feel so lucky and hornored that you’ve read all of my long questions patiently and answered them in detail, which helped me a lot! I really appreciate your contributions and enjoy all these discussions! I’ll explore more on pymc!

1 Like

You are welcome! Happy to help and glad you find my answer useful :slight_smile:

1 Like