Likelihood of latent variables in hierarchical model for model comparison

Discussions of model comparison in hierarchical models can easily be found, but I have both a theoretical and a couple of practical questions.

My situation is that I have multiple subjects, s, each being measured multiple times in paired measures x and y. I want to regress on the relationship of x and y per subject, but I’m estimating each of these from the multiple measures. That is, model is essentially this:

ys ~ N(beta_0 + beta_1*xs, ys_sigma)
y[i] ~ N(ys[s[i]], y_sigma[s[i]])
x[i] ~ N(xs[s[i]], x_sigma[s[i]])

I want to compare this model with a version without the beta_0 parameter (one that assumes y=0 for a subject for whom x=0). Thus, I need to be able to calculate WAIC or LOO at the subject level.

To me, it seems that the log likelihood I want for this model comparison is evaluated for each subject and the likelihood of should include both how likely y_s is given x_s and how likely the x and y are given y_s and x_s.

First question: I haven’t seen this anywhere. Is it wrong?

My solution to getting this was to calculate the log likelihood I wanted after the sampling. I will give code snippets below for a simpler version of the problem with only one variable.

Second question: This code is really slow. Can it be speeded up?

Third question: This code causes loo and waic to return warnings about influential data points. Is there a way to get a better answer?

Fourth question: Is it possible to calculate the log likelihood that I want inside the trace sampling so it doesn’t need to be calculated twice?

Code below

I made a 1 dimensional version of this problem where multiple subjects each produce multiple x values and my two models either have the grand mean of the subjects given as 0.0 or estimated as a parameter.

Here is random data generation (so that data structure is clear):

xs = stats.norm.rvs(loc=0, scale=1, size=num_subjects)
ns = stats.randint.rvs(low=4, high=8, size=num_subjects)
x_sigma = obs_sigma_sigma*stats.truncnorm.rvs(a=0, b=np.infty, size=num_subjects)
data = pd.DataFrame()
for this_s, (this_xs, this_n, this_x_sigma) in enumerate(zip(xs,ns,x_sigma)):
    x = stats.norm.rvs(loc=this_xs, scale=this_x_sigma, size=this_n)
    data = pd.concat([data, pd.DataFrame({"s": [this_s for xi in x], "x": x})])

I fit a hierarchical model which has either a mean at 0.0 or one which is estimated:

def simple_model(x,s,use_mu0):
  s_idx, s_vals = pd.factorize(s, sort=True)
  coords = {"subject": s_vals, "points":np.arange(len(x))}
  with pm.Model(coords=coords) as model:
    # Data
    s_idx = pm.Data("s_idx", s_idx, dims="points", mutable=False)
    # Priors
    if use_mu0:
      mu0 = pm.Normal('mu0', mu=0, sigma=1)
      mu0 = pm.Data('mu0', 0.0, mutable=False)
    xs_sigma = pm.HalfNormal("xs_sigma", sigma=1.0)
    x_sigma = pm.HalfNormal("x_sigma", sigma=0.5, dims="subject")
    # Latent variables
    xs = pm.Normal('xs', mu=mu0, sigma=xs_sigma, dims="subject")
    # Likelihood 
    obs_x = pm.Normal('obs_x', mu=xs[s_idx], sigma=x_sigma[s_idx], observed=x, dims="points")
  return model

and sample it:

idata = dict()
mdl = dict()
for use_m0 in [False, True]:
    mdl[use_m0] = simple_model(data.x, data.s, use_m0)
    with mdl[use_m0]:
        idata[use_m0] = pm.sample(2000, tune=3000, cores=3, target_accept=0.95, return_inferencedata=True)

Until ultimately this is my very slow calculation of the log likelihood I think I want:

def calc_ll(idata, mu0):
    x = idata.observed_data["obs_x"]
    s_idx = idata.constant_data["s_idx"]
    xs = idata.posterior["xs"]
    x_sigma = idata.posterior["x_sigma"]
    xs_sigma = idata.posterior["xs_sigma"]

    mu0 = mu0.broadcast_like(xs)
    xs_sigma = xs_sigma.broadcast_like(xs)
    ll_xs = stats.norm(mu0, xs_sigma).logpdf(xs)

    ll_x = np.zeros_like(ll_xs)

    x_subj = x.groupby(s_idx)
    coords = idata.posterior.coords
    for c in coords["chain"]:
        print(f'Chain: {c.item()}')
        for d in coords["draw"]:
            for s in range(len(x_subj)):
                ll_x[c,d,s] = stats.norm(xs[c, d, s], x_sigma[c, d, s]).logpdf(x_subj[s]).sum()
    ll = ll_xs + ll_x
    return xr.DataArray(ll, coords=xs.coords, dims=xs.dims)

I can use compare to compare this with either WAIC or LOO, var_name="xs", ic="loo")

But I get complaints about quality of the fit of the pareto distribution.

Thanks for your help!

Can you find what you need in idata.log_likelihood?

The code I made only has obs_x in idata.log_likelihood. Is there a way to get specific probability sums added to log_likelihood?

I don’t think so, because the log_likelihood group is added as sampling occurs. What do you need that isn’t captured in that group? I can’t quite parse your post-processing snippet.

Thanks for getting back to me on this!

I want the log-likelihood of each x_s to be calculated as \textrm{logp}(x_s | \mu_0)+\sum{\textrm{logp}(x_i|x_s)} where the sum is taken over the data points belonging to group s.

You can see me trying to calculate the likelihood of x_s given mu_0 in the line:
ll_xs = stats.norm(mu0, xs_sigma).logpdf(xs)

and the likelihood of the sum in the line:
ll_x[c,d,s] = stats.norm(xs[c, d, s], x_sigma[c, d, s]).logpdf(x_subj[s]).sum()

I then add them together using:
ll = ll_xs + ll_x

Is that clearer?

So is that expression just the log likelihood of the data from group s? If so, can’t you extract the subset of the log_likelihood group associated with the data observed from group s?

I apologize for my confusion. Your original question referenced x and y and y_s and x_s, but then the model includes x_s, xs, and obs_x. All the different x’s hurt my head.

It’s certainly my fault for not being clearer.

As far as I know, what you’re suggetsing would be just the likelihood of the data from group s given our estimate of its mean x_s. I wrote this as \sum{\textrm{logp}(x_i | x_s)} above. I’ve seen that used in “leave one group out” likelihoods. I am interested in the best or most canonical way to calculate it.

However, in my view, that probability is insufficient in this case. The parameter associated with the subject (x_s) should also be considered. That is, the likelihood associated with a subject should be the probability of having a subject with the estimated parameter (p(x_s|\mu_0)) times the probability of getting their specific data given that parameter (p(x_i|x_s)).

That’s what I’m trying to calculate.

Is that clearer?

Yes! You are correct about the various ingredients be included (or not). So I guess having the likelihood of the observed data will allow some computational savings because you don’t have to (re) calculate them. The other way to get some (lots) of computational savings would be to vectorize the loops you are currently using. You can collapse over the chains by flattening the samples into giant 1D vectors (e.g. using arviz.extract()). Then you should be able to do this line (not exactly this line, but something very similar) once:

ll_x[c,d,s] = stats.norm(xs[c, d, s], x_sigma[c, d, s]).logpdf(x_subj[s]).sum()

Or if you don’t want to deal with the indexing over subjects, you can loop over subjects and perform a version of the above line once per subject. That may be fast enough for you.

Ok. That was super helpful. The whole thing runs without loops now and is maybe 4 orders of magnitude faster (seriously).

For the record, here is the code for calculating the log likelihood, based on your suggestion. I still wonder whether it would be possible to get the likelihood of x_s into the idata object during sampling.

def calc_ll(idata, mu0):
    xs = idata.posterior["xs"]
    xs_sigma = idata.posterior["xs_sigma"]

    mu0 = mu0.broadcast_like(xs)
    xs_sigma = xs_sigma.broadcast_like(xs)
    ll_xs = stats.norm(mu0, xs_sigma).logpdf(xs)

    logp_x = idata.log_likelihood["obs_x"]
    s_idx = idata.constant_data["s_idx"]
    ll_x = logp_x.groupby(s_idx).sum()
    ll = ll_xs + ll_x
    return xr.DataArray(ll, coords=xs.coords, dims=xs.dims)

In any case, thanks so much for the help!


1 Like