Out of sample/model predictions in hierarchical model with new observed data

I have a simple partial pooled model that fits a normal distribution to observed data. In my actual data, I have hundreds of groups with anywhere between 1 and hundreds of observations per group. I tried fitting to the entire dataset and found that I was getting major divergences and sampling issues. Instead, I tried fitting the groups with only some minimum number of observations and then tried again. This time I had really fast sampling with very good results. I am wondering two things:

  1. Is this valid, or does it just bypass one of the strong points of Bayesian modeling in that I can still make estimates on groups with only few observations? I sort of want the model to be biased towards those with many observations so this doesn’t negatively impact my purpose. I have several projects that have been sidelined by sampling issues on groups with only few observations. If this approach is valid then it would help me get those back on track and use the out of sample predictions for those unseen small-sample groups.
  2. How do I do out of model/sample predictions with new groups in the hierarchical structure while also taking into account the observations I do have for those groups. This last part is key, because I do want to consider the information that I actually have for those new groups.

I have read this post, but I am still not clear. For one, the example with the schools only has one observation per group so the new group’s priors are just new means centered at that value. I guess I could just take the mean of the few points I have for my new groups, but I feel like I am throwing away some useful information.

I tried using data containers for the observed data and index variable for the hierarchical structure, but I am not sure if I then need to modify the coords or if I just build an entirely new model to include the new groups.

Here is a working example of what I want to do. The data frame trim has only the groups with a minimum number of observations. The dataframe remainder has the–you guessed it–…remainder… of the groups.

When modeling the new groups, I am not sure if the index variable needs to start over at 0 or continues the counting as to not overlap with the groups in the original model…

import pymc as pm
import pymc.sampling.jax as pmjax
import pandas as pd
import numpy as np
import arviz as az

RANDOM_SEED = 98
rng = np.random.default_rng(RANDOM_SEED)

# Number of observations per group
n_obs = pm.draw(pm.DiscreteUniform.dist(lower=5, upper=100),100, random_seed=rng)

# Random means
individual_means = pm.draw(pm.Normal.dist(mu=2, sigma=3), len(n_obs))

# Data
data = pd.DataFrame(
                    {"data"  : np.concatenate([pm.draw(pm.Normal.dist(mu=mu, sigma=3), n) for mu, n in zip(individual_means, n_obs)]),
                     "idx" : np.concatenate([[ii]*n for ii, n in enumerate(n_obs)]),
                     "n_obs" : np.concatenate([[n]*n for n in n_obs])}
                    )

# Trim data to remove groups with fewer then N observations
trim = data.query("n_obs > 20").copy(deep=True)                

# Factorize idx after trimming
idx_factorized, idx_list = pd.factorize(trim.idx)
trim = trim.assign(idx_factorized = idx_factorized)

with pm.Model(coords = {"idx": idx_list}) as model:

    # Multilevel mu
    mu_prior_mu    = pm.Normal("mu_prior_mu", 0, 5)
    mu_prior_sigma = pm.HalfNormal("mu_prior_sigma",  3)
    mu  = pm.Normal("mu", mu_prior_mu, mu_prior_sigma, dims="idx")

    # Multilevel sigma
    sigma_prior = pm.HalfNormal("sigma_prior", 3)
    sigma = pm.HalfNormal("sigma", sigma_prior, dims="idx")

    # Observed data
    y = pm.Normal('y', mu=mu[idx_factorized], sigma=sigma[idx_factorized], observed=trim.data.values)

    # Sample
    idata = pmjax.sample_numpyro_nuts(2000,chains=4, idata_kwargs=dict(log_likelihood = True), random_seed=RANDOM_SEED)

# Just crude check to see if the point estimate result matches the original parameters
MAE = np.mean(trim.groupby("idx").data.mean().values - az.summary(idata, var_names='mu')["mean"].values).round(5)
print(f"Mean absolute error: {MAE}")

# Remainder of data not used in the model
remainder = data.drop(trim.index)

# Factorize idx after trimming
remainder_idx_factorized, remainder_idx_list = pd.factorize(remainder.idx)
# Sloppy but just continues counting
remainder_idx_factorized_continue_count = np.max(idx_factorized) + 1 + remainder_idx_factorized
# assigining both factorized indices. Not sure if we continue the groups for the new model or start from 0 again
remainder = remainder.assign(idx_factorized                = remainder_idx_factorized, 
                             idx_factorized_continue_count = remainder_idx_factorized_continue_count)

Thanks!

@JAB Please see below for a “version” of your problem, where I do not observe divergences when sampling - while keeping the ability to “use small groups”.

import arviz as az
import numpy as np
import pandas as pd
import pymc as pm

def generate_data(lower: int = 5, upper: int = 100, num_groups: int = 100):
    rng = np.random.default_rng(98)

    # Number of observations per group
    n_obs = pm.draw(pm.DiscreteUniform.dist(lower=lower, upper=upper), num_groups, random_seed=rng)

    # Random means
    individual_means = pm.draw(pm.Normal.dist(mu=2, sigma=3), len(n_obs))

    # Data
    return pd.DataFrame({
        "data": np.concatenate([pm.draw(pm.Normal.dist(mu=mu, sigma=3), n) for mu, n in zip(individual_means, n_obs)]),
        "idx": np.concatenate([[ii] * n for ii, n in enumerate(n_obs)]),
        # "n_obs": np.concatenate([[n] * n for n in n_obs])
    })


def make_model(frame: pd.DataFrame) -> pm.Model:
    with pm.Model(coords={"idx": frame["idx"].unique()}) as model:
        # Data
        g = pm.Data("g", frame["idx"].to_numpy())
        y = pm.Data("y", frame["data"].to_numpy())

        # Population mean
        mu = pm.Normal("mu", 0., 5.)
        # Across group variability of the mean
        tau = pm.HalfNormal("tau", 3.)
        # Population standard deviation
        sigma = pm.HalfNormal("sigma", 3.)

        # Group specific mean
        mu_g = pm.Normal("mu_g", mu=mu, sigma=tau, dims="idx")
        # Group specific standard deviation
        sigma_g = pm.HalfNormal("sigma_g", sigma=sigma, dims="idx")

        # Likelihood
        pm.Normal("y_obs", mu=mu_g[g], sigma=sigma_g[g], observed=y)
        return model


def make_model_predict(frame: pd.DataFrame) -> pm.Model:
    with pm.Model(coords={"idx": frame["idx"].unique()}) as model:
        # Data
        g = pm.Data("g", frame["idx"].to_numpy())

        # Population mean
        mu = pm.Normal("mu", 0., 5.)
        # Across group variability of the mean
        tau = pm.HalfNormal("tau", 3.)
        # Population standard deviation
        sigma = pm.HalfNormal("sigma", 3.)

        # Group specific mean
        mu_g_new = pm.Normal("mu_g_new", mu=mu, sigma=tau, dims="idx")
        # Group specific standard deviation
        sigma_g_new = pm.HalfNormal("sigma_g_new", sigma=sigma, dims="idx")

        # Likelihood
        pm.Normal("y_obs", mu=mu_g_new[g], sigma=sigma_g_new[g])
        return model



if __name__ == "__main__":
    frame = generate_data()
    model = make_model(frame)
    with model:
        trace = pm.sample()
    summary = az.summary(trace, var_names=["mu", "tau", "sigma"], round_to=4).loc[:, ["mean", "hdi_3%", "hdi_97%", "ess_bulk", "ess_tail", "r_hat"]]
    print(summary)

    frame = generate_data(num_groups=10)
    model = make_model_predict(frame[["idx"]])
    with model:
        ppc = pm.sample_posterior_predictive(trace, var_names=["y_obs"])
        y_pred = pd.Series(
            data=ppc.posterior_predictive["y_obs"].mean(("chain", "draw")).to_numpy(),
            index=frame["idx"]
        )
    print(y_pred)


I also used a solution for out of sample predictions with a hierarchical model that is essentially suggested here.

The key point is that when you predict on previously unseen groups you can utilise the posterior of the population parameters (mu, tau, sigma), but as there is no information on the specific latent parameters of each group (mu_g and sigma_g) they will simply be sampled from their priors, i.e. pm.Normal("mu_g_new", mu=mu, sigma=tau) and pm.HalfNormal("sigma_g_new", sigma=sigma). The nice thing about those priors is: they are now informed by the posterior of the population parameters.

Also note: We’re using the trace of the “inference model” to “feed” a posterior sample of the population paramerers. There are also samples of group specific parameters (mu_g, sigma_g) in that trace object. So we need to make sure that our “prediction model” does not contain any variables with the same name. If that were the case then sample_posterior_predictive would try to use these samples for the unseen groups, which would be conceptually wrong - we don’t have any observations to inform mu_g and sigma_g for new groups. Technically it would probably result in some shape related error unless the number of groups (and group labels) would stay identical.

Finally ppc.posterior_predictive["y_obs"].mean(("chain", "draw")) averages the posterior predictive draws of “y_obs” for each sample in the input - at least that is how I understand the syntax, but I am not 100% certain, maybe @ricardoV94 or @jessegrabowski could chime in here to clarify.

Disclaimer: I am still learning about hierarchical modelling and how to best use pymc to work with hierarchical models, so any comments and corrections from more seasoned “veterans” are very welcome.

2 Likes

Hi @Benjamin thanks for the detailed answer! BTW, regarding this,

Please see below for a “version” of your problem, where I do not observe divergences when sampling

I am not having trouble with divergences for this simulated data set, only for my actual dataset which has ~ 700 groups. In addition, a large portion of the groups has very few observations.
groupsize_histogram

The solution you provided is makes total sense. I had read the blog post from pymc labs, but it had not occurred to me that it could be as simple as a new model that used sample_posterior_predictive to sample the original trace with the same priors and a new observed variable. I guess it is the same as the way I normally make predictions, except that we introduce new groups. In this case the entire posterior is being reused as opposed to just point estimates from posterior as priors in a second model, correct? I guess that means that for groups with few observations they would just get regressed to the group mean with some potentially minor offset depending on the number of observations, right?

Just one technical clarification… Totally makes sense that mu_g and mu_g_new can’t be the same name or there would like be some shape errors, but does it matter that mu, tau, and sigma are the same name? Would it be better to use pm.flat to make certain that those variables are not resampled but rather grabbed directly from the posterior? I guess either is valid according to this, but just wanted to double check.

pm.Flat is currently a good way to ensure the draws are being used. There was an idea to add a specific FromTrace object for this behavior, as Flat is a bit hacky.

2 Likes

Issue here, PR welcome. It’s a good first issue :slight_smile:

2 Likes

@JAB great idea with using pm.flat in the prediction model so that only the posterior draws are ever used for the population parameters mu, tau, and sigma (even better if there is a dedicated FromTrace). This way the predictive model gets a much clearer syntax.

Thanks @ricardoV94 and @jessegrabowski for clarifying this.

Regarding the divergences, I slightly modified the data generating code and the model and I am now able to sample for “more realistic” distribution of number of observations (per each group) without observing divergences/low effective samples sizes, cf. code below

def generate_data(num_obs: int = 10, num_groups: int = 10):
    rng = np.random.default_rng(98)

    # Number of observations per group
    n_obs = pm.draw(pm.Poisson.dist(mu=num_obs - 1), num_groups, random_seed=rng) + 1

    # Random means
    individual_means = pm.draw(pm.Normal.dist(mu=2, sigma=3), num_groups)
    # Data
    return pd.DataFrame({
        "data": np.concatenate([pm.draw(pm.Normal.dist(mu=mu, sigma=3), n).reshape(-1) for mu, n in zip(individual_means, n_obs)]),
        "idx": np.concatenate([[ii] * n for ii, n in enumerate(n_obs)]),        
    })


def make_model(frame: pd.DataFrame) -> pm.Model:
    with pm.Model(coords={"idx": frame["idx"].unique()}) as model:
        # Data
        g = pm.Data("g", frame["idx"].to_numpy())
        y = pm.Data("y", frame["data"].to_numpy())

        # Population mean
        mu = pm.Normal("mu", 0., 5.)
        # Across group variability of the mean
        tau = pm.HalfNormal("tau", 3.)
        # Population standard deviation
        sigma = pm.HalfNormal("sigma", 3.)

        # Group specific mean
        eps_mu_g = pm.ZeroSumNormal("eps_mu_g", dims="idx")
        mu_g = mu + tau * eps_mu_g

        # Likelihood
        pm.Normal("y_obs", mu=mu_g[g], sigma=sigma, observed=y)
        return model


if __name__ == "__main__":
    frame = generate_data(num_obs=2, num_groups=1000)
    model = make_model(frame)
    with model:
        trace = pm.sample()
    summary = az.summary(trace, var_names=["mu", "tau", "sigma"], round_to=4).loc[:, ["mean", "hdi_3%", "hdi_97%", "ess_bulk", "ess_tail", "r_hat"]]
    print(summary)

There are some important changes though:
i) I am using a Poission distribution to mimic the fact that you have a large number of groups with a small number of samples, I can go down to num_observations=2 without observing divergences. num_observations is the expected number of observations per group in the generated data set. In the way n_obs is set up, I am ensuring that all groups have at least one sample and that num_observations is indeed the expectation.

I am not entirely sure why I cannot go lower than num_observations=2 but I assume that there is just a minimum number of samples needed to “identify” tau. But this would need a lot more thought, I am just speculating here.

ii) The model uses a “non-centered parametrisation”, i.e. I am using standard normal increments eps_mu_g and I am computing the group-specific slope as mu_g=mu + tau * eps_mu_g. This is a recommended approach whenever your group-specific likelihoods are only weakly informed (if I recall correctly). You can find a very informed :wink: exposition on those points here. In the problem above there are only very few data points for each group, so this indeed qualifies as weakly informed likelihood I think.

iii) I am dropping the group specific standard deviations sigma_g from the model and I am just using the population specific standard deviation sigma in the likelihood. Here I do not have a very informed argument. Just my gut feeling that with such a low number of data points its getting impossible to “identify” group specific standard deviations. Whether this is an admissible approach for your situation or not, I of course cannot tell. For the data generating process the across group variability of sigma_g is in fact zero, so not sure whether that caused some of the divergences. Also it can be important to use “zero suppressing” priors for population parameters tau and sigma so this is another point to try out (after re-parametrising the model). Also the log of the group specific standard deviations could be modelled with a hierarchical normal model - and then use exp transform to get from log-space back to standard deviations. This is also discussed here.

The predictive model would then have to be adapted accordingly, but I assume (hope) that this is rather straightforward.

2 Likes

Thanks so much @ricardoV94 and @jessegrabowski. I’ll take a look and see if I can contribute!

@Benjamin Thanks for sending the updated code. I did end up implementing a non-centered version and it did help substantially. Still working out a few minor things to see if that improves. If I still run into issues I might follow up with a working example with my real dataset.

Thanks!

1 Like