Out of sample/model predictions in hierarchical model with new observed data

I have a simple partial pooled model that fits a normal distribution to observed data. In my actual data, I have hundreds of groups with anywhere between 1 and hundreds of observations per group. I tried fitting to the entire dataset and found that I was getting major divergences and sampling issues. Instead, I tried fitting the groups with only some minimum number of observations and then tried again. This time I had really fast sampling with very good results. I am wondering two things:

  1. Is this valid, or does it just bypass one of the strong points of Bayesian modeling in that I can still make estimates on groups with only few observations? I sort of want the model to be biased towards those with many observations so this doesn’t negatively impact my purpose. I have several projects that have been sidelined by sampling issues on groups with only few observations. If this approach is valid then it would help me get those back on track and use the out of sample predictions for those unseen small-sample groups.
  2. How do I do out of model/sample predictions with new groups in the hierarchical structure while also taking into account the observations I do have for those groups. This last part is key, because I do want to consider the information that I actually have for those new groups.

I have read this post, but I am still not clear. For one, the example with the schools only has one observation per group so the new group’s priors are just new means centered at that value. I guess I could just take the mean of the few points I have for my new groups, but I feel like I am throwing away some useful information.

I tried using data containers for the observed data and index variable for the hierarchical structure, but I am not sure if I then need to modify the coords or if I just build an entirely new model to include the new groups.

Here is a working example of what I want to do. The data frame trim has only the groups with a minimum number of observations. The dataframe remainder has the–you guessed it–…remainder… of the groups.

When modeling the new groups, I am not sure if the index variable needs to start over at 0 or continues the counting as to not overlap with the groups in the original model…

import pymc as pm
import pymc.sampling.jax as pmjax
import pandas as pd
import numpy as np
import arviz as az

RANDOM_SEED = 98
rng = np.random.default_rng(RANDOM_SEED)

# Number of observations per group
n_obs = pm.draw(pm.DiscreteUniform.dist(lower=5, upper=100),100, random_seed=rng)

# Random means
individual_means = pm.draw(pm.Normal.dist(mu=2, sigma=3), len(n_obs))

# Data
data = pd.DataFrame(
                    {"data"  : np.concatenate([pm.draw(pm.Normal.dist(mu=mu, sigma=3), n) for mu, n in zip(individual_means, n_obs)]),
                     "idx" : np.concatenate([[ii]*n for ii, n in enumerate(n_obs)]),
                     "n_obs" : np.concatenate([[n]*n for n in n_obs])}
                    )

# Trim data to remove groups with fewer then N observations
trim = data.query("n_obs > 20").copy(deep=True)                

# Factorize idx after trimming
idx_factorized, idx_list = pd.factorize(trim.idx)
trim = trim.assign(idx_factorized = idx_factorized)

with pm.Model(coords = {"idx": idx_list}) as model:

    # Multilevel mu
    mu_prior_mu    = pm.Normal("mu_prior_mu", 0, 5)
    mu_prior_sigma = pm.HalfNormal("mu_prior_sigma",  3)
    mu  = pm.Normal("mu", mu_prior_mu, mu_prior_sigma, dims="idx")

    # Multilevel sigma
    sigma_prior = pm.HalfNormal("sigma_prior", 3)
    sigma = pm.HalfNormal("sigma", sigma_prior, dims="idx")

    # Observed data
    y = pm.Normal('y', mu=mu[idx_factorized], sigma=sigma[idx_factorized], observed=trim.data.values)

    # Sample
    idata = pmjax.sample_numpyro_nuts(2000,chains=4, idata_kwargs=dict(log_likelihood = True), random_seed=RANDOM_SEED)

# Just crude check to see if the point estimate result matches the original parameters
MAE = np.mean(trim.groupby("idx").data.mean().values - az.summary(idata, var_names='mu')["mean"].values).round(5)
print(f"Mean absolute error: {MAE}")

# Remainder of data not used in the model
remainder = data.drop(trim.index)

# Factorize idx after trimming
remainder_idx_factorized, remainder_idx_list = pd.factorize(remainder.idx)
# Sloppy but just continues counting
remainder_idx_factorized_continue_count = np.max(idx_factorized) + 1 + remainder_idx_factorized
# assigining both factorized indices. Not sure if we continue the groups for the new model or start from 0 again
remainder = remainder.assign(idx_factorized                = remainder_idx_factorized, 
                             idx_factorized_continue_count = remainder_idx_factorized_continue_count)

Thanks!

@JAB Please see below for a “version” of your problem, where I do not observe divergences when sampling - while keeping the ability to “use small groups”.

import arviz as az
import numpy as np
import pandas as pd
import pymc as pm

def generate_data(lower: int = 5, upper: int = 100, num_groups: int = 100):
    rng = np.random.default_rng(98)

    # Number of observations per group
    n_obs = pm.draw(pm.DiscreteUniform.dist(lower=lower, upper=upper), num_groups, random_seed=rng)

    # Random means
    individual_means = pm.draw(pm.Normal.dist(mu=2, sigma=3), len(n_obs))

    # Data
    return pd.DataFrame({
        "data": np.concatenate([pm.draw(pm.Normal.dist(mu=mu, sigma=3), n) for mu, n in zip(individual_means, n_obs)]),
        "idx": np.concatenate([[ii] * n for ii, n in enumerate(n_obs)]),
        # "n_obs": np.concatenate([[n] * n for n in n_obs])
    })


def make_model(frame: pd.DataFrame) -> pm.Model:
    with pm.Model(coords={"idx": frame["idx"].unique()}) as model:
        # Data
        g = pm.Data("g", frame["idx"].to_numpy())
        y = pm.Data("y", frame["data"].to_numpy())

        # Population mean
        mu = pm.Normal("mu", 0., 5.)
        # Across group variability of the mean
        tau = pm.HalfNormal("tau", 3.)
        # Population standard deviation
        sigma = pm.HalfNormal("sigma", 3.)

        # Group specific mean
        mu_g = pm.Normal("mu_g", mu=mu, sigma=tau, dims="idx")
        # Group specific standard deviation
        sigma_g = pm.HalfNormal("sigma_g", sigma=sigma, dims="idx")

        # Likelihood
        pm.Normal("y_obs", mu=mu_g[g], sigma=sigma_g[g], observed=y)
        return model


def make_model_predict(frame: pd.DataFrame) -> pm.Model:
    with pm.Model(coords={"idx": frame["idx"].unique()}) as model:
        # Data
        g = pm.Data("g", frame["idx"].to_numpy())

        # Population mean
        mu = pm.Normal("mu", 0., 5.)
        # Across group variability of the mean
        tau = pm.HalfNormal("tau", 3.)
        # Population standard deviation
        sigma = pm.HalfNormal("sigma", 3.)

        # Group specific mean
        mu_g_new = pm.Normal("mu_g_new", mu=mu, sigma=tau, dims="idx")
        # Group specific standard deviation
        sigma_g_new = pm.HalfNormal("sigma_g_new", sigma=sigma, dims="idx")

        # Likelihood
        pm.Normal("y_obs", mu=mu_g_new[g], sigma=sigma_g_new[g])
        return model



if __name__ == "__main__":
    frame = generate_data()
    model = make_model(frame)
    with model:
        trace = pm.sample()
    summary = az.summary(trace, var_names=["mu", "tau", "sigma"], round_to=4).loc[:, ["mean", "hdi_3%", "hdi_97%", "ess_bulk", "ess_tail", "r_hat"]]
    print(summary)

    frame = generate_data(num_groups=10)
    model = make_model_predict(frame[["idx"]])
    with model:
        ppc = pm.sample_posterior_predictive(trace, var_names=["y_obs"])
        y_pred = pd.Series(
            data=ppc.posterior_predictive["y_obs"].mean(("chain", "draw")).to_numpy(),
            index=frame["idx"]
        )
    print(y_pred)


I also used a solution for out of sample predictions with a hierarchical model that is essentially suggested here.

The key point is that when you predict on previously unseen groups you can utilise the posterior of the population parameters (mu, tau, sigma), but as there is no information on the specific latent parameters of each group (mu_g and sigma_g) they will simply be sampled from their priors, i.e. pm.Normal("mu_g_new", mu=mu, sigma=tau) and pm.HalfNormal("sigma_g_new", sigma=sigma). The nice thing about those priors is: they are now informed by the posterior of the population parameters.

Also note: We’re using the trace of the “inference model” to “feed” a posterior sample of the population paramerers. There are also samples of group specific parameters (mu_g, sigma_g) in that trace object. So we need to make sure that our “prediction model” does not contain any variables with the same name. If that were the case then sample_posterior_predictive would try to use these samples for the unseen groups, which would be conceptually wrong - we don’t have any observations to inform mu_g and sigma_g for new groups. Technically it would probably result in some shape related error unless the number of groups (and group labels) would stay identical.

Finally ppc.posterior_predictive["y_obs"].mean(("chain", "draw")) averages the posterior predictive draws of “y_obs” for each sample in the input - at least that is how I understand the syntax, but I am not 100% certain, maybe @ricardoV94 or @jessegrabowski could chime in here to clarify.

Disclaimer: I am still learning about hierarchical modelling and how to best use pymc to work with hierarchical models, so any comments and corrections from more seasoned “veterans” are very welcome.

2 Likes

Hi @Benjamin thanks for the detailed answer! BTW, regarding this,

Please see below for a “version” of your problem, where I do not observe divergences when sampling

I am not having trouble with divergences for this simulated data set, only for my actual dataset which has ~ 700 groups. In addition, a large portion of the groups has very few observations.
groupsize_histogram

The solution you provided is makes total sense. I had read the blog post from pymc labs, but it had not occurred to me that it could be as simple as a new model that used sample_posterior_predictive to sample the original trace with the same priors and a new observed variable. I guess it is the same as the way I normally make predictions, except that we introduce new groups. In this case the entire posterior is being reused as opposed to just point estimates from posterior as priors in a second model, correct? I guess that means that for groups with few observations they would just get regressed to the group mean with some potentially minor offset depending on the number of observations, right?

Just one technical clarification… Totally makes sense that mu_g and mu_g_new can’t be the same name or there would like be some shape errors, but does it matter that mu, tau, and sigma are the same name? Would it be better to use pm.flat to make certain that those variables are not resampled but rather grabbed directly from the posterior? I guess either is valid according to this, but just wanted to double check.

pm.Flat is currently a good way to ensure the draws are being used. There was an idea to add a specific FromTrace object for this behavior, as Flat is a bit hacky.

2 Likes

Issue here, PR welcome. It’s a good first issue :slight_smile:

2 Likes

@JAB great idea with using pm.flat in the prediction model so that only the posterior draws are ever used for the population parameters mu, tau, and sigma (even better if there is a dedicated FromTrace). This way the predictive model gets a much clearer syntax.

Thanks @ricardoV94 and @jessegrabowski for clarifying this.

Regarding the divergences, I slightly modified the data generating code and the model and I am now able to sample for “more realistic” distribution of number of observations (per each group) without observing divergences/low effective samples sizes, cf. code below

def generate_data(num_obs: int = 10, num_groups: int = 10):
    rng = np.random.default_rng(98)

    # Number of observations per group
    n_obs = pm.draw(pm.Poisson.dist(mu=num_obs - 1), num_groups, random_seed=rng) + 1

    # Random means
    individual_means = pm.draw(pm.Normal.dist(mu=2, sigma=3), num_groups)
    # Data
    return pd.DataFrame({
        "data": np.concatenate([pm.draw(pm.Normal.dist(mu=mu, sigma=3), n).reshape(-1) for mu, n in zip(individual_means, n_obs)]),
        "idx": np.concatenate([[ii] * n for ii, n in enumerate(n_obs)]),        
    })


def make_model(frame: pd.DataFrame) -> pm.Model:
    with pm.Model(coords={"idx": frame["idx"].unique()}) as model:
        # Data
        g = pm.Data("g", frame["idx"].to_numpy())
        y = pm.Data("y", frame["data"].to_numpy())

        # Population mean
        mu = pm.Normal("mu", 0., 5.)
        # Across group variability of the mean
        tau = pm.HalfNormal("tau", 3.)
        # Population standard deviation
        sigma = pm.HalfNormal("sigma", 3.)

        # Group specific mean
        eps_mu_g = pm.ZeroSumNormal("eps_mu_g", dims="idx")
        mu_g = mu + tau * eps_mu_g

        # Likelihood
        pm.Normal("y_obs", mu=mu_g[g], sigma=sigma, observed=y)
        return model


if __name__ == "__main__":
    frame = generate_data(num_obs=2, num_groups=1000)
    model = make_model(frame)
    with model:
        trace = pm.sample()
    summary = az.summary(trace, var_names=["mu", "tau", "sigma"], round_to=4).loc[:, ["mean", "hdi_3%", "hdi_97%", "ess_bulk", "ess_tail", "r_hat"]]
    print(summary)

There are some important changes though:
i) I am using a Poission distribution to mimic the fact that you have a large number of groups with a small number of samples, I can go down to num_observations=2 without observing divergences. num_observations is the expected number of observations per group in the generated data set. In the way n_obs is set up, I am ensuring that all groups have at least one sample and that num_observations is indeed the expectation.

I am not entirely sure why I cannot go lower than num_observations=2 but I assume that there is just a minimum number of samples needed to “identify” tau. But this would need a lot more thought, I am just speculating here.

ii) The model uses a “non-centered parametrisation”, i.e. I am using standard normal increments eps_mu_g and I am computing the group-specific slope as mu_g=mu + tau * eps_mu_g. This is a recommended approach whenever your group-specific likelihoods are only weakly informed (if I recall correctly). You can find a very informed :wink: exposition on those points here. In the problem above there are only very few data points for each group, so this indeed qualifies as weakly informed likelihood I think.

iii) I am dropping the group specific standard deviations sigma_g from the model and I am just using the population specific standard deviation sigma in the likelihood. Here I do not have a very informed argument. Just my gut feeling that with such a low number of data points its getting impossible to “identify” group specific standard deviations. Whether this is an admissible approach for your situation or not, I of course cannot tell. For the data generating process the across group variability of sigma_g is in fact zero, so not sure whether that caused some of the divergences. Also it can be important to use “zero suppressing” priors for population parameters tau and sigma so this is another point to try out (after re-parametrising the model). Also the log of the group specific standard deviations could be modelled with a hierarchical normal model - and then use exp transform to get from log-space back to standard deviations. This is also discussed here.

The predictive model would then have to be adapted accordingly, but I assume (hope) that this is rather straightforward.

2 Likes

Thanks so much @ricardoV94 and @jessegrabowski. I’ll take a look and see if I can contribute!

@Benjamin Thanks for sending the updated code. I did end up implementing a non-centered version and it did help substantially. Still working out a few minor things to see if that improves. If I still run into issues I might follow up with a working example with my real dataset.

Thanks!

1 Like

Hi all,

I know its been a few weeks, but I wanted to follow-up with this thread. I was able to spend some time on a few variations of the problem. I reused some of the code from @Benjamin to write an example with a few variations of out-of-model prediction.

Just to recap, I have done out-of-sample prediction in a hierarchical model before using pm.Data in the model and then pm.set_data to swap the data and make predictions. However, (I believe) this only works for cases in which the group index already exists within the hierarchical structure (i.e., if I originally have 10 groups, I can’t find the 14th group in the original data). Here are some things I tried and some observations. I would love to get some thoughts on whether or not this is the correct interpretation of what I observed. The basic model looks like this:

# Data
        g = pm.Data("g", frame["idx"].to_numpy())
        y = pm.Data("y", frame["data"].to_numpy())

        # Population mean
        mu = pm.Normal("mu", 0., 5.)
        # Across group variability of the mean
        tau = pm.HalfNormal("tau", 3.)
        # Population standard deviation
        sigma = pm.HalfNormal("sigma", 3.)

        # Group specific mean
        eps_mu_g = pm.Normal("eps_mu_g", dims="idx")
        mu_g = pm.Deterministic("mu_g", mu + tau * eps_mu_g, dims="idx")

        # Observed data
        pm.Normal("y_obs", mu=mu_g[g], sigma=sigma, observed=y)
  1. mu_g and eps_mu_g are the shape of the number of groups. So in my first attempt, I only use pm.Flat for the params that are not group specific, so mu, tau, and sigma. When sampling the the posterior predictive, I found that the distribution was the same for all groups since no group specific information is being used. It confused me at first but makes perfect sense now. I thought that maybe some information would leak in from the backend but I realized I never told PyMC where to look. This is because my non-centered param eps_mu_g_new did not have the same name so the variable did not conflict with anything in the original inference data.

  2. In my second attempt I still used pm.Flat for mu, sigma, and tau, but this time I explicitly defined eps_mu_g_new to have mu/sigma that were taken directly from the inference data summary. New groups were assigned mu=0 and sigma=3. Now the model was able to use that previously learned information, but after the fact I realized that this still does not mean that the model sees the group specific params in the original inference data.

  3. In this attempt, I used an entirely new model and did not use pm.Flat. All of the variables were assigned values of mu and sigma that were taken from the inference data summary and new values were the same mu=0 and sigma=3 from above. In this case I was able to fully sample the new model since pm.Flat was not used for defining any of the variables.

  4. In the last attempt I followed the example from pymc labs where I use the old variable names and pm.Flat to tell the model to look at the inference data for existing groups and then defined a new variable for new groups. Those were then concatenated and then indexed by group number like usual.

Obviously (1) does not yield meaningful predictions on new groups since no group specific information carries over. Interestingly (or perhaps not to others), (2-4) gave me nearly identical predictions within each group.

One thing that I am concerned about is how models (2) and (4) can see any of the new data other than the group it belongs to. What I mean is that since we only sample the posterior predictive, the only value being used is the one used to index the mu_g param. The raw data for new groups is completely unobserved. In example (3), we sample a completely new model and all data are passed in as observed when defining the mean of the observed variable, y_obs.

I can see how (2) and (4) are advantageous when I only have some predicted group means for unseen groups where I can pass in the priors, such as the example from pymc labs. However, when I have actual observed data that comes in a later time, is the only way for the model to see the values of those observed data for me to train a new model and just use the summary of previous inference as priors in the new model? It feels like I am throwing away important information by just computing the mean/std of the results and then using those as priors.

I think in the case where I have few observations on new groups, the raw data will not matter much anyways since individual group means with insufficient data will be regressed to the pooled mean. However, I am more concerned about cases where I have new data that are sufficiently different from the group mean such that the group mean would be pulled away from the pooled mean if it were in the original sampling. However, I still want this new group to abide by the constraints imposed by the hierarchical model.

Thanks!

Sounds like you want to do inference about new groups not just predict potential new groups from the structure of your model. So you should probably rerun mcmc from scratch with the full data.

1 Like

Sounds good. I guess it makes sense that if we have new data on new groups we should just use it directly. thanks!