Posterior predictive sampling for new groups in hierarchical model using ModelBuilder

Hi all,

I am trying to use ModelBuilder from PyMC extras to wrap a model and use it in a production context. It is a fairly simple linear hierarchical model with a one dimensional outcome and two predictors. Observations are grouped by “area”. All coefficients are global except for coefficients on an interaction term between the two predictors, which vary by area according to a normal distribution.

I have “fit” the model, ie sampled from the posterior given a set of observations (covering a few hundred areas). In production I want to be able to run predict_posterior (a method of the model builder class). This method essentially sets the data and runs sample_posterior_predictive using the existing inference data from the “fitting” step. However, I am not sure how best to setup the ModelBuilder subclass I am writing to seamlessly handle this case. I would like to be able to pass a mix of previously seen and previously unseen areas when I run predict, but cannot say in advance which areas will be fed in for predictions. I want to set the code up in such a way that an appropriate predictive model is set up under the hood, for whatever mix of seen and unseen areas are fed in during prediction. The model should sample from the appropriate distribution in any case (so for an unseen area drawing a coefficient for the interaction term for that area from the posterior distribution for those coefficients, but for previously seen areas drawing from the posterior distribution over that specific area’s coefficient, if that make sense).

I have read this blog post and I think I understand it: Out of model predictions with PyMC

But I am scratching my head a bit when it comes to this slightly more complex situation I am facing and how it might be done neatly with ModelBuilder.

I thought this might be a common enough use case that someone else would have implemented something similar in the past and might be able to give me some pointers. Any help much appreciated!

I can share code if that is helpful / necessary (edit: here is the build_model method for my subclass of ModelBuilder, for reference):

def build_model(self, X: pd.DataFrame, y: pd.Series, **kwargs):

        """
        build_model creates the PyMC model for making predictions on new     areas
        Parameters:
        model_config: dictionary, 
                      it is a dictionary with all the parameters that we need in our model example:  a_loc, a_scale, b_loc

        X : pd.DataFrame
            The input data that is going to be used in the model. This should be a DataFrame containing the features (predictors) for the model. For efficiency reasons, it should only contain the necessary data columns, not the entire available dataset, as this will be encoded into the data used to recreate the model.

        y : pd.Series
            The target data for the model. This should be a Series representing the output or dependent variable for the model.

        kwargs : dict
            Additional keyword arguments that may be used for model configuration.
        """

        area_idx_int, self.area_map = pd.factorize(X['area_id'], sort=True)
        self.n_areas = len(self.area_map)

        coords = { "area": self.area_map}

        # Check the type of X and y and adjust access accordingly
        self._generate_and_preprocess_model_data(X, y)

        y = y.values if isinstance(y, pd.Series) else y

        with pm.Model(coords=coords) as self.model:
           # Create data containers
            area_idx = pm.Data("area_idx", area_idx_int)

            x_data = pm.Data(
                    "x_data", 
                    X['total_counts'].values,
                )
            m_data = pm.Data("m_data", X['area_size'].values)
            y_obs = pm.Data("y_obs", y)

            β0_mu_prior         = self.model_config.get("β0_mu_prior", 0.0)
            β0_sigma_prior      = self.model_config.get("β0_sigma_prior", 100.0)
            β1_mu_prior         = self.model_config.get("β1_mu_prior", 0.0)
            β1_sigma_prior      = self.model_config.get("β1_sigma_prior", 10.0)
            β2_mean_mu_prior    = self.model_config.get("β2_mean_mu_prior", 0.0)
            β2_mean_sigma_prior = self.model_config.get("β2_mean_sigma_prior", 10.0)
            β2_std_lambda_prior = self.model_config.get("β2_std_lambda_prior", 1.0)

            σ_prior = self.model_config.get("σ_prior", 1000.0)

            # global coefficient priors
            β0 = pm.Normal("β0", mu=β0_mu_prior, sigma=β0_sigma_prior)
            β1 = pm.Normal("β1", mu=β1_mu_prior, sigma=β1_sigma_prior)

            # area-specific area-count interaction term priors
            mu_b2 = pm.Normal("mu_b", mu=β2_mean_mu_prior, sigma=β2_mean_sigma_prior)
            sigma_b2 = pm.Exponential("sigma_b", β2_std_lambda_prior)
            β2 = pm.Normal("β2", mu=mu_b2, sigma=sigma_b2, dims="area")

            # model error
            σ = pm.HalfCauchy("σ", σ_prior)

            # linear mean 
            mu = β0 + β1*x_data + β2[area_idx]*x_data*m_data

            # likelihood
            obs = pm.Normal( "y", mu=mu, sigma=σ, observed=y_obs)