Posterior Predictive not similar to Posterior

I’m not sure if the bug is in my model, my understanding or pm.sample_posterior_predictive, but it does not appear to be at all similar to posterior:

My model:

def build_model(velocity_scaled,
                load_scaled,
                session_exercise_id,
                session_id,
                coords,
                render_model = True):
    
    with pm.Model() as model:

        # Add coordinates
        model.add_coord('observation', coords['observation'], mutable = True)
        model.add_coord('exercise', coords['exercise'], mutable = True)
        model.add_coord('session', coords['session'], mutable = True)

        # Add inputs
        velocity_shared = pm.MutableData('velocity_scaled', velocity_scaled, dims = 'observation')
        session_exercise_id = pm.MutableData('session_exercise_id', session_exercise_id, dims = 'session')
        session_id = pm.MutableData('session_id', session_id, dims = 'observation')

        # Global Parameters
        intercept_global = pm.Normal(name = 'intercept_global',
                                     mu = 0.0,
                                     sigma = 1.0)
        
        intercept_sigma_global = pm.HalfNormal(name = 'intercept_sigma_global',
                                               sigma = 1.0)
        
        slope_global = pm.HalfNormal(name = 'slope_global',
                                     sigma = 1.0)
        
        curve_global = pm.HalfNormal(name = 'curve_global',
                                     sigma = 3.0)
        
        error_global = pm.HalfNormal(name = 'error_global',
                                     sigma = 3.0)
        
        # Exercise Parameters
        intercept_offset_exercise = pm.Normal(name = 'intercept_offset_exercise',
                                              mu = 0.0,
                                              sigma = 1.0,
                                              dims = 'exercise')
        
        intercept_exercise = pm.Deterministic(name = 'intercept_exercise',
                                              var = intercept_global + intercept_sigma_global*intercept_offset_exercise,
                                              dims = 'exercise')
        
        intercept_sigma_exercise = pm.HalfNormal(name = 'intercept_sigma_exercise',
                                                 sigma = 1.0,
                                                 dims = 'exercise')
        
        slope_exercise = pm.HalfNormal(name = 'slope_exercise',
                                       sigma = slope_global,
                                       dims = 'exercise')
        
        curve_exercise = pm.HalfNormal(name = 'curve_exercise',
                                       sigma = curve_global,
                                       dims = 'exercise')

        # Session Parameters
        intercept_offset_session = pm.Normal(name = 'intercept_offset_session',
                                             mu = 0.0,
                                             sigma = 1.0,
                                             dims = 'session')

        intercept_session = pm.Deterministic(name = 'intercept_session',
                                                var = (intercept_exercise[session_exercise_id]
                                                       + intercept_sigma_exercise[session_exercise_id]
                                                       * intercept_offset_session),
                                                dims = 'session')
        
        slope_session = pm.HalfNormal(name = 'slope_session',
                                      sigma = slope_exercise[session_exercise_id],
                                      dims = 'session')

        curve_session = pm.HalfNormal(name = 'curve_session',
                                      sigma = curve_exercise[session_exercise_id],
                                      dims = 'session')

        # Final Parameters
        intercept = intercept_session[session_id]
        slope = slope_session[session_id]
        curve = curve_session[session_id]
        error = error_global

        # Estimated Value
        load_mu = pm.Deterministic(name = 'load_mu',
                                   var = intercept - slope*velocity_shared - curve*velocity_shared**2,
                                   dims = 'observation'
                                   )

        # Likelihood
        load_likelihood = pm.Normal(name = 'load_estimate',
                                    mu = load_mu,
                                    sigma = error,
                                    observed = load_scaled,
                                    dims = 'observation'
                                    )
    
    if render_model:
        display(pm.model_to_graphviz(model))

    return model

Could it be that pm.sample_posterior_predictive doesn’t use the final level parameters (session), but rather use the global?

1 Like

I believe this is related to this issue. When a dimension is given allowed to be variable (as in your observation, exercise, and session dimensions), the sample_posterior_predictive uses draws from the priors, rather than the posterior, for any variables that depend on the variable dimensions.

See the issue for more details. The current workaround is to not use mutable=True on dimension variables.

1 Like

Aah, got it.

Made the corrections and now it looks much better:

@lucianopaz in the issue you mention that one would have to create a new model instance for new data prediction. But the implementation would depend on the model design. Could you help me on how to do it for my model? :relaxed:

I’m not Luciano, but my suggestion would be to make a clearer distinction between what is truly variable and what is not in your model, and encode everything that is variable into a design matrix rather than using a variable index. All the code used to make is post is available here.

Actually, the coordinates on exercise and session are likely not variable. Similarly, the mapping between the levels of your hierarchy are also not variable (a geographic example: the hierarchical relationship USA → Michigan → Detroit is not altared by looking at data between 1960:1990 or data between 1990:2020. If your data had the Soviet Union, on the other hand, you would have to get creative…) You would not expect to rename, reorder, or add to these groups between inference and prediction time. That is, the number of parameters in the model should be fixed.

What will change is the actual values of the sessions, exercise, and observation on the data, encoded in session_id and observation. That is, an observation in your data might be in session 1 and exercise 3, but you want a prediction for session 1 exercise 4.

My suggestion would be the following: keep coords on the things in your model that are fixed, like exercise, session, and session_exercise_id. For the others, observation, and session_id, you should forego coords and instead encode these relationships directly into the data that you show your model.

(Note: this is only possible because you have a linear model, and this advice is not possible if you can’t write things down in a design matrix)

Instead of using indexing to map draws to rows, you can use a design matrix. The advantage of a design matrix is that it makes it much easier to do out-of-sample prediction: you just need a design matrix for the out-of-sample data, do a single matrix multiplication, and you’re done.

As an example, I tried to re-create some data in the spirit of your plots. For simplicity, I only consider a single level hierarchy, with 5 sessions and 10 observations per session. My data looks like this:

print(df.sample(5).to_string())
    session  velocity      load
3       0.0  1.530482  1.552861
49      4.0 -0.590331  1.060461
10      1.0 -0.361010 -0.084466
26      2.0 -0.788294  1.526947
33      3.0 -1.255325  1.592196

The design matrix needs to have one column for each of the “top level” parameters of the model. These are called intercept_session, slope_session, curve_session in your model code. Since I have 5 sessions, there will be 15 parameters, so the design matrix will be N x 15, in three blocks of 5. Here’s how I made it:

X = pd.get_dummies(df_train['session'].apply(int), prefix='session')
X_trend = (X * df_train.velocity.values[:, None]).rename(columns=lambda x: x.replace('session', 'trend_session'))
X_curve = (X * df_train.velocity.values[:, None] ** 2).rename(columns=lambda x: x.replace('session', 'curve_session'))

from functools import reduce
X = reduce(lambda left, right: left.join(right), [X, X_trend, X_curve])

print(X.head(10).iloc[:, :12].to_string())
    session_0  session_1  session_2  session_3  session_4  trend_session_0  trend_session_1  trend_session_2  trend_session_3  trend_session_4  curve_session_0  curve_session_1  curve_session_2  curve_session_3  curve_session_4
1           1          0          0          0          0         0.310387         0.000000              0.0              0.0              0.0         0.096340         0.000000              0.0              0.0              0.0
2           1          0          0          0          0         0.867316         0.000000              0.0              0.0              0.0         0.752237         0.000000              0.0              0.0              0.0
3           1          0          0          0          0         1.530482         0.000000              0.0              0.0              0.0         2.342377         0.000000              0.0              0.0              0.0
4           1          0          0          0          0         2.084761         0.000000              0.0              0.0              0.0         4.346229         0.000000              0.0              0.0              0.0
5           1          0          0          0          0         0.099846         0.000000              0.0              0.0              0.0         0.009969         0.000000              0.0              0.0              0.0
6           1          0          0          0          0        -0.126080        -0.000000             -0.0             -0.0             -0.0         0.015896         0.000000              0.0              0.0              0.0
7           1          0          0          0          0        -0.008299        -0.000000             -0.0             -0.0             -0.0         0.000069         0.000000              0.0              0.0              0.0
10          0          1          0          0          0        -0.000000        -0.361010             -0.0             -0.0             -0.0         0.000000         0.130328              0.0              0.0              0.0
11          0          1          0          0          0        -0.000000        -0.623473             -0.0             -0.0             -0.0         0.000000         0.388719              0.0              0.0              0.0
12          0          1          0          0          0         0.000000         0.494264              0.0              0.0              0.0         0.000000         0.244297              0.0              0.0              0.0

Now the model. The hierarchical prior structure is totally unchanged, the only thing that is different is how I compute mu. The parameters intercept_session, slope_session, and curve_session are stacked into a single (15,) vector, then mu = X @ beta gives the predicted mean. Here is the relevant code:

coords = {'session': pd.factorize(session_idx)[-1]}

with pm.Model(coords=coords) as model2:
    X_data = pm.Data('X_data', X.values, mutable=True)
    load_data = pm.Data('load', df_train.load.values, mutable=True)

    # ... non-centered priors ...

    beta = pm.Deterministic('beta', pm.math.concatenate([intercept_session, slope_session, curve_session]))
    
    load_mu = pm.Deterministic(name = 'load_mu',
                               var = X_data @ beta)

For the full model, see the gist. Now, to do out of sample predictions, you can use the model.set_data API on X_data and load, since these now carry all the indexing information our model needs:

# Create an out-of-sample design matrix
X_test = pd.get_dummies(df_test['session'].apply(int), prefix='session')
X_test_trend = (X_test * df_test.velocity.values[:, None]).rename(columns=lambda x: x.replace('session', 'trend_session'))
X_test_curve = (X_test * df_test.velocity.values[:, None] ** 2).rename(columns=lambda x: x.replace('session', 'curve_session'))

from functools import reduce
X_test = reduce(lambda left, right: left.join(right), [X_test, X_test_trend, X_test_curve])

# Out of sample PPC
with model2:
    model2.set_data('X_data', X_test)
    model2.set_data('load', df_test.load.values)
    
    ppc_test = pm.sample_posterior_predictive(idata2)
    post_test = az.extract_dataset(ppc_test, 'posterior_predictive')

Or, just to put emphasis on how easy organizing things into a design matrix is, you can do the PPC yourself by hand. To do this, you only need to extract beta and error from the posterior trace, perform mu = X_test @ beta, then sample np.random.normal(loc=mu, scale=error), as follows:

trace = az.extract_dataset(idata2, 'posterior')
beta_posterior = trace.beta.values
error_posterior = trace.error.values
load_mu_hat = X_test @ beta_posterior
load_hat_test = pd.DataFrame(rng.normal(loc=load_mu_hat, scale=error_posterior),
                             index=load_mu_hat.index,
                             columns=load_mu_hat.columns)

In the gist I show that these two methods produce an identical PPC.

PS: Bambi can do all this for you in like 4 lines of code.

2 Likes

Wow, thanks for writing that answer. I need to go through it a few times more to wrap my head around it.

Do I understand it correct that a design matrix is, in my case, just an array of [x, x^2] and the parameters are an array with intercept and the coefficient for each variable in [x, x^2]?

I did have a look at Bambi, but I got stuck when trying to add the third layer in my hierarchy.

About exercise and session not being variable, what if I want to do a prediction for a new, unseen, session? And the same for exercise.

Basically, in the first example, I’d like to get the predictions from the exercise level, not the session level.

In this case, you set all the values of “session” in the design matrix to 0. This corresponds to using the hypermean of the parameter distribution (e.g, intercept_global), as the parameter value. That’s the best you can do, since you’ve never seen this session before. If you get new data for a new session, you have to totally refit the model.

Note that I made an error in the gist by not adding a global mean column to the design matrix. Instead I directly added the global mean to the session mean. It gives the correct answer, but it makes estimating an unknown session awkward because setting all sessions to 0 doesn’t give the global mean. See the examples below for more on this.

There are three cases for hierarchical models. For all the examples, lets say your data has N rows, k features, and a group with p elements:

  1. The estimates are fully pooled. In this case, we estimate only a single parameter for each group. For example, just toss your data into OLS and don’t tell it anything about the hierarchical structure. You get an intercept,a slope, and a curve. This corresponds to a design matrix with columns [1, x, x ** 2], of shape (N, k).

  2. You don’t pool your estimates at all. You end up with k * p estimates, because you are simultaneously estimating p separate models, with no communication between them. In this case, your design matrix will have shape (N, k * p), and each of the columns [1, x, x ** 2] will be repeated p times. Look again at the design matrix I showed. In the first 7 rows, session_0 = 1, but all the other sessions are 0. Likewise, the values for trend_session_0 are non-zero, but all the others are zero. If you do out the matrix multiplication for X\beta, you will find the first row reduces to session_0 + trend_session_0 * x[0] + curve_session_0 * x[0] ** 2, and likewise for each session. This matrix, then, is p models stapled together.

  3. Partial pooling, which is the same as (2) but with a hierarchical prior structure. For the design matrix, there’s no difference between case (2) and (3).

This is hard to answer without knowing the structure of your data. To say something concrete, here’s an example multi-level hierarchy:

I guess this is like countries (A, B, C, D, E), states (A1, B1, C1, D1, E1, E2, E3), and regions (A11, etc). Let’s say there are p countries, q states, r states. Note that this hierarchy is not variable, so we can encode the prior structure into the model with coords. We want the distribution of intercepts for countries centered on the world intercept, each state intercept centered on the intercept of its own country, and each region on the intercept on of its state. If I was using coords, I would write something like:

world_intercept = pm.Normal()
country_sigma = pm.HalfNormal()
country_offset = pm.Normal(dims=['country'])
country_intercept = pm.Deterministic(world_intercept + country_sigma * country_offset, dims=['country'])

state_sigma = pm.HalfNormal()
state_offset = pm.Normal(dims=['state'])
state_intercept = pm.Deterministic(country_intercept[country_to_state_map] + state_sigma * state_offset, dims=['state'])

region_sigma = pm.HalfNormal()
region_offset = pm.Normal(dims=['region'])
region_intercept = pm.Deterministic(state_intercept[state_to_region_map] + region_sigma * region_offset, dims=['region'])

Notice that at every step we just take the previous result and add something. What does it look like if we expand it all out?

region_intercept = state_intercept + region_sigma * region_offset
region_intercept = country_intercept + state_sigma * state_offset + region_sigma * region_offset
region_intercept = world_intercept + country_sigma * country_offset + state_sigma * state_offset + region_sigma * region_offset

This suggests a way to proceed. Our beta vector is now going to be of length 1 + p + q + r, and it will be formed by stacking up [world_intercept , country_sigma * country_offset , state_sigma * state_offset, region_sigma * region_offset]. Notice that we don’ t need to include the summation in our priors anymore, nor do we need to worry about the indexing between levels! All of that will be handled in the matrix multiplication. The design matrix will look like this:

   world  country_A  country_B  country_C  country_D  country_E  state_A1  state_B1  state_C1  state_D1  state_E1  state_E2  state_E3  region_A11  region_B11  region_B12  region_B13  region_B14  region_C11  region_C12  region_C13  region_C14  region_C15  region_D11  region_D12  region_D13  region_D14  region_D15  region_D16  region_D17  region_D18  region_D19  region_E11  region_E21  region_E22  region_E23  region_E31  region_E32  region_E33  region_E34  region_E35  region_E36
0      1          1          0          0          0          0         1         0         0         0         0         0         0           1           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
1      1          0          1          0          0          0         0         1         0         0         0         0         0           0           1           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
2      1          0          1          0          0          0         0         1         0         0         0         0         0           0           0           1           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
3      1          0          1          0          0          0         0         1         0         0         0         0         0           0           0           0           1           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0
4      1          0          1          0          0          0         0         1         0         0         0         0         0           0           0           0           0           1           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0           0

You can see that if we have a beta structure like I suggested above, X @ beta will produce exactly what we want for each row. Everyone gets world_intercept, then the design matrix plucks out the correct country, state, and region for a given row, then adds everything up. This is exactly what you did with indexing, but leveraging the properties of matrix multiplication to achieve the same result.

Hopefully you can see this isn’t so complex. In you have a dataframe with four columns, session, exercise, velocity, load, then you call pd.get_dummies on columns session and exercise, and add a column of all 1’s to get the design matrix. Then you can multiply this design matrix by features you want to partially pool, or just add single column features that you want to leave unpooled.

In Bambi, I believe you would use something like load ~ 1 + (1 | session) + (1 | exercise), and follow the same pattern for your trend and curve terms. But don’t quote me on that, I am just getting starting using this package.

1 Like