Hierarchical multiple regression with multiple level groups

I apologize in advance if there is already a clear answer on this, but I am really struggling with dimensions/indexing in a hierarchical model when its multiple levels and multidimensional inputs.

Let’s say we have some data, x of size (m x n). However, x has multiple groups, say (g1, g2, g3, g4) and within each group there are two colors (red, blue). This is a totally contrived example but my real data gets complicated to explain concisely

x      = (m x n)
y      = (m x 1)
groups = (m x 1) e.g., [g1, g1, g3, g4, g2, g2, g3, g1 .....] 
colors = (m x 1) e.g., [red, red, blue, blue, blue, red, blue, ...]

So for a regression model where x = (m x 1), I would do something like:

# Assume the data used as indices are factorized and the _names variables are lists
group_names = [0, 1, 2, 3]
color_names = [0, 1]
coords = {"groups" : group_names, "colors" : color_names}

with pm.Model(coords=coords) as model:

    # Data
    xdata      = pm.MutableData('xdata'    , x      , dims="obs_id")
    ydata      = pm.MutableData('ydata'    , y      , dims="obs_id")
    group_idx  = pm.MutableData('group_idx' , groups_factorized, dims="obs_id")
    color_idx  = pm.MutableData('color_idx' , colors_factorized , dims="obs_id") 

    # Prior on slope
    slope_mu    = pm.Normal("slope_mu", mu=0, sigma=10)
    slope_sigma = pm.HalfNormal("slope_sigma", 1)

    # Prior on intercept
    int_mu    = pm.Normal("int_mu", mu=0, sigma=10)
    int_sigma = pm.HalfNormal("int_sigma", 1)

    # Intercept and slope
    intercept = pm.Normal("intercept", mu=int_mu   , sigma=int_sigma   , dims=("groups","colors"))
    slope     = pm.Normal("slope"    , mu=slope_mu , sigma=slope_sigma, dims=("groups","colors"))

    # Model error 
    err_sigma = pm.Lognormal('err_sigma',mu=0, sigma=0.5) 
    err       = pm.Exponential("err", lam=err_sigma ,dims=("groups", "colors")) 

    # Mean of distribution
    mu = intercept[group_idx, color_idx] + slope[group_idx, color_idx]*xdata

    # Likelihood
    y_likelihood = pm.Normal("y_likelihood", mu=mu, sigma=err, observed=ydata, dims="obs_id")

Here I assume different noise parameters for each group set. When I run my model like this I do not have any issues with dimensions (and I have been able to sample w/o divergences). Here is where I get confused…

Now if we return to x being (m x n) instead of (m x 1), I am not sure how to properly account for the shapes. I know that shapes operate like numpy, but I am still having a hard time with which values to index and when. I know that I could explicitly write out something where I have different variable for each column of x and then have a prior (indexed by groups and colors) for each one. Something like

mu = intercept[group_idx, color_idx] + slope1[group_idx, color_idx]*xdata_column1 + slope2[group_idx, color_idx]*xdata_column2 + ... + slope_n[group_idx, color_idx]*xdata_column_n

But obviously this gets messy and tedious. I want to make it dynamically change with the size of the input.

So now, the model might look like this…

# Assume the data used as indices are factorized and the _names variables are lists
group_names   = [0, 1, 2, 3]
color_names   = [0, 1]
feature_names = [0, 1, 2, 3, 4, ...] # ["feature_1", "feature_2", "feature_3", .... , "feature_n"]
coords = {"groups"   : group_names, 
          "colors"   : color_names, 
          "features" : feature_names}

with pm.Model(coords=coords) as model:

    # Data
    xdata      = pm.MutableData('xdata'     , x                 , dims=("obs_id", "features"))
    ydata      = pm.MutableData('ydata'     , y                 , dims="obs_id")
    group_idx  = pm.MutableData('group_idx' , groups_factorized , dims="obs_id")
    color_idx  = pm.MutableData('color_idx' , colors_factorized , dims="obs_id") 

    # Prior on slope
    slope_mu    = pm.Normal("slope_mu", mu=0, sigma=10)
    slope_sigma = pm.HalfNormal("slope_sigma", 1)

    # Prior on intercept
    int_mu    = pm.Normal("int_mu", mu=0, sigma=10)
    int_sigma = pm.HalfNormal("int_sigma", 1)

    # Intercept and slope
    intercept = pm.Normal("intercept", mu=int_mu   , sigma=int_sigma   , dims=("groups","colors"))
    slope     = pm.Normal("slope"    , mu=slope_mu, sigma=slope_sigma, dims=("groups","colors", "features"))

    # Model error 
    err_sigma = pm.Lognormal('err_sigma',mu=0, sigma=0.5) 
    err       = pm.Exponential("err", lam=err_sigma ,dims=("groups", "colors")) 

So this is where I get stuck. Here I assume that there is a separate intercept for each group and color (drawn from the hyperprior), but I don’t need an extra dimension for each feature. On the other hand, I DO assume that the coefficient is different for each (group, color, feature). How do I account for this third dimension when computing the mean of my likelihood? I also assume that the likelihood noise is different for each group/color as well, but I don’t think I need the extra dimension since the ydata is (m x 1) with (group, color) being indices, not dimensions.

    # Make an index for each feature?
    feature_idx = pm.MutableData('feature_idx', np.tile(feature_names, (len(y), 1))

In that case feature_idx would be something like:

array([[0, 1, 2, 3,...],
       [0, 1, 2, 3,...],
       [0, 1, 2, 3,...],
       ...,
       [0, 1, 2, 3,...],
       [0, 1, 2, 3,...],
       [0, 1, 2, 3,...]])

From there we index accordingly?

    # Mean of distribution
    mu = intercept[group_idx, color_idx] + slope[group_idx, color_idx, feature_idx ? ]*xdata

    # Likelihood
    y_likelihood = pm.Normal("y_likelihood", mu=mu, sigma=err, observed=ydata, dims="obs_id")

I’ve tried this and it does not work. I can’t tell if I am approaching it all wrong … I really appreciate the help with this.

Thanks!

2 Likes

First just a general note on broadcasting. Suppose you have a feature matrix X of shape (n, k), and you draw a vector beta with shape k, along with an intercept term alpha.

In the base case, we can just do alpha + X @ beta to get the linear combination of the X’s.

Now suppose we have a single grouping index, group_idx, of shape n, representing g groups. We adjust beta to reflect this, by drawing one beta for each feature-group, like beta = pm.Normal('beta', 0, 1, dims=('groups', 'features').

Now beta has shape (g,k). If I index it beta[group_idx], it will become (n, k). Now we can make a linear combination alpha[group_idx] + (X * beta[group_idx]).sum(axis=-1) The -1 axis is the feature axis, so this computation will multiply each row of X by the betas that correspond to the group to which the row belongs, then add them all up. This is exactly the same as X @ beta back when we only had one group.

Finally, suppose we have multiple groups. Consider only pooling the intercept, so we’re back to the X @ beta case, but we the intercept of each row to be determined by two factors: the group and the color. To accomplish this, we simply make two intercepts: intercept_group = pm.Normal('intercept_group', 0, 1, dims=['groups']) and intercept_color = pm.Normal('intercept_color', 0, 1, dims=['colors']. The intercept of each row will be the sum of the two, so we have:

mu = intercept_group[group_idx] + intercept_color[color_idx] + X @ beta

If you wanted now to have slopes that vary by each group, the logic is exactly the same. We will have beta_group and beta_color, of shapes (g,k) and (c,k) (where c is the number of colors), then we use the multiply-and-sum method to make the linear combination:

mu = intercept_group[group_idx] + intercept_color[color_idx] + (X * (beta_group[group_idx] + beta_color[color_idx])).sum(axis=-1)

This approach assumes an additive structure between the dimensions, which I guess you object to. Following your approach more closely implies a separate parameter for every group-color pair, which we can then fancy-index. For example, the intercept could be:

intercept = pm.Normal('intercept', 0, 1, dims=['group', 'color'])

So it’s shape (g,c). Fancy indexing with both index variables – intercept[group_idx, color_idx] – results in a vector of length n.

Note, however, that these are not the same models! My model groups the rows by color and estimates an intercept, then by group and estimates an intercept, then combines these to get a row intercept. This second model splits the rows into group-color groups, and estimates the intercept for each one separately.

For beta in this second case, you should see the pattern by now: fancy index from (g, c, k) to (n, k), then multiply-and-sum:

betas = pm.Normal('betas', 0, 1, dims=['group', 'color', 'feature'])
mu = intercepts[group_idx, color_idx] + (X * betas[group_idx, color_idx]).sum(axis=-1)
5 Likes

Awesome, thanks for such a thorough answer! This is immensely helpful. I just want to follow up with a few things:

If you wanted now to have slopes that vary by each group, the logic is exactly the same. We will have beta_group and beta_color, of shapes (g,k) and (c,k) (where c is the number of colors), then we use the multiply-and-sum method to make the linear combination:

mu = intercept_group[group_idx] + intercept_color[color_idx] + (X * (beta_group[group_idx] + beta_color[color_idx])).sum(axis=-1)

This approach assumes an additive structure between the dimensions, which I guess you object to.

It is not that I specifically object to this in general, but I guess I was trying to imagine that there might be away to generalize the code in a way that allows me vary the number of features that have coefficients. However, this does help me understand the difference between the structure of your model 1 and model 2.

When I mentioned not writing out each prior for the coefficient separately, I was under the impression that

(1)

X * (beta_group[group_idx] + beta_color[color_idx])).sum(axis=-1)

is equal to

(2)

(X * betas[group_idx, color_idx]).sum(axis=-1). 

Now I know that they are not equivalent. In my current problem, I do have reason to believe that one group varies within another, e.g., that color has some variation within each group. This is a terrible example for illustrating that, but nonetheless it does help me understand the difference.

So just to be clear, the difference between the two methods of dims and indexing are that (2) assumes that there is a separate coefficient and intercept for every (group, color) tuple, correct?

Would this be equivalent to do something like this?

x      = (m x n)
y      = (m x 1)
groups = (m x 1) e.g., [g1, g1, g3, g4, g2, g2, g3, g1 .....] 
colors = (m x 1) e.g., [red, red, blue, b

Then we could define a new variable where we combine group and color and then factorize that:

groups_colors = list(zip(groups,colors)) 
group_color_idx, group_color_list = pd.factorize(groups_colors)

So then model1 and model2 below are equivalent?

############################ Model 1 ############################

coords1 = {"group_color": group_color_list,
           "features"   : feature_names}

with pm.Model(coords) as model1:
    X         = pm.MutableData('X'   , x     , dims=("obs_id", "features"))
    intercept = pm.Normal('intercept', 0 , 1 , dims=['group_color'])
    betas     = pm.Normal('betas'    , 0 , 1 , dims=['group_color', 'features'])

    mu        = (X * betas[group_color_idx]).sum(axis=-1)

############################ Model 2 ############################

coords = {"groups"   : group_names, 
          "colors"   : color_names, 
          "features" : feature_names}

with pm.Model(coords2) as model2:
    X         = pm.MutableData('X'   , x     , dims=("obs_id", "features"))
    intercept = pm.Normal('intercept', 0 , 1 , dims=['group', 'color'])
    betas     = pm.Normal('betas'    , 0 , 1 , dims=['group', 'color', 'feature'])

    mu        = (X * betas[group_idx, color_idx]).sum(axis=-1)

If this is correct, then I was under the incorrect assumption before that the model1 is how force a coefficient for each (group, color) combo, and the second is doing what the additive model is doing.

Please let me know if I understood correctly.

Thanks!

There might be a general equivalence I’m overlooking, but as far as I understand yes that’s exactly right. You might generate some data with both models and see if the wrong model can recover the right answer to test it.

You might also want to check Bambi and either use it or reverse engineer the kind of models they produce: BAyesian Model-Building Interface (Bambi) in Python — Bambi 0.12.0.dev documentation

2 Likes

Awesome, thanks! I’ve played with Bambi but I don’t necessarily want to be completely hands off. I just want to be able to develop my own project-specific classes that are somewhat generalizable, but still have complete control of the model based on the specific data in hand. I might look at what they are doing so I can consider reverse engineering some of their implementations like you mentioned.

@jessegrabowski I am going to try comparing those two methods and see if they result in essentially the same model.

Thanks again, the detailed explanation on broadcasting and fancy indexing was very useful.