I apologize in advance if there is already a clear answer on this, but I am really struggling with dimensions/indexing in a hierarchical model when its multiple levels and multidimensional inputs.
Let’s say we have some data, x
of size (m x n)
. However, x
has multiple groups, say (g1, g2, g3, g4)
and within each group there are two colors (red, blue)
. This is a totally contrived example but my real data gets complicated to explain concisely
x = (m x n)
y = (m x 1)
groups = (m x 1) e.g., [g1, g1, g3, g4, g2, g2, g3, g1 .....]
colors = (m x 1) e.g., [red, red, blue, blue, blue, red, blue, ...]
So for a regression model where x = (m x 1)
, I would do something like:
# Assume the data used as indices are factorized and the _names variables are lists
group_names = [0, 1, 2, 3]
color_names = [0, 1]
coords = {"groups" : group_names, "colors" : color_names}
with pm.Model(coords=coords) as model:
# Data
xdata = pm.MutableData('xdata' , x , dims="obs_id")
ydata = pm.MutableData('ydata' , y , dims="obs_id")
group_idx = pm.MutableData('group_idx' , groups_factorized, dims="obs_id")
color_idx = pm.MutableData('color_idx' , colors_factorized , dims="obs_id")
# Prior on slope
slope_mu = pm.Normal("slope_mu", mu=0, sigma=10)
slope_sigma = pm.HalfNormal("slope_sigma", 1)
# Prior on intercept
int_mu = pm.Normal("int_mu", mu=0, sigma=10)
int_sigma = pm.HalfNormal("int_sigma", 1)
# Intercept and slope
intercept = pm.Normal("intercept", mu=int_mu , sigma=int_sigma , dims=("groups","colors"))
slope = pm.Normal("slope" , mu=slope_mu , sigma=slope_sigma, dims=("groups","colors"))
# Model error
err_sigma = pm.Lognormal('err_sigma',mu=0, sigma=0.5)
err = pm.Exponential("err", lam=err_sigma ,dims=("groups", "colors"))
# Mean of distribution
mu = intercept[group_idx, color_idx] + slope[group_idx, color_idx]*xdata
# Likelihood
y_likelihood = pm.Normal("y_likelihood", mu=mu, sigma=err, observed=ydata, dims="obs_id")
Here I assume different noise parameters for each group set. When I run my model like this I do not have any issues with dimensions (and I have been able to sample w/o divergences). Here is where I get confused…
Now if we return to x
being (m x n)
instead of (m x 1)
, I am not sure how to properly account for the shapes. I know that shapes operate like numpy, but I am still having a hard time with which values to index and when. I know that I could explicitly write out something where I have different variable for each column of x
and then have a prior (indexed by groups and colors) for each one. Something like
mu = intercept[group_idx, color_idx] + slope1[group_idx, color_idx]*xdata_column1 + slope2[group_idx, color_idx]*xdata_column2 + ... + slope_n[group_idx, color_idx]*xdata_column_n
But obviously this gets messy and tedious. I want to make it dynamically change with the size of the input.
So now, the model might look like this…
# Assume the data used as indices are factorized and the _names variables are lists
group_names = [0, 1, 2, 3]
color_names = [0, 1]
feature_names = [0, 1, 2, 3, 4, ...] # ["feature_1", "feature_2", "feature_3", .... , "feature_n"]
coords = {"groups" : group_names,
"colors" : color_names,
"features" : feature_names}
with pm.Model(coords=coords) as model:
# Data
xdata = pm.MutableData('xdata' , x , dims=("obs_id", "features"))
ydata = pm.MutableData('ydata' , y , dims="obs_id")
group_idx = pm.MutableData('group_idx' , groups_factorized , dims="obs_id")
color_idx = pm.MutableData('color_idx' , colors_factorized , dims="obs_id")
# Prior on slope
slope_mu = pm.Normal("slope_mu", mu=0, sigma=10)
slope_sigma = pm.HalfNormal("slope_sigma", 1)
# Prior on intercept
int_mu = pm.Normal("int_mu", mu=0, sigma=10)
int_sigma = pm.HalfNormal("int_sigma", 1)
# Intercept and slope
intercept = pm.Normal("intercept", mu=int_mu , sigma=int_sigma , dims=("groups","colors"))
slope = pm.Normal("slope" , mu=slope_mu, sigma=slope_sigma, dims=("groups","colors", "features"))
# Model error
err_sigma = pm.Lognormal('err_sigma',mu=0, sigma=0.5)
err = pm.Exponential("err", lam=err_sigma ,dims=("groups", "colors"))
So this is where I get stuck. Here I assume that there is a separate intercept for each group and color (drawn from the hyperprior), but I don’t need an extra dimension for each feature. On the other hand, I DO assume that the coefficient is different for each (group, color, feature)
. How do I account for this third dimension when computing the mean of my likelihood? I also assume that the likelihood noise is different for each group/color as well, but I don’t think I need the extra dimension since the ydata is (m x 1)
with (group, color)
being indices, not dimensions.
# Make an index for each feature?
feature_idx = pm.MutableData('feature_idx', np.tile(feature_names, (len(y), 1))
In that case feature_idx
would be something like:
array([[0, 1, 2, 3,...],
[0, 1, 2, 3,...],
[0, 1, 2, 3,...],
...,
[0, 1, 2, 3,...],
[0, 1, 2, 3,...],
[0, 1, 2, 3,...]])
From there we index accordingly?
# Mean of distribution
mu = intercept[group_idx, color_idx] + slope[group_idx, color_idx, feature_idx ? ]*xdata
# Likelihood
y_likelihood = pm.Normal("y_likelihood", mu=mu, sigma=err, observed=ydata, dims="obs_id")
I’ve tried this and it does not work. I can’t tell if I am approaching it all wrong … I really appreciate the help with this.
Thanks!