I’d add a value to coords for the feature, then make a single beta with everything bundled together. Example code:
import pymc as pm
import numpy as np
import pandas as pd
from string import ascii_letters
n_obs = 100
n_features = 20
n_groups = 5
X = np.random.normal(size=(n_obs, n_features))
groups = np.random.randint(0, n_groups, size=n_obs)
df = pd.DataFrame(X, columns = list(ascii_letters[:n_features]))
coords = {
'features':df.columns,
'groups': np.arange(n_groups)
}
with pm.Model(coords=coords) as mod:
X = pm.MutableData('X', df)
alpha = pm.Normal('alpha', dims=['groups'])
betas = pm.Normal('beta', dims=['groups', 'features'])
mu = alpha[groups] + (betas[groups] * X).sum(axis=-1)
betas is shape (groups, features) and gets blown up to (obs, features) when you index it with groups, so (betas[groups] * X).sum(axis=-1) is the same as you have, just compact.
For indexing X you have two choices:
-
(my recommended option) Make a function that creates and returns your model, and takes the dataframe and a list of columns as inputs. Use this list of columns to build the
coordsand slice your data inside thepm.MutableDataconstructor, so you don’t have to sliceXanywhere else down-stream in the model. -
If you really want to pass the entire dataframe (although you only use a subset of columns), you can slice by index number rather than index name. Make a list of features you want included in the regression, convert them to numerical indices for your dataframe, then slice
betasandXaccordingly. Example:
feature_list_1 = list('afj')
feature_list_2 = list('bcdehijkl')
feature_idx_1 = [i for i, x in enumerate(df) if x in feature_list_1]
feature_idx_2 = [i for i, x in enumerate(df) if x in feature_list_2]
mu1 = alpha[groups] + (betas[groups, :][:, feature_idx_1] * X[:, feature_idx_1]).sum(axis=-1)
mu2 = alpha[groups] + (betas[groups, :][:, feature_idx_2] * X[:, feature_idx_2]).sum(axis=-1)
If anyone knows how to avoid the awkward double-indexing on betas let me know. If you try to put them both in the same slice it will trigger fancy indexing, which isn’t what you want in this case.