Hello,
I’m revisiting this problem for a hierarchical Dirichlet regression and I was hoping to get some insight into setting up the GLM for this problem.
As outlined above, there are cases where we need to predict the composition of a mixture for more than two compounds. In this more generic case, the reported fractions for a sample contain compound A, B, and C with measurements from multiple studies and we’d like to account for these study differences through a hierarchical model.
My attempt to solve this goes as follows. Assume we have X_norm
and X_features
. Here, X_norm
is an Nx3 (N measurements) matrix containing the fraction composition (A, B, and C) of each measurement. The sum of X_norm
across each row is 1. X_features
contains the study designation and features for each measurement.
Using these two matrices I create a design matrix using patsy
and factorize the studies.
import pandas as pd
import patsy
study_idx, study_codes = pd.factorize(X_features['STUDY_ID'])
feature_design = patsy.dmatrix('1 + feature1 + feature2', X_features, return_type='dataframe')
Using these variables, I now build the hierarchical Dirichlet GLM.
import pymc as pm
import pytensor.tensor as pt
coords = {'chem': X_norm.columns, 'sample':X_norm.index, 'indep': feature_design.columns, 'study': study_codes}
with pm.Model(coords=coords) as model:
descriptors = pm.Data('descriptors', feature_design.values, mutable=True, dims=("sample", "indep"))
y = pm.Data('y', X_norm.values, mutable=True, dims=("sample", "chem"))
theta = pm.Normal('theta', 0, 1)
exptheta = pm.Deterministic('exptheta', pm.math.exp(theta))
mu_D = pm.Normal('mu_D', 0, 1, dims=("indep", "chem")) # Descriptor coefficients
sigma_D = pm.HalfNormal('sigma_D', 0.1, dims=("indep", "chem")) # Descriptor coefficients
D_offset = pm.Normal('D_offset', mu=0, sigma=1, dims=("study", "indep", 'chem'))
D = pm.Deterministic('D', mu_D + sigma_D*D_offset, dims=("study", "indep", 'chem'))
# Index the study-specific independent variables
Ds = pm.Deterministic('Ds', D[study_idx], dims = ('sample', 'indep', 'chem'))
# desciprors has size (sample, indep), i.e. n_rows = # of samples, n_cols = # of indepdendent variables
# Ds has size (sample, indep, chem), i.e. multidimensional with n_rows = # of indepdendent variables, n_cols = # of chemicals for every sample
# For linear model, need to calculate the dot product of the independnet variables for every sample
# Use pt.batched_dot which will scan 'Ds' to find the dimension that matches the second dimension on 'descriptors' (which is indep)
Dk = pm.Deterministic('Dk', pt.batched_dot(descriptors, Ds))
eta = pm.Deterministic("eta", pm.math.invlogit(Dk)*exptheta, dims=("sample", "chem"))
pm.Dirichlet('measure', a=eta, observed=y, dims=("sample", "chem"))
This appears to do a nice job of handling all the mixtures cases where n_chem>=2. However, the model is very sensitive to the choice of sigma_D
and sigma_D ~ HalfNormal(1)
can lead to convergence issues.
My questions are:
- Are there any glaring issues with this approach for modeling a hierarchical Dirichlet regression GLM?
- Is there a more 'bambi-esque` way to include the studies as a random variable in the design matrix for the model as opposed to manually indexing each study-specific descriptor?
Thanks as always to the community for the help with these technical questions.