I’ve been working on learning bayesian inference using Pymc3. The resources and community are amazing, thank you! Though I think I may have found my learning curve caught in local minimum between the introductory material and the more advanced examples.
I’m trying to work through a model that is based on real-world dataset I’ve encountered in the past. It’s a sort of interesting dataset with both continuous and categorical predictors, a nested categorical hierarchy and lots of missing data.
My goal is to build a hierarchical logistic regression model, but at the moment I’m trying to understand a rather simplified version where I use a single high-cardinality categorical predictor source
to predict the binary outcome
. There are many missing observations of source
. I started by using Pymc3 automatic tools for handling missing data (cool) with a masked array and put together the following “long-form” logistic model.
sources_encoded, sources = pd.factorize(df["Source"])
n_sources = sources.size
sources_encoded = np.ma.masked_equal(sources_encoded, -1)
n_sources = df["Source"].unique().size
with pm.Model() as model:
mu_alpha = pm.Normal("mu_alpha", mu=0, sd=100)
sigma_alpha = pm.HalfNormal("sigma_alpha", sd=100)
source = pm.Categorical(
"source",
np.repeat(1 / n_sources, n_sources),
shape=len(sources_encoded),
observed=sources_encoded,
)
a = pm.Normal("a", mu=mu_alpha, sd=sigma_alpha, shape=(n_sources,))
p = pm.Deterministic("p", pm.math.invlogit(a[source]))
outcome = pm.Binomial("outcome", 1, p, observed=df["outcome"])
This samples, but slowly, and has a propensity for chain pathology. I would like to try to improve the sampling by marginalizing out the missing observations. I’ve identified a number of discussions in the pymc discourse that have got me started, but I’m still a bit stuck. My thoughts are to compute the likelihood of an observations with a missing source
from a latent mixture of the Bernoulli distributions over the categories. This is my attempt, it’s clearly wrong, but I’m not sure how to fix it:
sources_encoded, sources = pd.factorize(df["Source"]
missing_idx = np.argwhere(sources_encoded == -1)
observed_idx = np.argwhere(sources_encoded != -1)
n_sources_observed = np.sum(sources_encoded != -1)
n_sources_missing = np.sum(sources_encoded == -1)
n_sources = df["Source"].unique().size
with pm.Model() as model2:
mu_alpha = pm.Normal("mu_alpha", mu=0, sd=100)
sigma_alpha = pm.HalfNormal("sigma_alpha", sd=100)
source = pm.Categorical(
"source",
p=np.repeat(1 / n_sources, n_sources),
shape=(n_sources_observed,),
observed=sources_encoded[observed_idx],
)
a = pm.Normal("a", mu=mu_alpha, sd=sigma_alpha, shape=(n_sources,))
p = pm.Deterministic("p", pm.math.invlogit(a[source]))
outcomeObs = pm.Binomial(
"outcomeObs",
1,
p,
observed=df["outcome"].values[observed_idx],
)
# latent p(observation is missing | category=c)
pMiss = pm.Dirichlet("pMiss", a=np.ones(n_sources))
# integrate over the missing observations
# with with the pointwise likelyhood for a single observation:
# $\prod_c{p(source_c)p(y_i|source_c)}$
# not sure how to set the following up.
comps = pm.Binomial.dist(n=1, p=p, shape=(n_sources,))
outcomeMissing = pm.Mixture(
"outcomeMissing",
comp_dists=comps,
w=pMiss,
observed=df["outcome"].values[missing_idx],
)
I’ve thought about maybe computing the likelyhood for rows with missing categories using logsumexp, but it seems that using a mixture is a preferred method. Thanks again for the wonderful project any pointers would be much appreciated!
I’ve looked through the following discussions in working through this problem, but I think I am still missing something (pun intended):