Marginalizing over missing categories

I’ve been working on learning bayesian inference using Pymc3. The resources and community are amazing, thank you! Though I think I may have found my learning curve caught in local minimum between the introductory material and the more advanced examples.

I’m trying to work through a model that is based on real-world dataset I’ve encountered in the past. It’s a sort of interesting dataset with both continuous and categorical predictors, a nested categorical hierarchy and lots of missing data.

My goal is to build a hierarchical logistic regression model, but at the moment I’m trying to understand a rather simplified version where I use a single high-cardinality categorical predictor source to predict the binary outcome. There are many missing observations of source. I started by using Pymc3 automatic tools for handling missing data (cool) with a masked array and put together the following “long-form” logistic model.

sources_encoded, sources = pd.factorize(df["Source"])
n_sources = sources.size
sources_encoded = np.ma.masked_equal(sources_encoded, -1)

n_sources = df["Source"].unique().size
with pm.Model() as model:
    mu_alpha = pm.Normal("mu_alpha", mu=0, sd=100)
    sigma_alpha = pm.HalfNormal("sigma_alpha", sd=100)
    source = pm.Categorical(
        "source",
        np.repeat(1 / n_sources, n_sources),
        shape=len(sources_encoded),
        observed=sources_encoded,
    )
    a = pm.Normal("a", mu=mu_alpha, sd=sigma_alpha, shape=(n_sources,))
    p = pm.Deterministic("p", pm.math.invlogit(a[source]))
    outcome = pm.Binomial("outcome", 1, p, observed=df["outcome"])

This samples, but slowly, and has a propensity for chain pathology. I would like to try to improve the sampling by marginalizing out the missing observations. I’ve identified a number of discussions in the pymc discourse that have got me started, but I’m still a bit stuck. My thoughts are to compute the likelihood of an observations with a missing source from a latent mixture of the Bernoulli distributions over the categories. This is my attempt, it’s clearly wrong, but I’m not sure how to fix it:

sources_encoded, sources = pd.factorize(df["Source"]
missing_idx = np.argwhere(sources_encoded == -1)
observed_idx = np.argwhere(sources_encoded != -1)
n_sources_observed = np.sum(sources_encoded != -1)
n_sources_missing = np.sum(sources_encoded == -1)
n_sources = df["Source"].unique().size

with pm.Model() as model2:
    mu_alpha = pm.Normal("mu_alpha", mu=0, sd=100)
    sigma_alpha = pm.HalfNormal("sigma_alpha", sd=100)
    source = pm.Categorical(
        "source",
        p=np.repeat(1 / n_sources, n_sources),
        shape=(n_sources_observed,),
        observed=sources_encoded[observed_idx],
    )

    a = pm.Normal("a", mu=mu_alpha, sd=sigma_alpha, shape=(n_sources,))
    p = pm.Deterministic("p", pm.math.invlogit(a[source]))
    outcomeObs = pm.Binomial(
        "outcomeObs",
        1,
        p,
        observed=df["outcome"].values[observed_idx],
    )
    # latent p(observation is missing | category=c)
    pMiss = pm.Dirichlet("pMiss", a=np.ones(n_sources))
    # integrate over the missing observations
    # with with the pointwise likelyhood for a single observation:
    # $\prod_c{p(source_c)p(y_i|source_c)}$
    # not sure how to set the following up.
    comps = pm.Binomial.dist(n=1, p=p, shape=(n_sources,))
    outcomeMissing = pm.Mixture(
        "outcomeMissing",
        comp_dists=comps,
        w=pMiss,
        observed=df["outcome"].values[missing_idx],
    )

I’ve thought about maybe computing the likelyhood for rows with missing categories using logsumexp, but it seems that using a mixture is a preferred method. Thanks again for the wonderful project any pointers would be much appreciated!

I’ve looked through the following discussions in working through this problem, but I think I am still missing something (pun intended):

1 Like

Update:
I think I found the issue with the above model. Appears I was a bit confused about the role of the ‘shape’ parameter when specifying the observed/RV’s. This blog post from @lucianopaz really helped clarify things!

This is my updated model, which seems to sample better (~1 hr). Moreover, I can now fit with ‘advi’. There are still 1-2 divergent transitions and I haven’t given it a full run-through with posterior predictive checks, so there might still be a better way to specify things.

with pm.Model() as model2:
    mu_alpha = pm.StudentT("mu_alpha",nu=1, mu=0, sigma=100)
    sigma_alpha = pm.HalfNormal("sigma_alpha", sd=100)
    source = pm.intX(pm.Data("source", sources_encoded[observed_idx.flatten()]))
    
    a = pm.Normal("a", mu=mu_alpha, sd=sigma_alpha, shape=(n_sources,))
    p = pm.Deterministic("p", pm.math.invlogit(a[source]))
    becameLicensedObs = pm.Binomial(
        "outcomeObs", 1, p, observed=df["outcome"].values[observed_idx.flatten()],
    )
    # latent probability observation of category is missing | category
    pMiss = pm.Dirichlet("pMiss", a=np.ones(n_sources))
    # components of the mixture generataing the observations with 
    # unknown categories. This makes the assumption that the missing
    # observations come from one of the categories known to be part of source
    # and that missingness might depend on the what that category is.
    comps = pm.Binomial.dist(n=1, p=pm.math.invlogit(a), shape=(n_sources,))
    becameLicensedMissing = pm.Mixture(
        "outcomeMissing",
        comp_dists=comps,
        w=pMiss,
        observed=df["outcome"].values[missing_idx.flatten()],
    )

As a side note: generating the plate notation diagram with graphiz was really helpful in working through this problem so I thought I would include it here:

1 Like