Understanding modeling when missing categorical values are present

matsuo_basho · April 15, 2023, 6:32pm

I’m trying to understand the role of a particular line of code in a model that includes data with missing categorical values. This is taken from the Code 15.30 section of the Statistical Rethinking 2 repo.

import warnings

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm

from numpy.random import default_rng
from scipy.special import expit as invlogit

N_houses = 100  # Number of houses to simulate
alpha = 5  # Average number of notes when cat is absent
beta = -3  # Difference in number of notes when cat is present
k_true = 0.5  # Probability of cat present
r = 0.2  # Probability of not knowing whether cat present/absent

cat_true = rng.binomial(n=1, p=k_true, size=N_houses)
notes = rng.poisson(lam=alpha + beta * cat_true, size=N_houses)
R_C = rng.binomial(n=1, p=r, size=N_houses)
cat = cat_true.copy()
cat[R_C == 1] = -9

with pm.Model() as m15_8:
    # priors
    a = pm.Normal("a", 0, 1)
    b = pm.Normal("b", 0, 0.5)
    k = pm.Beta("k", 2, 2)

    custom_logp = pm.math.logsumexp(
        pm.math.log(k)
        + pm.logp(pm.Poisson.dist(pm.math.exp(a + b)), notes[cat == -9])
        + pm.math.log(1 - k)
        + pm.logp(pm.Poisson.dist(pm.math.exp(a)), notes[cat == -9])
    )
    # Using pm.Potential to add custom term to model logp
    notes_RC_1 = pm.Potential("notes|RC==1", custom_logp)

    cat_RC_0 = pm.Bernoulli("cat|RC==0", k, observed=cat[cat != -9])  # Note how k will be updated
    lam = pm.math.exp(a + b * cat_RC_0)
    notes_RC_0 = pm.Poisson("notes|RC==0", lam, observed=notes[cat != -9])

    idata_m15_8 = pm.sample(return_inferencedata=True, random_seed=RANDOM_SEED)
az.summary(idata_m15_8, var_names=["a", "b", "k"], round_to=2)

 	mean 	sd 	hdi_5.5% 	hdi_94.5% 	mcse_mean 	mcse_sd 	ess_bulk 	ess_tail 	r_hat
a 	1.53 	0.07 	1.42 	1.64 	0.0 	0.0 	3549.82 	3093.47 	1.0
b 	-0.93 	0.14 	-1.16 	-0.72 	0.0 	0.0 	3590.55 	2995.64 	1.0
k 	0.45 	0.05 	0.37 	0.53 	0.0 	0.0 	4435.70 	2852.56 	1.0

My confusion is from notes_RC_1 not being used further in the code. I understand that this is how we model the categorical missing values, but we do we need it if we don’t use it further on?
Also, when we comment out the custom_logp and notes_RC_1 definition, we get different values for b and k (not a). Why is that?

ricardoV94 · April 16, 2023, 5:18am

Maybe the documentation of Potential clarifies your question? pymc.Potential — PyMC 5.3.0 documentation

Topic		Replies	Views
Trouble specificying X \| a, b, c, d ~ Categorical( . ) Questions	5	492	March 2, 2019
How to deal with missing values version agnostic	10	281	July 3, 2024
Dealing with random missing values in a GLM model v5 modeling	0	286	July 18, 2023
Marginalizing over missing categories Questions	1	702	June 17, 2020
Missing values in a model? Questions	12	4632	November 7, 2018

Understanding modeling when missing categorical values are present

Related topics