Is this a censored-data problem or not?

ehansis · October 27, 2023, 1:23pm

Hello all,

I have successfully been confusing myself regarding the following question:

I have observation data from N sites indexed by i = 1...N. At each site I observe a number of p_i persons. Among these persons I count special events, with the number of observed events for site i being e_i. (As a concrete example you could think about N restaurants with p_i customers at restaurant i, and e_i being the number of customers that complain about the food at restaurant i.)

I want to model this as Poisson with a relative event rate \tau:

e_i \sim Poisson(\tau \cdot p_i),\quad i = 1...N

Now the interesting bit: For privacy reasons, measurements for which p_i < 10 or 0 < e_i < 10 are missing. They are not censored (clipped to 0 or 10), they are just not part of the dataset. Observations with p_i \ge 10 and e_i = 0 are present, though.

So I know if there were 0 complaints among 100 customers, but restaurants with 7 complaints among 100 customers or 0 complaints among 3 customers would not be reported in the dataset. I don’t know how many sites there are in the latter category (i.e. unreported due to privacy reasons).

Here’s my question: Does this present a censored-data problem that requires special modeling in PyMC, or can this be modeled as a plain and simple Poisson distribution? Asked differently: If I model this with a plain Poisson distribution, am I introducing a bias because of the pre-selection on the measured data?

Thanks a lot!
Eberhard

ricardoV94 · October 27, 2023, 1:26pm

Sounds like a Truncation process: Truncation (statistics) - Wikipedia

We have an example here that shows how not handling truncation can cause bias in the estimates: Bayesian regression with truncated or censored data — PyMC example gallery

ehansis · October 27, 2023, 1:42pm

Thanks, Ricardo.

I have done a small analysis with synthetic data in the meantime, which also shows a bias when discarding measurements as described.

Do you have any pointers for how to model the truncation ‘in the middle’ for e, i.e. how to model measurements for e = 0 and e \ge 4 but not for e = 1, 2, 3?

ricardoV94 · October 27, 2023, 2:56pm

pm.Truncated can handle Truncation from below and/or above. For instance if you can’t observe values below 1 or above 4 you can set lower=1, upper=4.

Truncation in the middle (e.g., values between 1 and 4 can’t be observed) is not supported. You can define this density manually with pm.Potential, you basically have to rescale the untruncated density by 1-(dist.cdf(4) - dist.cdf(1). I can help you through that if that’s indeed your case.

ehansis · October 27, 2023, 2:58pm

Thanks for your latest reply - I tried something else in the meantime:

Here’s my attempt at a custom likelihood for the censored data, in sample_censored_correct(). The rest is my toy problem for checking that naive modeling is biased.

"""Toy model for censored data"""

import numpy as np
import pymc3 as pm

true_rate = 0.25
N = 5_000

np.random.seed(45678)

customers = np.random.uniform(low=0, high=100, size=N)
complaints = np.random.poisson(lam=customers * true_rate)


def censor_data():
    measured = (customers >= 10) & ((complaints == 0) | (complaints >= 10))
    n = customers[measured]
    e = complaints[measured]
    return n, e


# noinspection PyTypeChecker, PyUnresolvedReferences
def do_sample_naive(n, e):
    assert len(n) == len(e)
    print(f"{len(n)} observations used")
    print(n[:20])
    print(e[:20])
    with pm.Model():
        r = pm.Exponential("r", 10)
        pm.Poisson("events", mu=n * r, observed=e)
        trace = pm.sample(2000, tune=1000)
        print(pm.summary(trace))


def sample_censored_wrong():
    n, e = censor_data()
    do_sample_naive(n, e)


def sample_uncensored():
    do_sample_naive(customers, complaints)


# noinspection PyTypeChecker,PyUnresolvedReferences
def sample_censored_correct():
    ns, es = censor_data()

    print(f"{len(ns)} observations used")
    print(ns[:20])
    print(es[:20])

    with pm.Model():
        r = pm.Exponential("r", 10)

        def truncated_poisson_like(e_, n_):
            lbda = n_ * r
            log_like = pm.Poisson.dist(mu=lbda).logp(e_)

            if e_ in [1, 2, 3]:
                return -np.inf  # log(0)
            else:
                # Normalization for the truncated values: subtract probabilities for 1, 2 and 3
                prob_sum = pm.math.exp(-lbda) * (lbda + lbda**2 / 2 + lbda**3 / 6)
                norm_constant = 1 - prob_sum
                return log_like - pm.math.log(norm_constant)

        pm.DensityDist(
            "observed", logp=truncated_poisson_like, observed={"e_": es, "n_": ns}
        )

        trace = pm.sample(2000, tune=1000)
        print(pm.summary(trace))


if __name__ == "__main__":
    print("\n\n🟢 Uncensored:\n")
    sample_uncensored()

    print("\n\n🔴 Censored, wrong:\n")
    sample_censored_wrong()

    print("\n\n🟣 Censored, correct:\n")
    sample_censored_correct()

The issue is that I get an error

Traceback (most recent call last):
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/pickle.py", line 717, in save_reduce
    save(state)
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/pickle.py", line 603, in save
    self.save_reduce(obj=obj, *rv)
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/pickle.py", line 717, in save_reduce
    save(state)
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
[... repeating ...]  
File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/site-packages/dill/_dill.py", line 990, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/pickle.py", line 970, in save_dict
    self.memoize(obj)
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/pickle.py", line 510, in memoize
    self.write(self.put(idx))
  File "/Users/ehansis/anaconda/envs/pymc3/lib/python3.9/pickle.py", line 515, in put
    if self.proto >= 4:

RecursionError: maximum recursion depth exceeded in comparison

Any ideas on the error (or suggestions on my custom likelihood)? If not, I’ll look into your suggestion.

EDIT: I should mention that I’m on a quite old version of pymc3 (3.11.4), so maybe upgrading that would be a first step…

ricardoV94 · October 27, 2023, 3:10pm

I would suggest that. The API for DensityDist changed a bit (it’s now called CustomDist as well): pymc.CustomDist — PyMC dev documentation

You have to provide the parameters as unnamed arguments, and can only pass the actual observation to observed. And instead of rv.logp(value) you do pm.logp(rv, value). Your code may look something like:

def sample_censored_correct():
    ns, es = censor_data()

    print(f"{len(ns)} observations used")
    print(ns[:20])
    print(es[:20])

    assert not np.any((es == 0) | (es == 1) | (es == 2)), "Invalid data"

    with pm.Model():
        r = pm.Exponential("r", 10)

        def truncated_poisson_like(e_, n_, r):
            lbda = n_ * r
            log_like = pm.logp(pm.Poisson.dist(mu=lbda), e_)

            # Normalization for the truncated values: subtract probabilities for 1, 2 and 3
            prob_sum = pm.math.exp(-lbda) * (lbda + lbda**2 / 2 + lbda**3 / 6)
            norm_constant = 1 - prob_sum
            return log_like - pm.math.log(norm_constant)

        pm.CustomDist(
            "observed", 
            n,  r,
            logp=truncated_poisson_like, 
            observed=es,
        )

I removed the if/else branch in the log-likelihood function because that won’t work with PyMC variables

ehansis · October 27, 2023, 3:38pm

That seems to work, thanks so much! I’m providing my final toy model here for future reference. (Note that I’m actually censoring 1, 2, 3 and not 1…9 as described in the initial problem description.)

"""Toy model for censored data"""

import numpy as np
import pymc as pm

true_rate = 0.25
N = 10_000

np.random.seed(45678)

customers = np.random.uniform(low=0, high=25, size=N)
complaints = np.random.poisson(lam=customers * true_rate)


def censor_data():
    measured = (customers >= 4) & ((complaints == 0) | (complaints >= 4))
    n = customers[measured]
    e = complaints[measured]
    return n, e


# noinspection PyTypeChecker, PyUnresolvedReferences
def do_sample_naive(n, e):
    assert len(n) == len(e)
    print(f"{len(n)} observations used")
    print(n[:20])
    print(e[:20])
    with pm.Model():
        r = pm.Exponential("r", 10)
        pm.Poisson("events", mu=n * r, observed=e)
        trace = pm.sample(2000, tune=1000)
        print(pm.summary(trace))


def sample_censored_wrong():
    n, e = censor_data()
    do_sample_naive(n, e)


def sample_uncensored():
    do_sample_naive(customers, complaints)


# noinspection PyTypeChecker,PyUnresolvedReferences
def sample_censored_correct():
    ns, es = censor_data()

    print(f"{len(ns)} observations used, {(es == 0).sum()} times e=0, {(es > 0).sum()} times e>0")
    print(ns[:20])
    print(es[:20])

    assert not np.any((es == 1) | (es == 2) | (es == 3)), "Invalid data"

    with pm.Model():

        def truncated_poisson_like(e_, n_, r_):
            lbda = n_ * r_
            log_like = pm.logp(pm.Poisson.dist(mu=lbda), e_)

            # Normalization for the truncated values: subtract probabilities for 1, 2 and 3
            prob_sum = pm.math.exp(-lbda) * (lbda + lbda**2 / 2 + lbda**3 / 6)
            norm_constant = 1 - prob_sum
            return log_like - pm.math.log(norm_constant)

        r = pm.Exponential("r", 10)
        pm.CustomDist(
            "observed",
            ns,
            r,
            logp=truncated_poisson_like,
            observed=es,
        )

        trace = pm.sample(2000, tune=1000)
        print(pm.summary(trace))


if __name__ == "__main__":
    print("\n\n🟢 Uncensored:\n")
    sample_uncensored()

    print("\n\n🔴 Censored, wrong:\n")
    sample_censored_wrong()

    print("\n\n🟣 Censored, correct:\n")
    sample_censored_correct()

Output:

🟢 Uncensored:

10000 observations used
[19.03993623  6.73949587  4.74468214 13.56126838 23.20609131  9.04828661
 14.07107776 13.43000145 15.95306611 15.43764355 18.42642233  3.9369657
 21.62208019 15.67332437 10.78355162  6.67373619  3.29083908  4.82998659
  6.7086113  11.81542204]
[8 1 2 3 5 1 0 0 6 5 3 0 6 1 1 1 1 3 4 3]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [r]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.[12000/12000 00:01<00:00 Sampling 4 chains, 0 divergences]
   mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
r  0.25  0.001   0.247    0.252        0.0      0.0    3390.0    5606.0    1.0


🔴 Censored, wrong:

4465 observations used
[19.03993623 23.20609131 14.07107776 13.43000145 15.95306611 15.43764355
 21.62208019  6.7086113   8.04924026 19.74827761 19.2941839   8.66510745
 17.02242486 18.85849719  9.80728482 17.10877876 16.47295421 10.88572265
  7.87036873 18.72706509]
[ 8  5  0  0  6  5  6  4  0  7  6  0  8  5  4  6  5  4  4 16]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [r]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 1 seconds.[12000/12000 00:00<00:00 Sampling 4 chains, 0 divergences]
   mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
r   0.3  0.002   0.296    0.303        0.0      0.0    3033.0    5039.0    1.0


🟣 Censored, correct:

4465 observations used, 572 times e=0, 3893 times e>0
[19.03993623 23.20609131 14.07107776 13.43000145 15.95306611 15.43764355
 21.62208019  6.7086113   8.04924026 19.74827761 19.2941839   8.66510745
 17.02242486 18.85849719  9.80728482 17.10877876 16.47295421 10.88572265
  7.87036873 18.72706509]
[ 8  5  0  0  6  5  6  4  0  7  6  0  8  5  4  6  5  4  4 16]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [r]
Sampling 4 chains for 1_000 tune and 2_000 draw iterations (4_000 + 8_000 draws total) took 2 seconds.[12000/12000 00:01<00:00 Sampling 4 chains, 0 divergences]
    mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
r  0.252  0.002   0.249    0.256        0.0      0.0    3587.0    5185.0    1.0

This is with pymc 5.9.1

Topic		Replies	Views
Example notebook for truncated regression Sharing	6	1719	April 27, 2020
Help me understand pm.Censored() v5	7	1204	March 14, 2022
Modeling with soft truncation version agnostic modeling	6	380	December 24, 2023
Truncated observed data? Questions	6	2293	November 6, 2020
Error in Censored Data Models example notebook Questions doc	2	585	March 4, 2020

Is this a censored-data problem or not?

Related topics