Posterior inference on a continuous variable with boolean censoring

Hi,
the title for this topic is probably not the best choice, but I couldn’t come up with a better short description. So, here is what I am trying to do: I would like to model purchase decisions with a simple generative model how a customer decides to buy a product:
Assume that a customer visits a store (or a website) with a target product and a highest acceptable price in mind (say p_limit). The product is offered at a fixed price (price). The customer then decides to buy the product iff p_limit > price.

Now, I would like to infer the price limit of the customer given only the information whether they bought the product or not. Obviously, this is a very hard (perhaps impossible) task, and I’m not expecting accurate inferences, but the purchase information give us at least some piece of information which I’d like to update my prior with, to get a posterior estimate.

Here is some code to simulate data:

import pymc as pm 
import numpy as np
import altair as alt
import pandas as pd
import arviz as az

N_data = 1000
data = np.random.normal(loc=12.0, scale=2.0, size=N_data)

price = 8.0
df = pd.DataFrame(
    {"price_limit": data, "purchased": data >= price, "price": price}
)

The resulting data looks like (the vertical line being the price):
censored

Now if I had access to the actual price limits, I could build a simple model like this:

with pm.Model() as model_uncensored:
    mu = pm.Normal("mu", mu=10.0, sigma=4.0)
    sigma = pm.Exponential("sigma", 1.0)

    price_limit = pm.Normal(
        "price_limit", mu=mu, sigma=sigma, observed=df["price_limit"]
    )
    trace = pm.sample()

and infer the distribution of price limits. However, now I only have access to the purchased column and I would still like to draw some information from this. Intuitively, it’s clear that there is some value in this, for instance, in the simulated data from above, about 97% of the customers actually purchase the product at a price of 8.0, thus only 3% of them have price limits lower than 8.

But I don’t know how to build a model that can do such inference. I came up with two possible approaches:

  • Model the boolean decision price_limit > price explicitly as a pm.Deterministic variable. However, as it was discussed in several places in this forum, it is not possible to do inference on pm.Deterministic.
  • Use the pm.Censored distribution in some way.This notebook explains how to use it. But in our case, we don’t really have censored data as in that notebook where all values below and above some threshold are mapped to that threshold. Or, rather we are dealing with the limit case where lower == upper in the censored distribution, but building a model like this leads to an error:
with pm.Model() as model_censored:
    mu = pm.Normal("mu", mu=10.0, sigma=4.0)
    sigma = pm.Exponential("sigma", 1.0)

    price_limit = pm.Normal.dist(mu=mu, sigma=sigma)
    purchased = pm.Censored(
        "purchased",
        price_limit,
        lower=price,
        upper=price,
        observed=df["purchased"],
    )
    trace = pm.sample()

yields

SamplingError                             Traceback (most recent call last)
Cell In[48], line 13
      5 price_limit = pm.Normal.dist(mu=mu, sigma=sigma)
      6 purchased = pm.Censored(
      7     "purchased",
      8     price_limit,
   (...)
     11     observed=df["purchased"],
     12 )
---> 13 trace = pm.sample()

File /path/to/pymc/sampling/mcmc.py:481, in sample(draws, step, init, n_init, initvals, trace, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, keep_warning_stat, idata_kwargs, mp_ctx, **kwargs)
    479 # One final check that shapes and logps at the starting points are okay.
    480 for ip in initial_points:
--> 481     model.check_start_vals(ip)
    482     _check_start_shape(model, ip)
    484 sample_args = {
    485     "draws": draws,
    486     "step": step,
   (...)
    495     "discard_tuned_samples": discard_tuned_samples,
    496 }

File /path/to/pymc/model.py:1735, in Model.check_start_vals(self, start)
...
Starting values:
{'mu': array(10.30551444), 'sigma_log__': array(0.08593486)}

Initial evaluation results:
{'mu': -2.31, 'sigma': -1.0, 'purchased': -inf}

Any ideas how I could solve this problem?

If you didn’t observe when users don’t purchase, I would try something like this:

with pm.Model() as m:
  mu = pm.Normal("mu", mu=10.0, sigma=4.0)
  sigma = pm.Exponential("sigma", 1.0)

  price_limit = pm.Normal("price_limit", 8, 2)

  obs = pm.Truncated(
    "obs", 
    pm.Normal.dist(mu=mu, sigma=sigma), 
    lower=price_limit, 
    upper=None,
    observed=data[data >= price],
)

Now if you know whereas users purchase or not, can’t you just do a Logistic regression?

1 Like

Thank you for your reply. The problem in my case is that we don’t have access to the data of price limits (the data variable), we can only use the purchased column in df.

However, I thought about it a bit more and came up with a solution (I believe). I realized that the purchased data is essentially a Bernoulli distribution whose parameter p depends on the parameters of the Gaussian p = 1 - CDF(price; mu, sigma). (The likelihood of a purchase event depends on whether the Gaussian-distributed price limit falls left or right of the price).

I implemented a custom distribution as follows:

from pymc.distributions.dist_math import check_parameters
from pymc.distributions.shape_utils import rv_size_is_none

import scipy
from aesara.tensor.var import TensorVariable
from aesara.tensor.random.op import RandomVariable
from aesara.tensor.random.basic import ScipyRandomVariable
from typing import List, Tuple

class MaskedNormalRV(ScipyRandomVariable):
    name: str = "masked_normal_rv"

    ndim_supp: int = 0
    ndims_params: List[int] = [0, 0, 0]

    dtype = "int64"

    # A pretty text and LaTeX representation for the RV
    _print_name: Tuple[str, str] = ("masked_normal_rv", "\\operatorname{mn}")

    # If you want to add a custom signature and default values for the
    # parameters, do it like this. Otherwise this can be left out.
    def __call__(
        self, split=0.0, loc=0.0, scale=1.0, **kwargs
    ) -> TensorVariable:
        return super().__call__(split, loc, scale, **kwargs)

    @classmethod
    def rng_fn_scipy(
        cls,
        rng: np.random.RandomState,
        split: np.ndarray,
        loc: np.ndarray,
        scale: np.ndarray,
        size: Tuple[int, ...],
    ) -> np.ndarray:
        p = 1.0 - scipy.stats.norm.cdf(loc=loc, scale=scale, x=split)
        return scipy.stats.bernoulli.rvs(p, size=size, random_state=rng)


maskednormal = MaskedNormalRV()

def normal_cdf(mu, sigma, x):
    """Compute the cumulative density function of the normal."""
    z = (x - mu) / sigma
    return 1 / 2.0 * (1 + at.erf(z / at.sqrt(2.0)))

class MaskedNormal(pm.Discrete):
    rv_op = maskednormal

    @classmethod
    def dist(cls, split=0.0, mu=0.0, sigma=1.0, *args, **kwargs):
        p = 1.0 - normal_cdf(mu=mu, sigma=sigma, x=split)
        return super().dist([split, mu, sigma], **kwargs)

    def moment(rv, size, split, mu, sigma):
        p = 1.0 - normal_cdf(mu=mu, sigma=sigma, x=split)
        if not rv_size_is_none(size):
            p = at.full(size, p)
        return at.switch(p < 0.5, 0, 1)

    def logp(value, split, mu, sigma):
        p = 1.0 - normal_cdf(mu=mu, sigma=sigma, x=split)
        res = at.switch(
            at.or_(at.lt(value, 0), at.gt(value, 1)),
            -np.inf,
            at.switch(value, at.log(p), at.log1p(-p)),
        )

        return check_parameters(res, p >= 0, p <= 1, msg="0 <= p <= 1")

    def logcdf(value, split, mu, sigma):
        p = 1 - normal_cdf(mu=mu, sigma=sigma, x=split)
        res = at.switch(
            at.lt(value, 0),
            -np.inf,
            at.switch(
                at.lt(value, 1),
                at.log1p(-p),
                0,
            ),
        )
        return check_parameters(res, 0 <= p, p <= 1, msg="0 <= p <= 1")

and then use this in the model as follows:

with pm.Model() as model_censored:
    mu = pm.Normal("mu", mu=10.0, sigma=4.0)
    sigma = pm.Exponential("sigma", 1.0)
    price = pm.Normal("price", mu=8.0, sigma=0.1, observed=df["price"])
    purchased = MaskedNormal(
        "purchased",
        split=price,
        mu=mu,
        sigma=sigma,
        observed=df["purchased"],
    )
    prior = pm.sample_prior_predictive()
    trace = pm.sample(tune=4000, target_accept=0.999)
    trace.extend(prior)

There are probably more concise and elegant ways to implement this. In any case, the result looks like this:

Which looks quite good, I believe. The mu parameter is accurately estimated but the inference can’t tell much about the sigma, which is to expected, I believe.

Yes, that’s what I was thinking with the logistic regression. You don’t need to create an RV for that though (one rarely does, PyMC tries to give you all the building blocks you would need)

from pymc.distributions.dist_math import normal_lccdf

with pm.Model() as m:
    mu = pm.Normal("mu", mu=10.0, sigma=4.0)
    sigma = pm.Exponential("sigma", 1.0)
    price = pm.Normal("price", mu=8.0, sigma=0.1, observed=df["price"])
    p_purchase = normal_lccdf(mu, sigma, price)
    purchased = Bernoulli(
        "purchased",
        p=p_purchase,
        observed=df["purchased"],
    )

Hi, thanks a lot for the proposed solution, this is obviously much more concise and better.
Just one remark: I think the p_purchase needs to be exponentiated before being fed into the Bernoulli distribution, like so:

from pymc.distributions.dist_math import normal_lccdf

with pm.Model() as model_censored2:
    mu = pm.Normal("mu", mu=10.0, sigma=4.0)
    sigma = pm.Exponential("sigma", 1.0)
    price = pm.Normal("price", mu=8.0, sigma=0.1, observed=df["price"])
    log_p_purchase = normal_lccdf(mu, sigma, price)
    purchased = pm.Bernoulli(
        "purchased",
        p=at.exp(log_p_purchase),
        observed=df["purchased"],
    )

When sampling from this model for an example where mu != price (in my plot above, I had set both to 8.0), I noticed that the posterior samples for mu and sigma are highly correlated. But that is to be expected since they cannot be disentangled in this process (the more mu deviates from the price, the higher sigma is to lead to the same p_purchase).

Thanks a lot for your help.

1 Like