# Posterior inference on a continuous variable with boolean censoring

Hi,
the title for this topic is probably not the best choice, but I couldn’t come up with a better short description. So, here is what I am trying to do: I would like to model purchase decisions with a simple generative model how a customer decides to buy a product:
Assume that a customer visits a store (or a website) with a target product and a highest acceptable price in mind (say `p_limit`). The product is offered at a fixed price (`price`). The customer then decides to buy the product iff `p_limit > price`.

Now, I would like to infer the price limit of the customer given only the information whether they bought the product or not. Obviously, this is a very hard (perhaps impossible) task, and I’m not expecting accurate inferences, but the purchase information give us at least some piece of information which I’d like to update my prior with, to get a posterior estimate.

Here is some code to simulate data:

``````import pymc as pm
import numpy as np
import altair as alt
import pandas as pd
import arviz as az

N_data = 1000
data = np.random.normal(loc=12.0, scale=2.0, size=N_data)

price = 8.0
df = pd.DataFrame(
{"price_limit": data, "purchased": data >= price, "price": price}
)
``````

The resulting data looks like (the vertical line being the price):

Now if I had access to the actual price limits, I could build a simple model like this:

``````with pm.Model() as model_uncensored:
mu = pm.Normal("mu", mu=10.0, sigma=4.0)
sigma = pm.Exponential("sigma", 1.0)

price_limit = pm.Normal(
"price_limit", mu=mu, sigma=sigma, observed=df["price_limit"]
)
trace = pm.sample()
``````

and infer the distribution of price limits. However, now I only have access to the `purchased` column and I would still like to draw some information from this. Intuitively, it’s clear that there is some value in this, for instance, in the simulated data from above, about 97% of the customers actually purchase the product at a price of 8.0, thus only 3% of them have price limits lower than 8.

But I don’t know how to build a model that can do such inference. I came up with two possible approaches:

• Model the boolean decision `price_limit > price` explicitly as a `pm.Deterministic` variable. However, as it was discussed in several places in this forum, it is not possible to do inference on `pm.Deterministic`.
• Use the `pm.Censored` distribution in some way.This notebook explains how to use it. But in our case, we don’t really have censored data as in that notebook where all values below and above some threshold are mapped to that threshold. Or, rather we are dealing with the limit case where `lower == upper` in the censored distribution, but building a model like this leads to an error:
``````with pm.Model() as model_censored:
mu = pm.Normal("mu", mu=10.0, sigma=4.0)
sigma = pm.Exponential("sigma", 1.0)

price_limit = pm.Normal.dist(mu=mu, sigma=sigma)
purchased = pm.Censored(
"purchased",
price_limit,
lower=price,
upper=price,
observed=df["purchased"],
)
trace = pm.sample()
``````

yields

``````SamplingError                             Traceback (most recent call last)
Cell In[48], line 13
5 price_limit = pm.Normal.dist(mu=mu, sigma=sigma)
6 purchased = pm.Censored(
7     "purchased",
8     price_limit,
(...)
11     observed=df["purchased"],
12 )
---> 13 trace = pm.sample()

File /path/to/pymc/sampling/mcmc.py:481, in sample(draws, step, init, n_init, initvals, trace, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, keep_warning_stat, idata_kwargs, mp_ctx, **kwargs)
479 # One final check that shapes and logps at the starting points are okay.
480 for ip in initial_points:
--> 481     model.check_start_vals(ip)
482     _check_start_shape(model, ip)
484 sample_args = {
485     "draws": draws,
486     "step": step,
(...)
496 }

File /path/to/pymc/model.py:1735, in Model.check_start_vals(self, start)
...
Starting values:
{'mu': array(10.30551444), 'sigma_log__': array(0.08593486)}

Initial evaluation results:
{'mu': -2.31, 'sigma': -1.0, 'purchased': -inf}
``````

Any ideas how I could solve this problem?

If you didn’t observe when users don’t purchase, I would try something like this:

``````with pm.Model() as m:
mu = pm.Normal("mu", mu=10.0, sigma=4.0)
sigma = pm.Exponential("sigma", 1.0)

price_limit = pm.Normal("price_limit", 8, 2)

obs = pm.Truncated(
"obs",
pm.Normal.dist(mu=mu, sigma=sigma),
lower=price_limit,
upper=None,
observed=data[data >= price],
)
``````

Now if you know whereas users purchase or not, can’t you just do a Logistic regression?

1 Like

Thank you for your reply. The problem in my case is that we don’t have access to the data of price limits (the `data` variable), we can only use the `purchased` column in `df`.

However, I thought about it a bit more and came up with a solution (I believe). I realized that the `purchased` data is essentially a Bernoulli distribution whose parameter `p` depends on the parameters of the Gaussian p = 1 - CDF(price; mu, sigma). (The likelihood of a purchase event depends on whether the Gaussian-distributed price limit falls left or right of the price).

I implemented a custom distribution as follows:

``````from pymc.distributions.dist_math import check_parameters
from pymc.distributions.shape_utils import rv_size_is_none

import scipy
from aesara.tensor.var import TensorVariable
from aesara.tensor.random.op import RandomVariable
from aesara.tensor.random.basic import ScipyRandomVariable
from typing import List, Tuple

ndim_supp: int = 0
ndims_params: List[int] = [0, 0, 0]

dtype = "int64"

# A pretty text and LaTeX representation for the RV
_print_name: Tuple[str, str] = ("masked_normal_rv", "\\operatorname{mn}")

# If you want to add a custom signature and default values for the
# parameters, do it like this. Otherwise this can be left out.
def __call__(
self, split=0.0, loc=0.0, scale=1.0, **kwargs
) -> TensorVariable:
return super().__call__(split, loc, scale, **kwargs)

@classmethod
def rng_fn_scipy(
cls,
rng: np.random.RandomState,
split: np.ndarray,
loc: np.ndarray,
scale: np.ndarray,
size: Tuple[int, ...],
) -> np.ndarray:
p = 1.0 - scipy.stats.norm.cdf(loc=loc, scale=scale, x=split)
return scipy.stats.bernoulli.rvs(p, size=size, random_state=rng)

def normal_cdf(mu, sigma, x):
"""Compute the cumulative density function of the normal."""
z = (x - mu) / sigma
return 1 / 2.0 * (1 + at.erf(z / at.sqrt(2.0)))

@classmethod
def dist(cls, split=0.0, mu=0.0, sigma=1.0, *args, **kwargs):
p = 1.0 - normal_cdf(mu=mu, sigma=sigma, x=split)
return super().dist([split, mu, sigma], **kwargs)

def moment(rv, size, split, mu, sigma):
p = 1.0 - normal_cdf(mu=mu, sigma=sigma, x=split)
if not rv_size_is_none(size):
p = at.full(size, p)
return at.switch(p < 0.5, 0, 1)

def logp(value, split, mu, sigma):
p = 1.0 - normal_cdf(mu=mu, sigma=sigma, x=split)
res = at.switch(
at.or_(at.lt(value, 0), at.gt(value, 1)),
-np.inf,
at.switch(value, at.log(p), at.log1p(-p)),
)

return check_parameters(res, p >= 0, p <= 1, msg="0 <= p <= 1")

def logcdf(value, split, mu, sigma):
p = 1 - normal_cdf(mu=mu, sigma=sigma, x=split)
res = at.switch(
at.lt(value, 0),
-np.inf,
at.switch(
at.lt(value, 1),
at.log1p(-p),
0,
),
)
return check_parameters(res, 0 <= p, p <= 1, msg="0 <= p <= 1")
``````

and then use this in the model as follows:

``````with pm.Model() as model_censored:
mu = pm.Normal("mu", mu=10.0, sigma=4.0)
sigma = pm.Exponential("sigma", 1.0)
price = pm.Normal("price", mu=8.0, sigma=0.1, observed=df["price"])
"purchased",
split=price,
mu=mu,
sigma=sigma,
observed=df["purchased"],
)
prior = pm.sample_prior_predictive()
trace = pm.sample(tune=4000, target_accept=0.999)
trace.extend(prior)
``````

There are probably more concise and elegant ways to implement this. In any case, the result looks like this:

Which looks quite good, I believe. The `mu` parameter is accurately estimated but the inference can’t tell much about the `sigma`, which is to expected, I believe.

Yes, that’s what I was thinking with the logistic regression. You don’t need to create an RV for that though (one rarely does, PyMC tries to give you all the building blocks you would need)

``````from pymc.distributions.dist_math import normal_lccdf

with pm.Model() as m:
mu = pm.Normal("mu", mu=10.0, sigma=4.0)
sigma = pm.Exponential("sigma", 1.0)
price = pm.Normal("price", mu=8.0, sigma=0.1, observed=df["price"])
p_purchase = normal_lccdf(mu, sigma, price)
purchased = Bernoulli(
"purchased",
p=p_purchase,
observed=df["purchased"],
)
``````

Hi, thanks a lot for the proposed solution, this is obviously much more concise and better.
Just one remark: I think the `p_purchase` needs to be exponentiated before being fed into the Bernoulli distribution, like so:

``````from pymc.distributions.dist_math import normal_lccdf

with pm.Model() as model_censored2:
mu = pm.Normal("mu", mu=10.0, sigma=4.0)
sigma = pm.Exponential("sigma", 1.0)
price = pm.Normal("price", mu=8.0, sigma=0.1, observed=df["price"])
log_p_purchase = normal_lccdf(mu, sigma, price)
purchased = pm.Bernoulli(
"purchased",
p=at.exp(log_p_purchase),
observed=df["purchased"],
)
``````

When sampling from this model for an example where `mu != price` (in my plot above, I had set both to 8.0), I noticed that the posterior samples for mu and sigma are highly correlated. But that is to be expected since they cannot be disentangled in this process (the more mu deviates from the price, the higher sigma is to lead to the same `p_purchase`).

Thanks a lot for your help.

1 Like