Hi,
the title for this topic is probably not the best choice, but I couldn’t come up with a better short description. So, here is what I am trying to do: I would like to model purchase decisions with a simple generative model how a customer decides to buy a product:
Assume that a customer visits a store (or a website) with a target product and a highest acceptable price in mind (say p_limit
). The product is offered at a fixed price (price
). The customer then decides to buy the product iff p_limit > price
.
Now, I would like to infer the price limit of the customer given only the information whether they bought the product or not. Obviously, this is a very hard (perhaps impossible) task, and I’m not expecting accurate inferences, but the purchase information give us at least some piece of information which I’d like to update my prior with, to get a posterior estimate.
Here is some code to simulate data:
import pymc as pm
import numpy as np
import altair as alt
import pandas as pd
import arviz as az
N_data = 1000
data = np.random.normal(loc=12.0, scale=2.0, size=N_data)
price = 8.0
df = pd.DataFrame(
{"price_limit": data, "purchased": data >= price, "price": price}
)
The resulting data looks like (the vertical line being the price):
Now if I had access to the actual price limits, I could build a simple model like this:
with pm.Model() as model_uncensored:
mu = pm.Normal("mu", mu=10.0, sigma=4.0)
sigma = pm.Exponential("sigma", 1.0)
price_limit = pm.Normal(
"price_limit", mu=mu, sigma=sigma, observed=df["price_limit"]
)
trace = pm.sample()
and infer the distribution of price limits. However, now I only have access to the purchased
column and I would still like to draw some information from this. Intuitively, it’s clear that there is some value in this, for instance, in the simulated data from above, about 97% of the customers actually purchase the product at a price of 8.0, thus only 3% of them have price limits lower than 8.
But I don’t know how to build a model that can do such inference. I came up with two possible approaches:
- Model the boolean decision
price_limit > price
explicitly as apm.Deterministic
variable. However, as it was discussed in several places in this forum, it is not possible to do inference onpm.Deterministic
. - Use the
pm.Censored
distribution in some way.This notebook explains how to use it. But in our case, we don’t really have censored data as in that notebook where all values below and above some threshold are mapped to that threshold. Or, rather we are dealing with the limit case wherelower == upper
in the censored distribution, but building a model like this leads to an error:
with pm.Model() as model_censored:
mu = pm.Normal("mu", mu=10.0, sigma=4.0)
sigma = pm.Exponential("sigma", 1.0)
price_limit = pm.Normal.dist(mu=mu, sigma=sigma)
purchased = pm.Censored(
"purchased",
price_limit,
lower=price,
upper=price,
observed=df["purchased"],
)
trace = pm.sample()
yields
SamplingError Traceback (most recent call last)
Cell In[48], line 13
5 price_limit = pm.Normal.dist(mu=mu, sigma=sigma)
6 purchased = pm.Censored(
7 "purchased",
8 price_limit,
(...)
11 observed=df["purchased"],
12 )
---> 13 trace = pm.sample()
File /path/to/pymc/sampling/mcmc.py:481, in sample(draws, step, init, n_init, initvals, trace, chains, cores, tune, progressbar, model, random_seed, discard_tuned_samples, compute_convergence_checks, callback, jitter_max_retries, return_inferencedata, keep_warning_stat, idata_kwargs, mp_ctx, **kwargs)
479 # One final check that shapes and logps at the starting points are okay.
480 for ip in initial_points:
--> 481 model.check_start_vals(ip)
482 _check_start_shape(model, ip)
484 sample_args = {
485 "draws": draws,
486 "step": step,
(...)
495 "discard_tuned_samples": discard_tuned_samples,
496 }
File /path/to/pymc/model.py:1735, in Model.check_start_vals(self, start)
...
Starting values:
{'mu': array(10.30551444), 'sigma_log__': array(0.08593486)}
Initial evaluation results:
{'mu': -2.31, 'sigma': -1.0, 'purchased': -inf}
Any ideas how I could solve this problem?