How to model without access to individual observations?

I have an interesting situation where I don’t have access to individual observations - only summary statistics about those observations. Most of the examples and tutorials I’ve seen build models off a significant sample of observations, so I’m curious, how would you go about handling this scenario with PyMC?

To give you a more concrete scenario, I’d love to model the lifetime value of some customers in an e-commerce-like setting. I can compute and utilize any statistics about those customers, but I can’t export data about individual customers. So for instance, I can compute their average lifetime value, the standard deviation, etc. I’m interested in modeling how different behaviors (and aspects of the platform) contribute to or lead to lifetime values.

The first thing I tried was essentially using the summary statistics as observations - this felt immediately wrong and didn’t work:

# These numbers are computed in a separate system based on real observations
observed_mean = .... 
observed_std = ....
with pm.Model() as model:
    component_one_ltv = .... # Some modeling here
    component_two_ltv =  ..... # Some modeling here

    ltv = pm.Deterministic('lv', component_one_ltv + component_two_ltv)

    observed_ltv = pm.Normal('obs', mu=ltv sigma=observed_std, observed=[observed_mean])

    trace = pm.sample(1000)

The other thing I’m beginning to experiment with is generating synthetic observations based on the summary statistics.

Any other ideas, tips, or guidance on how to handle this situation?

Thank you so much!


That would be pretty wasteful, and wouldn’t provide more information to the model.

Modelling based on summary statistics is not that uncommon, you just need to find the right likelihood, which is trickier. A very clear example of this is the 8-school examples, where the observations are the average scores of each school and the standard error: Model comparison — PyMC 5.10.3 documentation. In that case the likelihood is easy, just a Normal :slight_smile:

Many CLV models actually use summary statistics for efficiency reasons. The classical papers usually explain how to go from individual data and assumptions to a likelihood that relies solely on summary statistics. Example:

You may explore some of the pre-packaged models in the CLV submodule of the pymc-marketing, but I just wanted to address the big picture question: pymc_marketing.clv — pymc-marketing 0.3.2 documentation

In my professional domain (macroeconomics), this situation is quite common. The resolution is to use a structural model – write down (parameterized) assumptions about the behavior of the agents in the model, then use those parameters to make the moments of the distributions output by the models match those observed in the data.

So for example, if I was working on this problem I’d write down a utility function and budget constraint (maybe just a random endowment?) for the users, then I’d think about a “search function” that transforms “inputs” (time from the user, structural elements of the platform on your side) into successful sales. You could solve that system for market clearing conditions (conditional on the actual price or supply data you have), then use those first-order conditions as the basis of a model.

Since your data is moments, you could use GMM to calibrate the model instead of doing full Bayes. Of course you could do full Bayes too :slight_smile: