Hi There,
I’m keen to get some input on a problem I’m trying to solve.
I’m trying to model the conversation rate of searches to bookings when there is a finite resource, e.g., hotel rooms. A simplified way to do this is to group all searches on a given date, and model as a binomial distribution, where n is the number of searches, k is the number of bookings. The problem with this is that you can’t have more bookings than rooms (r), and when sampling from the posterior predictive it’s likely that some predictions will break this constraint.
I’m sure this type of problem has been tackled before, so keen to hear if anyone has any experience? My thinking on a simple solution was to truncate the output of the binomial distribution to a max value of r, but it doesn’t feel very elegant. The alternative complex way would be to model searches individually and sequentially, but this feels like overkill and the size of the dataset will rapidly expand.
I’ll most likely try a few versions and see which gives the best fit for the use case, but if anyone else has any thoughts I’d happily hear them!
Thanks all!
Sounds like the general topic of counting processes? Counting process - Wikipedia.
For your specific case you may consider Bernoulli trials or a truncated likelihood (instead of artificially truncating the data).
If you are just starting you may also think about how you would simulate the system, that may suggest am initial model.
This kind of problem comes up when there’s a mismatch between the generative process you’re assuming in the model and reality. In this case, the problem is that conversions can’t be independent given a fixed number of hotel rooms. Trying to model them as independent with a post-hoc fix like truncating a binomial is probably not what you want in the longer run because the model won’t be generative and won’t be easy to modify. How much do you need to get the time series right? These searches can’t all be happening at once. And won’t prices change as supply diminishes like with airlines reservations?
Predictively, it might not get out of hand to model searches individually and sequentially if it makes sense that the search and decision can be treated as instantaneous (e.g., buying lunch from GrubHub, not buying an apartment from StreetEasy, to use a couple of New York examples). But if you have to estimate the model with the same data, that’s going to be much harder because then you have to marginalize out all the discrete decisions or build some kind of nested Monte Carlo inside of another MCMC process.
I’m afraid I don’t know any of the specifics of any of these counting processes, but it looks like a fun and deep area.
Thanks @ricardoV94 and @bob-carpenter!
This gives me lots of food for thought. I like the idea of including the time/sequence of searches from a modelling standpoint, but implementation with this approach will be far more complex - it needs to scale to 1000s of locations with many 1000s of searches per day. My inkling is to hold off from doing this unless absolutely necessary for the use case.
I had another thought that would be great to get your perspective on. Given I have so many trials for each date, I could model the binomial as a normal distribution, and truncate the upper bound at r - I think this may be what @ricardoV94 was suggesting?
All thoughts welcome, and thanks again for the initial comments.
Sorry I got confused about the truncation, you sometimes observe counts > n, I thought something else was going on.
Regarding scaling, it may still be useful to try and play with what you think is the real model, even if it only works for small datasizes. May give you a better idea for you (and others here) of how to approximate it differently.