I’m working on a project using Bayesian Multilevel Logistic Regression model and have some confusion about the concepts of “parameter estimation” and " prediction" and how to get “out-of-sample prediction” in the Bayesian logistic regression mindset.
My particular problem is as follows. I have some data for the purchase behavior from customers about some products. The products have some hierarchical structure (for example they can be divided into “tops/bottoms”, and the tops can be further divided into “short sleeve/long sleeve” and different sizes.) Now I want to look at historical data and predict what will be the probability for each type of products being purchased in the future. My idea is to build a hierarchical logistic regression using pymc3 for each individual transaction so that I can get a posterior distribution for all the different level group parameters.
Now the question is, if I want to estimate the conversion probability for each group of products, should I use the group parameter, or is it also okay to use “sample_ppc” to generate predictions for each transactions and taking the average? The reason I want to choose the second approach is that sometimes we want to group different types of products together and estimate the overall conversion probability for the merged group and it does not seem trivial to me to combine two group parameters together directly.
Another question is, if I’m going to use the individual predictions to generate the predicted conversion probability for the whole group, I would like to take the average of the predicted probability instead of the Bernoulli (0/1) predictions. And it seems that from the sample_ppc function I can only get the Bernoulli result. I wonder if I want to get the probability what should I do? And if I understand correctly, this probability will be different for each individual transaction since it includes the error term, am I right about this?
Sound like an interesting problem and Hierarchical model is very suitable.
As for your questions, here are my comments:
Both ways are valid actually. Essentially, the inference (NUTS, Metropolis, ADVI etc) only cares about the free parameters, and you can compute whatever contrast you like using the trace. As long as you are careful of your marginal - ie making sure that the quantity you are computing is not dependent, you can compute the contrast after the inference.
Take the Bayesian Estimation Supersedes the T-Test as an example: the quantity of interested (the effect size of the difference between the two groups) are coded in the model block (see cell 7) as deterministic. However, you can remove those part and do the inference, then use the trace to compute the effect size - the two is identical. Similarly, if you are interested in some group parameters differences just extract the value from the trace, do your computation, and then take the mean. Formally, this is valid as you are expressing the expectation of some function as MCMC integral.
Similarly, extracting the value from the trace and do the computation based on the MCMC samples directly would be easier.
I just want to get some clarifications and to make sure I understand your answer correctly.
Both ways are valid actually. Essentially, the inference (NUTS, Metropolis, ADVI etc) only cares about the free parameters, and you can compute whatever contrast you like using the trace. As long as you are careful of your marginal - ie making sure that the quantity you are computing is not dependent, you can compute the contrast after the inference.
In the example you showed, the quantity of interest is the difference of two group parameters and I understand that you can either directly code it as a deterministic parameter of the model or use the trace to calculate it. The two approaches are identical. But I feel what I’m trying to answer is slightly different (I might be wrong about this). For example, I have a two-level model, the first level is the particular product index for each item we sent out, the second level is the group which the product belongs to (let’s say tops/bottoms). Now I want to get the estimated conversion probability for all the tops. I can either use the second level parameter for tops as an estimate, or I can take the trace of all the model parameters, use them to predict the conversion probability for all the items belonging to tops, and then average the probability together. I wonder if I could get similar results for these two methods and which approach is more appropriate?
If you have bernoulli response, the two approach would be almost identical. But in general it might not if you have other noise associated with the response.
Using the second level parameters you are basically using the marginal posterior distribution of the latent variable, whereas using the sample_ppc base on the full model you will be looking at the predictive density.
The model indeed has Bernoulli response. But I’m not sure if I should use the average of the predicted Bernoulli response or the average of the predicted probability (the sigmoid function of the logit). And I wonder if the noise you refer to means the hidden factors not captured by the model or it’s something else?
I wonder if you could also elaborate on the difference between marginal posterior distribution and the predictive density? I guess I want to fully understand which one has more physical meaning in my use case (which is to predict the average conversion rate for the second level group).
What I meant is that, if the observation is depending on other parameters (other than the latent variable you are interested in), then the value returns from sample_ppc necessary also containing those parameters information (noise). If you take the mean of the predictive value to infer the latent mean, they might not be identical if you compute the mean of the sample of the latent variable directly.
You can think of the predictive density (or the sample from it if you are working with the result from sample_ppc) as some output of a stochastic function that takes the joint posterior as input. In that regard, which one to chose to do your computation on depends on what you are trying to quantify. If you are say making decision that minimize some future loss, then I would use the predictive density. If you are want to estimate say the effectiveness of some past policy, I would use the marginal posterior directly. Again in your case both would probably be similar given the binary response.