I would go with something like 2, with the assumption that the missing choices are by design (or known a-prior). This means that you dont need to infer the probability of the categories being missing like option 1. I am still thinking about what is the best way to implemented it. I am considering 2 ways:
- build the
pas usually, using a mask to set the unavailable ones to 0, and do a normalization for each row. This is straightforward but I think you will ran into problem during sampling because the normalization - treat each unique missing as independent, use a for-loop to build a multinomial for each case. You can index into your weight matrix (betas) to exclude the missing categories. It should work in general but if you have a lot of unique combination of missing, then there will be problem building the model (the theano graph becomes too big).