Help with Hierarchical regression for discrete data

Hi everyone, I am trying to build a hierarchical regression model, but the observed data are discrete and I am struggling to figure out what distribution should I use on the observed data.
I am trying with Multinomial, but I am not sure to fully understand how this should work.

In the image, I have used a normal dist for the outputs, but I guess it is a really weak fitting.

This is the current model I am trying, but it gives the error below.

with pm.Model():
        a_0 = pm.LogNormal('a_0', mu=pm.math.log(init_nu), sigma=0.1, shape = (X.shape[1],))        
        w_0 = pm.Dirichlet('w_0', a=a_0, shape = (X.shape[1],))

        w_k_ = []
        for i in range(n_categories):
        w_k = tt.stack(w_k_)

        act_out =  tt.batched_tensordot(pm.math.log(w_k[c].reshape((c.shape[0], X.shape[1], ))), #c is the categories' index array
                                        X, axes = [[1],[1]] )

        out_det = pm.Deterministic("oup_det", act_out)

        out = pm.Multinomial( "out",

I am having this error at the moment, but I think there is more that I am not understanding

SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'a_0_log__': array([-1.28775636, -0.93606029, -1.81451977, -2.47462444, -3.41155315,
       -2.81283299]), 'w_0_simplex__': array([ 0.81813326,  0.15554985,  0.66529345,  0.88937281, -0.90418284]), 'w_1_simplex__': array([-4.5531904 ,  2.05080364, -1.98760618, -5.38623783,  2.24851364]), 'w_2_simplex__': array([ 2.95381451, -1.25284196,  1.98640321,  1.65540121,  1.71594464]), 'w_3_simplex__': array([1.20801965, 4.39257921, 0.51940015, 3.1436691 , 0.89849361])}

Initial evaluation results:
{'a_0': 5.91, 'w_0': -12.37, 'w_1': -20.89, 'w_2': -12.73, 'w_3': -13.53, 'out': -inf}

Thanks a lot in advance, any type of help would be very useful!

Could you also provide the data or simulated data? It will help folks taking a look at your model.


Thanks, @junpenglao, good shout.

I am trying to build a regression model for \nu from X∙ν_c=y where c is the index for the hierarchical model.
X has a 1 and a -1 on each row places sort of randomly, the rest is zeros

X=\left\lceil \begin{matrix} 0 & 1 & -1 & 0 & 0 & 0 \\ 1 & -1 & 0 & 0& 0 & 0 \\ 0 & 0 & -1& 0 & 0 & 1 \\ & & ... & & & \end{matrix} \right\rceil

the y values can only assume 4 values from y=log(2d+1) with d∈\{1,2,3,4\}

and d=1 is more frequent than d=2 and so on, and these two things give the “weird” curve on the observed line in the previous picture, with the 4 picks corresponding to the 4 values of d

Finally, the Dirichlet distribution of \nu is it for a normalisation factor such as \sum_{}^{}e^\nu=1 , but I don’t think it gives problems

image image

The model that gave me the result of the posterior predictive is the one below, but it gives a high error for the long tail on the negative side. I have tried other distributions but nothing, getting NaN errors and now I am trying now with the Multinomial to have a better fitting…

Also, I don’t know why the output has (85x85) dimension (85 is the length of y ) should it not be just (85,)?

Output of 85*85 is an indication there is a shape problem in your model. I think that’s why you are getting an error when you are changing the observed to Multinomial.

1 Like

Yes, I think the same, and don’t know where that 85x85 dimension comes from.
I thought that the observed was taking the shapes from the observed data, so y, but that is a column array, not a square matrix… So I really don’t know where the 85x85 comes from.

I am new here, so my insight might not be the most sensible, but there are some things that I don’t understand about your model, and maybe you explaining them to me could help you understand where the errors are:

  • Reading your code or description, it is really not obvious to me what you mean. Maybe you should try expressing it in detailed mathematical terms first? For example, if y could only take two possible values: 0 and 1, and you had reasons to think that you can separate X linearly, then you could describe a logistic regression in the following way. Doing so might help you figure out the indices situation. Plus, I found that it is usually very straightforward to translate a model expressed in these mathematical terms into PyMC.
\begin{align*} \mathrm{Intercept} &\sim \mathrm{Normal}(0, 1)\\ \nu_j &\sim \mathrm{Normal}(0,1)\\ p_i &= \mathrm{logit}^{-1}\left(\mathrm{Intercept}+\sum_j{\nu_j X_{ij}}\right)\\ y_i &\sim \mathrm{Bernoulli}(p_i) \end{align*}
  • Is a regression actually adapted for your problem? From the image, it looks like there is only a small number of different possible rows in X. If this is indeed the case, and you are not planning to try your model on any other possible vector, then I am not sure that a regression is the most adapted. You might want to consider X as categorical. For each category (each possible row of X), you could then fit an independent multinomial distribution to the values of y that this category can lead to.
  • Talking about the values of y, because I don’t know what they represent, I am not sure if you should treat them as categorical or continuous. But if they really can only take a finite number of discrete values, then it might make more sense to consider them as categorical too. If they are categorical, it seems weird to store them as \log(2d+1), a formula which might be irrelevant to the modeling of the problem. However, if they are actually continuous but just discretized for whatever reason, I don’t think that a multinomial distribution makes sense.
  • Is there no intercept in your regression model?
  • To avoid the shape problems, perhaps you should create independent random variables for each category. It might take longer to sample, and not be as elegant as multidimensional tensors, but it can be a start.
  • You said that y can take 4 different values, but from the model graph, it seems that n_categories was equal to 3?
  • I am not sure how you would like your model to be hierarchical, but perhaps you could start with a non-hierarchical model first, and only add the hierarchical dimension when the non-hierarchical version works? This advice is given often by experienced modelers.

Here is what I can say from looking at your post. I apologize if some of my comments don’t make any sense, but I hope that some do and that they can help you. I might be able to help you further if I knew more about the process that you are trying to model.


+1 to @Armavica . The regression model here confuse me a bit, especially given that y is essentially a Categorical Random Variable (it is a transformation of one), the parameter p should have shape (len(y), len(d)) which is (85, 4). But your regression output is (85, ), which could not be reshaped into the right p for a Categorical. I usually find expressing your problem as a generative model helps, for example, writing a function to simulate fake data with numpy.random.


Thanks hugely both for those comments! these are super helpful, even just trying to explain and discuss the problem.

About the regression model, it has two reasons to be there. First, it might not work, but I am exactly testing the validity of the regression in this case. So it doesn’t work it is still a result. Second, it is true that I showed 4 picks depending on the distance d, but they might be many more, making the band almost continuous, and they are all related to the same variable (it is a distance), so it is not conceptually right to manage them as unrelated, different categories… but either as a continuous regression, so I am trying to find a spot in the middle.

The model that I tested should be this one:

\begin{align*} \mathrm{\alpha_{0i} } &\sim \mathrm{LogNormal}(0,1) \ , \ i\in \{1,...,N\} \\ \mathrm{\bar{\omega}_0 } &= (\omega_{01},...,\omega_{0N})\sim \mathrm{Dir}(N,\alpha_0)\\ \mathrm{\bar{\omega}_c} &= (\omega_{c1},...,\omega_{cN}) \sim \mathrm{Dir}(N,\omega_0) \ ,\ \ \forall c\in \{1,...,C\} \\ \bar{\nu_c} &= log(\bar{\omega}_c) \\ y\mid c & \sim \mathrm{Normal}(X_c \cdot \bar{\nu_c}, \sigma_y) \end{align*}

It should be hierarchical on \omega_0 to multiple \omega_c, all vectors to normalise with the Dirichlet distribution. I also don’t know if this is correct, really glad to get any feedback on this as well.

+1 to @Armavica . The regression model here confuse me a bit, especially given that y is essentially a Categorical Random Variable (it is a transformation of one), the parameter p should have shape (len(y), len(d)) which is (85, 4) . But your regression output is (85, ) , which could not be reshaped into the right p for a Categorical. I usually find expressing your problem as a generative model helps, for example, writing a function to simulate fake data with numpy.random .

This is useful, @junpenglao also because at this point I don’t think the value to interpolate is categorical, but rather just discrete. However, I don’t really know how these types of situations are addressed with probability models.

I might actually try something like
p_d = \mathrm{logit^{-1}}(X_c \cdot \nu_c) \\ w \sim \mathrm{Multinomial}(D,p) \\ y \sim \sum_{d}^{D} w_d \cdot \mathrm{Normal}(\mu_d, \sigma_{yd})

where d\in\{1...D\} is the index for the possible values of y.
Do you think this makes sense in some way? or not completely?

On top of that, the main problem I have at the moment is dimensionality, as you said, guys. I need to understand where the 85x85 comes from first and address that.
Thanks for the huge help and let me know if you have other thoughts
EDIT: I found that the dimensionality issue were coming from a missing .reshape((-1,1)) . This part should be solved

Why do you parameterize the second Dirichlet (\bar \omega_c) with a Dirichlet? The alpha parameter isn’t required to be constrained between 0 and 1. I’d just put a normal hierarchical prior as usual then apply a function (exp or softplus) to ensure the output to ensure it’s always larger than 0. Something similar came up once in another discussion of hierarchical estimation of the Dirichlet alpha parameter.

The log transformation at the end also strikes me as odd. It seems like you lose all the properties that make one reach for a Dirichlet in the first place (between 0-1, sum to 1). Do you need to constrain \bar{\nu}_c to be negative? Couldn’t you just directly model what you want in the end rather than going through all the hassle?


I agree with @jessegrabowski: the way that you express your models still looks odd to me, there might still be some issues here.

Besides, @junpenglao’s advice is excellent: if you think about the physical process that generates data in the first place, and manage to write a probabilistic function that can generate simulated data according to this process, then your model will be a literal translation of this function.

1 Like

thanks for sharing the link, actually the two cases are very similar. Both on survey’s preferences and I guess this is the reason why both of us come out with nested Dirichlet dists. I will have a deep look at his case.
All the different \omega have a meaning that I am trying to model, in terms of priorities of choices from survey responders, and in this sense, they should all be normalised. This is the reason for the hierarchy of Dirichlet dists. But I might be wrong and this is actually not a good way of modelling it.

the costrain is either$\sum \omega =1$ or \sum e^\nu = 1 for all the \omega / \nu (they are actually sort of the same variable)

At the moment, with TruncatedNormal and fixing the error of the dimensionality, it gets to a better fitting, but still, I agree everything seems a bit odd and there is something I am missing


Thanks everyone for the help, I found the solution to was I was looking. As you guys suggested, the regression was the “wrong” approach. I mean it was just a way of describing the data, but the end goal wasn’t to have a perfect fitting with the process generating the data.

I was trying to describe how coherent or not were responders of a particular survey and cluster them. the scope so is not to interpolate the results perfectly, because the results are inherently non-coherent as coming from people with different opinions. But rather find few parameters that can describe the groups.

I was in a way looking at the problem from the wrong perspective. Thanks all for the feedback and the suggestion of rethinking at the entire process!

1 Like