MAP not working with shape>1 and VI on a pooled model

Hey guys,

I’m trying to use VI for inference on a rather complex hierarchical model. I found two problems that I can’t get around, and I’m able to demonstrate them on a simple model.

  1. MAP does not work well for me with shape>1. For example

     N=11
     vec=np.arange(0,10,0.1).round(0)
     val=50+10*vec+np.random.randn(100)
     cat=vec.astype(int)
    
     with pm.Model() as model1:
    
         e_mu = pm.Flat("e_mu")
         e_sd = pm.HalfFlat("e_sd")
         e_rel= pm.Normal("e_rel", mu=0,sd=1, shape=N)    
         estim= pm.Deterministic("estim", e_mu + e_rel[cat] * e_sd)
    
         rsd = pm.HalfFlat("rsd")
         res = pm.Normal("res", mu=estim, sd=rsd, observed=val)
    
         map1 = pm.find_MAP()
    
     with pm.Model() as model2:
    
         e_mu = pm.Flat("e_mu")
         e_sd = pm.HalfFlat("e_sd")
         rsd = pm.HalfFlat("rsd")
    
         e_rel=[None]*N
         estim=[None]*N
         res=[None]*N    
    
         for i in range(N):
             e_rel[i]= pm.Normal("e_rel"+str(i), mu=0,tau=1)    
             estim[i]= pm.Deterministic("estim"+str(i), e_mu + e_rel[i] * e_sd)    
             res[i] = pm.Normal("res"+str(i), mu=estim[i], sd=rsd, observed=[val[j] for j in range(100) if cat[j]==i])
    
         map2 = pm.find_MAP()
    
     print ("\nmodel1 MAP:")    
     for i in range(N):
         print(map1["estim"][i],"=",map1["e_mu"],"+",map1["e_sd"],"*",map1["e_rel"][i])    
    
    
     e_rel=[None]*N
     estim=[None]*N
     res=[None]*N
    
    
     print("\nmodel2 MAP:")   
     for i in range(N):
         print(map2["estim"+str(i)],"=",map2["e_mu"],"+",map2["e_sd"],"*",map2["e_rel"+str(i)])
    

The output I get is:

model1 MAP:
50.4339520554 = 66.9733403143941 + 3263.931880542973 * -0.00506732029477
50.4339520554 = 66.9733403143941 + 3263.931880542973 * -0.00214404233975
50.4339520554 = 66.9733403143941 + 3263.931880542973 * 0.00090311929684
50.4339520554 = 66.9733403143941 + 3263.931880542973 * 0.00393404637666
50.4339520554 = 66.9733403143941 + 3263.931880542973 * 0.00716117123596
50.4339520554 = 66.9733403143941 + 3263.931880542973 * 0.00987876353612
59.9753321685 = 66.9733403143941 + 3263.931880542973 * 0.0133180094822
59.9753321685 = 66.9733403143941 + 3263.931880542973 * 0.0161422212201
59.9753321685 = 66.9733403143941 + 3263.931880542973 * 0.0192213932958
59.9753321685 = 66.9733403143941 + 3263.931880542973 * 0.0225604603235
59.9753321685 = 66.9733403143941 + 3263.931880542973 * 0.0256524217578

model2 MAP:
50.43392202694377 = 67.0767613033466 + 1476.1655176669472 * -0.011274372065475788
59.97532270019239 = 67.0767613033466 + 1476.1655176669472 * -0.004810733293904538
69.9209560098227 = 67.0767613033466 + 1476.1655176669472 * 0.0019267451193219207
79.81381426206056 = 67.0767613033466 + 1476.1655176669472 * 0.008628472082754407
90.34696398158516 = 67.0767613033466 + 1476.1655176669472 * 0.015763952212497618
99.21698961999387 = 67.0767613033466 + 1476.1655176669472 * 0.02177278085146191
110.4425660067233 = 67.0767613033466 + 1476.1655176669472 * 0.029377332138143676
119.66051359746791 = 67.0767613033466 + 1476.1655176669472 * 0.03562185382654716
129.71082312983438 = 67.0767613033466 + 1476.1655176669472 * 0.04243024313796448
140.60923419625794 = 67.0767613033466 + 1476.1655176669472 * 0.04981316255722331
150.70128222065802 = 67.0767613033466 + 1476.1655176669472 * 0.056649826809041334

Clearly, it got model1 wrong. For convinience I really prefer to work with something similar to model1, so what can I do to get the correct results?

  1. I’m trying to use VI to infer latent variables in the model, but I’m getting nonsensical results:

     with model1:
         sample1 = pm.fit(5000,method=pm.SVGD(jitter=0.1), obj_optimizer=pm.adamax(), start=map1).sample(draws=1000)
    
     with model2:
         sample2 = pm.fit(5000,method=pm.SVGD(jitter=0.1), obj_optimizer=pm.adamax(), start=map2).sample(draws=1000)
    
     print ("model1 SVGD mean e_mu:",sample1["e_mu"].mean(),"mean e_sd:",sample1["e_sd"].mean())        
     print ("model2 SVGD mean e_mu:",sample2["e_mu"].mean(),"mean e_sd:",sample2["e_sd"].mean())        
    

output:

    model1 SVGD mean e_mu: 1.45233787185 mean e_sd: 38.4966457737
    model2 SVGD mean e_mu: 1.55453451283 mean e_sd: 38.3855233369

Note that when looking at

pm.traceplot(sample1);

I get decent estimates for “estim” (in the same way, in my original model, I can get decent predictions). Also I get good results when I run NUTS on this simple model (but this is very very slow for my original model, so not tractable).

Any help on these issues will be greatly appreciated!

Thanks,
Yoni.

Well, you data generation is not the same as your model. If you write down the generation process exactly like the model, or the model the same as the generation process, the estimation is actually quite accurate:

with pm.Model():
    intercept = pm.Normal("intercept", 0, 100)
    slope = pm.Normal("beta", 0, 100)
    estim = intercept + slope*vec
    rsd = pm.HalfNormal("rsd", 10)
    res = pm.Normal("res", mu=estim, sd=rsd, observed=val)
    map1 = pm.find_MAP()
{'beta': array(10.021884801908273),
 'intercept': array(49.810495227431794),
 'rsd': array(0.9685585514082438),
 'rsd_log__': array(-0.031946342197626494)}

I am not sure how you are modelling your hierarchical model, but in general you should avoid using Flat and HalfFlat prior, and use NUTS to do your inference.

Also, in model1 you are not checking it correctly:

map1["estim"] is the same size as the observed val, if you want to check the predicted value for each of the N class you should do:

for i in range(N):
    predictvalue = map1["e_mu"]+map1["e_sd"]*map1["e_rel"][i]
    print(predictvalue,"=",map1["e_mu"],"+",map1["e_sd"],"*",map1["e_rel"][i])

which gives

50.09492897 = 63.75973814064936 + 1500.6472344407782 * -0.0091059436602
60.0019191299 = 63.75973814064936 + 1500.6472344407782 * -0.00250413216676
70.0334065721 = 63.75973814064936 + 1500.6472344407782 * 0.00418064171743
79.7380402786 = 63.75973814064936 + 1500.6472344407782 * 0.0106476070933
90.0370780669 = 63.75973814064936 + 1500.6472344407782 * 0.0175106709446
99.5677158566 = 63.75973814064936 + 1500.6472344407782 * 0.0238616890727
109.958264942 = 63.75973814064936 + 1500.6472344407782 * 0.0307857341423
120.435885617 = 63.75973814064936 + 1500.6472344407782 * 0.0377678019028
129.80089702 = 63.75973814064936 + 1500.6472344407782 * 0.0440084500629
139.815990556 = 63.75973814064936 + 1500.6472344407782 * 0.0506822993902
150.6627953 = 63.75973814064936 + 1500.6472344407782 * 0.0579103837095

Hey junpenglao, thank you very much for your reply!

I understand your point regarding the size of estim in the first model, thanks for pointing it out!

In any case, I’m not doing a simple linear regression as you suggest, I was just just trying to model pooling (that’s why I chose cat=vec.astype(int) - I want discrete categories). To be more clear, I’ll modify the example to:

    N=11
    dfunc=[10,50,30,45,15,10,95,70,60,80,20] 
    cat=np.arange(0,10,0.1).round(0).astype(int)
    val=[dfunc[cat[i]] for i in range(100)]+np.random.randn(100)

(everything else remains the same)

As far as SVGD goes, I still get bad inferred e_mu/e_sd values but good estim.

model1 SVGD mean e_mu: 1.40907550701 mean e_sd: 29.8533752333

The flat prior doesn’t change these results, when I use other priors, the result for e_mu under SVGD is usually just the mode of the prior (that is, as if there are no observations).

As for your suggestion to use NUTS - I agree that it works very well on this toy model. Unfortunately as I mentioned, this model is just trying to reproduce this error. My actual model is a lot more complicated and NUTS takes extremely long time to run, which is why I’m trying VI.

Yoni.,

I see, you can model pooling and discrete categories without using indexing. Your models’ poor performance is likely due to the parameterization, as using indexing mu[cat[i]] is generally not as good as a non-centered parameterization. If you are familiar with the linear mixed model literature, it is the same idea of turning the linear equation into y = Xbeta+Zb, where Z is coding for the categorical/grouping.

As for the SVGD, you can try increasing the number of particles, however, I doubt that it would perform well as it likely suffer the same problem as NUTS (and it is quite slow as well). I would suggest you to try to reparameterize your model first, start with more simple model and structure, make sure NUTS samples fairly OK, then add more predictors and more variables to build more complex model.

If you have any other question you can post your code of the actual model you are working on, with some simulation data so we can better help on this.

What do you mean by the non-centered parameterization? Is it a vector with a length equal to the number of categories with all zeros except for a one for the observed category (“one hot” vector)?

From the bit of experience I had with SVGD, the number of particles does not improve the estimation mean, just smooths it out. And also it can really speed things up when low. So I keep the number of particles low to get quick approximations.

My model is basically a hierarchical model, where for each category I have different betas. Also in that model I have a temporal multiplicative variable for each beta, which is a random (log-normal) walk. For this model I already have difficulties to run NUTS. Also MAP and SVGD can’t find me the latent beta_mu’s / intercept_mu’s as mentioned before (which is something I like to find) but can give a good prediction for the time series.

Anyway, this is what I mean:

(data generation)

    np.random.seed(1234)

    hidden_i =[505,113,102, 10,427,190,248, 97,311, 77]
    hidden_b1=[ 10, 15, 12, 20,  7, 14, 18, 11, 12, 22]
    hidden_b2=[  2,  9, 15, 20,  5, 10,  3, 12,  5, 30]
    hidden_sd=[102, 53, 83, 40, 97,103, 76,116, 67,146]
    hidden_t =[1,0.92,1.02,0.99,1,0.98,0.99,0.97,1,1.02,1.02,1.03,1.02,1.02,1.08,1.17,1.27,1.3,1.36,1.28]

    id_obs   = np.random.randint(0,10,size=10000)
    time_obs = np.random.randint(0,20,size=10000)
    x1_obs   = np.random.randint(100,200,size=10000)
    x2_obs   = np.random.randint(2,15,size=10000)
    noise    = np.random.normal(size=10000)
    y_obs    = [(hidden_i[id_obs[i]]+
                 hidden_b1[id_obs[i]]*x1_obs[i]+
                 hidden_b2[id_obs[i]]*x2_obs[i]+
                 hidden_sd[id_obs[i]]*noise[i])*hidden_t[time_obs[i]] for i in range(10000)]

    max_id=10
    max_time=20

model:

    pooling_variance=1
    intercept_prior=500
    b1_prior=20
    b2_prior=20
    rsd_prior=20

    with pm.Model() as model:

        rw_mu = pm.Normal("rw_mu", mu=0, sd=0.1, shape=max_time-1)
        rw_sd = pm.HalfNormal("rw_sd", sd=0.1)

        intercept_mu = pm.Normal("intercept_mu", mu=0, sd=intercept_prior)
        intercept_sd = pm.HalfNormal("intercept_sd", sd=intercept_prior*pooling_variance)
        rel_intercept = pm.Normal("rel_intercept", mu=0, sd=1, shape=max_id)
        intercept_t = pm.GaussianRandomWalk("intercept_t", mu = rw_mu, sd = rw_sd, shape=max_time)
        intercept = pm.Deterministic("intercept", intercept_mu + intercept_sd * rel_intercept) 

        b1_mu = pm.Normal("b1_mu", mu=0, sd=b1_prior)   
        b1_sd = pm.HalfNormal("b1_sd",sd=b1_prior*pooling_variance)       
        rel_b1 = pm.Normal("rel_b1", mu=0, sd=1, shape=max_id)   
        b1_t = pm.GaussianRandomWalk("b1_t", mu = rw_mu, sd = rw_sd, shape=max_time)
        b1 = pm.Deterministic("b1", b1_mu + rel_b1 * b1_sd)

        b2_mu = pm.Normal("b2_mu", mu=0, sd=b2_prior)
        b2_sd = pm.HalfNormal("b2_sd", sd=b2_prior*pooling_variance)
        rel_b2 = pm.Normal("rel_b2", mu=0, sd=1, shape=max_id)
        b2_t = pm.GaussianRandomWalk("b2_t", mu = rw_mu, sd = rw_sd, shape=max_time)
        b2 = pm.Deterministic("b2", b2_mu + b2_sd * rel_b2)

        estimate = pm.Deterministic("estimate", 
                                    tt.exp(intercept_t[time_obs]) * intercept[id_obs] + 
                                    tt.exp(b1_t[time_obs]) * b1[id_obs] * x1_obs  + 
                                    tt.exp(b2_t[time_obs]) * b2[id_obs] * x2_obs)                                

        rsd_mu = pm.HalfNormal("rsd_mu", sd=rsd_prior)
        rsd_sd = pm.HalfNormal("rsd_sd", sd=rsd_prior*pooling_variance)
        rsd_t = pm.GaussianRandomWalk("rsd_t", mu = rw_mu, sd = rw_sd, shape=max_time)
        rsd = pm.Deterministic("rsd", tt.exp(rsd_t[time_obs]) * pm.Gamma("rel_rsd", mu=rsd_mu, sd=rsd_sd, shape=max_id)[id_obs])      

        nu = pm.Uniform("nu", lower = 1, upper = 100)    
        res = pm.StudentT("res", mu = estimate, sd = rsd, nu=nu, observed = y_obs)    

MAP doesn’t get the right mu’s:

    with model:
          MAP = pm.find_MAP()

    MAP["intercept_mu"],MAP["b1_mu"],MAP["b2_mu"]

output:

    (array(0.01140210514034317),
     array(0.1994615290798753),
     array(0.2646010136479789))

SVGD fails as well:

    with model:
        apx = pm.fit(20000,method=pm.SVGD(jitter=0.1, n_particles=10), obj_optimizer=pm.adamax(), start=MAP)
        sample = apx.sample(draws=1000)  
    sample["intercept_mu"].mean(),sample["b1_mu"].mean(),sample["b2_mu"].mean()

the output is: (0.042609815697268585, 0.4473498188034295, 0.54902316720495059)

For more information about center vs non-center parameterization, you can have a look at this information:

http://twiecki.github.io/blog/2017/02/08/bayesian-hierchical-non-centered/

http://docs.pymc.io/notebooks/Diagnosing_biased_Inference_with_Divergences.html

Also see an example here https://github.com/junpenglao/GLMM-in-Python/blob/master/pymc3_different_sampling.py#L31-L36 centering the design matrix

Right now, your model is quite complicated and difficult to see which part is not working (you can argue that the model is much more complex than the data generation process). I am not surprised at all that MAP does not give sensible result.

In my experience, the double stochastic random walk on both the mu and sd is difficult to work with (need care and informative prior). As a benchmark, have you try to get rid of the rsd part and just model the observed as from a Normal with mu=estimate and sd=1?

Ah yes, I see what you mean by the “Centered/Non-centered” representation. Well, actually you can see that starting from the first example I always used the non-centered representation (with the “rel” parameters). From what I understand, this is just to help MCMC converge as it prevents these sort of almost-sink-like states in the Markov chain (“funnels”). I’m not really sure why or how this would help with MAP or SVGD. Maybe there is some other reparametrization/transformation that can help with MAP or SVGD?

I’m not sure as to what you mean by “double stochastic” random walk. The rw_sd is a one dimentional parameter (how divergent all the walks are) and I have different mu’s for each time slice - this is a part of the model (there are certain points in time where betas may change together). The generated data is not as complicated as the model, you can see there are no different random walks per each beta. I’m not running the model just on this generated data, of course. I generated it so I can debug the model. Also not seen in the generated data is possible correlation between the predictors (which does happen on the real data), outliers resulting from dirty observation data (this is why I use student-t and not normal for the error), and very non uniform distribution over the id’s. I want to see what breaks my model with simple data I can understand easily, and then advance. In any case, I see that the random walk is actually one part where both MAP and SVGD get good estimates, so I don’t think the problem lies there.

As per your suggestion, I tried to simplify the model by changing rsd to 1 (works very poorly) or just a one dimensional variable (still doesn’t find the mus and also gives less accurate estimations for other things). Also, my data does have different magnitudes for the residual in the different categories, which is also a parameter I want to estimate in my model, so I rather not leave it out of the model.

Again, many thanks for your help!

Maybe it is a terminology issue, but in a non-centered model, you usually parameterized to have one of the id as the reference group, or having the group average as the intercept. For example, in your last model:

estimate = pm.Deterministic("estimate", 
              tt.exp(intercept_t[time_obs]) * intercept[id_obs] + 
              tt.exp(b1_t[time_obs]) * b1[id_obs] * x1_obs  + 
              tt.exp(b2_t[time_obs]) * b2[id_obs] * x2_obs)

where reparameterized it to include an intercept (thus non-centered) would usually help.

I understand that you are simulating simple data to try to debug a more complex model, but in my experience, I find it much more informative to simulate using the same generative process. The reason is that when you are testing your model with simpler data, you dont know whether the wrong estimation is because of the model is unidentifiable, or because the fitting algorithm is wrong. With the same generative process to simulate data, you can perform better model criticism:

1, Is the model fitting good? What is the prediction error or residual? is the posterior prediction reasonable (using the sample_ppc, is the predicted value around the observed value)?

2, Which parameter has different estimation with the real parameter? If the residual is small, then there might be some identification problem (e.g., the estimated mu1 and mu2 is unidentifiable if the observed is only depends on mu=mu1+mu2)

3, If you are happy with the above two, you can then try to improve by changing the model parameterization, trying different model fitting methods, etc.

A few side notes:

  • by double stochastic, I meant having Random walk in both mu and var for the StudentT. That’s why I was suggesting setting the rsd just to 1.
  • MAP does not work well in high dimension, and I am not sure how good SVGD is in these kind of model.
  • did you try to model just for one single id? how is the performance there?

So I guess I still don’t really understand what you meant by non-centered. You sent me the same line as in my model, can you rewrite the same line the way you describe so I can see the difference?

I ran the model with my own data, of course, which is how I found the problem in the first place. Then trying to pin point what caused it, I tried simpler problems and see if the problem still exists (see the first one sent in this thread, where the model is much simpler but SVGD still gets the mu’s wrong).

There are two places where my model disagrees with the simulated data: the random walk, and the student-t error. Both of which do not seem related as these are things that MAP and SVGD can estimate correctly (it finds the right time series and estimates a rather large value for “nu”, to indicate it doesn’t think this data has a fat tail). Also, as mentioned, I tested the model on my real data and got the problems in the exact same places as in this generated example. As for your points:

  1. Yes, sample_ppc works very well. Also I run the model with holdout data (using numpy masked arrays) and it gives accurate predictions for the missing values.

  2. In our case it doesn’t seem reasonable that the mu’s are unidentifiable. Because if that was true, then the actual betas and intercept would not be identifiable as well, and they are estimated quite well. The problem is that it finds a mu of 0 (or the mean of the original distribution) and very large sd to be more likely than the actual values, even though it gets a very unlikely distribution for the “rel” variables (not at all like a Normal distribution). This is what gets me puzzled.

  3. Yes, I’ll be happy to to do that. I couldn’t find a reparametrization that solves this problem, nor another fitting method. I’ll be happy to get more ideas on what to try. This is why I posted the question in the first place. :slight_smile:

I see what you mean. I tried to use sd that is stationary in time. Still same problem.

Now this is something interesting: why does MAP not work well in high dimension? I couldn’t find a lot of reference to how it works in PyMC3 in the first place (I did try to use different optimizers in scipy, and got either results or vectors full of 0/NaN). Understanding this issue could perhaps help me a lot.

The model works fine for each single ID (provided of course there enough observations for that single ID). You can see that in the results from this model as well, as it gets the betas quite well, just not the underlying distribution for the betas.

Something along the line of

estimate = pm.Deterministic("estimate", 
              overal_intercept + 
              tt.exp(intercept_t[time_obs]) * intercept[id_obs] + 
              tt.exp(b1_t[time_obs]) * b1[id_obs] * x1_obs  + 
              tt.exp(b2_t[time_obs]) * b2[id_obs] * x2_obs)

I see, sorry I miss this in the very beginning when you mention that. In this case, the problem is parameter non-identifiable issue. Here is what I meant: take the top model above, if you substitute the estimated value from both MAP and SVGD into e_mu + e_rel[cat] * e_sd, the value is very similar (that’s why estim is almost the same), but the estimations for these parameters are very different, because they have different cost function which makes them converge to different estimations.
Here is a simple example:

with pm.Model() as model2:
    x = pm.Flat("x", shape=2)
    y = pm.Normal('y', x.sum(), 1, observed=1.)
    map1=pm.find_MAP()

In this case, there is no information to identify each elements in x, under a Flat prior, x=[.5, .5] or x=[-1, 2] makes no difference as only the sum contributed to the likelihood. What happen when you try to fit it is that the MAP converge to a value closest to the starting point.
So back to your case, it is likely that, the parameter intercept_mu is already accounted for by intercept_sd * rel_intercept in the linear function intercept = pm.Deterministic("intercept", intercept_mu + intercept_sd * rel_intercept), similar for b1_mu, and b2_mu.

One of the solutions you can try is to set more informative prior. For example setting priors for the *_mu to a Normal distribution with mean equal to the empirical mean. ie: intercept_mu = pm.Normal("intercept_mu", mu=y_obs.mean(), sd=intercept_prior)

I tried to non-center my model by extracting the mu’s and sd’s out of the terms:

        with pm.Model() as model:

            rw_mu = pm.Normal("rw_mu", mu=0, sd=0.1, shape=max_time-1)
            rw_sd = pm.HalfNormal("rw_sd", sd=0.1)

            intercept_mu = pm.Normal("intercept_mu", mu=0, sd=intercept_prior)
            intercept_sd = pm.HalfNormal("intercept_sd", sd=intercept_prior*pooling_variance)
            rel_intercept = pm.Normal("rel_intercept", mu=0, sd=1, shape=max_id)
            intercept_t = pm.GaussianRandomWalk("intercept_t", mu = rw_mu, sd = rw_sd, shape=max_time)

            b1_mu = pm.Normal("b1_mu", mu=0, sd=b1_prior)   
            b1_sd = pm.HalfNormal("b1_sd",sd=b1_prior*pooling_variance)       
            rel_b1 = pm.Normal("rel_b1", mu=0, sd=1, shape=max_id)   
            b1_t = pm.GaussianRandomWalk("b1_t", mu = rw_mu, sd = rw_sd, shape=max_time)

            b2_mu = pm.Normal("b2_mu", mu=0, sd=b2_prior)
            b2_sd = pm.HalfNormal("b2_sd", sd=b2_prior*pooling_variance)
            rel_b2 = pm.Normal("rel_b2", mu=0, sd=1, shape=max_id)
            b2_t = pm.GaussianRandomWalk("b2_t", mu = rw_mu, sd = rw_sd, shape=max_time)

            estimate = pm.Deterministic("estimate", 
                                        tt.exp(intercept_t[time_obs]) * (intercept_mu + rel_intercept[id_obs]*intercept_sd ) + 
                                        tt.exp(b1_t[time_obs]) * (b1_mu + rel_b1[id_obs]*b1_sd) * x1_obs  + 
                                        tt.exp(b2_t[time_obs]) * (b2_mu + rel_b2[id_obs]*b2_sd) * x2_obs)                                

            rsd_mu = pm.HalfNormal("rsd_mu", sd=rsd_prior)
            rsd_sd = pm.HalfNormal("rsd_sd", sd=rsd_prior*pooling_variance)
            rsd_t = pm.GaussianRandomWalk("rsd_t", mu = rw_mu, sd = rw_sd, shape=max_time)
            rsd = pm.Deterministic("rsd", tt.exp(rsd_t[time_obs]) * pm.Gamma("rel_rsd", mu=rsd_mu, sd=rsd_sd, shape=max_id)[id_obs])      

            nu = pm.Uniform("nu", lower = 1, upper = 100)    
            res = pm.StudentT("res", mu = estimate, sd = rsd, nu=nu, observed = y_obs)    

This has the exact same problem as I’ve had before.

As to what you mention on the informative prior: What you show in the example is a model with singularity / too many degrees of freedom. This is not what I’ve got. The informative prior in my case is actually not the original mu’s but the “rel” variables which should be Normal(0,1). When I set the mu’s to any other distribution, I always get the values to be the mode of the distribution (and obviously, the mean of the y’s is not going to be the mean of the intercept as there are the betas), and the sd/rel values get distorted.

If you look at the distribution of all the “rel” values you notice it is very unlikely given the specific prior, and still this is the result I get both from MAP and from SVGD. For example, when I run

    MAP["rel_intercept"],MAP["rel_b1"],MAP["rel_b2"]

I get as a result:

    (array([ 2.78046801,  0.47374556, -0.02032467,  0.03302413,  2.70729312,
             1.31475486,  1.71472227,  1.0161798 ,  1.87376116,  0.03497224]),
     array([ 0.35119225,  0.87183933,  0.56884895,  1.41487114, -0.01392106,
             0.72018126,  1.16750542,  0.3488562 ,  0.52261201,  1.68577888]),
     array([-0.57709052,  0.4109046 ,  2.1256187 ,  2.30866203, -0.58237427,
             1.1298501 , -0.57186311,  1.2900333 ,  0.02431912,  4.3071117 ]))

Also - consider the fact that NUTS is able to actually find the proper mu’s - it means that there is no issue of non-identifiability - because if it was the issue, NUTS wouldn’t be able to converge to the right numbers as well.