Difference in results between PyMC and STAN

I’m not sure if I can ask this in PyMC’s discourse, but I’m helping someone that built a model in PyMC and I reproduced the results in Stan. The results in Stan look way more like we would expect, given the domain knowledge. Can anyone help in identifying perhaps what might cause the difference?

Note, I added temporary variable names to obfuscate sensitive information.

data {
    int<lower = 0> sample_size;
    int<lower = 0> B; 
    int<lower = 0> P;
    int<lower = 0> T;
    int<lower = 0> E; 
    int<lower = 0> R;
    real y[sample_size];
    int t[sample_size];
    int b[sample_size];
    int p[sample_size];
    int e[sample_size];
    int r[sample_size];
}
parameters{
    vector[T] varT; 
    vector[B] varB;
    vector[P] VarP;
    vector[E] varE;
    vector[R] VarR;
    real<lower = 0> c;
    real<lower = 0> sigma;

}
model{
    real mu[sample_size];
    varT ~ normal(0, 1);
    varP ~ normal(0 , 1);
    varE ~ normal(0 , 1);
    varR ~ normal(0 , 1);
    c ~ normal(30, 7.2);
    sigma ~ exponential(2);
    varB ~ normal(0 , 1);

for (i in 1:sample_size){
        mu[i] = c  + varP[p[i]] + varT[t[i]] + varE[e[i]] + varR[r[i]] + varB[b[i]];
        y[i] ~ normal(mu[i], sigma);
        }
}

This gives the output:

In PyMC the code is:

with pm.Model() as basic_model:

    intercept = pm.Normal('c', mu=30, sd=7) 
    varT = pm.Normal('varT', 0, sd=1, shape=(data['varT'].nunique(),1))
    varP = pm.Normal('varP', 0, sd=1, shape=(data['varP'].nunique(),1))
    varE = pm.Normal('varE', 0, sd=1, shape=(data['varE'].nunique(),1))
    varB = pm.Normal('varB', 0, sd=1, shape=(data['varB'].nunique(),1))
    varR = pm.Normal('varR', 0, sd=1, shape=(data['varR'].nunique(),1))

    sigma = pm.Exponential('sigma', 2)
    growth_mu = pm.Deterministic('Growth_mu', c + varT[data['varT_cat'].values] + 
                             varP[data['varP_cat'].values] +
                             varE[data['varE_cat'].values] +
                             varB[data['varB_cat'].values] +
                             varR[data['varR_cat'].values])
    out = pm.Normal('Y', mu=growth_mu, sd=sigma, observed=data[['Y', 'varT_cat', 'varP_cat', 'varE_cat', 'varB_cat', 'varR_cat']])

This gives the output, with variables in the same order :

The intercept at the bottom is definitely not what one would expect (given the domain knowledge at least). There is also way more variation than expected and a lot of the effects have means and median below 0, also not expected.

The likelihood definition in the PyMC model should include the actual observe variable Y only, try changing:

out = pm.Normal('Y', mu=growth_mu, sd=sigma, observed=data[['Y', 'varT_cat', 'varP_cat', 'varE_cat', 'varB_cat', 'varR_cat']])

to

out = pm.Normal('Y', mu=growth_mu, sd=sigma, observed=data['Y'])
4 Likes

Fantastic, that made all the difference! Thanks so much @junpenglao.

Can you maybe explain why this is happening?

Also, should this not be something that PyMC guards against, because I’m sure many other people might make the same mistake? I know this was a rushed exercise from my side and I probably failed to read the documentation properly.

It could be a valid use case to indicate repeated measure (e.g., instead of a single column [Y] there are multiple columns [Y, Y', Y''...]), so we dont put restriction on the shape of the observed. In this cases it is a mistake because they are not repeated measure.

Thus in general, it is difficult to know whether we should warn or not here.

Ah, I can see the possible use case for repeated measures, that makes sense. I guess a warning for at least a very specific scenarios might be useful, like when you provide multiple columns and they have completely different scales, it is most likely different variables.

Thanks again for the quick response.

1 Like