Posterior distribution of estimated parameter has a lot of variation

Hello! I’m trying to solve the solutions for the 3rd problem from Rethinking Statistics course:
It refers to the cherry_blossom dataset, I’m sure is very familiar, you can find it here.

Below I replicated a model which estimates the mean day of the year as a linear regression of the temperature. I used the standardized values for temperature and day of the year (only after dropping the NaN from the cherry blossom dataframe).

def standardize(x):
    x = (x - np.mean(x)) / np.std(x)
    return x
> with pm.Model() as m2:
>     a = pm.Normal("a", 0, 10)
>     b = pm.Normal("b", 0, 10)
>     sigma = pm.Exponential("sigma", 1)
>     
>     pred = pm.MutableData('pred', df_cherry['temp_std'], dims="obs_id")
>     
>     mu = pm.Deterministic("mu", a + b*pred, dims="obs_id")
>     D = pm.Normal('D', mu, sigma, observed=df_cherry['doy_std'], dims="obs_id")
>     
>     m2_trace = pm.sample(return_inferencedata=True)

When I inspect the trace, the posterior distribution for the mean looks really off. From my understanding it looks like the posterior has a lot of variation between the samples.

Does anyone has any idea why this happens and how to fix it?

1 Like

Welcome!

When you plot the posterior for mu (the bottom panel), you are looking at a set of posteriors, one per observation (each a different color). So the variation you see in the bottom panel reflects the fact that a and b are pretty certain, but that df_cherry['temp_std'] (via pred) likely varies quite a bit across observations. Is that clearer?

Yes, it makes more sense, thanks for explaining this. I just realized that the posterior for mu was not usually plot by default (I upgraded my pymc version) when inspecting the trace and I wasn’t used with this plot visually, I was expecting to see only the priors distribution and have certain expectations on how that should look like.
Thanks!

1 Like