I’m having a hard time understanding what goes on under the hood during missing value imputation. I understand that we set priors (usually using a normal distribution) and then use the non-missing data as the likelihood. However, what’s really confusing me is why each missing value in the output gets its own mean and standard deviation. Since these values are missing, there’s nothing really differentiating them, so why would the mean and standard deviation not be the same for all the missing values?

It depends on what the other inputs to the likelihood/sampling distribution are. In the very simple case that the likelihood just has a scalar mean, then all of the missing values would have the same mean and standard devation:

```
fake_data = np.random.normal(2, 5, size=100)
fake_data[:3] = np.nan
with pm.Model() as m:
mu = pm.Normal('mu', 0, 10)
sigma = pm.HalfNormal('sigma', 20)
x = pm.Normal('x', mu, sigma, observed=fake_data)
trace = pm.sample(2000, cores=2)
```

So if we look at the posterior of the three missing values in this dataset:

If I ran the sampler longer they would converge to the same values.

However, in many models the mean might be a linear model or some other hierarchical structure with covariates, where the expected value would generally be different for every observation, in which case we would not expect the imputed values to be the same.

Thanks for the example, I understand the explanation partly. Regarding different values when we have a hierarchical structure, why would the expected values be different for every observation? We don’t have the observation, so don’t know which group in the hierarchy the value belongs to. So we would assign the same expected value based on the weighted average of the effect of each group with the probability of the missing value being in the group.

You can have predictors at the hierarchical levels as well, but if your truly don’t have any information that distinguishes them, then you’re correct that they will all be the same.