Understanding effect of small data size on posterior distribution

In performing some diagnostics on a model, I realized a gap in my conceptual understanding.

data_array = np.array([1.726184, 1.567740, 1.637396, 1.584314])

def build_model_and_return_stats(obs_data):
    with pm.Model() as model_mix:
        mu1 = pm.Normal('mu1', mu = 1.73, sigma = 0.035)
        sigma1 = pm.HalfNormal('sigma1', .03)
        mu2 = pm.Normal('mu2', mu=1.85, sigma = 0.025)
        sigma2 = pm.HalfNormal('sigma2', .02)

        norm1 = pm.Normal.dist(mu=mu1, sigma = sigma1)
        norm2 = pm.Normal.dist(mu=mu2, sigma = sigma2)

        w = pm.Dirichlet('w', a=np.array([1, 1]))

        like = pm.Mixture('like', w=w, comp_dists = [norm1, norm2], observed=obs_data)

        trace_mix = pm.sample()

    with model_mix:
        trace_mix.extend(pm.sample_prior_predictive())
        trace_mix.extend(pm.sample_posterior_predictive(trace_mix))

    prior_pred = trace_mix['prior_predictive']['like'].to_numpy().reshape(-1)
    post_pred = trace_mix['posterior_predictive']['like'].to_numpy().reshape(-1)

    return np.median(prior_pred), np.median(post_pred)

prior_pred_frag, post_pred_frag = build_model_and_return_stats(data_array)
print(prior_pred_frag)
print(post_pred_frag)
print(f'% change: {1 - post_pred_frag/prior_pred_frag}')

We see that just 4 data points decrease the median of the prior distribution by 6.2%. This seems like a pretty drastic influence just from 4 points.

That go me thinking about what the actual likelihood distribution would look like? Since there are only 4 datapoints, does that mean that all values that don’t appear among these 4 will have a 0% probability? Is there a way to visualize what the likelihood distribution looks like? If we had 100 points, how would our distribution and the conclusions change?

You can visualize your likelihood with prior and posterior predictive sampling.

The likelihood does not change with data, so it does not mean that all other values have 0% probability. It only specifies how the observed data should influence your beliefs about other model parameters.

The likelihood does not change with the data? Isn’t the likelihood the probability of seeing the data given our belief (most often about the value of the distribution parameters). So any change in the data by this definition will change the likelihood.

The likelihood is a function that is fixed per model. What changes when you do inference are the parameter values that parametrize your likelihood, not the likelihood function itself.

Yes, the function is fixed, but the outcome of the likelihood line in pm is a distribution that shows the probability of seeing the data given the beliefs.

We’re getting into the weeds here, my main question still stands - how just 4 points can have such a drastic effect on the posterior distribution.

The likelihood actually gives the probably of the parameters, given the data. The data is fixed in the likelihood term of Bayes rule, so you can’t talk about the “probably of observing the data” in this context.
Edit: Nope Nope Nope. I totally pepega’d this. Ignore me. The remaining comments are valid, though.

Junpeng Lao gave an excellent talk about the ins and outs of likelihood functions and how they are updated given new data. Perhaps this might be useful? It’s quite hard to answer what is going on in your model specifically because it is quite complex. The fact that it is a mixture model might have something to do with it; there is a discontinuity built into the model right up front, which makes it difficult to reason about marginal changes.

It might be useful to play around with some simple conjugate models, where you can actually work out the updated posterior one observation at a time, to gain a deeper understanding of what is going on.

I was about to say ). So we can in fact talk about the probability of observing the data… And I don’t see why a mixture model would make it impossible to examine that distribution. It’s complex, but nothing out of the realm of ordinary in modeling a real-world phenomenon (which my example is based on).

Thanks for the links.

Yeah my bad on getting it totally backwards. And I don’t mean to say it’s impossible, just that’s it’s more difficult that in the case where you have a closed-form posterior and you can look at how things change point by point.

As a case-in-point, I graphed the prior and posterior. There are several interesting things that happen. In the prior, the “right” class was slightly more peaked, while in the posterior the left class utterly dominates, since all the data points evidently came from that one. The new mode is roughly centered around the observed data, but the likelihood is still fully bimodal (it’s not ruling out that future data could come from the other part of the mixture). It’s also worth noting that three of the four data points come from a region that the prior likelihood function placed almost zero weight on, so it really had to re-evaluate it’s life choices following that.

To me this illustrates my point about it being “more difficult” to think about what “should” happen. On one hand, everything in this picture looks very reasonable to me, but on the other hand, I couldn’t write down a mathematical argument for why it should be so.

Thanks for plotting that. So your last point here is what confused me the most and prompted the question to begin with. The posterior distribution is the product of the prior and likelihood distributions. If the prior distribution is near 0 in the 1.5-1.6 range, then irrespective of the data, the posterior should be 0 in that range as well.

This one I won’t get wrong – it’s proportional to that product, but not equal to it. If we had access to the full closed-form posterior, the evidence term would re-normalize everything. Recall that the likelihood isn’t a valid probability distribution, so the product of the likelihood and the prior isn’t either.

Also, the prior probability isn’t truly zero below 1.6, it’s just really small. If you run the model again with a TruncatedNormal prior on mu1 and mu2 instead of a normal prior, and enforce a hard cut-off at 1.6, you will get the results you expected (actually I think the model would diverge/throw nans, because you are telling it the data are impossible).

What I meant is that the likelihood function does not change depending on data. It evaluates to different values depending on the data you have, but the function itself is not changed or updated, whereas the prior functions do “change” with the data (except for special conjugate cases).

I assumed you were talking about that kind of functional change when you said the likelihood “changes” with the data.

About the question of influence, it just means your prior was very diffused and/or different from your likelihood, leading to a strong influence from the latter.

Another case where small datapoints can have wide effects on your predictions are outliers + gaussian likelihood. Even one single datapoint can move your expected mean (and/or variance) drastically away from the prior.

This is really useful info. How did you plot these? The code would be really helpful.

Sure, here is a notebook. It’s mostly just seaborn’s kdeplot, pulling the likelihood out of the prior/posterior predictive samples.

1 Like

Really useful. So I definitely agree with your statement that the posterior is proportional to the product of the prior and likelihood distributions. But even with that caveat about proportionality, in order for us to see the peak in the posterior distribution where the prior distribution probability is really low (around 1.7), the likelihood function should have a relatively high probability there.

What I still find confusing is based on my understanding of likelihood. McElreath in Stat Rethinking 2 defines likelihood as the relative # of ways our belief about data (ie distribution parameters) can produce the data. So we’re saying, given the prior distribution, what’s the probability of seeing the data. Seems to me that in the range 1.55 to 1.65 that probability should be relatively low, which then contradicts the posterior results we’re seeing.

Perhaps it would help to visualize the likelihood function too. Is that possible? I watched parts of the video you referenced, but because it’s all discrete examples, not sure how do get the plot the likelihood from the pymc model.

I don’t think we want to model THE data, we want to model the process that generated the data. Perhaps this is a point of confusion? You supplied the model with informative priors, which said the data generating process should produce points between 1.6 and 1.9, The observed data fall beyond that range, but then again we only observe four data points. Should be abandon our original belief about the range 1.6-1.9? After all, four points isn’t so many. But the range 1.55 to 1.65 clearly shouldn’t be low, because we have strong evidence that the process in question does produce values in this range. So we have to compromise between this evidence and the priors.

A worked example with a normal likelihood begins at 24:54 in the video. You could also check out @junpenglao 's notebooks associated with the lecture here, perhaps you would be particularly interesting in the notebook on mixture models. although they are written for pymc3. You will have to do some conversion if you want to use them in v4.

Perhaps I don’t understand the distinction between the “likelihood”, which I plotted in the notebook, and a “plot of the likelihood function”? The object I plotted is (samples from) the joint likelihood function. If you want a analytical function you will have to consider a simpler model, because closed-form likelihood functions are quite hard to come by.

Here is a notebook looking at iterative updates to a normal distribution with unknown mean and variance using the conjugate normal-inverse-gamma setup. In particular, we can look at how the joint likelihood over the range (1, 2) changes as we show the model each of your four data points, one at a time. Since the model is conjugate, each posterior becomes the prior for the next update. But it’s also very likely I made a mistake in my code. I invite you to check everything.

As you can see, when a point falls far outside the prior likelihood, the posterior is pulled towards the point, but not perhaps as far as we might expect. The final likelihood distribution is quite a bit more narrow than the one we see from the mixture model; after four iterations, the the range 1.55-1.6 is not inside the body of the likelihood curve. Was this more what you were expecting? I suppose in the mixture model, the left distribution (which is the one I am working with here) has more freedom to shift towards zero, since it doesn’t have to explain the 1st data point at 1.75 (the other mixture component can handle that).

We can also look at the logp as a function of parameter values given the observed data at each update step. You can see that there are many admissible combinations of mu and sigma2 in the “ellipsis” at the bottom that could reasonably have generated the 4th datapoint, x=1.584, given the previous 3 data points and the priors, and that the range 1.55-1.65 is well within that ellipse for values of sigma2 around 0.1. If you check the notebook, those values are not in the most likely ellipse in after the first iteration, since the first data point is almost exactly the prior mean (orange star, first figure).

@ jessegrabowski thanks for the detailed response again.

Perhaps I don’t understand the distinction between the “likelihood”, which I plotted in the notebook, and a “plot of the likelihood function”?

I will respond to this particular point now. So you plotted the prior and the posterior, but not the likelihood - you sampled from prior predictive and posterior predictive. There’s not a readily-available way to plot the likelihood from PyMC which is what I’m asking about, but it would be useful to understand how we got the posterior distribution from the prior distribution.

You can use pm.logp to get the log probability for a random variable in your model graph, evaluated at a particular value. Given your original model:

# Compile an aesara function for re-use
y = at.vector('y')
f_logp = aesara.function([y], pm.logp(like, y))

# Evaluate the logp at some arbitrary points
f_logp([1.73, 2.0, 1.66])
>> array([  -6.79354953, -101.17086587,    0.93656281])

Is this what you’re looking for?

Probably, but what library is the ‘at’ method part of - I don’t have that loaded in my env.

Sorry, that’s aesara.tensor

1 Like