Apparent Posterior Bias With Increased Number of Data Points

I am new to PyMC and have run into an odd issue in which the more data points I get, the worse of an answer I get.

I have been following this example of modeling a Gaussian distribution with an unknown standard deviation (see especially equations 1 and 6). I create my sample by drawing N~10 data points from a normal distribution with standard deviation 2.5 with numpy.random.normal().

To estimate the standard deviation, I use a Poisson likelihood with mean equal to the value of a normal distribution with a given standard deviation.

true_sigma = 2.5
num = 50  # Length of dataset
vs = np.random.normal(0, true_sigma, num)  # Create normal distribution with standard deviation of 2.5
counts = np.ones_like(vs)
with Model() as model:
    # Define prior
    sigma = Uniform('sigma', lower=0, upper=20)

    # Define likelihood
    likelihood = Poisson('y', 1/sigma * np.exp((-(vs)**2)/(2 * sigma**2)), observed=counts)
    idata = sample(10000, tune=5000)

When I run PyMC on this, the posterior for the standard deviation is very close to the true value (2.5) used to create the data.

When I increase the number of data points (N=500, 5000), the posteriors become increasingly offset from the true value until eventually the correct number is completely ruled out.
These are the resulting posteriors for different lengths of data.:

What is happening here?


I didn’t read through the source that closely, but why Poisson? If you just want to sample from an arbitrary likelihood, you can use pm.Potential(), but it really depends on what you are trying to do.

Could you explain a bit more about why 1/sigma * np.exp((-(vs)**2)/(2 * sigma**2)) is supplied as the location parameter for the Poisson here? I see that it’s a Gaussian density, but I don’t understand how it’s being used.