How to interpret a very basic PyMC3 model?

Hello everyone,
I am a beginner in PyMC3 and I can’t understand how to use it. Here in this example I am trying to find
the distribution of just one variable using PyMC3 but I can’t. Could you please look at my model and my interpretation of it and tell me where I am going wrong:

data = (np.random.randn(20)+15)*2
with pm.Model() as model:
  mean = pm.Uniform('mean',lower=0, upper=35)
  std  = pm.Uniform('std', lower=0, upper=5 )
  X = pm.Normal('X', mu=mean, sd=std, observed = data)
  n_samples = 10000
  approx = pm.fit(n = n_samples, method = pm.ADVI())
ndraws = 1000
trace = approx.sample(draws = ndraws)
ndraws2 = 100
samples = pm.sample_posterior_predictive(trace, var_names=['X','std','mean'], model=model, size=ndraws2) 
samples = samples['X']
avg = np.mean(samples[200:],axis=0)
plt.hist(avg)
plt.hist(data)

result:

Here is what I think I am telling PyMC3:

  1. I have some data. I don’t know the distribution of the data but I suspect it might be Gaussian (Hence X = pm.Normal(…)).
  2. I have no idea what the mean and standard deviation of the data is, but they are probably between 0 and 35, and, 0 and 5, respectively. (Hence mean = pm.Uniform(‘mean’,lower=0, upper=35) and std = pm.Uniform(‘std’, lower=0, upper=5 ))
  3. Find the distribution of my data (Hence approx = pm.fit(n = n_samples, method = pm.ADVI()))
  4. Give me 1000 samples for what you think the values of ‘mean’, ‘std’ might be ( ndraws = 1000, trace = approx.sample(draws = ndraws))
  5. Using each of these 1000 values for ‘mean’ and ‘std’ produce 100 values for ‘X’ (ndraws2 = 100
    samples = pm.sample_posterior_predictive(trace, var_names=[‘X’], model=model, size=ndraws2) )

I am expecting the two histograms to be the almost the same, why aren’t they? The mean seems to be matching but the variance is very off.

Hi @vtac!
I think the problem lies in understanding what the sample_posterior_predictive gives you.
As far as I see it, you only get the “mean” from it, and cannot predict what goes to the sd keyword of your X. Considering this, results don’t look bad at all. Think of the narrower histogram as the “standard error of the mean”, though statistical rigor police might arrest me for drawing that analogy.

Besides that, the samples['X'] you get don’t have to be averaged. They are a numpy array and have a certain shape: one dimension equals the number of observations, another one the number of chains… This can be quite confusing, and I think a rework is under way for pymc 4; I tend to just flatten out the predictions (.ravel()) but that might be too short-sighted.

Hope this helps!

Unless you know what you’re doing, I recommend to stick with MCMC sampling instead of ADVI.

I’m not sure if I understood what you’re trying to achieve, but how about this?

import numpy as np
import pymc3 as pm
import arviz as az

data = (np.random.randn(20)+15)*2
with pm.Model() as model:
    data_mean = pm.Uniform("data_mean", lower=0, upper=35)
    data_std  = pm.Uniform("data_std", lower=0, upper=5)
    X = pm.Normal("X", mu=data_mean, sd=data_std, observed=data)

    # draw posterior samples by MCMC
    idata = pm.sample(return_inferencedata=True)
az.plot_forest(idata);

The forestplot shows each chain separately. (I had 3 chains.)

az.summary(idata)

The ArviZ summary gives you a highest density interval (HDI) which is the narrowest credible interval.
So with this, the mean lies in the interval [28.810, 30.917] with a 94 % probability.
Likewise, the std lies in the interval [1.773, 3.323] with a 94 % probability.

1 Like