Sample_ppc shape

Hi.
So I am using an observed data set that N units long. I define my model, define my ObservedRV and give it the training set, and then sample it. Everything appears to work up to that point, the fit parameters of my model look reasonable for the data set I gave it (it is a set we understand well), and then I try to do a posterior predictive sample using it. This is where things start to have problems.

The input set is N units long, (shape (N,))
If I want to get 1 sample from my model, i would expect to use

sample_point = pm.sample(model=model, samples=1, size=1, trace=trace)

but when I do this, I get a strange shape out.

post_pred['output'].shape
(1,N)

Why is it retaining the shape of the training data?

more detailed code example

train.shape
(N,1)
train = train.values.reshape(1,len(train))[0]
train.shape
(756,)
model = pm.Model()
with model:
  sigma=pm.HalfNormal('sigma', 0.1)
  nu=pm.HalfNormal('nu',1)
  mu=pm.Normal('mu',0.0)
  r=pm.StudentT('output', mu=mu, nu=nu, sd=sigma, observed=train)
  step=pm.NUTS(vars=[r,mu,nu,sigma], gamma=0.25)
  trace=pm.sample(draws=2000,step=step,tune=500, chains=20, cores=9)

post_pred = pm.sample_ppc(model=model, samples=1, size=1, trace=trace)
post_pred['output'].shape
(1,N)

My expected result is

post_pred['output'].shape
(1,)

Actually the old behavior outputs (1, ), but I recently changed it.

The main reason for changing it is to have a consistent behavior regarding sampling and prior samples. Before my change, if you want to do prior samples (ie, sample from the prior distribution and prior predictive distribution), you sometimes need to write down the shape specifically:

r = pm.StudentT('output', mu=mu, nu=nu, sd=sigma, observed=train, shape=train.shape)

as the observed variable does not inherent the shape from observed. This is quite inconvenient as user need to change the way model is written when they are sampling from the prior or posterior.

Now, this come to the question of why would we want the observed random variable have the same shape as the input observed. It is a bit of a subtle point, the key is the likelihood function of the parameters. By that I mean the likelihood function change as you have more or less data, thus number of observation (i.e., shape) change the geometry of the likelihood function (you can have a look at my recent talk @ pydataberlin https://github.com/junpenglao/All-that-likelihood-with-PyMC3/tree/pydata_Berlin). On the computation side you want the actual observed and the generated (from prior or posterior) fake observed to have the same shape, then when you are passing the (actual or fake) observed array around and evaluating them on some function (eg the likelihood function), the result is consistent.

I hope this clears things up a bit.

1 Like

This is a little difficult for me to wrap my head around immediately, but I think I get the gist of why this is the default behaviour.

I think I want to ask my question in more detail to see if you can help me understand what a correct model looks like, and a correct prediction.

My intent is to take waterflow data (time series with cyclic patterns), train a model on it, and do predictions/probabilities for the next month. The training set is years of USGS waterflow data.
I want an input of 5,000 measurements over 10 years. I want outputs to be the probabilities for the waterflow for the next month.

This set is one I chose to try and teach myself how to use PyMC3, and to introduce myself to Bayesian modelling. I do have a strong background in MonteCarlo simulations in physics with Geant4, but am trying to branch out into a more general framework.

I have several milestones I have set for myself in this project.

  1. Model the flow (Time independent).
  2. Produce a probability measurement for the flows for the next 30 days using the time independent model (10,000 simulations).
  3. Produce a time dependent model with annual cycles.
  4. Produce a probability measurement for the flows for the next 30 days using the time dependent annual cylces model.
  5. Produce a time dependent model with 4 seasonal cycles per annual cycle
  6. Produce a probability measurement for the flows for the next 30 days using the time dependent annual model with seasonal cycles.

The current status of my project is a very simple model (time independent, StudentT distribution for waterflow levels), and I just wanted to train the model on these values, and predict the next 30 points. That is why I found it very concerning when I could only get 5000 points.

What would the correct way for me to approach milestone 2 on my project plan be? Being able to produce specific prediction sets will be a requirement for me to work on the later stages of this project.

Edit: I do want to apologize if this question is too naive. My introduction to Baysian statistics has been very shallow thus far.

Many way to do so, but the general idea is that you are trying to predict future data (eg, flows of the next 30 days). sample_ppc gives you the “predictive” distribution of the observed, but not the prediction of the future (PS, there are many interesting discussions regarding to the doubt use of data in the Bayesian sense). What you want is p(y_{new}| x, x_{new}, \theta ) with \theta being the posterior of the parameters, in another word, the new predictor (time in this case) need to be incorporate into the prediction.

The easiest way (in PyMC3 sense) is to use a theano.shared variable for the predictor and observed, some example here: http://docs.pymc.io/notebooks/posterior_predictive.html#Prediction
Otherwise, you can generate a new model or add new node in the original model for prediction, example here: https://junpenglao.xyz/Blogs/posts/2017-10-23-OOS_missing.html

What you might also find useful is the Gaussian Process Mauna Loa example: http://docs.pymc.io/notebooks/GP-MaunaLoa.html which is time series model with also prediction on the future.

1 Like

Thank you. The Mauna Loa example looks exceptionally relevant to my current project.

If you would let me ask one more question (to check my understanding),

pm.sample does not predict. It uses the data to determine the best fit parameters of the model.
pm.sample_ppc samples does not predict the future, but rather tries to predict the data that was used to build the model. This tells us weather our model looks like our data? (Is it a good model?).

To predict future data, I need to train a model (for a time dependent series with cyclic behaviours, this will be a multivariate distribution) that is built correctly with sample (I can check this by comparing the observed data with results from sample_ppc), and then I need to separately sample this distribution to create predictions/simulations of possible future behaviour?

Am I understanding the intended workflow of PyMC3 a bit better?

Edit: Also, thank you for posting the videos to your Machine Learning Workshop in May. They are proving most informative.

Yes - but fitting might not be the best description here. Rather think of the sample as a representation of your posterior (i.e., gives you information how the underlying space, or the space that is the most relevant when you trying to take expectation, looks like)

Exactly, and measurement like LOO or WAIC formalized these intuitions.

Yes that’s the conceptual workflow of a Bayesian analysis :slight_smile:

1 Like