I want to implement a basic model of capture and recapture in PyMC3 (you capture 100 animals and mark them, then you liberate them and recapture another 100 once they have mixed, and annotate how many are marked). This is my code:
import numpy as np
import pymc3 as pm
import arviz as az
# Datos:
K = 100 #marked animals in first round
n = 100 #captured animals in second round
obs = 10 #observed marked in second round
with pm.Model() as my_model:
N = pm.DiscreteUniform("N", lower=K, upper=10000)
likelihood = pm.HyperGeometric('likelihood', N=N, k=K, n=n, observed=obs)
trace = pm.sample(10000, tune=2000)
print(pm.summary(trace))
print(trace['N'])
ppc = pm.sample_posterior_predictive(trace, 100, var_names=["N", "likelihood"])
data_ppc = az.from_pymc3(trace=trace, posterior_predictive=ppc) #create inference data
az.plot_ppc(data_ppc, figsize=(12, 6))
But I obtain some warnings and the error in plot_ppc that ValueError: x and y must have same first dimension, but have shapes (1,) and (0,). If I do some prints I see values in data_ppc for both N and likelihood.
What is happening and what can I do to obtain a posterior predictive check plot?
I do not receive the error you describe when I run your code. I do receive the warning about requesting less than 1 posterior predictive sample than there are samples in the trace (thrown by the call to sample_posterior_predictive()), but that’s separate. What version of pymc are you running?
(I’m running the same versions of PyMC and Arviz as you)
The shape problem is happening because your observed data, n=10, is a scalar. In addition, since N is not observed, there is no corresponding observed value in the trace, and arviz complains. You have several options:
Set observed = False in az.plot_ppc, and stick to only plotting “likelihood”. This will give you a posterior plot, but the results are still a bit screwy. Here is what I got doing this:
Use az.plot_posterior instead of az.plot_ppc. This gives you basically the same information, minus the blue blobs and the load time.
This is happening because of the “100” in ppc = pm.sample_posterior_predictive(trace, 100, var_names=["N", "likelihood"]). Just remove that and let it sample a number of times equal to n_chains x n_draws, which is the recommended default.
Thanks. I think the plot you generated helps me a lot. Now the only warnings I have is a future warning and a The number of effective samples is smaller than 25% for some parameters. I’m not sure of the relative importance of this last warning. Could it be related to the fact you mentioned that n=10 (scalar)? I wrote obs = [10,12,19,9,15,7] and launched it and the warning remained. Perhaps with a bigger vector?
A low effective sample sizes means that the samples the MCMC procedure generated are highly correlated, so they’re not “truly random”. In a way we’re spoiled by having the NUTS sampler; otherwise we’d pretty much always expect to have correlated samples. It could be worse, I’ve sampled models were I run the thing for days and get less than 100 effective samples. It’s not a pathology like divergences are, it just means that you have to be aware that you have less information than you think you do. The rule of thumb I’ve heard is that 1,000 samples is usually enough to do inference and you’re drawing 10k x 0.25 = 2,500, so you should be fine.
As for the future warning, just do what it says, and you can get rid of that az.from_pymc3 as well. I recommend always passing return_inferencedata = True when you sample, because it makes interfacing with Arviz much nicer.
If you have only a single observation, you can’t then do any KDE or histogram plot, but you should still be able to use kind="scatter" and check that the observation is in a high probability region of the posterior predictive.
The difference between the “mean” and “not mean” lines is only that the histogram or kde is either generated with the samples from a single draw and chain or combining all available draws and chains. Differences between the two lines are a clear signal that something is going wrong. In this case, there is only an observation, so the non mean lines (blue ones) are histograms of a single value which doesn’t really make any sense. ArviZ has a hard rule on histograms of discrete data that the bin width can never be smaller than 1. It looks like the lines above are single bin histograms that don’t even have contibuous bins for the vertical bin edges to be shown. As I said above, the scatter kind would be more indicated in this case.
This is not exactly a posterior predictive check. First it is plotting a latent variable, so it can’t be compared to any observation, second it is comparing the “mean” type line with a histogram of all the samples from all draws and chains to random subsets of multiple samples from multiple chains and draws, not all of them but multiple. The base idea of a posterior predictive check is comparing the observed data to in sample posterior predictive samples. If the model is correct, one should not be able to distinguish the observed data from the posterior predictive samples. By in sample posterior predictive samples I mean posterior predictive samples that correspond to predictions on the exact same data that was used to fit the model (i.e. in a regression, use the same x instead of changing it to generate actual predictions/out of sample posterior predictive samples).
Just as a note, N is a latent/posterior variable that is sampled in pm.sample. Adding that here is only copying the samples already stored in trace (that go to data_ppc.posterior) to the ppc dictionary (that goes to data_ppc.posterior_predictive). At best it will be confusing to ArviZ because there will be variables in the posterior predictive group without corresponding variable in observed data group, so some of the automatation won’t work. At worst it will be confusing for you and people reading the code at a conceptual level. Only deterministics need to be explicitly added there sometimes even when they are not posterior predictive variables when generating predictions.
The futurewarning I’d recommend taking care of it as it is probably to prepare you for an easy transition to v4. The effective sample size warning doesn’t necessarly mean anything. In this case, as the sampler is MH and not NUTS, low efficiency is expected. As @jessegrabowski said, you need to check that the effective sample size (ess) is high enough, and for that you’ll need many more actual samples that what you’d need with NUTS and a model with continuous variables. In fact, this warning has been removed in v4 as it is more confusing than helpful, especially when not using NUTS as sampler. It is not related to having 1, 10 or 200 observations.
One of the main pain points of PyMC 3.x is shape handling. It is often inconsistent between scalars and length 1 vectors which I suspect is the root of this issue. There is probably a decoy dimension likelihood_dim_0 in the posterior_predictive that is not in the observed data or viceversa. Assuming it is in the posterior_predictive and taking into account comments from other answers, I’d recommend something like (note that I haven’t run any of the code though, some tweaks might be needed):
Thank you for this clarification! I wrongly assumed that the PPC was doing a type of simple bootstrapping from the posterior over the observed variable, and extended that logic to the latent variables. I didn’t notice that pm.sample_posterior_predictive draws samples for each observation, so you get back n_obs x n_chains x n_draws.
Thanks to you both. I have relaunched with Oriol’s code and this is the image I obtain: I can see the observations, nice, but only 5 points for the posterior predictive. Do you know why is this so? Can I choose to have more points? Thanks.
Ok, I just self answer this. Looking closely at the docs in Arviz for plot_ppc, it seems that the default for a scatter plot is 5 points, but you can tune it with num_pp_samples. I am still not sure of what the relative height means for the blue points, perhaps just a convenience to plot, i.e., looking for the best positions to not obscure the plot?