It is possible to extract parameter values from the MultiTrace trace = pymc3.sample(...)
via trace.get_values('x', chains=[0])
. I would like to similarly extract posteriors (and likelihoods, priors, if possible) for the sampled parameter points, in a similar manner. Are those values stored in the trace and if so, how can one get them e.g. for a specified set of chains?
For priors you can use pm.sample_prior_predictive()
In general to extract specific data I tend to convert my trace into an Arviz InferenceaData object and use that, please see example here
Thanks @nkaimcaudle, via
pm_data = az.from_pymc3(trace)
pm_data.log_likelihood.to_array()
I have now been able to extract the log likelihood in the right format. From that I can in principle retrieve the other values, as priors are comparably cheap to compute post-hoc. I wonder whether from the trace alone it is possible to retrieve not only the llh but also either the log posterior value, or the prior value of the samples in pm_data.posterior
? This would avoid computing the prior twice.
In ArviZ InferenceData objects, there are several groups, each of which contains different information relevant to the inference process. A more detailed description can be found here (note it is still in progress https://github.com/arviz-devs/arviz/pull/1063). Thus, the first step is converting the trace
object to an ArviZ InferenceData one. The best doc on conversion (that I know of) is the link provided above by @nkaimcaudle. I’ll focus on handling the InferenceData
object once created.
log posterior
The log posterior value is stored in the sample_stats
group, it can be accessed as pm_data.sample_stats.lp
as a xarray DataArray (add a .values
at the end to convert to numpy array).
log likelihood
The log likelihood value is not stored, but it can be calculated from the pointwise log likelihood values which are stored. The pointwise log likelihood is stored in its own group, this is done to support models with multiple observed values, thus, the observed_data
and log_likelihood
groups will have the same variable names (with different dimensions and values obviously) in order to map each pointwise log likelihood value to the corresponding observed value.
I would recommend avoiding xarray.Dataset.to_array()
unless you are sure that either the dataset has only one variable or all its variables have exactly the same shape; otherwise the data cannot be represented as a single array and the conversion behaves in unexpected ways by broadcasting and duplicating values to make shapes match.
Ideally, pointwise log likelihood values should be accessed using the correct variable name. For example:
pm_data.log_likelihood.y
pm_data.log_likelihood.obs
again, use .values
to get numpy arrays instead of xarray objects.
A general way of retrieving the log likelihood with multiple variables with different shapes would be:
pm_data.log_likelihood.sum().to_array().sum() # add a .item() to get a scalar
The first sum will reduce all dimensions in the dataset, and leave us with a dataset whose variables are all scalars, we can then use to_array
as all variables have the same shape and then sum the contributions of each variable together. In the case of having a single variable, this can be reduced to
pm_data.log_likelihood.<var_name>.sum() # use .item() if needed
log prior
As for log prior values, they are not stored, only their samples are when using pm.sample_prior_predictive
as explained above.
Thanks @OriolAbril, that clarifies it completely. I had not found the sample_stats.lp
.