Compute reduced summary statistics

CorbinFoucart · April 29, 2020, 10:48am

Normally I compute summaries of traces with pm.summary(trace) but this can be quite slow for large problems. Is there an argument to specify that I only want mean and standard deviation?

https://docs.pymc.io/api/stats.html?highlight=pymc3%20summary#pymc3.stats.summary specificies that I can pass a func_dict of functions, but I don’t see a pm.stats.mean function, for example. Does PyMC3 use np.mean under the hood?

nkaimcaudle · April 29, 2020, 11:56pm

You can reduce the number of variables that pm.summary() calculates all the stats for by using the var_names parameter. As for only computing a few of the stats I don’t think that is possible. To check model convergence you should calculate and check rhat and ess at least once after sampling.

Alternatively you can write a simple function to calc just mean and std using numpy and return a DataFrame.

OriolAbril · April 30, 2020, 8:36am

It looks like you want something like the second example in the documentation, that is, passing a custom stats_funcs dictionary and using extend=False. Example code below:

stats_funcs = {"mean": np.mean, "std": np.std}
az.summary(idata, stats_funcs=stats_funcs, extend=False, kind="stats")

I’d also like to point out a couple of extra details to make the answer more complete.

The first comment is that PyMC3 now uses ArviZ for statistics and plotting, you’ll see that I have used idata instead of trace. ArviZ uses its own data structure (aiming to be backend agnostic and therefore be used not only by PyMC3 users but also by PyStan, Pyro…). If performace is critical, you should convert your trace to InferenceData using az.from_pymc3, otherwise, the conversion is done internally every time you call an ArviZ function like az.summary(trace,...).

The second comment is that ArviZ uses xarray.apply_ufunc with a wrapper around the functions in stats_funcs. This wrapper allows users to use a wider range of functions (otherwise it would be limited to functions following numpy ufuncs conventions) but for functions which are already pure ufuncs, it may not be as efficient as doing something like idata.posterior.mean(dim=("chain", "draw")). By default, az.summary uses this second approach, but to customize the output, the wrapper must be used. This being said, unless the model has a relatively large number of variables, this should not affect performance significantly.

And finally, I have used the kind argument, present only in ArviZ development version to avoid the computation of diagnostics (ess, rhat, mcse). This is bugish, even though the values for the diagnostics are not returned when extend=False, they are still executed, unlike diagnostics, default stats functions are not executed if exclude=False.

CorbinFoucart · April 30, 2020, 12:56pm

Thank you – that’s incredibly helpful. Marking as solution.

Topic		Replies	Views
Is there an efficient way to get all of my linear models parameters distributions mean? I.e. analogue to sklearn's .coef_? Questions	5	569	March 26, 2021
Error when trying to get summary statistics, trace plot and posterior plot v5 arviz	2	537	February 16, 2024
Different SD between pm.summary() and trace.posterior.std() version agnostic arviz	3	46	September 6, 2024
Create traces from numpy arrays Questions	7	1365	April 10, 2019
Percentile and Mean of certain parameter	1	277	October 18, 2023

Compute reduced summary statistics

Related topics