How to display a single deterministic output of a Bayesian regression model using ArviZ

Let’s say I have a dataset of X and Y with 100 observations from each. I developed a Bayesian simple linear regression model from X to Y, using PyMC, like this:

Y = Alpha + Beta * X + Epsilon

As you appreciate, there will be some stochastic prior distributions for the model parameters and a deterministic variable like mu=Deterministic(‘mu’, alpha + beta*X) to be used as the mean value of the likelihood normal distribution.

After running the model, when I use the ‘summary’ function of ArviZ, it will return the posterior statistics of alpha, beta, epsilon, and from mu[0] to mu[100].

My question is how I can return the result of a deterministic variable (mu) for only a single observation (i.e., for mu[75])? Specifying the variable name using the var_names method of the summary function does not limit the outputs of mu, but still expresses mu for every single observation. Is there any other method for the summary function of ArviZ to pick the result of a deterministic variable for a single observation?

Besides that, the same question applies to the visual outputs. When I use the plot_posterior function of ArviZ, it returns HDI graphs for every single observation (like from m[0] to mu[100]). How can I set a limit for the number of those outputs or pick only a single one?

Could you try to select the variable before plotting/summarizing?

az.summary(idata.posterior.isel(mu_dim_0=75)) or whatever the name of the dim is

1 Like

Yeah, I tried it, but it still plots/summarizes for all dimensions.

Most (all?) arviz functions take a coords argument that allows you to filter what you want computed or displayed. Here’s an example:

import pymc as pm
import arviz as az

coords = {
    'obs_idx': range(100),
    'feature': [f'feature_{i}' for i in range(0, 10)]
}

with pm.Model(coords=coords) as model:
    
    # X is only here to generate sample data for the model, it will be replaced
    # with pm.do later
    X = pm.Normal('X', 0, 10, dims=['obs_idx', 'feature'])
    
    alpha = pm.Normal('alpha', 0, 10)
    beta = pm.Normal('beta', 0, 3, dims=['feature'])
    
    mu = pm.Deterministic('mu', alpha + X @ beta, dims=['obs_idx'])
    sigma = pm.Exponential('sigma', 1)
    
    y_hat = pm.Normal('y_hat', mu=mu, sigma=sigma, dims=['obs_idx'])
    
    prior = pm.sample_prior_predictive(draws=1)
    
# Grab generated data to use as model inputs
X_observed = prior.prior.X.sel(chain=0, draw=0).values
y_observed = prior.prior.y_hat.sel(chain=0, draw=0).values


# Sample parameters conditional on the generated data
with pm.observe(pm.do(model, {'X': X_observed}), {'y_hat': y_observed}):
    idata = pm.sample()

With idata in hand, we can look at the summary for only certain obs_idx of interest:

az.summary(idata, var_names=['mu'], coords={'obs_idx': [3, 10, 25]})

Output:

           mean     sd   hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
mu[3]   -37.468  0.284  -37.982  -36.920      0.004    0.005    5705.0    3299.0    1.0
mu[10]  -52.711  0.182  -53.075  -52.392      0.002    0.003    6302.0    3000.0    1.0
mu[25]  149.517  0.247  149.062  149.963      0.003    0.004    6097.0    3073.0    1.0
2 Likes

Today I learned

1 Like

It wouldn’t be clearer than that. Thank you!

I have one more question now. What if I need to add a new dimension to the model? A dimension that consists of two-categorical strings, and as long as the first dimension like that: {‘obs_idx2’: [‘male, female, female, male, …, male’]}

I need to do that because when I press the posterior plots, I want to show the selected variables with their specific category names, not with coordinate numbers like 0 and 1. Because that would be more chic. So, is it possible to make this change?

Yes, but you have to hack it a bit because multi-indices aren’t supported by pymc by default (because they aren’t supported by arviz by default, because nobody likes indices anymore for reasons beyond my ken).

You will need to fit your model with a dummy numerical index, then swap it out using xr.Coordinate.from_pandas_multiindex.

The following code starts from where the previous code left off, so you have an idata with a dummy numeric index called obs_idx. Make a new Coordinate object:

import pandas as pd
import xarray as xr

# This would actually come from your data of course, but I have to set it up
user_id = np.arange(10)
date_idx = pd.date_range('2010-01-01', periods=10, freq='YS')

# You could get a multi-index by doing `df.set_index(['user', 'gender'])` or whatever, I have to make one the hard way
# Note that the indices have names, that's important.
index = pd.MultiIndex.from_product((user_id, date_idx), names=('user', 'date'))

# Once you have the index (however you get it), make a new coordinate
# The name has to be the same as the name of your dummy numeric index in the idata
multi_index_coord = xr.Coordinates.from_pandas_multiindex(index, dim='obs_idx')

# Inject the new multi-index into the idata
idata = idata.assign_coords(multi_index_coord)

Now you can select using the same coords argument in arviz. For example, show the estimated mean for user 3 on all dates:

az.summary(idata, var_names=['mu'], coords={'user': [3]})

                                      mean     sd   hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
mu[2010-01-01T00:00:00.000000000]  126.750  0.031  126.689  126.806        0.0    0.000    6156.0    3263.0    1.0
mu[2011-01-01T00:00:00.000000000]   24.966  0.030   24.910   25.020        0.0    0.000    5650.0    3201.0    1.0
mu[2012-01-01T00:00:00.000000000]  -17.990  0.025  -18.037  -17.944        0.0    0.000    6386.0    3047.0    1.0
mu[2013-01-01T00:00:00.000000000]  -30.327  0.022  -30.367  -30.284        0.0    0.000    6269.0    3124.0    1.0
mu[2014-01-01T00:00:00.000000000]   18.623  0.020   18.587   18.661        0.0    0.000    6691.0    3252.0    1.0
mu[2015-01-01T00:00:00.000000000]   -4.291  0.021   -4.331   -4.254        0.0    0.000    6097.0    3333.0    1.0
mu[2016-01-01T00:00:00.000000000]   73.754  0.021   73.711   73.791        0.0    0.000    6189.0    3104.0    1.0
mu[2017-01-01T00:00:00.000000000]  154.283  0.027  154.233  154.332        0.0    0.000    5828.0    2984.0    1.0
mu[2018-01-01T00:00:00.000000000]  190.142  0.034  190.079  190.203        0.0    0.001    6159.0    2867.0    1.0
mu[2019-01-01T00:00:00.000000000]  -54.211  0.022  -54.250  -54.171        0.0    0.000    6316.0    3028.0    1.0
1 Like

The solution works. I created the MultiIndex using hierarchical indexing, which was more convenient in my case, and it also worked.