Shape question on sample_prior_predictive()

Hi, I think I have a shape bug in my code but I can’t figure out what’s wrong. Basically, when I run sample_prior_predictive() I am getting n times the number of expected number of samples where n is the number of groups.

So for example in this MWE, I expected to get 4 (number of groups) * 500 (number of samples requested) = 2000 samples. But I am getting 4 copies for each sample => 8000 total samples.

import pymc3 as pm
import arviz as az

import numpy as np
import pandas as pd

coords = {
    'region': ['north', 'east', 'south', 'west'],
    'obs_id': np.arange(3 * 4)
}

np.random.seed(0)
df = pd.DataFrame({
    'region': np.repeat(coords['region'], 3),
    'value': np.array([np.random.normal(loc=x, scale=[1], size=(3)) for x in [10, 11, 12, 15]]).reshape(-1)
})
region_labels, region_levels = pd.factorize(df['region'])

with pm.Model(coords=coords) as unpooled_model:
    region_idx = pm.Data('region_idx', region_labels, dims='obs_id')
    
    region_mu = pm.Exponential('region_mu', 0.001, dims='region')
    pooled_sigma = pm.Exponential('sigma', 1 / 100)
    item_mu = region_mu[region_idx]
    
    y = pm.Normal('y', item_mu, sigma=pooled_sigma, observed=df['value'], dims='obs_id')
    prior_checks = pm.sample_prior_predictive(samples=500, random_seed=0)
    idata_prior = az.from_pymc3(prior=prior_checks)

df_prior_samples = idata_prior.prior.to_dataframe()
print(df_prior_samples.shape)
df_prior_samples

However, if I change region_mu = pm.Exponential('region_mu', 0.001, dims='region') to pm.Normal then I’d get 2000 samples as expected. I played around different distributions and found that only Normal works the way I expected. For example, Gamma, ChiSquared, Uniform, etc. all give me 8000 samples (so this is not an issue of number of parameters of the distributions). Can someone help me understand the reason how this works? Thank you!

Can you share the output of idata_prior.prior without converting to dataframe in both exponential and normal case?

I have not idea for now what is going on, but I prefer xarray view given how the data is 3d and doing this we’ll also discard any strange conversion issue

Thanks!

For exponential:

Dimensions:
chain: 1draw: 500region: 4region_mu_log___dim_0: 4
Coordinates:
chain
(chain)
int64
0
draw
(draw)
int64
0 1 2 3 4 5 ... 495 496 497 498 499
region
(region)
<U5
'north' 'east' 'south' 'west'
region_mu_log___dim_0
(region_mu_log___dim_0)
int64
0 1 2 3
Data variables:
sigma
(chain, draw)
float64
79.59 125.6 92.32 ... 28.56 70.49
sigma_log__
(chain, draw)
float64
4.377 4.833 4.525 ... 3.352 4.256
region_mu
(chain, draw, region)
float64
371.6 466.9 ... 212.7 1.085e+03
region_mu_log__
(chain, draw, region_mu_log___dim_0)
float64
5.918 6.146 6.613 ... 5.36 6.99
Attributes:
created_at :
2022-02-24T00:47:35.577995
arviz_version :
0.11.2
inference_library :
pymc3
inference_library_version :
3.11.4

For Normal:

Dimensions:
chain: 1draw: 500region: 4
Coordinates:
chain
(chain)
int64
0
draw
(draw)
int64
0 1 2 3 4 5 ... 495 496 497 498 499
region
(region)
<U5
'north' 'east' 'south' 'west'
Data variables:
sigma
(chain, draw)
float64
51.17 52.92 10.9 ... 5.695 114.2
sigma_log__
(chain, draw)
float64
3.935 3.969 2.389 ... 1.74 4.738
region_mu
(chain, draw, region)
float64
18.82 12.0 14.89 ... 4.29 3.445
Attributes:
created_at :
2022-02-24T00:47:46.295879
arviz_version :
0.11.2
inference_library :
pymc3
inference_library_version :
3.11.4

Looks like there is an extra dimension in the exponential case.

In both cases region_mu is a variable with dimensions chain, draw, region which corresponds to a shape of 1, 500, 4. The difference is the presence of the extra variable region_mu_log__ for the exponential, which makes the dataframe conversion to generate a different shape.

You’ll see how idata_prior.prior["region_mu"] generates xarray DataArrays with the same shape.

Thanks! So if I want to convert it to Pandas, should I do something like this?

df = idata_prior.prior["region_mu"].to_series().reset_index()

This works but feels kind of hacky.

I was a bit hesitant to start working with posteriors in the native az.InferenceData object (because I’m lazy). But if you are already familiar with pandas, it’s really not that different.

That makes sense, thanks. I will try working with az.InferenceData then.

(different question) But I still don’t understand how sampling works. Specifically, is there a way to sample both the parameters and the observed variables from prior?

If I run this:

with pm.Model(coords=coords) as m:
    region_idx = pm.Data('region_idx', region_labels, dims='obs_id')
    region_mu = pm.Exponential('region_mu', 0.001, dims='region')

    pooled_sigma = pm.Normal('sigma', 1 / 100)
    item_mu = region_mu[region_idx]
    
    y = pm.Normal('y', item_mu, sigma=pooled_sigma, dims='obs_id')
    prior_checks = pm.sample_prior_predictive(samples=500, random_seed=0)
    idata_prior = az.from_pymc3(prior=prior_checks)

idata_prior.prior would give me region_mu samples indexed by chain, draw, region (which is what I expected). But the y samples are only indexed by chain, draw, obs_id but not region. How do I get y samples indexed by region? I wanted to draw samples of the priors and their downstream variables to get some intuition of the resulted distributions. Thanks!