Shape question on sample_prior_predictive()

saiwing · February 23, 2022, 6:36pm

Hi, I think I have a shape bug in my code but I can’t figure out what’s wrong. Basically, when I run sample_prior_predictive() I am getting n times the number of expected number of samples where n is the number of groups.

So for example in this MWE, I expected to get 4 (number of groups) * 500 (number of samples requested) = 2000 samples. But I am getting 4 copies for each sample => 8000 total samples.

import pymc3 as pm
import arviz as az

import numpy as np
import pandas as pd

coords = {
    'region': ['north', 'east', 'south', 'west'],
    'obs_id': np.arange(3 * 4)
}

np.random.seed(0)
df = pd.DataFrame({
    'region': np.repeat(coords['region'], 3),
    'value': np.array([np.random.normal(loc=x, scale=[1], size=(3)) for x in [10, 11, 12, 15]]).reshape(-1)
})
region_labels, region_levels = pd.factorize(df['region'])

with pm.Model(coords=coords) as unpooled_model:
    region_idx = pm.Data('region_idx', region_labels, dims='obs_id')
    
    region_mu = pm.Exponential('region_mu', 0.001, dims='region')
    pooled_sigma = pm.Exponential('sigma', 1 / 100)
    item_mu = region_mu[region_idx]
    
    y = pm.Normal('y', item_mu, sigma=pooled_sigma, observed=df['value'], dims='obs_id')
    prior_checks = pm.sample_prior_predictive(samples=500, random_seed=0)
    idata_prior = az.from_pymc3(prior=prior_checks)

df_prior_samples = idata_prior.prior.to_dataframe()
print(df_prior_samples.shape)
df_prior_samples

However, if I change region_mu = pm.Exponential('region_mu', 0.001, dims='region') to pm.Normal then I’d get 2000 samples as expected. I played around different distributions and found that only Normal works the way I expected. For example, Gamma, ChiSquared, Uniform, etc. all give me 8000 samples (so this is not an issue of number of parameters of the distributions). Can someone help me understand the reason how this works? Thank you!

OriolAbril · February 24, 2022, 12:17am

Can you share the output of idata_prior.prior without converting to dataframe in both exponential and normal case?

I have not idea for now what is going on, but I prefer xarray view given how the data is 3d and doing this we’ll also discard any strange conversion issue

saiwing · February 24, 2022, 12:49am

Thanks!

For exponential:

Dimensions:
chain: 1draw: 500region: 4region_mu_log___dim_0: 4
Coordinates:
chain
(chain)
int64
0
draw
(draw)
int64
0 1 2 3 4 5 ... 495 496 497 498 499
region
(region)
<U5
'north' 'east' 'south' 'west'
region_mu_log___dim_0
(region_mu_log___dim_0)
int64
0 1 2 3
Data variables:
sigma
(chain, draw)
float64
79.59 125.6 92.32 ... 28.56 70.49
sigma_log__
(chain, draw)
float64
4.377 4.833 4.525 ... 3.352 4.256
region_mu
(chain, draw, region)
float64
371.6 466.9 ... 212.7 1.085e+03
region_mu_log__
(chain, draw, region_mu_log___dim_0)
float64
5.918 6.146 6.613 ... 5.36 6.99
Attributes:
created_at :
2022-02-24T00:47:35.577995
arviz_version :
0.11.2
inference_library :
pymc3
inference_library_version :
3.11.4

For Normal:

Dimensions:
chain: 1draw: 500region: 4
Coordinates:
chain
(chain)
int64
0
draw
(draw)
int64
0 1 2 3 4 5 ... 495 496 497 498 499
region
(region)
<U5
'north' 'east' 'south' 'west'
Data variables:
sigma
(chain, draw)
float64
51.17 52.92 10.9 ... 5.695 114.2
sigma_log__
(chain, draw)
float64
3.935 3.969 2.389 ... 1.74 4.738
region_mu
(chain, draw, region)
float64
18.82 12.0 14.89 ... 4.29 3.445
Attributes:
created_at :
2022-02-24T00:47:46.295879
arviz_version :
0.11.2
inference_library :
pymc3
inference_library_version :
3.11.4

Looks like there is an extra dimension in the exponential case.

OriolAbril · February 24, 2022, 4:02am

In both cases region_mu is a variable with dimensions chain, draw, region which corresponds to a shape of 1, 500, 4. The difference is the presence of the extra variable region_mu_log__ for the exponential, which makes the dataframe conversion to generate a different shape.

You’ll see how idata_prior.prior["region_mu"] generates xarray DataArrays with the same shape.

saiwing · February 25, 2022, 7:41pm

Thanks! So if I want to convert it to Pandas, should I do something like this?

df = idata_prior.prior["region_mu"].to_series().reset_index()

This works but feels kind of hacky.

cluhmann · February 25, 2022, 8:26pm

I was a bit hesitant to start working with posteriors in the native az.InferenceData object (because I’m lazy). But if you are already familiar with pandas, it’s really not that different.

saiwing · February 25, 2022, 10:42pm

That makes sense, thanks. I will try working with az.InferenceData then.

(different question) But I still don’t understand how sampling works. Specifically, is there a way to sample both the parameters and the observed variables from prior?

If I run this:

with pm.Model(coords=coords) as m:
    region_idx = pm.Data('region_idx', region_labels, dims='obs_id')
    region_mu = pm.Exponential('region_mu', 0.001, dims='region')

    pooled_sigma = pm.Normal('sigma', 1 / 100)
    item_mu = region_mu[region_idx]
    
    y = pm.Normal('y', item_mu, sigma=pooled_sigma, dims='obs_id')
    prior_checks = pm.sample_prior_predictive(samples=500, random_seed=0)
    idata_prior = az.from_pymc3(prior=prior_checks)

idata_prior.prior would give me region_mu samples indexed by chain, draw, region (which is what I expected). But the y samples are only indexed by chain, draw, obs_id but not region. How do I get y samples indexed by region? I wanted to draw samples of the priors and their downstream variables to get some intuition of the resulted distributions. Thanks!

Topic		Replies	Views
Sample_prior_predictive bug, gives wrong shape for dependent variable? Questions	2	442	July 29, 2019
Shape issues with sample_posterior_predictive in PyMc5 Questions	3	738	July 1, 2023
Sample_prior_predictive failing with shape argument Questions	7	475	May 10, 2021
Reshape error inside sample_prior_predictive Questions	2	579	August 13, 2019
Dirichlet and Categorical Questions	2	596	October 26, 2020

Shape question on sample_prior_predictive()

Related topics