Dear pymc developers and community,
I’ve run into an issue with the “sample_ppc” function when using more than one variable of a multivariate data set as outcome.
Maybe my intention of giving “shape” to the data likelihood is wrong (see below). I would appreciate your opinion on that.
However, the error also occurs when using only one observable. I attach example code to help you track the behavior.
Thank you for your help!
Falk
short error description:
When I use the ‘shape’ argument on the data likelihood, sampling fails with the following error:
ValueError: cannot reshape array of size XXX into shape (1,1)
Below a code example.
The shape is no issue for summary functions.
I would expect that “sample_ppc” can handle the shape of the posterior, as do “summary” and “traceplot”.
The error is gone when not using the shape argument in the likelihood; however, that limits the capability of the code to sample multiple observables at once.
Why would I want to use a shape on the data likelihood?
I have acquired a multivariate, hierarchical data set and want to model the effect of some covariates on many (N) independent observables. Thus, I thought it would be comfortable to be able to hit the “inference button” only once, instead of N times (saving a huge lot of computation time).
My attempt uses the number of observables N consistently as first shape dimension for all the model parameters. That apparently worked for all purposes, except “sample_ppc”.
My question: is that a valid approach, i.e. does the sampler really estimate the traces independently for all the observables (even in a hierarchical setting)?
code:
Here a realistic model example with fake data. The hierarchy is not necessary to produce the error, but I wanted to stay close to the actual model.
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy.stats as st
import matplotlib.pyplot as plt
### generate test data set
observations_per_subject = 100
# subject: [mean, sd, age, group]
group_parameters = { 's1': [0.7, 0.10, 1.0, 0] \
, 's2': [0.8, 0.15, 1.2, 1] \
, 's3': [0.9, 0.20, 1.4, 1] \
, 's4': [1.0, 0.25, 1.6, 0] \
, 's5': [1.1, 0.20, 1.8, 0] \
, 's6': [1.2, 0.25, 2.0, 1] \
}
data = [pd.DataFrame({'subject': [grp]*observations_per_subject \
, 'age': [vals[2]]*observations_per_subject \
, 'measurement1': np.random.normal(vals[0], vals[1], observations_per_subject) \
, 'measurement2': np.random.normal(0.2+vals[0], 0.9*vals[1], observations_per_subject) \
, 'group': [vals[3]]*observations_per_subject \
})
for grp, vals in group_parameters.items()]
data = pd.concat(data).reset_index(inplace = False, drop = True)
data['group'] = pd.Categorical(data['group'], ordered = False)
print (data)
### prepare hierarchy
groups = list(data['group'].cat.categories)
# the number of groups
n_groups = len(groups)
# a list of the group indices
idx_group = data['group'].cat.codes.values
### heuristics
nu0, mu0, sd0 = st.t.fit(data.loc[:,['measurement1', 'measurement2']].values.ravel())
### model construction
def BuildModel(observables, shape_restricted = False):
n_observables = len(observables)
with pm.Model() as model:
intercept_population = pm.Normal( 'intercept_population' \
, mu = mu0 \
, sd = sd0 \
, shape = (n_observables, 1) \
)
intercept_group = pm.Normal( 'intercept_group' \
, mu = intercept_population \
, sd = sd0 \
, shape = (n_observables, n_groups) \
)
slope_age = pm.Normal( 'slope_age' \
, mu = 0 \
, sd = sd0 \
, shape = (n_observables, 1) \
)
estimator = intercept_group[:,idx_group] + slope_age * data['age'].values
residual = pm.HalfCauchy('residual', 5, shape = (n_observables, 1) )
dof = pm.Gamma('dof', 2, 0.1, shape = (n_observables, 1) )
likelihood_args = dict( nu = dof \
, mu = estimator \
, sd = residual \
, observed = data.loc[:, observables].values.T \
)
if shape_restricted:
likelihood = pm.StudentT( 'likelihood' \
, **likelihood_args \
)
else:
### HERE the ERROR gets introduced.
likelihood = pm.StudentT( 'likelihood' \
, shape = (n_observables, 1) \
, **likelihood_args \
)
return model
### two models
model_limited = BuildModel(observables = ['measurement1'], shape_restricted = True)
model_error = BuildModel(observables = ['measurement1'], shape_restricted = False)
# model_desired = BuildModel(observables = ['measurement1', 'measurement2'], shape_restricted = False)
### sample
with model_limited:
trace_limited = pm.sample(progressbar = False)
with model_error:
trace_error = pm.sample(progressbar = False)
# with model_desired:
# trace_desired = pm.sample(progressbar = False)
### inspect
# pm.traceplot(trace)
# plt.show()
### ppc
ppc_samples = pm.sample_ppc( trace_limited \
, model = model_limited \
, samples = 500 \
, progressbar = False \
)
ppc_samples = [ppc.mean() for ppc in ppc_samples['likelihood']]
print ('actual mean:', mu0, 'ppc mean:', np.mean(ppc_samples))
### produce the error:
ppc_samples = pm.sample_ppc( trace_error \
, model = model_error \
, samples = 500 \
, progressbar = False \
)
Here the error trace:
Traceback (most recent call last):
File "", line 125, in <module>
, progressbar = False \
File "/usr/lib/python3.6/site-packages/pymc3/sampling.py", line 1068, in sample_ppc
size=size))
File "/usr/lib/python3.6/site-packages/pymc3/distributions/continuous.py", line 1029, in random
size=size)
File "/usr/lib/python3.6/site-packages/pymc3/distributions/distribution.py", line 494, in generate_samples
return reshape_sampled(samples, size, dist_shape)
File "/usr/lib/python3.6/site-packages/pymc3/distributions/distribution.py", line 410, in reshape_sampled
return np.reshape(sampled, repeat_shape + dist_shape)
File "/usr/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 257, in reshape
return _wrapfunc(a, 'reshape', newshape, order=order)
File "/usr/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 52, in _wrapfunc
return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 600 into shape (1,1)