Fitting multiple measurements, shape issues in sample_ppc


#1

Dear pymc developers and community,

I’ve run into an issue with the “sample_ppc” function when using more than one variable of a multivariate data set as outcome.
Maybe my intention of giving “shape” to the data likelihood is wrong (see below). I would appreciate your opinion on that.
However, the error also occurs when using only one observable. I attach example code to help you track the behavior.

Thank you for your help!

Falk

short error description:
When I use the ‘shape’ argument on the data likelihood, sampling fails with the following error:

ValueError: cannot reshape array of size XXX into shape (1,1)

Below a code example.
The shape is no issue for summary functions.
I would expect that “sample_ppc” can handle the shape of the posterior, as do “summary” and “traceplot”.
The error is gone when not using the shape argument in the likelihood; however, that limits the capability of the code to sample multiple observables at once.

Why would I want to use a shape on the data likelihood?
I have acquired a multivariate, hierarchical data set and want to model the effect of some covariates on many (N) independent observables. Thus, I thought it would be comfortable to be able to hit the “inference button” only once, instead of N times (saving a huge lot of computation time).
My attempt uses the number of observables N consistently as first shape dimension for all the model parameters. That apparently worked for all purposes, except “sample_ppc”.

My question: is that a valid approach, i.e. does the sampler really estimate the traces independently for all the observables (even in a hierarchical setting)?

code:
Here a realistic model example with fake data. The hierarchy is not necessary to produce the error, but I wanted to stay close to the actual model.

import numpy as np
import pandas as pd
import pymc3 as pm
import scipy.stats as st
import matplotlib.pyplot as plt


### generate test data set
observations_per_subject = 100
# subject: [mean, sd, age, group]
group_parameters = {  's1': [0.7, 0.10, 1.0, 0] \
                    , 's2': [0.8, 0.15, 1.2, 1] \
                    , 's3': [0.9, 0.20, 1.4, 1] \
                    , 's4': [1.0, 0.25, 1.6, 0] \
                    , 's5': [1.1, 0.20, 1.8, 0] \
                    , 's6': [1.2, 0.25, 2.0, 1] \
                    }
                    
data = [pd.DataFrame({'subject': [grp]*observations_per_subject \
                , 'age': [vals[2]]*observations_per_subject \
                , 'measurement1': np.random.normal(vals[0], vals[1], observations_per_subject) \
                , 'measurement2': np.random.normal(0.2+vals[0], 0.9*vals[1], observations_per_subject) \
                , 'group': [vals[3]]*observations_per_subject \
                })
        for grp, vals in group_parameters.items()]

data = pd.concat(data).reset_index(inplace = False, drop = True)
data['group'] = pd.Categorical(data['group'], ordered = False)
print (data)


### prepare hierarchy
groups = list(data['group'].cat.categories)
# the number of groups
n_groups = len(groups)
# a list of the group indices
idx_group = data['group'].cat.codes.values




### heuristics
nu0, mu0, sd0 = st.t.fit(data.loc[:,['measurement1', 'measurement2']].values.ravel()) 


### model construction
def BuildModel(observables, shape_restricted = False):
    n_observables = len(observables)

    with pm.Model() as model:
        intercept_population = pm.Normal( 'intercept_population' \
                            , mu = mu0 \
                            , sd = sd0 \
                            , shape = (n_observables, 1) \
                            )

        intercept_group = pm.Normal( 'intercept_group' \
                            , mu = intercept_population \
                            , sd = sd0 \
                            , shape = (n_observables, n_groups) \
                            )

        slope_age = pm.Normal( 'slope_age' \
                            , mu = 0 \
                            , sd = sd0 \
                            , shape = (n_observables, 1) \
                            )

        estimator = intercept_group[:,idx_group] + slope_age * data['age'].values

        residual = pm.HalfCauchy('residual', 5, shape = (n_observables, 1) )
        dof = pm.Gamma('dof', 2, 0.1, shape = (n_observables, 1) )

        likelihood_args = dict(   nu = dof \
                                , mu = estimator \
                                , sd = residual \
                                , observed = data.loc[:, observables].values.T \
                                )
        if shape_restricted:
            likelihood = pm.StudentT( 'likelihood' \
                            , **likelihood_args \
                            )
        else:
            ### HERE the ERROR gets introduced.
            likelihood = pm.StudentT( 'likelihood' \
                            , shape = (n_observables, 1) \
                            , **likelihood_args \
                            )


    return model


### two models
model_limited = BuildModel(observables = ['measurement1'], shape_restricted = True)
model_error = BuildModel(observables = ['measurement1'], shape_restricted = False)
# model_desired = BuildModel(observables = ['measurement1', 'measurement2'], shape_restricted = False)

### sample
with model_limited:
    trace_limited = pm.sample(progressbar = False)
with model_error:
    trace_error = pm.sample(progressbar = False)
# with model_desired:
#     trace_desired = pm.sample(progressbar = False)


### inspect
# pm.traceplot(trace)
# plt.show()

### ppc
ppc_samples = pm.sample_ppc(  trace_limited \
                            , model = model_limited \
                            , samples = 500 \
                            , progressbar = False \
                            )

ppc_samples = [ppc.mean() for ppc in ppc_samples['likelihood']] 
print ('actual mean:', mu0, 'ppc mean:', np.mean(ppc_samples))


### produce the error:
ppc_samples = pm.sample_ppc(  trace_error \
                            , model = model_error \
                            , samples = 500 \
                            , progressbar = False \
                            )

Here the error trace:

Traceback (most recent call last):
  File "", line 125, in <module>
    , progressbar = False \
  File "/usr/lib/python3.6/site-packages/pymc3/sampling.py", line 1068, in sample_ppc
    size=size))
  File "/usr/lib/python3.6/site-packages/pymc3/distributions/continuous.py", line 1029, in random
    size=size)
  File "/usr/lib/python3.6/site-packages/pymc3/distributions/distribution.py", line 494, in generate_samples
    return reshape_sampled(samples, size, dist_shape)
  File "/usr/lib/python3.6/site-packages/pymc3/distributions/distribution.py", line 410, in reshape_sampled
    return np.reshape(sampled, repeat_shape + dist_shape)
  File "/usr/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 257, in reshape
    return _wrapfunc(a, 'reshape', newshape, order=order)
  File "/usr/lib/python3.6/site-packages/numpy/core/fromnumeric.py", line 52, in _wrapfunc
    return getattr(obj, method)(*args, **kwds)
ValueError: cannot reshape array of size 600 into shape (1,1)

#2

I am fairly sure this is now fixed on master - I did a few PRs for exactly this kind of issue (eg https://github.com/pymc-devs/pymc3/pull/3053).


#3

Thanks Junpeng, this disappeared when updating.

Using pip, I reinstalled from git, which solved the issue. Sorry, I had briefly checked that I was up to date. Having installed a while ago, that was apparently with the pip repositories and not with git.

Also, regarding my other question, I discovered your presentation and the part on multivariate statistics and will try to work through it. There are lots of tricks in it that will certainly apply to my data.

Best,

Falk


#4

Great! Hope you find the material helpful :wink: