More warnings needed for extra dimensions in observed

When the observed variable has an extra singleton dimension incorrect results
can be produced. This can lead to unnoticed and subtle bugs. There should be
some form of warning/error when the observed variable does not have the same
number of dimensions as would be observed if data were sampled from the model.

This example illustrates the problem in a very simple example of observing
a single example from a Gaussian likelihood with Gaussian prior:

import numpy as np
import pymc3 as pm

n_iter = 1000
N = 10

np.random.seed(1234)

X = np.random.randn(N, 1)

with pm.Model() as model_bugged:
    mu = pm.Normal('mu', 0.0, sd=3.0, shape=(N,))
    pm.Normal('X', mu=mu, tau=2.0, observed=X)

with model_bugged:
    tr_bugged = pm.sampling.sample(draws=n_iter)

with pm.Model() as model_ok:
    mu = pm.Normal('mu', 0.0, sd=3.0, shape=(N,))
    pm.Normal('X', mu=mu, tau=2.0, observed=X.squeeze(1))

with model_ok:
    tr_ok = pm.sampling.sample(draws=n_iter)

exact_tau = (1.0 / 9.0) + 2.0
exact_sd = np.sqrt(1.0 / exact_tau)
exact_mean = X.squeeze(1) * (2.0 / exact_tau)

print 'exact answer'
print exact_mean
print exact_sd * np.ones(N)

print 'ok'
print np.mean(tr_ok['mu'], axis=0)
print np.std(tr_ok['mu'], axis=0)

print 'bugged'
print np.mean(tr_bugged['mu'], axis=0)
print np.std(tr_bugged['mu'], axis=0)

which for me produces

exact answer
[ 0.44662279 -1.12829276  1.35730134 -0.29619653 -0.68266301  0.84047015
  0.81434692 -0.60302227  0.01487025 -2.1246489 ]
[ 0.6882472  0.6882472  0.6882472  0.6882472  0.6882472  0.6882472
  0.6882472  0.6882472  0.6882472  0.6882472]
ok
[ 0.48265565 -1.11080441  1.37796336 -0.28009225 -0.70884742  0.8216728
  0.82825643 -0.60005764  0.00556801 -2.11422877]
[ 0.66225099  0.66790124  0.65845662  0.69003781  0.6511985   0.6877904
  0.68203232  0.69729027  0.69290222  0.67772646]
bugged
[-0.13922189 -0.14672933 -0.13457985 -0.15271475 -0.14627176 -0.13330782
 -0.14730323 -0.13669795 -0.14992282 -0.14559449]
[ 0.22075163  0.22182957  0.2211531   0.2246049   0.22193467  0.21384687
  0.23333297  0.2288796   0.23979859  0.21362418]

I cannot think of an example where this type of behavior would be desirable.

One could even argue that the correct likelihood for the (N,1) observation is 0 since generated data is always (N,).

This is not true, you can have k realization of the RV giving you (N, k) observation. But I agree with you that some warning is needed, as currently, it invites some undesired corner cases. For example:

N = 10
X = np.random.randn(N, 1)

with pm.Model() as m1:
    mu1 = pm.Normal('mu1', 0.0, sd=3.0, shape=2)
    pm.Normal('X1', mu=mu1, tau=2.0, observed=X)

there is no warning from above, and the result is unintuitive (it is running as if the two elements in mu1 has the identical observation X)…

you can have k realization of the RV giving you (N, k) observation

True, but I think the more common convention in Python would be to have a (k, N) matrix for k observations of the same variable.

The most robust solution would be to have an optional size= argument. So,

with pm.Model() as model:
    mu = pm.Normal('mu', 0.0, sd=3.0, shape=(N,))
    pm.Normal('X', mu=mu, tau=2.0, observed=X, size=(k,))

would require the shape of observed to be size + shape of single observation = (k,N); and raise an error if it doesn’t match. This could also allow singleton observations with size=() or a matrix of iid observations with size=(m,k) that expects observations to be (m,k,N). It seems this logic is already used in random() via distribution.reshape_sampled(). This would also ensure that the shape of the observed data matches the shape of generated data.

A smaller, but less robust, fix would be to just raise a warning when the extra dimensions are singletons.

1 Like

I think now would be quite difficult to introduce new kwarg without breaking people’s code. The shape issue is one of the recurrent themes and there are some confusing corner cases here and there. I think in the next version we should definitively be more rigor on this.

I don’t think it would break people’s code if the kwarg is optional, but it is a big feature so it might be better for PyMC4.

Maybe we should consider warnings on extra singleton dimensions in PyMC3?

maybe something like
Warning: There is a mismatch between the parameter shapes and the observed array. The parameter shape is broadcasted but you should double check the input.
In observed RV?
cc @fonnesbeck @aseyboldt @twiecki

2 Likes

Yeah, that is a good message, but we need to determine the exact condition to trigger it.