Unexpected Results when using WAIC to Compute Log Predictive Accuracy

Related to : Using WAIC to compute Log Predictive Accuracy

Can someone please explain this behavior.

I used NUTS to train linear regression model with a generated dataset that have 100 data instances and 100 dimensions.

If we compute WAIC for the train data it reduces when increasing the number of samples (samples drawn from the target distribution), as expected. However, when I calculate the WAIC for the test data, at first it gives a very low value and then it increased as shown in the figure below (x axis is time taken and y axis is WAIC).


It should be noted that, eventhough the WAIC is really high when number of samples are increased, the model converges to the true parameters. Even if we compute the MSE for test set, that also converge to the 0 with time as expected. I also checked the traceplot and observe that the variance of the model also reduces when drawing more samples. However, when I ran the same experiment reducing the number of dimensions to 10 then MCMC WAIC converges as expected.

Moreover, when I compute LOO for test data, the process fails with and error during the _psislw computation (this won’t happen when computing LOO with train data).

It is a great help if someone can help me to understand this behavior. Is that means we can’t use current implementation of WAIC and LOO to calculate the predictive accuracy?

I have attached the relevant script for your reference.

import pymc3 as pm
import numpy as np
from sklearn.model_selection import train_test_split
from theano import shared, tensor as tt
from sklearn.metrics import mean_squared_error

seed = 7
def operator(x, w, c, noise):
    Operator is used to generate artificial datasets
    :param w:
    :param c:
    :param noise:
    return x * w + c + noise

n = 120
d = 100

_n = n
n = int(n + n*0.2)
X = np.matrix([np.random.uniform(0, 1, d) for i in range(n)])
_w = np.matrix(np.random.uniform(-10, 10, d)).T
_c = float(np.random.uniform(-10, 10, 1))

y = operator(x=X, w=_w, c=_c, noise=np.matrix(np.random.normal(scale=0.5, size=n)).T)
y = np.array(y).flatten()

train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=int(_n*0.2), shuffle=True)

def adding_intecept(X):
    Adding a column for intercept
    :param X: features
    return np.c_[X, np.ones(X.shape[0])]

train_X, test_X = adding_intecept(train_X), adding_intecept(test_X)

shared_X = shared(train_X)
shared_y = shared(train_y)

with pm.Model() as model:
    # Define hyper-prior
    alpha = pm.Gamma("alpha", alpha=1e-2, beta=1e-4)

    # Define priors'
    w = pm.Normal("w", mu=0, sd=alpha, shape=train_X.shape[1])
    sigma = pm.HalfCauchy("sigma", beta=10)
    mu = tt.dot(w, shared_X.T)

    # Define likelihood
    likelihood = pm.Normal("y", mu=mu, sd=sigma, observed=shared_y)

    trace = pm.sample(500, discard_tuned_samples=False, random_seed=seed, njobs=1, chains=1)

trace1 = trace[:50] # 50 samples from tuned samples
trace2 = trace[500:1000] # after discarding 500 tune samples

print("True coefficients")
print("Coefficients inferred ")

print(mean_squared_error(test_y, pm.sample_ppc(trace1, 10000, model)['y'].mean(axis=0)))
print(mean_squared_error(test_y, pm.sample_ppc(trace2, 10000, model)['y'].mean(axis=0)))

print("WAIC for train data")
print(pm.waic(trace1, model).WAIC)
print(pm.waic(trace2, model).WAIC)


print("WAIC for test data")
print(pm.waic(trace1, model).WAIC)
print(pm.waic(trace2, model).WAIC)

I can run loo on the test data set without problem - are you on master?
Two quick comments:

  1. When I run the code above I see a lot of warning for both WAIC and LOO, this is an indication of problem in the computation and you cannot trust the result.
  2. I would suggest you to set the number of predictors to a lower dimension - it is likely that your model is overfitting with few data points.
1 Like

Thanks for the suggestions @junpenglao. I will re-check the LOO.

I did a little bit of investigation. It seems, you are correct. When calculating the WAIC; computed lppd_i are getting smaller as we increase the number of samples drawn from the target distribution.

However, when we compute the var(logp), this result in some large variance values and that cause the large WAIC.

So what is the alternative, if we can’t trust WAIC or LOO for this case. Can we just use lppd values?

You need to find out first what is causing the large var(logp) in WAIC or large khat in LOO. And depending on your model set up, there are different possibility that cause such problem. See for example: