How to understand `obs_dim_2` in out-of-sample prediction?

Galen_Seilis · June 29, 2024, 2:54pm

Question

This questions is related to this one

but I felt like the original example is more complicated than needed. I am trying to understand how to make out-of-sample predictions with PyMC, so I figure I might as well go back to something simpler first.

I have been following this example:

Here is a selection of the Python code along with some file writes at the end:

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm

from scipy.special import expit as inverse_logit

RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)
az.style.use("arviz-darkgrid")

# Number of data points
n = 250
# Create features
x1 = rng.normal(loc=0.0, scale=2.0, size=n)
x2 = rng.normal(loc=0.0, scale=2.0, size=n)
# Define target variable
intercept = -0.5
beta_x1 = 1
beta_x2 = -1
beta_interaction = 2
z = intercept + beta_x1 * x1 + beta_x2 * x2 + beta_interaction * x1 * x2
p = inverse_logit(z)
# note binomial with n=1 is equal to a Bernoulli
y = rng.binomial(n=1, p=p, size=n)
df = pd.DataFrame(dict(x1=x1, x2=x2, y=y))
df.head()

labels = ["Intercept", "x1", "x2", "x1:x2"]
df["Intercept"] = np.ones(len(df))
df["x1:x2"] = df["x1"] * df["x2"]
# reorder columns to be in the same order as labels
df = df[labels]
x = df.to_numpy()

indices = rng.permutation(x.shape[0])
train_prop = 0.7
train_size = int(train_prop * x.shape[0])
training_idx, test_idx = indices[:train_size], indices[train_size:]
x_train, x_test = x[training_idx, :], x[test_idx, :]
y_train, y_test = y[training_idx], y[test_idx]


coords = {"coeffs": labels}

with pm.Model(coords=coords) as model:
    # data containers
    X = pm.MutableData("X", x_train)
    y = pm.MutableData("y", y_train)
    # priors
    b = pm.Normal("b", mu=0, sigma=1, dims="coeffs")
    # linear model
    mu = pm.math.dot(X, b)
    # link function
    p = pm.Deterministic("p", pm.math.invlogit(mu))
    # likelihood
    pm.Bernoulli("obs", p=p, observed=y)


with model:
    idata = pm.sample()

with model:
    pm.set_data({"X": x_test, "y": y_test})
    predictions = pm.sample_posterior_predictive(trace=idata, predictions=True).predictions

predictions.to_netcdf(__file__.replace('.py', '.nc'))
predictions.to_dataframe().to_csv(__file__.replace('.py', '.csv'))

I have written the results to a CSV for inspection.

$ head experiment.csv 
chain,draw,obs_dim_2,obs
0,0,0,0
0,0,1,0
0,0,2,0
0,0,3,0
0,0,4,1
0,0,5,0
0,0,6,1
0,0,7,1
0,0,8,0

I understand the concept of the chain and draw with respect to the MCMC. I assume obs is the same observation specified in the model itself, so I am looking at outcomes. I am unsure what obs_dim_2 is about.

Update 1

When I look in the NETCDF (loaded as an xarray) I see that it is a collection of 75 integers:

>>> import xarray
>>> data = xarray.load_dataset("experiment.nc")
>>> data.obs_dim_2
<xarray.DataArray 'obs_dim_2' (obs_dim_2: 75)> Size: 300B
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
       36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
       54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
       72, 73, 74], dtype=int32)
Coordinates:
  * obs_dim_2  (obs_dim_2) int32 300B 0 1 2 3 4 5 6 7 ... 68 69 70 71 72 73 74

I’m not sure where the 75 comes from. It isn’t a multiple of the sample size (250) or the number of levels labels (4).

Galen_Seilis · June 29, 2024, 3:46pm

I noticed that when I removed the prediction=True and accessing the attribute prediction, like this,

with model:
    pm.set_data({"X": x_test, "y": y_test})
    predictions = pm.sample_posterior_predictive(trace=idata)

that the CSV has 75 columns of the form ('obs[k]', k) for k=0…74.

$ head experiment.csv 
,chain,draw,"('obs[0]', 0)","('obs[10]', 10)","('obs[11]', 11)","('obs[12]', 12)","('obs[13]', 13)","('obs[14]', 14)","('obs[15]', 15)","('obs[16]', 16)","('obs[17]', 17)","('obs[18]', 18)","('obs[19]', 19)","('obs[1]', 1)","('obs[20]', 20)","('obs[21]', 21)","('obs[22]', 22)","('obs[23]', 23)","('obs[24]', 24)","('obs[25]', 25)","('obs[26]', 26)","('obs[27]', 27)","('obs[28]', 28)","('obs[29]', 29)","('obs[2]', 2)","('obs[30]', 30)","('obs[31]', 31)","('obs[32]', 32)","('obs[33]', 33)","('obs[34]', 34)","('obs[35]', 35)","('obs[36]', 36)","('obs[37]', 37)","('obs[38]', 38)","('obs[39]', 39)","('obs[3]', 3)","('obs[40]', 40)","('obs[41]', 41)","('obs[42]', 42)","('obs[43]', 43)","('obs[44]', 44)","('obs[45]', 45)","('obs[46]', 46)","('obs[47]', 47)","('obs[48]', 48)","('obs[49]', 49)","('obs[4]', 4)","('obs[50]', 50)","('obs[51]', 51)","('obs[52]', 52)","('obs[53]', 53)","('obs[54]', 54)","('obs[55]', 55)","('obs[56]', 56)","('obs[57]', 57)","('obs[58]', 58)","('obs[59]', 59)","('obs[5]', 5)","('obs[60]', 60)","('obs[61]', 61)","('obs[62]', 62)","('obs[63]', 63)","('obs[64]', 64)","('obs[65]', 65)","('obs[66]', 66)","('obs[67]', 67)","('obs[68]', 68)","('obs[69]', 69)","('obs[6]', 6)","('obs[70]', 70)","('obs[71]', 71)","('obs[72]', 72)","('obs[73]', 73)","('obs[74]', 74)","('obs[7]', 7)","('obs[8]', 8)","('obs[9]', 9)"
0,0,0,0,1,1,0,0,1,0,0,1,0,1,1,0,1,1,1,1,1,0,1,1,0,0,0,1,0,1,0,1,1,0,0,1,0,0,1,0,0,0,1,0,1,0,1,0,0,0,1,0,0,0,0,1,0,1,1,0,0,0,0,1,1,0,1,0,0,1,0,1,1,0,0,1,0,1
1,0,1,0,1,1,0,1,1,0,0,1,0,1,1,0,0,0,1,1,0,0,1,0,0,0,1,1,0,1,0,1,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,1
2,0,2,0,1,1,0,0,1,0,0,1,1,1,1,0,1,1,1,1,1,0,1,0,0,0,0,1,0,0,0,1,1,0,0,1,0,1,1,0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,0,1,0,1,1,0,0,0,0,1,1,0,0,0,0,1,0,1,1,0,0,1,0,1
3,0,3,0,1,1,0,0,1,1,0,1,0,1,1,0,1,0,1,1,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,0,0,0,1,0,1,1,0,0,1,0,1
4,0,4,0,1,1,0,1,1,1,0,1,0,1,0,0,1,1,1,1,0,0,1,0,0,1,1,1,1,1,0,0,1,1,0,1,0,1,1,0,0,1,0,0,1,0,1,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,1,1,0,0,1,0,0
5,0,5,1,1,1,0,0,1,0,0,1,0,1,1,0,1,1,1,1,0,0,0,0,0,1,1,1,0,1,0,1,1,0,0,1,0,1,1,0,0,1,1,0,1,0,0,1,0,0,1,0,1,0,0,0,0,1,1,0,0,0,0,1,1,0,0,0,0,1,0,0,1,0,0,1,0,1
6,0,6,0,1,1,0,1,1,0,0,1,1,1,1,0,1,1,1,1,1,0,1,0,1,1,1,1,0,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0,1,1,0,0,1,0,1,0,0,1,0,1,1,0,0,0,0,1,1,0,0,0,0,1,0,1,1,0,0,1,0,1
7,0,7,1,1,1,0,1,1,0,0,1,0,1,1,0,1,1,1,1,0,0,1,0,0,0,1,1,0,1,0,0,1,1,0,1,0,0,1,0,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,1,0,1,1,0,0,0,0,1,1,0,0,0,0,1,0,1,1,0,0,1,0,1
8,0,8,0,1,1,0,1,0,0,0,1,0,1,1,0,0,0,1,1,1,0,1,0,0,0,1,1,1,1,0,0,1,0,0,1,0,1,1,0,0,1,0,1,1,0,1,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,1,0,1

It just clicked for me that these correspond to the testing set size! So the obs_dim_2 is indexing the test set observation.

Topic		Replies	Views
How to include coords in out-of-sample prediction? v5 modeling	1	95	July 2, 2024
PyMC v5.10.3 prediction stuff v5 modeling	11	458	January 15, 2024
Pymc5 out of sample v5 modeling	9	1222	August 20, 2023
Issues when trying to do out of sample prediction v5	2	143	April 9, 2024
Making Out of Sample Predictions with GaussianRandomWalk v5 time_series	9	1013	October 5, 2022

How to understand `obs_dim_2` in out-of-sample prediction?

Question

Update 1

Related topics