Question
This questions is related to this one
but I felt like the original example is more complicated than needed. I am trying to understand how to make out-of-sample predictions with PyMC, so I figure I might as well go back to something simpler first.
I have been following this example:
Here is a selection of the Python code along with some file writes at the end:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
from scipy.special import expit as inverse_logit
RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)
az.style.use("arviz-darkgrid")
# Number of data points
n = 250
# Create features
x1 = rng.normal(loc=0.0, scale=2.0, size=n)
x2 = rng.normal(loc=0.0, scale=2.0, size=n)
# Define target variable
intercept = -0.5
beta_x1 = 1
beta_x2 = -1
beta_interaction = 2
z = intercept + beta_x1 * x1 + beta_x2 * x2 + beta_interaction * x1 * x2
p = inverse_logit(z)
# note binomial with n=1 is equal to a Bernoulli
y = rng.binomial(n=1, p=p, size=n)
df = pd.DataFrame(dict(x1=x1, x2=x2, y=y))
df.head()
labels = ["Intercept", "x1", "x2", "x1:x2"]
df["Intercept"] = np.ones(len(df))
df["x1:x2"] = df["x1"] * df["x2"]
# reorder columns to be in the same order as labels
df = df[labels]
x = df.to_numpy()
indices = rng.permutation(x.shape[0])
train_prop = 0.7
train_size = int(train_prop * x.shape[0])
training_idx, test_idx = indices[:train_size], indices[train_size:]
x_train, x_test = x[training_idx, :], x[test_idx, :]
y_train, y_test = y[training_idx], y[test_idx]
coords = {"coeffs": labels}
with pm.Model(coords=coords) as model:
# data containers
X = pm.MutableData("X", x_train)
y = pm.MutableData("y", y_train)
# priors
b = pm.Normal("b", mu=0, sigma=1, dims="coeffs")
# linear model
mu = pm.math.dot(X, b)
# link function
p = pm.Deterministic("p", pm.math.invlogit(mu))
# likelihood
pm.Bernoulli("obs", p=p, observed=y)
with model:
idata = pm.sample()
with model:
pm.set_data({"X": x_test, "y": y_test})
predictions = pm.sample_posterior_predictive(trace=idata, predictions=True).predictions
predictions.to_netcdf(__file__.replace('.py', '.nc'))
predictions.to_dataframe().to_csv(__file__.replace('.py', '.csv'))
I have written the results to a CSV for inspection.
$ head experiment.csv
chain,draw,obs_dim_2,obs
0,0,0,0
0,0,1,0
0,0,2,0
0,0,3,0
0,0,4,1
0,0,5,0
0,0,6,1
0,0,7,1
0,0,8,0
I understand the concept of the chain and draw with respect to the MCMC. I assume obs
is the same observation specified in the model itself, so I am looking at outcomes. I am unsure what obs_dim_2
is about.
Update 1
When I look in the NETCDF (loaded as an xarray) I see that it is a collection of 75 integers:
>>> import xarray
>>> data = xarray.load_dataset("experiment.nc")
>>> data.obs_dim_2
<xarray.DataArray 'obs_dim_2' (obs_dim_2: 75)> Size: 300B
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35,
36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53,
54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71,
72, 73, 74], dtype=int32)
Coordinates:
* obs_dim_2 (obs_dim_2) int32 300B 0 1 2 3 4 5 6 7 ... 68 69 70 71 72 73 74
I’m not sure where the 75 comes from. It isn’t a multiple of the sample size (250) or the number of levels labels (4).