I’ll be honest here, I haven’t used PyMC much since PyMC2 so I am a bit out of the loop on so many of the latest and greatest features and am trying to catch up.
I’m trying to build a model with some ragged panel data indexed by date and another id, and I’d like to have some of the model parameters use dimensions that are derived from the date dimension, e.g. day of week.
For a simplified 1-D example without taking advantage of dimensions, I could build a linear regression with a slope on time and a day of the week effect (y \sim \beta_0 t + \beta_{weekday}) like this:
# imports
import pymc as pm
import arviz as az
import xarray as xr
import pandas as pd
import numpy as np
# dates
dates = pd.date_range("2022-01-01", "2022-12-31")
weekdays = dates.weekday
days = dates.day_of_year
# simulated parameters
slope = .1
dow_effect = np.array([5, 10, 15, 20, 12.5, 7.5, 2.5])
# simulated response
y = (slope * days) + dow_effect[dates.weekday] + np.random.normal(size=dates.shape)
# combine into dataframe and xarray
df = pd.DataFrame(
data={
"weekday": weekdays,
"day": days,
"y": y,
},
index=pd.Index(dates, name="date")
)
ds1 = df.to_xarray()
# simple model
with pm.Model() as model1:
# priors
sigma = pm.HalfStudentT("sigma", 4, 1)
dow_effect = pm.Normal("dow_effect", mu=np.repeat(0, 7), sigma=np.repeat(7.14, 7))
slope = pm.Normal("slope", mu=0, sigma=.0109)
# likelihood
likelihood = pm.Normal(
"y",
mu=(
slope * ds1["day"].values
+ dow_effect[ds1["weekday"].values]
),
sigma=sigma,
observed=ds1["y"]
)
# inference
idata1 = pm.sample(1000)
That fits great, but then I miss out on the fancy indexing:
az.summary(idata1)
>> mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
dow_effect[0] 4.972 0.165 4.654 5.267 0.003 0.002 3388.0 3219.0 1.0
dow_effect[1] 9.865 0.166 9.574 10.200 0.003 0.002 3365.0 2999.0 1.0
dow_effect[2] 15.123 0.169 14.775 15.413 0.003 0.002 3364.0 2942.0 1.0
dow_effect[3] 20.046 0.169 19.707 20.347 0.003 0.002 3251.0 3113.0 1.0
dow_effect[4] 12.395 0.167 12.085 12.718 0.003 0.002 3359.0 2851.0 1.0
dow_effect[5] 7.317 0.169 7.003 7.631 0.003 0.002 3332.0 2900.0 1.0
dow_effect[6] 2.414 0.168 2.110 2.738 0.003 0.002 3605.0 3036.0 1.0
slope 0.100 0.001 0.099 0.101 0.000 0.000 2133.0 2780.0 1.0
sigma 1.009 0.038 0.936 1.081 0.000 0.000 5753.0 2704.0 1.0
My initial thought was to create another dimension on the xarray Dataset for the day of the week, but this leads to a lot of missingness in the Dataset because there’s only one weekday per date:
df["day_name"] = df.index.day_name()
ds2 = df.reset_index().set_index(["date", "day_name"]).to_xarray()[["day", "y"]]
ds2
>>
Dimensions:
(date: 365, day_name: 7)
Coordinates:
date (date) datetime64[ns]
2022-01-01 ... 2022-12-31
day_name (day_name) object
'Friday' 'Monday' ... 'Wednesday'
Data variables:
day (date, day_name) float64
nan nan 1.0 nan ... nan nan nan nan
y (date, day_name) float64
nan nan 8.254 nan ... nan nan nan
If I then use that new mostly missing Dataset in a model with dimensions, it fails:
with pm.Model(coords={"day_name": ds2.coords["day_name"]}) as model2:
# priors
sigma = pm.HalfStudentT("sigma", 4, 1)
dow_effect = pm.Normal("dow_effect", mu=0, sigma=7.14, dims="day_name")
slope = pm.Normal("slope", mu=0, sigma=.0109)
# likelihood
likelihood = pm.Normal(
"y",
mu=(
slope * ds2["day"].values
+ dow_effect
),
sigma=sigma,
observed=ds2["y"]
)
# inference
idata2 = pm.sample(1000)
>> SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'sigma_log__': array(0.38758682), 'dow_effect': array([-0.05657639, 0.44262808, -0.48668421, -0.56598902, -0.2920948 ,
0.89548499, -0.65632519]), 'slope': array(0.10509197), 'y_missing': array([nan, nan, nan, ..., nan, nan, nan])}
Logp initial evaluation results:
{'sigma': -0.98, 'dow_effect': -20.21, 'slope': -42.88, 'y_missing': nan, 'y_observed': -10912.17}
You can call `model.debug()` for more details.
I think this is likely due to all the missingness induced by the new dimension. Is there a better way of incorporating dimensions into the model and take advantage of all of those niceties without messing up the data structure that I’m not imagining? Thanks for any suggestions!