Parallel dimensions...? Utilizing multiple datetime dimensions for different parameters

I’ll be honest here, I haven’t used PyMC much since PyMC2 so I am a bit out of the loop on so many of the latest and greatest features and am trying to catch up.

I’m trying to build a model with some ragged panel data indexed by date and another id, and I’d like to have some of the model parameters use dimensions that are derived from the date dimension, e.g. day of week.

For a simplified 1-D example without taking advantage of dimensions, I could build a linear regression with a slope on time and a day of the week effect (y \sim \beta_0 t + \beta_{weekday}) like this:

# imports
import pymc as pm
import arviz as az
import xarray as xr
import pandas as pd
import numpy as np

# dates
dates = pd.date_range("2022-01-01", "2022-12-31")
weekdays = dates.weekday
days = dates.day_of_year

# simulated parameters
slope = .1
dow_effect = np.array([5, 10, 15, 20, 12.5, 7.5, 2.5])

# simulated response
y = (slope * days) + dow_effect[dates.weekday] + np.random.normal(size=dates.shape)

# combine into dataframe and xarray
df = pd.DataFrame(
    data={
        "weekday": weekdays,
        "day": days,
        "y": y,
    }, 
    index=pd.Index(dates, name="date")
)
ds1 = df.to_xarray()

# simple model
with pm.Model() as model1:
    # priors
    sigma = pm.HalfStudentT("sigma", 4, 1)
    dow_effect = pm.Normal("dow_effect", mu=np.repeat(0, 7), sigma=np.repeat(7.14, 7))
    slope = pm.Normal("slope", mu=0, sigma=.0109)

    # likelihood
    likelihood = pm.Normal(
        "y", 
        mu=(
            slope * ds1["day"].values
            + dow_effect[ds1["weekday"].values]
        ), 
        sigma=sigma, 
        observed=ds1["y"]
    )

    # inference
    idata1 = pm.sample(1000)

That fits great, but then I miss out on the fancy indexing:

az.summary(idata1)
>> 				mean	sd		hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
dow_effect[0]	4.972	0.165	4.654	5.267	0.003		0.002	3388.0		3219.0		1.0
dow_effect[1]	9.865	0.166	9.574	10.200	0.003		0.002	3365.0		2999.0		1.0
dow_effect[2]	15.123	0.169	14.775	15.413	0.003		0.002	3364.0		2942.0		1.0
dow_effect[3]	20.046	0.169	19.707	20.347	0.003		0.002	3251.0		3113.0		1.0
dow_effect[4]	12.395	0.167	12.085	12.718	0.003		0.002	3359.0		2851.0		1.0
dow_effect[5]	7.317	0.169	7.003	7.631	0.003		0.002	3332.0		2900.0		1.0
dow_effect[6]	2.414	0.168	2.110	2.738	0.003		0.002	3605.0		3036.0		1.0
slope			0.100	0.001	0.099	0.101	0.000		0.000	2133.0		2780.0		1.0
sigma			1.009	0.038	0.936	1.081	0.000		0.000	5753.0		2704.0		1.0

My initial thought was to create another dimension on the xarray Dataset for the day of the week, but this leads to a lot of missingness in the Dataset because there’s only one weekday per date:

df["day_name"] = df.index.day_name()
ds2 = df.reset_index().set_index(["date", "day_name"]).to_xarray()[["day", "y"]]
ds2

>> 
Dimensions: 
	(date: 365, day_name: 7)
Coordinates: 
	date		(date)				datetime64[ns]	
		2022-01-01 ... 2022-12-31
	day_name	(day_name)			object
		'Friday' 'Monday' ... 'Wednesday'
Data variables:
	day			(date, day_name)	float64
		nan nan 1.0 nan ... nan nan nan nan
	y			(date, day_name)	float64
		nan nan 8.254 nan ... nan nan nan

If I then use that new mostly missing Dataset in a model with dimensions, it fails:

with pm.Model(coords={"day_name": ds2.coords["day_name"]}) as model2:
    # priors
    sigma = pm.HalfStudentT("sigma", 4, 1)
    dow_effect = pm.Normal("dow_effect", mu=0, sigma=7.14, dims="day_name")
    slope = pm.Normal("slope", mu=0, sigma=.0109)

    # likelihood
    likelihood = pm.Normal(
        "y", 
        mu=(
            slope * ds2["day"].values
            + dow_effect
        ), 
        sigma=sigma, 
        observed=ds2["y"]
    )

    # inference
    idata2 = pm.sample(1000)

>> SamplingError: Initial evaluation of model at starting point failed!
Starting values:
{'sigma_log__': array(0.38758682), 'dow_effect': array([-0.05657639,  0.44262808, -0.48668421, -0.56598902, -0.2920948 ,
        0.89548499, -0.65632519]), 'slope': array(0.10509197), 'y_missing': array([nan, nan, nan, ..., nan, nan, nan])}

Logp initial evaluation results:
{'sigma': -0.98, 'dow_effect': -20.21, 'slope': -42.88, 'y_missing': nan, 'y_observed': -10912.17}
You can call `model.debug()` for more details.

I think this is likely due to all the missingness induced by the new dimension. Is there a better way of incorporating dimensions into the model and take advantage of all of those niceties without messing up the data structure that I’m not imagining? Thanks for any suggestions!

In your initial model, you only have 7 different dow_effect values (I assume one per weekday), so what I would do with that is update the model to:

coords = {
    "weekday": ["Monday", "Thursday", "Wednesday", # continue with rest of days so it is a list of length 7
}
with pm.Model(coords=coords) as model1:
    # priors
    sigma = pm.HalfStudentT("sigma", 4, 1)
    dow_effect = pm.Normal("dow_effect", mu=0, sigma=7.14, dims="weekday")
    # the dims="weekday" will repeat/batch the normal to get a variable with 7 dims,
    # so you can skip the repeat part now
    slope = pm.Normal("slope", mu=0, sigma=.0109)

    # likelihood
    likelihood = pm.Normal(
        "y", 
        mu=(
            slope * ds1["day"].values
            + dow_effect[ds1["weekday"].values]
        ), 
        sigma=sigma, 
        observed=ds1["y"]
    )

It might also be possible to add some dimensions to the likelihood, but I don’t know what these would be.

Thanks!

Maybe I’m just making assumptions about what model coords can do given my past experience with xarray. I was hoping to make use of automatic alignment/broadcasting/etc between model parameters and xarray datasets, but maybe that’s not possible?

For instance, in your example the dow_effect parameter has a weekday dim with named coordinates Monday, Tuesday, etc, but to actually use it in the likelihood it’s using integers to index into it. Is there any way to use the model coordinates to index into a parameter object? I can do dow_effect[0] but not dow_effect["Sunday"].

Dims in PyMC models are just labels for when we convert draws to InferenceData. They don’t have any effect on the model (broadcasting and indexing still follow numpy logic and are positional based). This is because PyTensor doesn’t work like Xarray

Thanks, now I understand when/how the dims are used - will stick to using positional indexing instead inside of the model.

Ah, dims can be used to determine the size of variables (instead of shape). Thats the only effect they have in the model itself.

I think I must have just been remembering the PyTensor announcement and mistakenly thought some of these features were already present:

One example is Xarray-like dimension support: Xarray adds many useful features to arrays. Most notably, named dimensions and coordinates. This is what allows coords and dims in PyMC to give names to your arrays. Currently, Aesara does not support this, nor are there plans to do so. But there are many benefits to having labeled tensors as well.

Any issues or discussions to follow on that? I tried searching both the PyTensor and PyMC github repos but came up short.

1 Like

Nothing yet, we’ll make sure to announce loudly when there are updates :slight_smile: