Modelling long data with multiple groups

Hello! I’ve read this excellent reply by @OriolAbril on recommending passing “long data” rather than “wide data” (see a figure here for an example). I completely agree with this, and I’m trying to write my model code in this manner.

I’m doing time-series forecasting, and I’d like to add seasonal fourier features. I have observed values on two groups, so for a given date, I have an observed value for group A and another for group B. I am struggling with getting the shapes to work together, and would very much appreciate some pointers.

The data might look like this:

┌────────────┬───────┬───────────┐
│ date       ┆ group ┆ actual    │
│ ---        ┆ ---   ┆ ---       │
│ date       ┆ str   ┆ f64       │
╞════════════╪═══════╪═══════════╡
│ 2023-01-01 ┆ A     ┆ 2.091415  │
│ 2023-01-01 ┆ B     ┆ -2.171598 │
│ 2023-01-02 ┆ A     ┆ 1.892557  │
│ 2023-01-02 ┆ B     ┆ -2.400566 │
└────────────┴───────┴───────────┘

I have shared the above dummy data and code in this gist.

It seems sensible for me that since we have 1 years worth of data, to instead of creating a 732 row long fourier feature, we create a 366 long one. This code does that:

def get_fourier_features(
    dates: pd.Series, n_order: int, period: float
) -> tuple[pd.DataFrame, np.ndarray]:
    day = dates.dt.day_of_year / period
    labels = np.arange(2 * n_order)

    features = pd.DataFrame(
        {
            f"{func.__name__}_order_{order}": func(2 * np.pi * day * order)
            for order in range(1, n_order + 1)
            for func in (np.sin, np.cos)
        },
    )
    features.set_index(dates, inplace=True)
    return features, labels

# monthly fourier series, we create a shape 365 x 12 dataframe here
fourier, fourier_labels = get_fourier_features(pd.Series(dates), 6, 30.5)

The following pymc model is my current attempt at making this work. The goal is making the model sample nicely with a b_fourier_monthly of shape (2, 12).

coords = {
    "index": index,  # 732 rows
    "date": date_labels,  # 366 days
    "group": group_labels,  # 2 groups
    "fourier": fourier_labels,  # 12 fourier features
}


with pm.Model(coords=coords):
    intercept = pm.Normal("intercept", mu=0, sigma=1, dims="group")

    # HOW CAN I MAKE ONE B_FOURIER_MONTHLY PER GROUP (A, B)?
    b_fourier_monthly = pm.Normal(
        name="b_fourier_monthly",
        mu=0,
        sigma=1,
        dims="fourier",  # <--- Should I make this ("group", "fourier") instead?
    )
    seasonality_monthly = pm.Deterministic(
        name="seasonality_monthly",
        var=pymc.math.dot(b_fourier_monthly, fourier.T),
        dims="date",
        # this will currently have shape 366, or 2, 366 if I change the dims above
    )

    pm.Normal(
        name="likelihood",
        mu=(
            intercept[group_idx]
            + seasonality_monthly  # <-- this has shape (366,), while the intercept has shape (732,), so we get an error here
        ),
        sigma=1,
        observed=df_groups["actual"],  # shape = (732, )
        dims="index",
    )

    trace = pm.sample(1000)

Does anyone have any suggestions? As mentioned, the full code is here in this gist.

You basically have the right idea. You need to change the dimensions of b_fourier_monthly to have group. Then you will switch from using dot to doing the multiplication and sum manually, because after you index into b_fourier_monthly[group_idx], it will have the same shape as seasonality_monthly. Here’s the model:

def create_fourier_features(t, n, p=365.2):
    """
    Generates Fourier basis vectors to model seasonal patterns using pytensor.
    
    Parameters
    ----------
    t: TensorVariable
        Time counter .
    n: int
        Number of basis vectors to create. When n = p/2, the model is saturated and any function can be
        represented. n < p/2 learns simpler patterns. n > p/2 is not identified.
    p: float
        Length of one seasonal cycle, **in units of t**. Must be adjusted for spacing if Δt != 1!
    
    Returns
    -------
    X: TensorVariable
        Matrix of Fourier basis vectors. Shape is (T, 2 * n)
    """
    x = 2 * np.pi * (pt.arange(n)+1) * t[:, None] / p
    return pt.concatenate((pt.cos(x), pt.sin(x)), axis = 1)

coords = {
    "date": date_labels,  # 366 days
    "group": group_labels,  # 2 groups
    "fourier": fourier_labels,  # 12 fourier features
}

coords_mutable = {"index": index}

with pm.Model(coords=coords, coords_mutable=coords_mutable):
    y_data = pm.MutableData('y_data', df_groups.actual, dims=['index'])
    group_idx_pt = pm.MutableData('group_idx', group_idx, dims=['index'])
    date_idx_pt = pm.MutableData('date_idx_pt', date_idx, dims=['index'])
    
    # Compute the fourier features on your model's compute graph. There's not reason not to, and it will
    # make forcasting easier.
    # Also, the fourier function doesn't care if the dates repeat, so you can just compute all the features
    # for all the rows in one shot, no need to muck around with indexing later. The only gotcha is that the 
    # periodicity is based on the maximum value of the times we provide. So period of 12 for T_max = 366 is
    # 366 / 12
    X_fourier = create_fourier_features(date_idx_pt, n=6, p=366/12)

    intercept = pm.Normal("intercept", mu=0, sigma=1, dims=["group"])
        
    b_fourier_monthly = pm.Normal(
        name="b_fourier_monthly",
        mu=0,
        sigma=1,
        dims=["group", "fourier"],
    )
    
    # Dot product "by hand" -- expand b_fourier_monthly to (732, 12), then do
    # (732, 12) * (732, 12), then sum over the (12, ) dimension
    seasonality_monthly = pm.Deterministic("seasonality_monthly",
                                           (b_fourier_monthly[group_idx] * X_fourier).sum(axis=-1),
                                           dims=["index"])
    
    sigma = pm.HalfNormal('sigma')
    y_hat = pm.Normal(
        name="y_hat",
        mu=intercept[group_idx] + seasonality_monthly,
        sigma=sigma,
        observed=y_data, 
        shape = X_fourier.shape[0], # This line is for forcasting again
        dims=["index"],
    )
    
    idata = pm.sample()
    idata = pm.sample_posterior_predictive(idata, extend_inferencedata=True)

And the outputs:

fig, ax = plt.subplots(1, 2, figsize=(14, 4))
for i, (group, axis) in enumerate(zip(group_labels, fig.axes)):
    group_mask = group_idx == i
    x_grid_idx = date_idx[group_mask]
    x_grid = date_labels[x_grid_idx]
    mu = idata.posterior_predictive.y_hat.isel(obs_idx=group_mask).mean(dim=['chain', 'draw'])
    hdi = az.hdi(idata.posterior_predictive.y_hat).y_hat.sel(obs_idx=group_mask)
    axis.plot(x_grid, mu, c='tab:blue')
    axis.fill_between(x_grid, *hdi.values.T, color='tab:blue', alpha=0.25)
    axis.plot(x_grid, df_groups[df_groups.group == group].actual, color='k', lw=0.5, ls='--')
    axis.set_title(f'Group {group}')

I did a few other changes to help your quality of life, especially when it come to forecasting. With the model written this way (computing the fourier basis vectors on the graph and setting y_hat to have shape = X_fourier.shape[0]), you can just do pm.set_data(dict(date_idx_pt = new_dates, group_idx_pt = new_groups), coords={'index":new_range_index"}) and you’re done.

Fantastic!

I really appreciate your quality of life tips! I’m really excited to try this now!

Last night I was thinking about how if I dotted b_fourier_monthly @ fourier.T (without indexing into b_fourier_monthly first), I would have shape [2, 732]. I would then want to index it with something in order to match the 732 long observed. I saw this comment of yours about multiindex. I wonder if that could have helped me here. If the multiindex had been ordered groups, date, then I think I might have been able to index with that directly. But I’m on thin ice here.

You can’t do multi-indexing in a PyMC model, sadly. Adding a multi-index after sampling like in that link would make the post-estimation plotting nicer, though. Instead of all the group_mask stuff, you could just do .sel(group = group).

This works, but I think you would have shape (n_groups, T_timesteps). 732 is n * T right? If you made fourier with the “long” time, then you’d have groups twice, once on the first dimension and then again on the 2nd.

But in either case, you would have to ravel it after the dot. It would also not be robust to unbalanced panels (situations where you don’t observe the same number of time steps for every group).

If you have a balanced panel anyway, I’d just keep the data in wide form and do it your way. It skips all the reasoning about indices.