Multi-dimensional Dims Do not Seem to Work

Hello!

As I’m trying to turn a prophet model into a hierarchical model, I’m having trouble with seasonality. The shape of seasonality should be the dimensions*2 (in the case of yearly, 10 dimensions).

Using PyMC 4, I’m trying to only use dimensions but keep getting the following error:

---------------------------------------------------------------------------
ShapeError                                Traceback (most recent call last)
/tmp/ipykernel_7820/3842427766.py in <module>
     77                predictions,
     78                observed=eaches,
---> 79                dims = 'obs_id')
     80 
     81     # trace = pm.sample(tune=2000, draws=2000)

/opt/conda/lib/python3.7/site-packages/pymc/distributions/distribution.py in __new__(cls, name, *args, **kwargs)
    592             raise ValueError("Transformations for discrete distributions")
    593 
--> 594         return super().__new__(cls, name, *args, **kwargs)
    595 
    596 

/opt/conda/lib/python3.7/site-packages/pymc/distributions/distribution.py in __new__(cls, name, rng, dims, initval, observed, total_size, transform, *args, **kwargs)
    276             dims=dims,
    277             transform=transform,
--> 278             initval=initval,
    279         )
    280 

/opt/conda/lib/python3.7/site-packages/pymc/model.py in register_rv(self, rv_var, name, data, total_size, dims, transform, initval)
   1311                         actual=len(new_coords),
   1312                         expected=new_length,
-> 1313                     )
   1314                 # store it as tuple for immutability as in add_coord
   1315                 self._coords[dname] = tuple(new_coords)

/opt/conda/lib/python3.7/site-packages/pymc/model.py in make_obs_var(self, rv_var, data, dims, transform)
   1338         initval
   1339             The initial value of the random variable.
-> 1340 
   1341         Returns
   1342         -------

ShapeError: Dimensionality of data and RV don't match. (actual 1 != expected 2)

I’ve ran this model fine when only changing the trend factor into a hierarchical framework. The error started when I tried to change the yearly seasonality into the hierarchical framework. Model below:

t = np.arange(len(df_train)) / len(df_train)

s = np.linspace(0, np.max(t), n_changepoints+2)[1:-1]


a = (t[:, None] > s)*1

n_components = 10

def yearly_X(t, p=365.25, n=n_components):
    x = 2 * np.pi * (np.arange(n)+1) * t[:, None] / p
    return np.concatenate((np.cos(x), np.sin(x)), axis = 1)

monthly_n_components = 5
def monthly_X(t, p=30.5, n=monthly_n_components):
    x = 2 * np.pi * (np.arange(n)+1) * t[:, None] / p
    return np.concatenate((np.cos(x), np.sin(x)), axis = 1) 

month = pd.get_dummies(df_train['month'])
sept = np.array(month[9])
march = np.array(month[3])
month = np.array(df_train['month'])
yearly_fourier = yearly_X(t, 365.25/len(t))
monthly_fourier = monthly_X(t, 30.5/len(t))

location_idxs, locations = pd.factorize(df_train['location'])
coords_={
    "location":df_train['location'].unique(),
    "n_components": np.arange(n_components*2),
    "obs_id":np.arange(len(location_idxs))
}

with pm.Model(coords=coords_) as partial_pooled_model:
    
    location_idxs = pm.ConstantData('location_idx', location_idxs, dims = 'obs_id')
    t_ = pm.MutableData('t', t)
    a_ = pm.MutableData('a', a)
    s_ = pm.MutableData('s', s)
    yearly_f_ = pm.MutableData('yearly_f_', yearly_fourier)
    monthly_m_ = pm.MutableData('monthly_m_',monthly_fourier)
    sept_month = pm.MutableData('sept_month',sept)
    march_month = pm.MutableData('march_month', march)
    eaches = pm.MutableData('eaches', df_train['eaches'])

    mu_k = pm.Normal('mu_k', 0, 1)
    sigma_k = pm.HalfCauchy('sigma_k', 3)
    k = pm.Normal('k', mu_k, sigma_k, dims = 'location')
    
    mu_m = pm.Normal('mu_m', 0, 1)
    sigma_m = pm.HalfCauchy('sigma_m', 3)
    m = pm.Normal('m', mu_m, sigma_m, dims = 'location')
    
    delta = pm.Laplace('delta', 0, 0.1, shape=n_changepoints)

    growth = pm.Deterministic('growth', k[location_idxs] + at.dot(a_, delta))
    offset = pm.Deterministic('offset', (m[location_idxs] + at.dot(a_, -s_ * delta)))
    trend = pm.Deterministic('trend', growth[location_idxs] * t_ + offset[location_idxs])

    yearly_mu = pm.Normal('yearly_mu',0,1)
    yearly_sigma = pm.HalfCauchy('yearly_sigma', 3)
    yearly_beta = pm.Normal('yearly_beta', yearly_mu, yearly_sigma, dims=('n_components', 'location'))
    yearly_seasonality = pm.Deterministic('yearly_seasonality',at.dot(yearly_f_, yearly_beta[location_idxs]))

    monthly_beta = pm.Normal('monthly_beta', 0, 10, shape = monthly_n_components*2)
    monthly_seasonality = pm.Deterministic('monthly_seasonality',at.dot(monthly_m_, monthly_beta))
    
    sept_beta = pm.Normal('sept_beta', 0,10)
    march_beta = pm.Normal('march_beta', 0,10)

    predictions =  pm.Deterministic('predictions', np.exp(trend[location_idxs] + yearly_seasonality[location_idxs]  + monthly_seasonality + (sept_beta*sept_month) + (march_beta*march_month)))


    pm.Poisson('obs',
               predictions,
               observed=eaches,
               dims = 'obs_id')

    trace = pymc.sampling_jax.sample_numpyro_nuts(tune=2000, draws = 2000)

That model won’t even print out a graphviz object.

When I change seasonality to

yearly_mu = pm.Normal('yearly_mu',0,1)
    yearly_sigma = pm.HalfCauchy('yearly_sigma', 3)
    yearly_beta = pm.Normal('yearly_beta', yearly_mu, yearly_sigma, shape = n_components*2, dims=( 'location'))
    yearly_seasonality = pm.Deterministic('yearly_seasonality',at.dot(yearly_f_, yearly_beta[location_idxs]))

The graph prints, but does not sample throwing another dimensionality error. How should this variable be coded?

Dims are like force, if they aren’t working you should probably use more. You also shouldn’t mix shape and dims. Just make more dims. I generated some fake data that I guess might match yours, based on the shapes in your model graph:

data = stats.poisson(mu=[5,2,1,7,2]).rvs([143, 5]).T.ravel()
dates = pd.date_range('2020-01-01', freq='W', periods=143)
locations = [f'location_{i}' for i in range(5)]
df_train = pd.DataFrame(data, index=pd.MultiIndex.from_product([dates, locations]), columns=['eaches'])
df_train.index.names = ['date', 'location']
df_train.head()
>>>out:
                       eaches
date       location          
2020-01-05 location_0       7
           location_1       2
           location_2       5
           location_3       6
           location_4       8

Then I made a bunch of coords, one for each feature – yearly components, monthly components, and changepoints – plus locations and all observations. This will cover all your bases when you write the model. I don’t like numbered coords because I never remember which number is which, so I made a bunch of fancy junk to give them all names. Just my preference, the important thing is that they are all there.

month = pd.get_dummies(df_train.index.get_level_values(0).month)
sept = np.array(month[9])
march = np.array(month[3])
month = np.array(df_train.index.get_level_values(0).month)

def create_fourier_features(t, n, p=365.25):
    x = 2 * np.pi * (np.arange(n)+1) * t[:, None] / p
    return np.concatenate((np.cos(x), np.sin(x)), axis = 1)

n_yearly = 10
n_monthly = 5

# Weekly data
yearly_fourier = create_fourier_features(t, n_yearly, 52)
monthly_fourier = create_fourier_features(t, n_monthly, 4)

location_idxs, locations = pd.factorize(df_train.index.get_level_values(1))
time_idxs, times = pd.factorize(df_train.index.get_level_values(0))

coords_={
    "location":locations,
    "yearly_components": [f'yearly_{f}_{i+1}' for f in ['cos', 'sin'] for i in range(n_yearly)],
    "monthly_components":[f'monthly_{f}_{i+1}' for f in ['cos', 'sin'] for i in range(n_monthly)],
    'changepoints':df_train.index.get_level_values(0)[np.argwhere(np.diff(a, axis=0) != 0)[:, 0]],
    "obs_id":[f'{loc}_{time.year}_week_{time.isocalendar().week}' for time, loc in df_train.index.values]
}

I don’t have access to your exact data so I don’t know if this is right, but I inferred a couple potential “gotchas” from your graph.

  1. The periodicity of the Fourier features should be the length of a full cycle. I made synthetic weekly data, so the yearly component has a period of 52 and the monthly a period of 4. It should only be 365.25 if you’re working with daily data. Maybe you are, but I thought it was possible it’s not, since there are only 715 data points with several locations, so I wanted to highlight this.

(You also had some minor code duplication with the Fourier features function, you can just make it once and pass different values of n each time)

  1. Make sure your time counter t is correct. It looks like you are working with long data, i.e. locations and times stacked together, but your time is defined as simply running the length of the dataset. If your data is structured like mine is above, your trends will be wrong, because the first 5 observations all correspond to the first time-period. To fix this, I made the time_idxs index variable, and I will use it to repeat time entries so that they match how the data is structured.

Using the new coords, here’s how the trend terms ended up looking. Note the use of time_idxs to make the A matrix and t_ vector have the correct repeating structure:

    delta = pm.Laplace('delta', 0, 0.1, dims=['location', 'changepoints'])

    growth = pm.Deterministic('growth', k[location_idxs] + (a_[time_idxs] * delta[location_idxs, :]).sum(axis=1), dims=['obs_id'])
    offset = pm.Deterministic('offset', m[location_idxs] + (a_[time_idxs] * -s_[None, :] * delta[location_idxs, :]).sum(axis=1), dims=['obs_id'])
    trend = pm.Deterministic('trend', growth[location_idxs] * t_[time_idxs] + offset[location_idxs], dims=['obs_id'])

Note that you can put dims on your deterministic variables. This is useful because otherwise the idata object will be polluted with lots of auto-generated coords like growth_dim_0 and offset_dim_0 and trend_dim_0, which are actually all the same thing. If you want to do fancier broadcasting operations, it’s good to make sure the xarray knows these are all the same. The same goes for your pm.Data containers, you can and should put dims on those as well.

Getting locations into the seasonal component will look the same as the trend. Just like above, you have to switch to “dot product by hand” to make the shapes work out:

    yearly_mu = pm.Normal('yearly_mu', 0, 0.1)
    yearly_sigma = pm.HalfNormal('yearly_sigma', sigma=0.1)
    yearly_beta = pm.Normal('yearly_beta', yearly_mu, yearly_sigma, dims=['location', 'yearly_components'])
    yearly_seasonality = pm.Deterministic('yearly_seasonality', (yearly_f_[time_idxs] * yearly_beta[location_idxs, :]).sum(axis=1), dims=['obs_id'])

How to structure the hyper-priors here is an open question to me. The way 've written it, information is pooled across seasonal components and across locations. It might be preferable to divide that up somehow by introducing another level of hierarchy, but I’m really not sure. I hope someone else can chime in on this point.

You might also want to check the values of your priors, I was getting overflow errors because the HalfCauchy with beta=3 can produce crazy huge values. That’s why I changed things to HalfNormal with tiny sigma here. You can run np.int64(np.exp(stats.cauchy(3).rvs(1000)).max()) to see that it underflows every single time. Ignore this if you’re not having problems, just be careful with mixing Cauchy and exponential function.

Thanks for all the great catches. I am running monthly right now but weekly/daily will have to be done at a later date.

What are you using for t in the yearly_fourier feature? I get an error that t is not defined when running that code.

It’s just an np.arange of length 143. You should probably use time_idxs (the one created by pd.factorize) instead. That way if there are missing data between groups it will be accounted for.

The best would be something like t = time_idxs / max(time_idxs) so it runs from 0 to 1 (your poisson mean will be less likely to overflow) and automatically accounts for missing data/heterogeneous lengths.