Correlated parameters, uneven group sizes, deep hierarchies

I’m working with hierarchical geographic data, something like this:

Call N the number of regions at the bottom of the hierarchy, and T the number of time steps over which I observe them. Each higher level is just the sum of the lower levels. I am interested in fitting a simple slope intercept model that allows for information sharing between the different levels of the hierarchy, something like:

y_{i,t} = \alpha_0 + \alpha_{i,1} + \alpha_{i,2} + \alpha_{i,3} + (\gamma_0 + \gamma_{i,1} + \gamma_{i,2} + \gamma_{i,3})\cdot t + \epsilon_{i,t}

In the simple case where all the parameters are independent this is easy enough to do with the usual indexing tricks. I am also curious about the more complex case when the slopes and intercepts are correlated. Apparently one cannot draw values from a centered MVN, because I require a different number of draws from each level (i.e. 5 parameters for the level 1 group, 7 parameters for the level 2, etc)

Question 1: short of working out a structured covariance matrix with lots of duplicated elements, is this true? Could I re-purpose a GP covariance function, for example the Matern function with a suitably chosen \nu (less than 1?) to induce discontinuity between each hierarchical index?

So I do something like this:

    param_chol, corr, stds = pm.LKJCholeskyCov('param_chol', n=6, eta=1.0, sd_dist=sd_dist)
    intercepts = at.stack([offset_intercept_country[country_idx],
                       offset_intercept_region_2[region2_idx]], axis=1)
    slopes = at.stack([offset_trend_country[country_idx],
                       offset_trend_region_2[region2_idx]], axis=1)
    params = at.concatenate([intercepts, slopes], axis=1).dot(param_chol)

Where all the offsets are independent normals of the appropriate size and the idx arrays are indexing arrays to broadcast everything to the size of the bottom of the hierarchy.

In principal this should give me what I want, an N \times 6 matrix of offsets with the desired correlations. The final mean would be something like mu = mu_intercept + params[:, :3].sum(axis=1)[None, :] + (mu_slope + params[:, 3:].sum(axis=1)[None, :]) * time[:, None], giving a T \times N matrix of observations.

I’m here because this approach samples extremely poorly, even on synthetic data generated using exactly this procedure, and it gets worse as I hold N constant and increase T, which should be the asymptotically “good” case. This makes me think I’m missing something.

Curious if anyone has experience with data like this.