Understanding indexing and dimensions in hierarchical models

I have data that is grouped into 5 countries. For each country, I am trying to model the following time-varying process:

\theta_{t} = invlogit(\vec{\alpha_t} * x_{t}+ \beta + Ha)

with values:

\alpha_0 \sim N(0,2), \alpha_t \sim N(\alpha_{t-1},2)

\beta \sim N(0,2), Ha \sim N(0,2)

where \alpha_t are time-varying coefficients, \beta and Ha are constants.

Each x_{t} is a vector with 19 entries (ie I have 19 coefficients) - all of these are saved in a pandas dataframe coefficients_df of shape (780k,19) where my countries and t’s are in long format.

i.e. 4xT = c780k, with t in {0,1,2,3…T}. However not every country has the same number of time series observations.

with pm.Model(coords={"countries": countries, "timedelta": timedelta}) as model:
     
    # instantiate stochastic RV for global parameters                               
    beta = pm.Normal('beta',mu=0,sigma=np.sqrt(2))
    ha = pm.Normal('Ha',mu=0,sigma=np.sqrt(2))
                                       
    #vector of time varying parameters as random walk
    alpha= pm.GaussianRandomWalk("alpha",init_dist=pm.Normal.dist(mu=0,sigma=np.sqrt(2)) , sigma=np.sqrt(2),dims=("countries","timedelta"))   

    theta_home = pm.Deterministic("theta_home",pm.invlogit(pm.math.dot(alpha[time_idx],df[pred_cols][time_idx])+beta+ha), dims=("games","timedelta"))

After some attempts I am able to fit this model on a single country, see my other thread:

I think I have a reasonable handle on how pymc handlings shapes and dimensions, after some experimentation.

What I am slightly confused about is the format of my coefficients data needed to run this model for each country, and how I use indexing correctly to code the linear transformation needed for my thetas.

This discussion

suggests I simply need a long format pandas dataframe. But I am confused as to how I should code the linear transformation I need for my thetas.

theta = pm.Deterministic("theta",pm.invlogit(pm.math.dot(alpha[country_idx],preds[country_idx])+beta+ha), dims=("countries","timedelta"))

This seems to not result in an array of shape (countries, timedelta) but rather of shape (780k, countries). I seem to not be specifying my dot product right or/and index right here.

What is the likely cause of this?

Hi. I’m not sure whether I understood all details of your problem correctly. But if I’m not completely wrong, what you may want is a dataframe on completely long format (i.e. each of the 19 scores should be expanded and assigned it’s own index). I have approached a similar problem before, here you can find the code.

In the way you describe your present data, I assume that you would like to use the pandas function melt on the coefficients columns, and that should give you 19 categories that you’d be able to index in your model. Something like this:

import pymc as pm
import numpy as np
import pandas as pd
import arviz as az

##country name
c1 = np.repeat('c1',20)
c2 = np.repeat('c2',10)
c3 = np.repeat('c3',15)
c4 = np.repeat('c4',21)
c5 = np.repeat('c5',12)

## dates per country
d1 = np.arange(20)#.astype(str)
d2 = np.arange(10)#.astype(str)
d3 = np.arange(15)#.astype(str)
d4 = np.arange(21)#.astype(str)
d5 = np.arange(12)#.astype(str)

cs = np.concatenate([c1,c2,c3,c4,c5])
ds = np.concatenate([d1,d2,d3,d4,d5])

data = pd.DataFrame({'country':cs, 'date':ds})

scores = np.array([np.random.uniform(0,1, 78) for c in range(19)])

snames = ['s'+str(s+1) for s in range(19)]

data[snames] = scores.T

df = pd.melt(data, id_vars=['country','date'], value_vars=snames, var_name='sname', 
             value_name='score', ignore_index=True)

##indices per country, date and score category
c = pd.factorize(df['country'])[0].astype('int32')
d = pd.factorize(df['date'])[0].astype('int32')
s = pd.factorize(df['sname'])[0].astype('int32')

I assumed the timepoints are dates (e.g. data obtained from each country) and simply simulated scores (i.e. coefficients) from a uniform distribution. Looking at your previous PYMC v3 post you refer above, I thought one of the models you posted there may be useful for this example:

with pm.Model() as model:
                                 
    beta = pm.Normal('beta',mu=0,sigma=np.sqrt(2))
    ha = pm.Normal('Ha',mu=0,sigma=np.sqrt(2))
                                   
    alpha_zero = pm.Normal.dist(0,1, shape=21) 

    alpha = pm.GaussianRandomWalk("alpha", init=alpha_zero, sigma=np.sqrt(2),shape=(21,19,5))   

    theta = pm.Deterministic("theta", pm.invlogit(pm.math.dot(df.score, alpha[d,s,c])+beta+ha))

    Y_obs = pm.Gamma("Y_obs", alpha=theta, beta=1, observed=df.score)

with model:
    trace = pm.sample(1000)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, Ha, alpha]
 |████████████| 100.00% [8000/8000 16:25<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1010 seconds.
The acceptance probability does not match the target. It is 0.8869, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The acceptance probability does not match the target. It is 0.8875, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.

The point here is that your data is “sort of flattened” over time, so you’ll get a time dimension that represents the maximum number of time-samples, so countries with more samples simply get more observations (this can be effective for a GP, for example, same repo here). I’m not sure whether this is what you want, but may be a useful approach. Especially if you consider to average over dates, for example, as the model takes about 15 mins to run in my simulation, where I reduced the number of samples by orders of magnitude (i.e. from your 780k to 78). So for your date this may be a problem. Also note that the gaussian random walk (GRW) prior has 3 dimensions now, namely date (i.e. time samples), score name (i.e. coefficient category) and country. Again, maybe I misunderstood your approach and you don’t want to model your data in this way.

This particular model I sampled here doesn’t ample really well and doesn’t do a great job with the predictions, but I guess adjusting priors may help, also via prior predictive checks. (e.g. I’m not sure about the Gamma likelihood). In any case, I hope this at least helps a bit.

Thank you Simon, I’ve come back to this after the summer break and will have a look through your suggested approach. My time variable is in timedelta/time elapsed but this should not change your suggested approach.

One potential problem I can see now (after stepping away from this problem for a while) is that I do not have all 19 scores for each time point - I’m not sure how Pymc deals with NA, do I need to impute this data somehow?

There may be several approaches. As you mentioned, imputation ca be useful, or other similar techniques for addressing missing data. McElreath’s statistical rethinking book has some useful approaches to missing data on chapter 15, which has been ported to PyMC here: pymc-resources/Rethinking_2 at main · pymc-devs/pymc-resources · GitHub. Possibly, the most inappropriate approach would be obviating those missing data and use only the timepoints that actually have data. Not the best idea, but may be a good start to test whether your modelling approach works (in principle at least).