Understanding indexing and dimensions in hierarchical models

I have data that is grouped into 5 countries. For each country, I am trying to model the following time-varying process:

\theta_{t} = invlogit(\vec{\alpha_t} * x_{t}+ \beta + Ha)

with values:

\alpha_0 \sim N(0,2), \alpha_t \sim N(\alpha_{t-1},2)

\beta \sim N(0,2), Ha \sim N(0,2)

where \alpha_t are time-varying coefficients, \beta and Ha are constants.

Each x_{t} is a vector with 19 entries (ie I have 19 coefficients) - all of these are saved in a pandas dataframe coefficients_df of shape (780k,19) where my countries and tā€™s are in long format.

i.e. 4xT = c780k, with t in {0,1,2,3ā€¦T}. However not every country has the same number of time series observations.

with pm.Model(coords={"countries": countries, "timedelta": timedelta}) as model:
     
    # instantiate stochastic RV for global parameters                               
    beta = pm.Normal('beta',mu=0,sigma=np.sqrt(2))
    ha = pm.Normal('Ha',mu=0,sigma=np.sqrt(2))
                                       
    #vector of time varying parameters as random walk
    alpha= pm.GaussianRandomWalk("alpha",init_dist=pm.Normal.dist(mu=0,sigma=np.sqrt(2)) , sigma=np.sqrt(2),dims=("countries","timedelta"))   

    theta_home = pm.Deterministic("theta_home",pm.invlogit(pm.math.dot(alpha[time_idx],df[pred_cols][time_idx])+beta+ha), dims=("games","timedelta"))

After some attempts I am able to fit this model on a single country, see my other thread:

I think I have a reasonable handle on how pymc handlings shapes and dimensions, after some experimentation.

What I am slightly confused about is the format of my coefficients data needed to run this model for each country, and how I use indexing correctly to code the linear transformation needed for my thetas.

This discussion

suggests I simply need a long format pandas dataframe. But I am confused as to how I should code the linear transformation I need for my thetas.

theta = pm.Deterministic("theta",pm.invlogit(pm.math.dot(alpha[country_idx],preds[country_idx])+beta+ha), dims=("countries","timedelta"))

This seems to not result in an array of shape (countries, timedelta) but rather of shape (780k, countries). I seem to not be specifying my dot product right or/and index right here.

What is the likely cause of this?

Hi. Iā€™m not sure whether I understood all details of your problem correctly. But if Iā€™m not completely wrong, what you may want is a dataframe on completely long format (i.e. each of the 19 scores should be expanded and assigned itā€™s own index). I have approached a similar problem before, here you can find the code.

In the way you describe your present data, I assume that you would like to use the pandas function melt on the coefficients columns, and that should give you 19 categories that youā€™d be able to index in your model. Something like this:

import pymc as pm
import numpy as np
import pandas as pd
import arviz as az

##country name
c1 = np.repeat('c1',20)
c2 = np.repeat('c2',10)
c3 = np.repeat('c3',15)
c4 = np.repeat('c4',21)
c5 = np.repeat('c5',12)

## dates per country
d1 = np.arange(20)#.astype(str)
d2 = np.arange(10)#.astype(str)
d3 = np.arange(15)#.astype(str)
d4 = np.arange(21)#.astype(str)
d5 = np.arange(12)#.astype(str)

cs = np.concatenate([c1,c2,c3,c4,c5])
ds = np.concatenate([d1,d2,d3,d4,d5])

data = pd.DataFrame({'country':cs, 'date':ds})

scores = np.array([np.random.uniform(0,1, 78) for c in range(19)])

snames = ['s'+str(s+1) for s in range(19)]

data[snames] = scores.T

df = pd.melt(data, id_vars=['country','date'], value_vars=snames, var_name='sname', 
             value_name='score', ignore_index=True)

##indices per country, date and score category
c = pd.factorize(df['country'])[0].astype('int32')
d = pd.factorize(df['date'])[0].astype('int32')
s = pd.factorize(df['sname'])[0].astype('int32')

I assumed the timepoints are dates (e.g. data obtained from each country) and simply simulated scores (i.e. coefficients) from a uniform distribution. Looking at your previous PYMC v3 post you refer above, I thought one of the models you posted there may be useful for this example:

with pm.Model() as model:
                                 
    beta = pm.Normal('beta',mu=0,sigma=np.sqrt(2))
    ha = pm.Normal('Ha',mu=0,sigma=np.sqrt(2))
                                   
    alpha_zero = pm.Normal.dist(0,1, shape=21) 

    alpha = pm.GaussianRandomWalk("alpha", init=alpha_zero, sigma=np.sqrt(2),shape=(21,19,5))   

    theta = pm.Deterministic("theta", pm.invlogit(pm.math.dot(df.score, alpha[d,s,c])+beta+ha))

    Y_obs = pm.Gamma("Y_obs", alpha=theta, beta=1, observed=df.score)

with model:
    trace = pm.sample(1000)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, Ha, alpha]
 |ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 100.00% [8000/8000 16:25<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1010 seconds.
The acceptance probability does not match the target. It is 0.8869, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The acceptance probability does not match the target. It is 0.8875, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.

The point here is that your data is ā€œsort of flattenedā€ over time, so youā€™ll get a time dimension that represents the maximum number of time-samples, so countries with more samples simply get more observations (this can be effective for a GP, for example, same repo here). Iā€™m not sure whether this is what you want, but may be a useful approach. Especially if you consider to average over dates, for example, as the model takes about 15 mins to run in my simulation, where I reduced the number of samples by orders of magnitude (i.e. from your 780k to 78). So for your date this may be a problem. Also note that the gaussian random walk (GRW) prior has 3 dimensions now, namely date (i.e. time samples), score name (i.e. coefficient category) and country. Again, maybe I misunderstood your approach and you donā€™t want to model your data in this way.

This particular model I sampled here doesnā€™t ample really well and doesnā€™t do a great job with the predictions, but I guess adjusting priors may help, also via prior predictive checks. (e.g. Iā€™m not sure about the Gamma likelihood). In any case, I hope this at least helps a bit.

1 Like

Thank you Simon, Iā€™ve come back to this after the summer break and will have a look through your suggested approach. My time variable is in timedelta/time elapsed but this should not change your suggested approach.

One potential problem I can see now (after stepping away from this problem for a while) is that I do not have all 19 scores for each time point - Iā€™m not sure how Pymc deals with NA, do I need to impute this data somehow?

There may be several approaches. As you mentioned, imputation ca be useful, or other similar techniques for addressing missing data. McElreathā€™s statistical rethinking book has some useful approaches to missing data on chapter 15, which has been ported to PyMC here: pymc-resources/Rethinking_2 at main Ā· pymc-devs/pymc-resources Ā· GitHub. Possibly, the most inappropriate approach would be obviating those missing data and use only the timepoints that actually have data. Not the best idea, but may be a good start to test whether your modelling approach works (in principle at least).

Hi Simon, just looking at your example and inspecting the results - this gives me 1000 samples for Y_obs but I am trying to estimate theta (and Y_obs) as a time-varying process - so would want a sample for five different thetas, e.g. one for each date - I have tried simply adding

shape=5

to the definition of theta but this doesnā€™t seem to work. Do you have any advice?

Note that you cannot add shape to a pm.Deterministic. What the theta estimate gives you in this model is the estimate per date, score and country. Because the GaussianRandomWalk (alpha) has shape = (21,19,5), which is later expanded by indexing as alpha[d,s,c] .So, this is the ā€œexpandedā€ or ā€œflattenedā€ estimate, you simply need to reshape your estimate according to your data. For example, by doing data[ā€˜theta_meanā€™] = trace[ā€˜thetaā€™].mean() , and then re-arranging your data as you require. (you can add posterior SD and/or HDIs in a similar way). Hope I understood your question correctly and this makes sense.

I understand - this

data[ā€˜posterior_meanā€™] = trace 

does not work for me, but I can access Y_obs through

trace.log_likelihood['Y_obs']

which is an array of shape (chains, samplesize, len(df) )

I assume I can simply average over chains and the samples drawn, or is there anything I need to be mindful of when doing so?

Sorry, in my rush I answered in PyMC3 style. If your trace is a PyMC idata obeject (i.e. idata = trace), to get the a ā€œtraceā€ object of posterior distributions with grouped chains, you can do something like this: trace = idata.stack(sample = [ā€˜chainā€™, ā€˜drawā€™]).posterior. And now you could do all operations with your posterior distributions, e.g.: mean_thetas = trace[ā€˜thetaā€™].mean(), sd_thetas = trace[ā€˜thetaā€™].std(), take HIDs using Arviz, etc. And then you could things like: data[ā€˜theta_meanā€™] = trace[ā€˜thetaā€™].mean(). (Sorry my previous answer was also incomplete, edited now).

Iā€™m not sure whether I got something wrong from your question, but I donā€™t quite understand why you want to work with the log_likelihood of Y_obs. If your interest is in doing inference via the posterior, the you need to work with the posterior distributions of theta, alpha, Ha, beta. Where theta gives you the main estimate. As long as you do all operations with the posterior, then you only need to be mindful of reporting what operations have you done (e.g. average). I hope this makes things a bit clearer.

Hi Simon, yes this makes sense - Iā€™m sorry but I get 4,000 samples of theta, and canā€™t quite see how I can sample e.g. the mean thetas from this for each of my 5 dates as the dimensions of the resulting \theta_{t} donā€™t seem to match.

Iā€™m sure Iā€™m making some silly mistake.

Iā€™ve tried to simplify the problem, so am now only looking to model time-varying thetas, and want to model this across all countries.

This is my data (Iā€™ve cut this down to a sample to make my trial runs more workable).

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2450 entries, 0 to 2449
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   date          2450 non-null   float64
 2   Y_obs  2450 non-null   float64
 3   pred_name                2450 non-null   object 
 4   pred_score               2450 non-null   float64
dtypes: float64(3), int64(1), object(1)
memory usage: 95.8+ KB
t = pd.factorize(df_pilot['date'])[0].astype('int32')
p = pd.factorize(df_pilot['pred_name'])[0].astype('int32')


len(np.unique(p))
len(np.unique(t))

gives 5 and 10 respectively (i.e. I am modelling over 5 dates and 10 predictors. The data covers 50 countries (not 5 as before).

I define the model with a Poisson this time (I want to get a Poisson or Gamma to work at some point)

with pm.Model() as model:
                                 
    beta = pm.Normal('beta',mu=0,sigma=np.sqrt(2))
    ha = pm.Normal('Ha',mu=0,sigma=np.sqrt(2))
    
    alpha_zero = pm.Normal.dist(0,np.sqrt(2), shape=10) 

    alpha = pm.GaussianRandomWalk("alpha", init_dist=alpha_zero, sigma=np.sqrt(2),shape=(10,5))   
 
    theta = pm.Deterministic("theta", pm.invlogit(pm.math.dot(df_pilot.pred_score, alpha[p,t])+beta+ha))
    
    #Y_obs = pm.Gamma("Y_obs", alpha=theta, beta=1, observed=df_pilot.Y_obs)

    Y_obs =pm.Poisson("Y_obs", mu=theta, observed= df_pilot.Y_obs)
with model:
    # --- prior samples ---
    prior_predictive_base = pm.sample_prior_predictive(samples=1000, random_seed=rng)

gives

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
Ha -0.0050 1.3600 -2.3110 2.7690 0.0420 0.0330 1057.0000 935.0000 NaN
alpha[0, 0] -0.0240 1.4270 -2.4390 2.7330 0.0440 0.0320 1039.0000 981.0000 NaN
alpha[0, 1] -0.0530 1.9620 -3.8300 3.5570 0.0640 0.0450 949.0000 941.0000 NaN
alpha[0, 2] -0.0250 2.4480 -4.7380 4.5120 0.0780 0.0550 983.0000 1063.0000 NaN
alpha[0, 3] -0.0280 2.8610 -4.9330 5.7370 0.0900 0.0660 1012.0000 944.0000 NaN
alpha[0, 4] -0.0420 3.2180 -6.0860 5.6160 0.1010 0.0740 1010.0000 930.0000 NaN
alpha[1, 0] 0.0540 1.3510 -2.3640 2.5980 0.0420 0.0310 1051.0000 917.0000 NaN
alpha[1, 1] 0.0190 1.9440 -3.7180 3.5370 0.0590 0.0460 1106.0000 752.0000 NaN
alpha[1, 2] 0.0320 2.4420 -4.0300 4.7290 0.0740 0.0560 1090.0000 838.0000 NaN
alpha[1, 3] 0.0080 2.8380 -5.0480 5.1670 0.0860 0.0640 1085.0000 866.0000 NaN
alpha[1, 4] -0.0740 3.1630 -6.5090 5.3450 0.1000 0.0710 992.0000 836.0000 NaN
alpha[2, 0] -0.0160 1.4120 -2.8110 2.4590 0.0450 0.0320 1007.0000 852.0000 NaN
alpha[2, 1] -0.0630 2.0050 -4.0010 3.4390 0.0630 0.0450 1000.0000 940.0000 NaN
alpha[2, 2] -0.0880 2.3860 -4.6190 4.4930 0.0760 0.0540 974.0000 981.0000 NaN
alpha[2, 3] -0.1400 2.8360 -5.6430 5.2060 0.0920 0.0650 947.0000 952.0000 NaN
alpha[2, 4] -0.0460 3.1110 -5.8480 5.7890 0.0990 0.0730 993.0000 881.0000 NaN
alpha[3, 0] 0.0280 1.3770 -2.4160 2.6850 0.0440 0.0310 988.0000 875.0000 NaN
alpha[3, 1] 0.0090 1.9380 -3.5190 3.6420 0.0640 0.0450 923.0000 919.0000 NaN
alpha[3, 2] -0.0110 2.4350 -4.6560 4.3090 0.0830 0.0590 853.0000 868.0000 NaN
alpha[3, 3] 0.0460 2.8430 -4.8060 5.8240 0.0840 0.0650 1146.0000 944.0000 NaN
alpha[3, 4] 0.0670 3.0940 -5.7100 5.7120 0.0900 0.0680 1186.0000 1013.0000 NaN
alpha[4, 0] -0.0350 1.4460 -2.7790 2.7630 0.0460 0.0330 973.0000 942.0000 NaN
alpha[4, 1] -0.0220 2.0440 -3.8090 3.8010 0.0730 0.0520 790.0000 771.0000 NaN
alpha[4, 2] 0.0080 2.4900 -4.8670 4.3230 0.0890 0.0630 785.0000 907.0000 NaN
alpha[4, 3] -0.0070 2.8750 -5.0930 5.2770 0.1030 0.0730 786.0000 949.0000 NaN
alpha[4, 4] -0.0300 3.1330 -5.8190 5.5530 0.1030 0.0730 932.0000 944.0000 NaN
alpha[5, 0] 0.0140 1.4290 -2.7180 2.6750 0.0460 0.0340 950.0000 908.0000 NaN
alpha[5, 1] -0.0640 2.0200 -4.2460 3.3060 0.0680 0.0480 874.0000 906.0000 NaN
alpha[5, 2] -0.0810 2.4640 -4.7670 4.2950 0.0770 0.0550 1013.0000 944.0000 NaN
alpha[5, 3] -0.1160 2.8590 -5.3760 5.1690 0.0900 0.0640 1001.0000 1023.0000 NaN
alpha[5, 4] -0.0510 3.2080 -6.3340 5.7090 0.1020 0.0720 984.0000 930.0000 NaN
alpha[6, 0] 0.0230 1.4060 -2.3570 2.7300 0.0450 0.0340 961.0000 821.0000 NaN
alpha[6, 1] 0.0340 1.9570 -3.5630 3.8030 0.0640 0.0460 944.0000 905.0000 NaN
alpha[6, 2] 0.0620 2.3820 -4.1180 4.9970 0.0750 0.0560 1007.0000 845.0000 NaN
alpha[6, 3] 0.0540 2.7840 -5.1580 5.1270 0.0870 0.0650 1043.0000 980.0000 NaN
alpha[6, 4] 0.0500 3.1260 -6.0930 5.5400 0.1050 0.0740 887.0000 728.0000 NaN
alpha[7, 0] -0.0410 1.4740 -2.6560 2.8300 0.0470 0.0340 986.0000 981.0000 NaN
alpha[7, 1] -0.0550 2.0430 -3.6830 3.6570 0.0620 0.0470 1092.0000 1026.0000 NaN
alpha[7, 2] 0.0120 2.4590 -4.0990 4.9820 0.0760 0.0570 1066.0000 869.0000 NaN
alpha[7, 3] 0.0130 2.8190 -5.7510 4.5710 0.0870 0.0620 1050.0000 1067.0000 NaN
alpha[7, 4] 0.0270 3.0820 -5.0400 6.4490 0.0900 0.0710 1166.0000 1018.0000 NaN
alpha[8, 0] 0.0080 1.4020 -2.4670 2.8280 0.0430 0.0320 1056.0000 968.0000 NaN
alpha[8, 1] 0.0000 1.9860 -3.7180 3.6200 0.0600 0.0470 1084.0000 821.0000 NaN
alpha[8, 2] 0.0120 2.4140 -4.3130 4.5500 0.0730 0.0540 1111.0000 885.0000 NaN
alpha[8, 3] -0.0610 2.8100 -5.1290 5.1550 0.0800 0.0680 1221.0000 1071.0000 NaN
alpha[8, 4] -0.0430 3.1630 -5.8310 5.6390 0.0920 0.0760 1186.0000 944.0000 NaN
alpha[9, 0] 0.0460 1.4280 -2.8500 2.5740 0.0460 0.0360 950.0000 875.0000 NaN
alpha[9, 1] 0.0450 2.0160 -3.5300 4.0780 0.0690 0.0490 849.0000 941.0000 NaN
alpha[9, 2] 0.0330 2.4530 -5.1710 4.1810 0.0810 0.0570 912.0000 823.0000 NaN
alpha[9, 3] -0.0030 2.8530 -5.2890 5.2420 0.0950 0.0670 904.0000 910.0000 NaN
alpha[9, 4] 0.0210 3.1540 -5.8810 5.9820 0.1070 0.0760 868.0000 983.0000 NaN
beta 0.0720 1.4300 -2.7050 2.6480 0.0460 0.0360 962.0000 917.0000 NaN
theta 0.4800 0.5000 0.0000 1.0000 0.0160 0.0110 1037.0000 1000.0000 NaN

and posterior sampling gives

with model:
    trace = pm.sample(1000, random_seed=rng, target_accept=.95)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, Ha, alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2563 seconds.
There were 200 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 754 divergences after tuning. Increase `target_accept` or reparameterize.
There were 22 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 106 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.

trace.posterior[ā€˜traceā€™] now has shape (4,1000), and it seems like istack simply stacks this data into a 1-dimensional object of shape(4000,).

I can of course take the mean of theta but what I want here is a distribution or mean for each of my 5 \theta_{t}

I look at this more calmly and re-run the code, effectively theta gives only 4000 samples. The problem is that theta is reduced to a single parameter due to the matrix multiplication. Another problem seems to be that the sample is reaching maximum tree depth, which may indicate that the model is not well defined (probably the Beta distribution for the likelihood is not the most appropriate). So, just as an attempt (as Iā€™m still missing some things about your model I think), I replaced the matrix multiplication for a simple multiplication (this is no problem as alpha is expanded), and I replaced the Gamma likelihood for a Lognormal. Hereā€™s what Iā€™ve got:

# -*- coding: utf-8 -*-
import pymc as pm
import numpy as np
import pandas as pd
import arviz as az

##country name
c1 = np.repeat('c1',10)
c2 = np.repeat('c2',5)
c3 = np.repeat('c3',7)
c4 = np.repeat('c4',11)
c5 = np.repeat('c5',6)

## dates per country
d1 = np.arange(10)#.astype(str)
d2 = np.arange(5)#.astype(str)
d3 = np.arange(7)#.astype(str)
d4 = np.arange(11)#.astype(str)
d5 = np.arange(6)#.astype(str)

cs = np.concatenate([c1,c2,c3,c4,c5])
ds = np.concatenate([d1,d2,d3,d4,d5])

data = pd.DataFrame({'country':cs, 'date':ds})

scores = np.array([np.random.uniform(0,1, 39) for c in range(19)])

snames = ['s'+str(s+1) for s in range(19)]

data[snames] = scores.T

df = pd.melt(data, id_vars=['country','date'], value_vars=snames, var_name='sname', 
             value_name='score', ignore_index=True)

pred = np.random.uniform(0,1,len(df))
df['pred'] = pred

##indices per country, date and score category
c = pd.factorize(df['country'])[0].astype('int32')
d = pd.factorize(df['date'])[0].astype('int32')
s = pd.factorize(df['sname'])[0].astype('int32')

with pm.Model() as model:

    beta = pm.Normal('beta',mu=0,sigma=np.sqrt(2))
    ha = pm.Normal('Ha',mu=0,sigma=np.sqrt(2))

    alpha_zero = pm.Normal.dist(0,1, shape=21) 

    alpha = pm.GaussianRandomWalk("alpha", init=alpha_zero, sigma=np.sqrt(2),shape=(21,19,5))   

    x = pm.Deterministic("x", pm.invlogit(df.pred.values*alpha[d,s,c]))

    theta = pm.Deterministic("theta", pm.invlogit(df.pred.values*alpha[d,s,c]+beta+ha))

    y = pm.Lognormal("y", mu=theta, sigma=1, observed=df.score.values)

with model:
    trace = pm.sample(100)

 |ā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆā–ˆ| 100.00% [4400/4400 01:39<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 100 draw iterations (4_000 + 400 draws total) took 134 seconds.

trace = idata.stack(sample = ["chain", "draw"]).posterior

trace["theta"].shape
Out[4]: (741, 400)

Note that I reduced the number of dates 39 and the number of samples to 100 (this is just for simplicity, so I can run this a bit faster). Also note that the sample outputs no max tree depth/divergence warnings this time. Finally, note that the modelā€™s estimates are terrible (this is probably because I added variables I donā€™t really understand, like df[ā€˜predā€™], which I just drew randomly from a Uniform distribution, as I donā€™t really know what are those predictions in your data and why do you need to have multiplied with a Gaussian random walk). Hope this, at least in principle, helps a bit more.