Understanding indexing and dimensions in hierarchical models

janspiegel · August 6, 2022, 6:53am

I have data that is grouped into 5 countries. For each country, I am trying to model the following time-varying process:

\theta_{t} = invlogit(\vec{\alpha_t} * x_{t}+ \beta + Ha)

with values:

\alpha_0 \sim N(0,2), \alpha_t \sim N(\alpha_{t-1},2)

\beta \sim N(0,2), Ha \sim N(0,2)

where \alpha_t are time-varying coefficients, \beta and Ha are constants.

Each x_{t} is a vector with 19 entries (ie I have 19 coefficients) - all of these are saved in a pandas dataframe coefficients_df of shape (780k,19) where my countries and t’s are in long format.

i.e. 4xT = c780k, with t in {0,1,2,3…T}. However not every country has the same number of time series observations.

with pm.Model(coords={"countries": countries, "timedelta": timedelta}) as model:
     
    # instantiate stochastic RV for global parameters                               
    beta = pm.Normal('beta',mu=0,sigma=np.sqrt(2))
    ha = pm.Normal('Ha',mu=0,sigma=np.sqrt(2))
                                       
    #vector of time varying parameters as random walk
    alpha= pm.GaussianRandomWalk("alpha",init_dist=pm.Normal.dist(mu=0,sigma=np.sqrt(2)) , sigma=np.sqrt(2),dims=("countries","timedelta"))   

    theta_home = pm.Deterministic("theta_home",pm.invlogit(pm.math.dot(alpha[time_idx],df[pred_cols][time_idx])+beta+ha), dims=("games","timedelta"))

After some attempts I am able to fit this model on a single country, see my other thread:

I think I have a reasonable handle on how pymc handlings shapes and dimensions, after some experimentation.

What I am slightly confused about is the format of my coefficients data needed to run this model for each country, and how I use indexing correctly to code the linear transformation needed for my thetas.

This discussion

suggests I simply need a long format pandas dataframe. But I am confused as to how I should code the linear transformation I need for my thetas.

theta = pm.Deterministic("theta",pm.invlogit(pm.math.dot(alpha[country_idx],preds[country_idx])+beta+ha), dims=("countries","timedelta"))

This seems to not result in an array of shape (countries, timedelta) but rather of shape (780k, countries). I seem to not be specifying my dot product right or/and index right here.

What is the likely cause of this?

Simon · August 8, 2022, 3:26pm

Hi. I’m not sure whether I understood all details of your problem correctly. But if I’m not completely wrong, what you may want is a dataframe on completely long format (i.e. each of the 19 scores should be expanded and assigned it’s own index). I have approached a similar problem before, here you can find the code.

In the way you describe your present data, I assume that you would like to use the pandas function melt on the coefficients columns, and that should give you 19 categories that you’d be able to index in your model. Something like this:

import pymc as pm
import numpy as np
import pandas as pd
import arviz as az

##country name
c1 = np.repeat('c1',20)
c2 = np.repeat('c2',10)
c3 = np.repeat('c3',15)
c4 = np.repeat('c4',21)
c5 = np.repeat('c5',12)

## dates per country
d1 = np.arange(20)#.astype(str)
d2 = np.arange(10)#.astype(str)
d3 = np.arange(15)#.astype(str)
d4 = np.arange(21)#.astype(str)
d5 = np.arange(12)#.astype(str)

cs = np.concatenate([c1,c2,c3,c4,c5])
ds = np.concatenate([d1,d2,d3,d4,d5])

data = pd.DataFrame({'country':cs, 'date':ds})

scores = np.array([np.random.uniform(0,1, 78) for c in range(19)])

snames = ['s'+str(s+1) for s in range(19)]

data[snames] = scores.T

df = pd.melt(data, id_vars=['country','date'], value_vars=snames, var_name='sname', 
             value_name='score', ignore_index=True)

##indices per country, date and score category
c = pd.factorize(df['country'])[0].astype('int32')
d = pd.factorize(df['date'])[0].astype('int32')
s = pd.factorize(df['sname'])[0].astype('int32')

I assumed the timepoints are dates (e.g. data obtained from each country) and simply simulated scores (i.e. coefficients) from a uniform distribution. Looking at your previous PYMC v3 post you refer above, I thought one of the models you posted there may be useful for this example:

with pm.Model() as model:
                                 
    beta = pm.Normal('beta',mu=0,sigma=np.sqrt(2))
    ha = pm.Normal('Ha',mu=0,sigma=np.sqrt(2))
                                   
    alpha_zero = pm.Normal.dist(0,1, shape=21) 

    alpha = pm.GaussianRandomWalk("alpha", init=alpha_zero, sigma=np.sqrt(2),shape=(21,19,5))   

    theta = pm.Deterministic("theta", pm.invlogit(pm.math.dot(df.score, alpha[d,s,c])+beta+ha))

    Y_obs = pm.Gamma("Y_obs", alpha=theta, beta=1, observed=df.score)

with model:
    trace = pm.sample(1000)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, Ha, alpha]
 |████████████| 100.00% [8000/8000 16:25<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 1010 seconds.
The acceptance probability does not match the target. It is 0.8869, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The acceptance probability does not match the target. It is 0.8875, but should be close to 0.8. Try to increase the number of tuning steps.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.

The point here is that your data is “sort of flattened” over time, so you’ll get a time dimension that represents the maximum number of time-samples, so countries with more samples simply get more observations (this can be effective for a GP, for example, same repo here). I’m not sure whether this is what you want, but may be a useful approach. Especially if you consider to average over dates, for example, as the model takes about 15 mins to run in my simulation, where I reduced the number of samples by orders of magnitude (i.e. from your 780k to 78). So for your date this may be a problem. Also note that the gaussian random walk (GRW) prior has 3 dimensions now, namely date (i.e. time samples), score name (i.e. coefficient category) and country. Again, maybe I misunderstood your approach and you don’t want to model your data in this way.

This particular model I sampled here doesn’t ample really well and doesn’t do a great job with the predictions, but I guess adjusting priors may help, also via prior predictive checks. (e.g. I’m not sure about the Gamma likelihood). In any case, I hope this at least helps a bit.

janspiegel · September 21, 2022, 5:44am

Thank you Simon, I’ve come back to this after the summer break and will have a look through your suggested approach. My time variable is in timedelta/time elapsed but this should not change your suggested approach.

One potential problem I can see now (after stepping away from this problem for a while) is that I do not have all 19 scores for each time point - I’m not sure how Pymc deals with NA, do I need to impute this data somehow?

Simon · September 26, 2022, 11:41am

There may be several approaches. As you mentioned, imputation ca be useful, or other similar techniques for addressing missing data. McElreath’s statistical rethinking book has some useful approaches to missing data on chapter 15, which has been ported to PyMC here: pymc-resources/Rethinking_2 at main · pymc-devs/pymc-resources · GitHub. Possibly, the most inappropriate approach would be obviating those missing data and use only the timepoints that actually have data. Not the best idea, but may be a good start to test whether your modelling approach works (in principle at least).

janspiegel · January 12, 2023, 1:30pm

Hi Simon, just looking at your example and inspecting the results - this gives me 1000 samples for Y_obs but I am trying to estimate theta (and Y_obs) as a time-varying process - so would want a sample for five different thetas, e.g. one for each date - I have tried simply adding

shape=5

to the definition of theta but this doesn’t seem to work. Do you have any advice?

Simon · January 16, 2023, 3:50pm

Note that you cannot add shape to a pm.Deterministic. What the theta estimate gives you in this model is the estimate per date, score and country. Because the GaussianRandomWalk (alpha) has shape = (21,19,5), which is later expanded by indexing as alpha[d,s,c] .So, this is the “expanded” or “flattened” estimate, you simply need to reshape your estimate according to your data. For example, by doing data[‘theta_mean’] = trace[‘theta’].mean() , and then re-arranging your data as you require. (you can add posterior SD and/or HDIs in a similar way). Hope I understood your question correctly and this makes sense.

janspiegel · January 25, 2023, 10:05pm

I understand - this

data[‘posterior_mean’] = trace

does not work for me, but I can access Y_obs through

trace.log_likelihood['Y_obs']

which is an array of shape (chains, samplesize, len(df) )

I assume I can simply average over chains and the samples drawn, or is there anything I need to be mindful of when doing so?

Simon · January 26, 2023, 11:00am

Sorry, in my rush I answered in PyMC3 style. If your trace is a PyMC idata obeject (i.e. idata = trace), to get the a “trace” object of posterior distributions with grouped chains, you can do something like this: trace = idata.stack(sample = [‘chain’, ‘draw’]).posterior. And now you could do all operations with your posterior distributions, e.g.: mean_thetas = trace[‘theta’].mean(), sd_thetas = trace[‘theta’].std(), take HIDs using Arviz, etc. And then you could things like: data[‘theta_mean’] = trace[‘theta’].mean(). (Sorry my previous answer was also incomplete, edited now).

I’m not sure whether I got something wrong from your question, but I don’t quite understand why you want to work with the log_likelihood of Y_obs. If your interest is in doing inference via the posterior, the you need to work with the posterior distributions of theta, alpha, Ha, beta. Where theta gives you the main estimate. As long as you do all operations with the posterior, then you only need to be mindful of reporting what operations have you done (e.g. average). I hope this makes things a bit clearer.

janspiegel · January 26, 2023, 2:11pm

Hi Simon, yes this makes sense - I’m sorry but I get 4,000 samples of theta, and can’t quite see how I can sample e.g. the mean thetas from this for each of my 5 dates as the dimensions of the resulting \theta_{t} don’t seem to match.

I’m sure I’m making some silly mistake.

I’ve tried to simplify the problem, so am now only looking to model time-varying thetas, and want to model this across all countries.

This is my data (I’ve cut this down to a sample to make my trial runs more workable).

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2450 entries, 0 to 2449
Data columns (total 5 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   date          2450 non-null   float64
 2   Y_obs  2450 non-null   float64
 3   pred_name                2450 non-null   object 
 4   pred_score               2450 non-null   float64
dtypes: float64(3), int64(1), object(1)
memory usage: 95.8+ KB

t = pd.factorize(df_pilot['date'])[0].astype('int32')
p = pd.factorize(df_pilot['pred_name'])[0].astype('int32')

len(np.unique(p))
len(np.unique(t))

gives 5 and 10 respectively (i.e. I am modelling over 5 dates and 10 predictors. The data covers 50 countries (not 5 as before).

I define the model with a Poisson this time (I want to get a Poisson or Gamma to work at some point)

with pm.Model() as model:
                                 
    beta = pm.Normal('beta',mu=0,sigma=np.sqrt(2))
    ha = pm.Normal('Ha',mu=0,sigma=np.sqrt(2))
    
    alpha_zero = pm.Normal.dist(0,np.sqrt(2), shape=10) 

    alpha = pm.GaussianRandomWalk("alpha", init_dist=alpha_zero, sigma=np.sqrt(2),shape=(10,5))   
 
    theta = pm.Deterministic("theta", pm.invlogit(pm.math.dot(df_pilot.pred_score, alpha[p,t])+beta+ha))
    
    #Y_obs = pm.Gamma("Y_obs", alpha=theta, beta=1, observed=df_pilot.Y_obs)

    Y_obs =pm.Poisson("Y_obs", mu=theta, observed= df_pilot.Y_obs)

with model:
    # --- prior samples ---
    prior_predictive_base = pm.sample_prior_predictive(samples=1000, random_seed=rng)

gives

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
Ha	-0.0050	1.3600	-2.3110	2.7690	0.0420	0.0330	1057.0000	935.0000	NaN
alpha[0, 0]	-0.0240	1.4270	-2.4390	2.7330	0.0440	0.0320	1039.0000	981.0000	NaN
alpha[0, 1]	-0.0530	1.9620	-3.8300	3.5570	0.0640	0.0450	949.0000	941.0000	NaN
alpha[0, 2]	-0.0250	2.4480	-4.7380	4.5120	0.0780	0.0550	983.0000	1063.0000	NaN
alpha[0, 3]	-0.0280	2.8610	-4.9330	5.7370	0.0900	0.0660	1012.0000	944.0000	NaN
alpha[0, 4]	-0.0420	3.2180	-6.0860	5.6160	0.1010	0.0740	1010.0000	930.0000	NaN
alpha[1, 0]	0.0540	1.3510	-2.3640	2.5980	0.0420	0.0310	1051.0000	917.0000	NaN
alpha[1, 1]	0.0190	1.9440	-3.7180	3.5370	0.0590	0.0460	1106.0000	752.0000	NaN
alpha[1, 2]	0.0320	2.4420	-4.0300	4.7290	0.0740	0.0560	1090.0000	838.0000	NaN
alpha[1, 3]	0.0080	2.8380	-5.0480	5.1670	0.0860	0.0640	1085.0000	866.0000	NaN
alpha[1, 4]	-0.0740	3.1630	-6.5090	5.3450	0.1000	0.0710	992.0000	836.0000	NaN
alpha[2, 0]	-0.0160	1.4120	-2.8110	2.4590	0.0450	0.0320	1007.0000	852.0000	NaN
alpha[2, 1]	-0.0630	2.0050	-4.0010	3.4390	0.0630	0.0450	1000.0000	940.0000	NaN
alpha[2, 2]	-0.0880	2.3860	-4.6190	4.4930	0.0760	0.0540	974.0000	981.0000	NaN
alpha[2, 3]	-0.1400	2.8360	-5.6430	5.2060	0.0920	0.0650	947.0000	952.0000	NaN
alpha[2, 4]	-0.0460	3.1110	-5.8480	5.7890	0.0990	0.0730	993.0000	881.0000	NaN
alpha[3, 0]	0.0280	1.3770	-2.4160	2.6850	0.0440	0.0310	988.0000	875.0000	NaN
alpha[3, 1]	0.0090	1.9380	-3.5190	3.6420	0.0640	0.0450	923.0000	919.0000	NaN
alpha[3, 2]	-0.0110	2.4350	-4.6560	4.3090	0.0830	0.0590	853.0000	868.0000	NaN
alpha[3, 3]	0.0460	2.8430	-4.8060	5.8240	0.0840	0.0650	1146.0000	944.0000	NaN
alpha[3, 4]	0.0670	3.0940	-5.7100	5.7120	0.0900	0.0680	1186.0000	1013.0000	NaN
alpha[4, 0]	-0.0350	1.4460	-2.7790	2.7630	0.0460	0.0330	973.0000	942.0000	NaN
alpha[4, 1]	-0.0220	2.0440	-3.8090	3.8010	0.0730	0.0520	790.0000	771.0000	NaN
alpha[4, 2]	0.0080	2.4900	-4.8670	4.3230	0.0890	0.0630	785.0000	907.0000	NaN
alpha[4, 3]	-0.0070	2.8750	-5.0930	5.2770	0.1030	0.0730	786.0000	949.0000	NaN
alpha[4, 4]	-0.0300	3.1330	-5.8190	5.5530	0.1030	0.0730	932.0000	944.0000	NaN
alpha[5, 0]	0.0140	1.4290	-2.7180	2.6750	0.0460	0.0340	950.0000	908.0000	NaN
alpha[5, 1]	-0.0640	2.0200	-4.2460	3.3060	0.0680	0.0480	874.0000	906.0000	NaN
alpha[5, 2]	-0.0810	2.4640	-4.7670	4.2950	0.0770	0.0550	1013.0000	944.0000	NaN
alpha[5, 3]	-0.1160	2.8590	-5.3760	5.1690	0.0900	0.0640	1001.0000	1023.0000	NaN
alpha[5, 4]	-0.0510	3.2080	-6.3340	5.7090	0.1020	0.0720	984.0000	930.0000	NaN
alpha[6, 0]	0.0230	1.4060	-2.3570	2.7300	0.0450	0.0340	961.0000	821.0000	NaN
alpha[6, 1]	0.0340	1.9570	-3.5630	3.8030	0.0640	0.0460	944.0000	905.0000	NaN
alpha[6, 2]	0.0620	2.3820	-4.1180	4.9970	0.0750	0.0560	1007.0000	845.0000	NaN
alpha[6, 3]	0.0540	2.7840	-5.1580	5.1270	0.0870	0.0650	1043.0000	980.0000	NaN
alpha[6, 4]	0.0500	3.1260	-6.0930	5.5400	0.1050	0.0740	887.0000	728.0000	NaN
alpha[7, 0]	-0.0410	1.4740	-2.6560	2.8300	0.0470	0.0340	986.0000	981.0000	NaN
alpha[7, 1]	-0.0550	2.0430	-3.6830	3.6570	0.0620	0.0470	1092.0000	1026.0000	NaN
alpha[7, 2]	0.0120	2.4590	-4.0990	4.9820	0.0760	0.0570	1066.0000	869.0000	NaN
alpha[7, 3]	0.0130	2.8190	-5.7510	4.5710	0.0870	0.0620	1050.0000	1067.0000	NaN
alpha[7, 4]	0.0270	3.0820	-5.0400	6.4490	0.0900	0.0710	1166.0000	1018.0000	NaN
alpha[8, 0]	0.0080	1.4020	-2.4670	2.8280	0.0430	0.0320	1056.0000	968.0000	NaN
alpha[8, 1]	0.0000	1.9860	-3.7180	3.6200	0.0600	0.0470	1084.0000	821.0000	NaN
alpha[8, 2]	0.0120	2.4140	-4.3130	4.5500	0.0730	0.0540	1111.0000	885.0000	NaN
alpha[8, 3]	-0.0610	2.8100	-5.1290	5.1550	0.0800	0.0680	1221.0000	1071.0000	NaN
alpha[8, 4]	-0.0430	3.1630	-5.8310	5.6390	0.0920	0.0760	1186.0000	944.0000	NaN
alpha[9, 0]	0.0460	1.4280	-2.8500	2.5740	0.0460	0.0360	950.0000	875.0000	NaN
alpha[9, 1]	0.0450	2.0160	-3.5300	4.0780	0.0690	0.0490	849.0000	941.0000	NaN
alpha[9, 2]	0.0330	2.4530	-5.1710	4.1810	0.0810	0.0570	912.0000	823.0000	NaN
alpha[9, 3]	-0.0030	2.8530	-5.2890	5.2420	0.0950	0.0670	904.0000	910.0000	NaN
alpha[9, 4]	0.0210	3.1540	-5.8810	5.9820	0.1070	0.0760	868.0000	983.0000	NaN
beta	0.0720	1.4300	-2.7050	2.6480	0.0460	0.0360	962.0000	917.0000	NaN
theta	0.4800	0.5000	0.0000	1.0000	0.0160	0.0110	1037.0000	1000.0000	NaN

and posterior sampling gives

with model:
    trace = pm.sample(1000, random_seed=rng, target_accept=.95)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [beta, Ha, alpha]

Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 2563 seconds.
There were 200 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 754 divergences after tuning. Increase `target_accept` or reparameterize.
There were 22 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
There were 106 divergences after tuning. Increase `target_accept` or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.

trace.posterior[‘trace’] now has shape (4,1000), and it seems like istack simply stacks this data into a 1-dimensional object of shape(4000,).

I can of course take the mean of theta but what I want here is a distribution or mean for each of my 5 \theta_{t}

Simon · January 26, 2023, 4:37pm

I look at this more calmly and re-run the code, effectively theta gives only 4000 samples. The problem is that theta is reduced to a single parameter due to the matrix multiplication. Another problem seems to be that the sample is reaching maximum tree depth, which may indicate that the model is not well defined (probably the Beta distribution for the likelihood is not the most appropriate). So, just as an attempt (as I’m still missing some things about your model I think), I replaced the matrix multiplication for a simple multiplication (this is no problem as alpha is expanded), and I replaced the Gamma likelihood for a Lognormal. Here’s what I’ve got:

# -*- coding: utf-8 -*-
import pymc as pm
import numpy as np
import pandas as pd
import arviz as az

##country name
c1 = np.repeat('c1',10)
c2 = np.repeat('c2',5)
c3 = np.repeat('c3',7)
c4 = np.repeat('c4',11)
c5 = np.repeat('c5',6)

## dates per country
d1 = np.arange(10)#.astype(str)
d2 = np.arange(5)#.astype(str)
d3 = np.arange(7)#.astype(str)
d4 = np.arange(11)#.astype(str)
d5 = np.arange(6)#.astype(str)

cs = np.concatenate([c1,c2,c3,c4,c5])
ds = np.concatenate([d1,d2,d3,d4,d5])

data = pd.DataFrame({'country':cs, 'date':ds})

scores = np.array([np.random.uniform(0,1, 39) for c in range(19)])

snames = ['s'+str(s+1) for s in range(19)]

data[snames] = scores.T

df = pd.melt(data, id_vars=['country','date'], value_vars=snames, var_name='sname', 
             value_name='score', ignore_index=True)

pred = np.random.uniform(0,1,len(df))
df['pred'] = pred

##indices per country, date and score category
c = pd.factorize(df['country'])[0].astype('int32')
d = pd.factorize(df['date'])[0].astype('int32')
s = pd.factorize(df['sname'])[0].astype('int32')

with pm.Model() as model:

    beta = pm.Normal('beta',mu=0,sigma=np.sqrt(2))
    ha = pm.Normal('Ha',mu=0,sigma=np.sqrt(2))

    alpha_zero = pm.Normal.dist(0,1, shape=21) 

    alpha = pm.GaussianRandomWalk("alpha", init=alpha_zero, sigma=np.sqrt(2),shape=(21,19,5))   

    x = pm.Deterministic("x", pm.invlogit(df.pred.values*alpha[d,s,c]))

    theta = pm.Deterministic("theta", pm.invlogit(df.pred.values*alpha[d,s,c]+beta+ha))

    y = pm.Lognormal("y", mu=theta, sigma=1, observed=df.score.values)

with model:
    trace = pm.sample(100)

 |████████████| 100.00% [4400/4400 01:39<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 100 draw iterations (4_000 + 400 draws total) took 134 seconds.

trace = idata.stack(sample = ["chain", "draw"]).posterior

trace["theta"].shape
Out[4]: (741, 400)

Note that I reduced the number of dates 39 and the number of samples to 100 (this is just for simplicity, so I can run this a bit faster). Also note that the sample outputs no max tree depth/divergence warnings this time. Finally, note that the model’s estimates are terrible (this is probably because I added variables I don’t really understand, like df[‘pred’], which I just drew randomly from a Uniform distribution, as I don’t really know what are those predictions in your data and why do you need to have multiplied with a Gaussian random walk). Hope this, at least in principle, helps a bit more.

Topic		Replies	Views
Understanding coords, indexation, Data, ..., for multilevel models v5 modeling	1	3899	April 29, 2022
Multi-dimensional Dims Do not Seem to Work v5 modeling	3	884	July 11, 2022
Creating hierarchical models using 3 dimensional xarray data Questions	20	2026	April 7, 2021
Problem with coords/dims in hierarchical model v5	4	817	January 5, 2023
Dimension error for Hierarchical model version agnostic	11	383	April 13, 2023

Understanding indexing and dimensions in hierarchical models

Related topics