Hierarchical Model with Multiple Predictors

Hi
I’m trying to make the example (A Primer on Bayesian Methods for Multilevel Modeling — PyMC3 3.11.5 documentation) for hierarchical models work, but I’m struggling to introduce new betas (b_2, from data_var_2) efficiently for other variables from the data that I want to add without adding them manually (b_2,sigma_b_2, adjusting formula for theta).

coords = {"Level": ["Basement", "Floor"], "obs_id": np.arange(floor.size)}

with pm.Model(coords=coords) as varying_intercept_slope:
    floor_idx = pm.Data("floor_idx", floor, dims="obs_id")
    #data_2_idx = pm.Data("data_2_idx", data_var_2, dims="obs_id")
    county_idx = pm.Data("county_idx", county, dims="obs_id")

    ### Hyperpriors:
    a = pm.Normal("a", mu=0.0, sigma=5.0)
    sigma_a = pm.Exponential("sigma_a", 1.0)
    #a_2 = pm.Normal("a_2", mu=0.0, sigma=5.0)
    #sigma_a_2 = pm.Exponential("sigma_a_2", 1.0)
    b = pm.Normal("b", mu=0.0, sigma=1.0)
    sigma_b = pm.Exponential("sigma_b", 0.5)
    # b_2 = pm.Normal("b_2", mu=0.0, sigma=1.0)
    # sigma_b_2 = pm.Exponential("sigma_b_2", 0.5)

    ### Varying intercepts:
    za_county = pm.Normal("za_county", mu=0.0, sigma=1.0, dims="County")
    # za_2_county = pm.Normal("za_2_county", mu=0.0, sigma=1.0, dims="County")
    # Varying slopes:
    zb_county = pm.Normal("zb_county", mu=0.0, sigma=1.0, dims="County")
    #zb_2_county = pm.Normal("zb_2_county", mu=0.0, sigma=1.0, dims="County")

    ### Expected value per county:
    theta = (a + za_county[county_idx] * sigma_a) + (b + zb_county[county_idx] * sigma_b) * floor
   #+ (a_2 + za_2_county[county_idx] * sigma_a_2)+ (b_2 + zb_2_county[county_idx] * sigma_b_2) * data_var_2
    ### Model error:
    sigma = pm.Exponential("sigma", 1.0)

    y = pm.Normal("y", theta, sigma=sigma, observed=log_radon, dims="obs_id")

    varying_intercept_slope_idata = pm.sample(
        2000, tune=2000, target_accept=0.99, random_seed=RANDOM_SEED, return_inferencedata=True
    )

Can this be handled via some shape parameters or do I need to create a special xarray like described here? Coordinates in PyMC & InferenceData Objects – Christian Luhmann – Python, math, etc.

Hi. You just need to have your data in long format and add the coordinates corresponding to the indices of the variable (i.e. column) you want to add to the model.

new_var_idx = pd.Categorical(df.new_var).codes #note that this function can give the wrong order if the df is not ordered in the precise way you need. It may be better to use dictionary indexing.

coords = {"Level": ["Basement", "Floor"], "obs_id": np.arange(floor.size), 
                 "County":county_idx, "new_var": new_var_idx}

with varying_intercept_slope:
          b_2 = pm.Normal("b_2", 0, 1, dims="new_var")

Another option is building up your model entirely without coordinates and assign the levels to each distribution via the “shape” parameter. Let’s say that the number of levels of one of your variables = n (e.g. n = len(df.varibale.unique()) ).

with pm.Model() as model:
    a = pm.Normal("a", 0, 1, shape=n)

And you could do that for each one of your variables. There are some advantages to using model coordinates, though. Hope this helps.

1 Like

Thanks a lot for your reply.

I’m a bit confused by the mention of the long format.
Are you suggesting I should reshape the data like described below?

… | y | var_1 | var_2 | county
ob1 | 22 | 55 | 66 | A
ob2 | 33 | 44 | 77 | B

… | y | new_vars | var_value | county
ob1 | 22 | 1 | 55 | A
ob1 | 22 | 2 | 66 | A
ob2 | 33 | 1 | 44 | B
ob2 | 33 | 2 | 77 | B

Besides the data format question:
You suggest adding “b_2”. This is what I did in the example code (commented out) and Id like to avoid that because it leads to manual code for each variable, or is the “dims” parameter with “new_vars” then handling this?

The “dims” parameter will handle the (array) dimensionality of your parameter (in this case b_2). If you have, for instance, a discrete variable with 2 levels (e.g. “high”, “low”), then your b_2 distribution will take on those 2 levels. You can check that up by doing b_2.eval().shape, after you have defined your model. A quick example with toy data below, where you can see how PyMC manages the variable dimensionality throughout the whole modelling process (from defining the model to sampling).

import pymc as pm
import numpy as np
import pandas as pd

y = np.random.normal(0,1,12)
x = list(np.repeat("high",6)) + list(np.repeat("low",6))
z = list(np.repeat("a",4)) + list(np.repeat("b",4)) + list(np.repeat("c",4))

df = pd.DataFrame({"obs":y, "var":x, "cov":z})

df
Out[1]: 
         obs   var cov
0   1.407061  high   a
1   0.686457  high   a
2  -0.670788  high   a
3   0.576915  high   a
4   1.935287  high   b
5  -2.331960  high   b
6   0.438080   low   b
7  -1.190885   low   b
8   0.144347   low   c
9   0.584575   low   c
10 -0.953830   low   c
11  0.442864   low   c

var_idx = pd.Categorical(df['var']).codes
cov_idx = pd.Categorical(df['cov']).codes

coords = {'loc':df.index.values, 
          'var':df['var'].unique(),
          'cov':df['cov'].unique()}

with pm.Model(coords=coords) as model:
    v_idx = pm.ConstantData("var_idx", var_idx, dims="loc")
    c_idx = pm.ConstantData("cov_idx", cov_idx, dims="loc")

    a = pm.Normal("a", 0, 1)
    b = pm.Normal("b", 0, 1, dims="var")
    c = pm.Normal("c", 0, 1, dims="cov")

    m = a + b[v_idx] + c[c_idx]

    s = pm.HalfNormal("s", 1)

    y = pm.Normal("y", m, s, observed=df['obs'].values)

b.eval().shape
Out[2]: (2,)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [a, b, c, s]
 |████████████| 100.00% [8000/8000 00:20<00:00 Sampling 4 chains, 0 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 42 seconds.

idata
Out[4]: 
Inference data with groups:
	> posterior
	> sample_stats
	> observed_data
	> constant_data

idata.posterior['b']
Out[5]: 
<xarray.DataArray 'b' (chain: 4, draw: 1000, var: 2)>
array([[[-0.65472295, -0.31560767],
        [-0.49080176, -0.19496418],
        [-0.43480596,  0.2989079 ],
        ...,
        [ 1.03582084, -0.73988372],
        [-0.92628691,  0.33085689],
        [ 1.67374922,  0.47049486]],

       [[-0.33019963, -0.52679078],
        [ 0.3327297 , -0.23758491],
        [ 0.75950194, -0.78424727],
        ...,
        [-0.65862662, -1.49969723],
        [-0.36586425, -0.01656974],
        [-1.1275673 ,  0.99075037]],

       [[-1.26850014, -1.55347173],
        [-2.00023507, -0.00816708],
        [ 0.79022025, -0.60316954],
        ...,
        [ 1.07199527,  0.45203872],
        [ 0.65674821, -0.23802447],
        [ 0.83036322, -0.48556644]],

       [[ 1.72620799, -2.0443458 ],
        [ 1.88393521, -0.75462455],
        [ 1.28261164, -1.16342085],
        ...,
        [ 0.42779301,  0.64523998],
        [ 0.04914379, -0.33293152],
        [ 0.41417636,  1.01508946]]])
Coordinates:
  * chain    (chain) int32 0 1 2 3
  * draw     (draw) int32 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
  * var      (var) <U4 'high' 'low'
1 Like

Thanks for your detailed response. Much appreciated!

However, besides categorical variables, I’m mostly interested in non-categorical ones.
The categorical one is also covered in the primer (county_idx)

~theta = a + b_1[county_idx] * var_1 + b_2[county_idx] * var_2 + ...

The question is how I can extend the regression with any number of variables (b_1,b_2 resp var_1 ,var_2) to (b_1,b_2,b_3,b_4…) without needing to define them each manually and also keep the nice labeling for further work.

~theta = a + b_array[county_idx] * var_array

1 Like

I see, I understand your question now (I hope). I don’t know of any way to add variables automatically. However, it would be easy to do it in a loop.

import pymc as pm
import numpy as np
import pandas as pd
import pytensor.tensor as pt

y = np.random.normal(0,1,12)
x = np.random.normal(0,1,12)
z = np.random.normal(0,1,12)
w = list(np.repeat("high",6)) + list(np.repeat("low",6))


df = pd.DataFrame({"obs":y, "var":x, "cov":z, "cat":w})

cat_idx = pd.Categorical(df['cat']).codes

coords = {'loc':df.index.values, 'cat':df['cat'].unique()}

varias = []
for i in df.columns[1:3]:
    coords[i] = df[i].unique()
    varias.append(df[i])

with pm.Model(coords=coords) as model:
    c_idx = pm.ConstantData("cat_idx", cat_idx, dims="loc")
    
    d = {}
    for i in range(len(df.columns[1:3])):
        x = "b_"+str(i)
        d[x] = pm.Normal(x, 0, 1, dims="cat")
        
    a = pm.Normal("a", 0, 1)
    
    d2 = [list(d.values())[i][c_idx]*varias[i] for i in range(len(varias))]
    
    m = a + pt.sum(d2, axis=0)[c_idx]
    
    s = pm.HalfNormal("s", 1)

    y = pm.Normal("y", m, s, observed=df['obs'].values)

with model:
    idata = pm.sample(1000)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [b_0, b_1, a, s]
 |████████████| 100.00% [8000/8000 00:34<00:00 Sampling 4 chains, 2 divergences]Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 56 seconds.
There were 2 divergences after tuning. Increase `target_accept` or reparameterize.

Personally, however, I think this is a bad idea (in general). Unless there’s a lot of certainty that each prior will have the same distribution and that the covariates make sense explanatorily or causally. I don’t know of any Bayesian stats package including functions for adding variables automatically (as far as my limited knowledge goes). Many of these variables may require different prior distributions or distributions parametrised in different ways. Also, adding a large number of variables (i.e. covariates) without knowing how these variables relate to each other in the model (i.e. causal dependencies, back-door criterion, etc. McElreath neatly summarises this topic), can have very bad consequences for inference. I hope this helps.

2 Likes