Hierarchical model and pandas dataframe

Hi everyone, I would like to implement a hierarchical model in PyMC3 and so I was reading The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3 — While My MCMC Gently Samples. My Problem is that I have a pandas dataset in which ten columns correspond to ten different groups plus other regressors in additional columns. So my question is: How can I write something like this

    # Expected value
    radon_est = a[county_idx] + b[county_idx] * data.floor.values

as in the example at the upper link?
As far as I understand, the groups must be sequential and not in different columns and “county_idx” is an indicator variable with length equal to the total number of examples in the dataset. Is there a shortcut in my case?

Thank you so much for your time

2 Likes

It depends on the actual structure of the model. It is not the same to have all groups be independent and one “level” above the observations than to have three levels and some groups be constrained by the higher level groups or (perish the thought) all 10 had a hierarchical relation generating 10 levels.

References that I think will be useful: Multi-Multilevel Modeling and Bambi 0.5.0 is out

2 Likes

It sounds like your data is in ‘wide form’. If you transform the data into ‘long form’ that will be much easier. So you want it so that you have a single column which acts as an index into country 1-10.

See for example python - Reshape pandas dataframe wide to long, with some variables to stack, other variables to repeat - Stack Overflow

2 Likes

Thank you so much for your answers! Yes my problem was to transform my data from a wide form to a long one. I’ve got a question about the variables of the model. Once I write something like:


with pm.Model() as hierarchical_model: 

    # Hyperpriors for group nodes
    mu_intercept = pm.Normal("mu_intercept", mu=0.0, sigma=20)
    sigma_intercept = pm.HalfNormal("sigma_intercept", sigma=10) 
    mu_x = pm.Normal("mu_x", mu=0.0, sigma=20)
    sigma_x = pm.HalfNormal("sigma_x", sigma=10)


    # Define priors


    intercept = pm.Normal('Intercept', mu=mu_intercept, sigma=sigma_intercept, shape=10) 
    
    x_coeff = pm.Normal('x', mu=mu_x, sigma=sigma_x, shape=10) 
   

    mu = pm.Deterministic('mu',function((intercept[training_sample["indx"]] + x_coeff[training_sample["indx"]] * x)))

    # Define the error:

    error = pm.HalfNormal('phi', sigma=50) 
      
    # Define likelihood
    likelihood = pm.Normal('y', mu=mu, 
                        sigma=error, shape=len(x), observed=y)
    
    
    trace_hierarchical = pm.sample(progressbar=True, random_seed = 30, target_accept = 0.9)

How can I select the “mu” variable linked to the first group for example? That is, I want write something like:

with hierarchical_model:
    posterior_predictive = pm.sample_posterior_predictive(trace_hierarchical, var_names=['mu', 'y'])
fig, ax = plt.subplots(figsize=(12,8))

az.plot_hdi(
    x,
    posterior_predictive['mu'],
    ax=ax,
    fill_kwargs={"alpha": 1.0, "label": "Mean outcome 94% HPD"},
)

az.plot_hdi(
    x,
    posterior_predictive['y'],
    ax=ax,
    fill_kwargs={"alpha": 0.6, "color": "#a1dab4", "label": "Outcome 94% HPD"},
)

ax.scatter(x, y, alpha= 0.01)
ax.legend()
ax.set_title("Posterior Predictive Check")

but I want specify the mu and y linked only to one of the ten groups of my model. Is it that possible? Thank you so much

I think cell 10 in the original notebook (The Best Of Both Worlds: Hierarchical Linear Regression in PyMC3 — While My MCMC Gently Samples) is probably the best place to start.

Otherwise, you could try to manually extract traces for each group, something along the lines of trace["intercept[0]"], trace["intercept[1]"] etc

But I might be misunderstanding what you want to do

In my opinion, this is one of the places where pm.Data, xarray and combinations of both shine.

To begin with, you can use named dimensions to indicate that the shapes of intercept and x_coeff are not both 10 by coincidence but because they represent the same dimension. Moreover, you can

with pm.Model(
    # note you can use the actual group names (if they exists) 
    # and use those to subset the data afterwards
    coords={"group": np.arange(10), "obs_id": np.arange(len(x))}
) as hierarchical_model: 
    indx = pm.Data("indx", training_sample["indx"], dims="obs_id")
    x_data = pm.Data("x_data", x, dims="obs_id")

    # Hyperpriors for group nodes
    mu_intercept = pm.Normal("mu_intercept", mu=0.0, sigma=20)
    sigma_intercept = pm.HalfNormal("sigma_intercept", sigma=10) 
    mu_x = pm.Normal("mu_x", mu=0.0, sigma=20)
    sigma_x = pm.HalfNormal("sigma_x", sigma=10)


    # Define priors


    intercept = pm.Normal('Intercept', mu=mu_intercept, sigma=sigma_intercept, dims="group") 
    
    x_coeff = pm.Normal('x', mu=mu_x, sigma=sigma_x, dims="group") 
   

    mu = pm.Deterministic('mu', function((intercept[indx] + x_coeff[indx] * x_data)), dims="obs_id")

    ...

note that the obs_id instances are only informative, the shape is set by the input arrays, the dims only label those.

After sampling, you’ll already have mu in the posterior (mu is a deterministic computation from the posterior variables, so it’s not really a posterior predictive variable, which is why as you probably have noticed, you need to explicitly specify it for it to be sampled by pm.sample_posterior_predictive). The problem as you have noticed is that it has obs_id dimension instead of group one. It has basically the same info, only 10 unique values, one per group, but they are repeated to match the group of each observation.

If you want the group effects at a given x (say x=2.34), you can use pm.sample_posterior_predictive:

with hierarchical model:
    pm.set_data({"x_data": [2.34]*10, "indx": np.arange(10)})
    pp = pm.sample_posterior_predictive(trace...)

but you probably want multiple x values per group so you can plot group level hdis and so on. One option is to do something like we did above but the other way around in a sense, looping over groups and using pm.set_data({"x_data": np.linspace..., "indx": [group]*len(linspace)}). We can change the length of the dimension obs_id but we can’t change the number of dimensions of the variables x_data and indx, they are 1d arrays.

Another option is to use xarray which will handle most of the broadcasting and aligning of arrays:

post = trace_hierarchical.posterior
x_da = xr.DataArray(np.linspace(), dims=["x_plot"])
mu = post["Intercept"] + post["x_coeff"] * x_da
# mu is a 4d array with dims chain, draw, group, x_plot
# you can select and plot a group hdi with
az.plot_hdi(x_da, mu.sel(group=group_name), ...)

more details and working examples (I haven’t run any of the code above and it won’t run, some parts are pseudocode) of a related question at 11.16 Rethinking Code - #8 by OriolAbril

1 Like

I’m preparing a new example PyMC3 notebook which covers these issues. Will aim to do a pull request soon. It should provide a pretty complete working example for what you want to do.

In-progress notebook here → Simpson’s paradox and mixed models

Note: if you are reading this in the future, then hopefully this notebook will be available on the main PyMC3 example page and the link above probably won’t work because the specific branch of my fork of pymc3-examples will be gone after pull request approved.

EDIT: FYI pull request #182 here

1 Like

If those links don’t work, the notebook can be found here pymc-examples/GLM-simpsons-paradox.ipynb at main · pymc-devs/pymc-examples · GitHub It will eventually go live on the website.