Understanding coords, indexation, Data, ..., for multilevel models

covertg · April 28, 2022, 4:32am

The use of coords in order to specify dimensionality seems like a potentially really useful way to develop complex models with understandable variable names. I’ve been reading up on as much as I can with respect to how to configure dimensionality in pymc, but I’m still struggling to understand how to develop multilevel models with arbitrary interaction between variables of different levels.

To preface: I have learning about/getting up to speed with pymc, and I am still learning about methods in bayesian modeling more generally. So apologies in advance for this long-winded post, and apologies if there are some easy answers that might not be obvious to me. Hopefully this may be helpful for the sake of providing more explanation on e.g. the use of coords, indexation, and other topics in pymc dimensionality? Anyways, I think that my confusion broadly stems from three related questions:

Is there “broadcasting” of shape/dimension when performing operations on variables? If so, how does it work?
When performing indexation on vector or array variables (variables that have dimensionality more complex than just a scalar), most of the multilevel model examples that I’ve found end up doing operations in the dimension of the total number of observations. That is, they perform fancy indexation based on various other dimensions, but that indexation yields a vector that is as long as the number of observations in the data. Is this the recommended way to proceed? My confusion here is that the important dimensionality is sometimes (often?) not as large as the number of observations, and so I wonder if there is a more efficient and understandable way to proceed.
Is there a recommended way to represent multifactor/multilevel data for convenient use in pymc? I have been experimenting with various implementation details: dataframes that are “long”, versus dataframes that are “wide”, using pandas MultiIndex, using one pm.Data object to store multiple columns’ worth of data (columns are considered in the “long” format), using multiple pm.Data objects to store each column of data, and so on. Different combinations of these options can give rise to different ways to specify models in pymc, and it’s been difficult to know if I’m making my life easier or harder depending on how I represent my data.

I understand that these are some very broad questions. For concreteness, here is a running example that I have been trying to learn from. This example is modified from the “using shared variables” notebook. That notebook works with daily temperature data as measured at three different cities in Europe. I’ve complicated that example by adding in an extra factor: “time” of the temperature recording, which for simplicity is either day or night. Let’s assume that nights are generally a little bit colder than days. Then we can generate the data with:

import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm

RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)

dates = pd.date_range(start="2020-05-01", end="2020-05-29")
ground_truth = [("Berlin", 15, 0.8), ("San Marino", 18, 0.5), ("Paris", 16, 1.2)]
date_col, time_col, city_col, temp_col = [], [], [], []
for city, mu, offset in ground_truth:
    temps = rng.normal(loc=mu, scale=1, size=len(dates))
    for time in ["day", "night"]:
        temps_time = temps + (1 if time == "day" else -1) * rng.normal(loc=offset, scale=0.5, size=len(dates))
        date_col.extend(dates)
        time_col.extend([time] * len(dates))
        city_col.extend([city] * len(dates))
        temp_col.extend(temps_time)
assert len(date_col) == len(time_col) == len(city_col) == len(temp_col) == (len(dates) * len(ground_truth) * 2)

df_data = pd.DataFrame(zip(date_col, time_col, city_col, temp_col), columns=["date", "time", "city", "temp"])
df_data = df_data.set_index(["date", "time", "city"], append=True)
print(df_data.sample(10, random_state=RANDOM_SEED))

                                      temp
    date       time  city                 
151 2020-05-07 night Paris       14.369560
158 2020-05-14 night Paris       13.350646
1   2020-05-02 day   Berlin      16.809533
45  2020-05-17 night Berlin      12.081991
39  2020-05-11 night Berlin      14.829763
30  2020-05-02 night Berlin      16.288574
75  2020-05-18 day   San Marino  19.536284
70  2020-05-13 day   San Marino  19.007176
28  2020-05-29 day   Berlin      15.324609
53  2020-05-25 night Berlin      12.587037

dates_idx, dates = pd.factorize(df_data.index.get_level_values("date"))  # 29 dates
times_idx, times = pd.factorize(df_data.index.get_level_values("time"))  # 2 times
cities_idx, cities = pd.factorize(df_data.index.get_level_values("city"))# 3 cities
obs_idx, _ = pd.factorize(df_data.index.get_level_values(0))             # 29*2*3=174 observations

I’ve deviated from the original notebook by expressing the data in “long” format. It makes the most sense to me as it’s very easy to handle arbitrary combinations of factor levels, as well as possibly imbalanced observations, in this format. I hope this is a wise enough first step. (To question 3: is it?)

My ultimate objective here is to estimate (a) the global or European mean temperature, (b) each city’s mean temp, and (c), new to this modified example, the offset from each city’s mean that we see during the different times of day. But first, here’s my approach to replicating the results in the original notebook, ignoring the “time” dimension.

with pm.Model(coords={"city": cities}) as model:
    temperature_pm = pm.MutableData("data_temperature", df_data.temp, dims="obs_idx")
    cities_idx_pm = pm.MutableData("cities_idx", cities_idx, dims="obs_idx")

    europe_mean = pm.Normal("europe_mean_temp", mu=15.0, sigma=3.0)
    city_offset = pm.Normal("city_offset", mu=0.0, sigma=3.0, dims="city")
    city_temperature = pm.Deterministic(
        "city_temperature", europe_mean + city_offset, dims="city"
    )
    pm.Normal("likelihood", mu=city_temperature[cities_idx_pm], sigma=0.5, observed=temperature_pm, dims="obs_idx")

This samples just as well as the original example. However, it strikes me that the likelihood node only needs to have a size of 3 (instead, it has size of 174). So, to question 2: why does pymc error out if I try to set the dims argument in the likelihood to just city (and remove the cities_idx_pm indexation)? Trying to understand, I also implemented this without using coords or dims. But in that case everything seems to work with the likelihood having size of only 3.

with pm.Model() as model:
    temperature_pm = pm.MutableData("data_temperature", df_data.temp)
    cities_idx_pm = pm.MutableData("cities_idx", cities_idx)
    
    europe_mean = pm.Normal("europe_mean_temp", mu=15.0, sigma=3.0)
    city_offset = pm.Normal("city_offset", mu=0.0, sigma=3.0, size=len(cities))
    city_temperature = pm.Deterministic(
        "city_temperature", europe_mean + city_offset
    )
    pm.Normal("likelihood", mu=city_temperature[cities_idx_pm], sigma=0.5, observed=temperature_pm, size=len(cities))

Again to question 2: what is the difference between these two models? In this case, it doesn’t make a difference to our results, since sampling still works fine and the stochastic and deterministic variables have the shapes that we want.

But this question does seem relevant to trying to level up and incorporate the “time” dimension in our model. A naive approach that I can think of is to try and estimate a global “day” offset alongside a “night” offset. (The data-generating process had symmetrical average offsets between day and night, but let’s suppose that we didn’t know this.) Here is my best shot at doing this:

with pm.Model(coords={"city": cities, "time": times}) as model:
    temperature_pm = pm.MutableData("data_temperature", df_data.temp, dims="obs_idx")
    cities_idx_pm = pm.MutableData("cities_idx", cities_idx, dims="obs_idx")
    times_idx_pm = pm.MutableData("times_idx", times_idx, dims="obs_idx")

    europe_mean = pm.Normal("europe_mean_temp", mu=15.0, sigma=3.0)
    city_offset = pm.Normal("city_offset", mu=0.0, sigma=3.0, dims="city")
    city_temperature = pm.Deterministic(
        "city_temperature", europe_mean + city_offset, dims="city"
    )
    time_offset = pm.Normal("time_offset", mu=0.0, sigma=1.0, dims="time")
    city_time_temperature = city_temperature[cities_idx_pm] + time_offset[times_idx_pm]  # is there any way to make this dims (cities, times) ?
    pm.Normal("likelihood", mu=city_time_temperature, sigma=0.5, observed=temperature_pm, dims="obs_idx")

When I start writing interactions between the different levels, the only way that I can find to do this is by working in the “observation index” dimension (i.e. in the city_time_temperature variable). Again to question #2: is this the correct way to proceed? I would love to be able to create a Deterministic city_time_temperature that has total “rank” of 6 (city x time). Ultimately, this is a posterior estimate of interest, and it would be nice to have it in the trace. But if I make city_time_temperature a Deterministic, it must have shape 174, and I end up with 174 variables in the trace. I haven’t been able to find a way to create this model without expressing city_time_temperature as a vector that is as long as the number of observations. Is there a way to do so?

For completeness, I did also implement this model without using coords. Interestingly, similar to the above non-coords model, it works fine with the likelihood being specified as size=len(cities)*len(times). However, since Deterministic does not take shape or size arguments, I could not find a way to express city_time_temperature in terms of (city, time).

To question #1: I wondered if any type of intelligent broadcasting is possible, something like:

city_time_temperature = pm.Deterministic("city_time_temperature", city_temperature + time_offset, dims=("city", "time"))

Interestingly this model compiles fine, but it throws an error immediately upon sampling with ValueError: Input dimension mismatch. (input[3].shape[0] = 174, input[5].shape[0] = 3).

And back to question #3: I’m wondering if I am tying my hands somehow by creating those "idx" variables that all have length equal to the number of observations. It seems to defeat the purpose of coords and dims, a little bit. However, I haven’t been able to find an easier way to work with this multifactor data. In particular, it seems like pm.Data isn’t built to play nicely with pandas MultiIndexed dataframes. Is this the best way to work with multifactor data?

Lastly, ultimately I would wonder about taking this model another hierarchical step further, and try to estimate the “time” offset for each city rather than just globally. I have enough questions about the simpler models, though, that I figured there is plenty to learn before complicating things further.

Seriously, my sincere apologies for somehow turning my questions into an essay. I couldn’t figure out how to clearly explain my curiosities while being any less verbose. On reflection, I suppose that my questions are generally seeking advice on “best practices” and asking for some sanity checks that I haven’t made any fundamental misunderstandings in all of this. Thanks for the wonderful library by the way .

Here’s what I’ve been reading up on, if it helps to have possibly-related content crystallized here:

Possibly-related

New Feature: Add a TidyData Class · Discussion #5335 · pymc-devs/pymc · GitHub (is this still relevant?)

OriolAbril · April 29, 2022, 9:29pm

Yes, random variables are aesara tensors, so the broadcasting rules explained in https://aesara.readthedocs.io/en/latest/tutorial/broadcasting.html apply. TL; DR they are similar to numpy ones but tensors also have a true/false indicator to “turn off” broadcasting on some dimensions if desired.

Yes, I’d recommend factorizing into 2d vectors with length as the number of observations in a style similar to long format dataframes. There are two reasons for that. The first is that it works also with ragged data; keeping every hierarchy level as its own dimension only works if there are the same number of observations per group which is generally not the case. The second is I haven’t had time to do proper benchmarks but I am quite sure it is faster to implement the models this way unless no indexing were required if working with multidimensional arrays of parameters.

Using your example, if you have M observations per combination of city and time, then you can strucutre you observations as a 3d array with dimensions time: 2, city: 3, observation: 5 → n_obs = 30. You can then use broadcasting only to combine the 1d (length 2) time parameters vector with the 1d (length 3) citiy parameters, no indexing would be needed here and I think it would be quite intuitive, but that is generally not the case and you have 5 observations for Berlin and 7 for Paris.

I think it is important to note that the number of parameters and the number of observations are independent and you always need to evaluate the model on all your observations. In a simple linear regression we have 3 parameters a, b, \sigma but the log likelihood is the sum of the linear model evaluated at every observations; PyMC does that automatically (and aesara can optimize the operations done sometimes) when given the observed kwarg with all the observations.

Second note, the pm.Deterministics and the city_temperature[cities_idx] do not add more parameters to the model, they only create transformations and arrays with the original parameters restructured to compute the right quantity when working with arrays.

In my opinion, going back to what I said above you need 1d vectors and all the factors as indexes (you can use pandas.factorize for this like you have done) of if same # of observations per category then multidimensional array (not compatible with pandas anymore unless you have only one level). I can’t really speak about having the levels as “raw” columns or as multiindex though.

I think it is

Here is where the parameters vs observation difference comes into play. The likelihood can’t be of size 3 because the likelihood combines both data and parameters. There are only 3 possible values for the mean yes, but we need to evaluate the normal (whatever one of the 3 means we need to use) at each of the observations. In the exponent of the pdf (x - \mu) you need to take the x into account too, not only the \mu.

Moreover, to do posterior predictive checks you need to preserve that structure.

However, this can be quite inconvenient if you want to analyze the posterior predictive samples/predictions at each city (i.e. to get the probability the temperature will be higher than t in each city).

To get around this I think you should consider two options:

You can update the cities_idx_pm with pm.set_data and replace it with an arange of length the number of cities. This is easy and convenient if you only have one category, but it quickly becomes cumbersome. To add the time component you’d need to use 0, 0, 1, 1, 2, 2 as city_idx and 0, 1, 0, 1, 0, 1 as time_idx…
Repeat the mu computation with xarray and use xarray-einstats to sample from the posterior predictive. Broadcasting and alignment will be automatic and you’ll get a multidimensional array with hierarchies as different dimensions.

The likelihood with size 3 looks very wrong to me. I still don’t grok the new size argument, but you might be generating a 3, 174 array to evaluate the log likelihood on. What is the shape of the data in the log likelihood group of the result? Do you get the same results?

Yes, the example does have the same number of observations per category, so you can store the observations in a time, city, date 3d array. Then this line can become city_time_temperature = city_temperature[None, :] + time_offset[:, None] which might work as input directly if the observed is the 3d size, or you might need an extra length 1 dim for broadcasting to work so use mu=city_time_temerature[:, :, None].

However, big however. That is the mean of the temperatures, it is not what you observed. I’d recommend working with the posterior predictive samples to get values you can interpret as what you’d expect to see. Already with the current model, you can get posterior predictive samples as:

from xarray_einstats.stats import XrContinuousRV
from scipy.stats import norm

post = idata.posterior
temperature = XrContinuousRV(norm, post["city_temperature"] + post["time_offset"], 0.5).rvs()

I recently wrote a Discourse topic to introduce xarray-einstats.

Magic dimension/label based broadcasting is (at least for now) outside the scope of pymc and aesara. IIRC there is an open issue on aesara about this though you might want to search for it. All of this and more however already works after sampling once you get the results as InferenceData.

Topic		Replies	Views
PyMC3+ArviZ: improve your workflow with labeled coords and dims Sharing doc	20	5829	April 5, 2021
Creating hierarchical models using 3 dimensional xarray data Questions	20	2041	April 7, 2021
PyMC3 posterior prediction example; how to incorporate a matrix of betas? Questions	24	2874	April 23, 2021
Understanding indexing and dimensions in hierarchical models v5 time_series	9	1930	January 26, 2023
How to properly use dims on a 3 level hierarchical model v5 modeling	1	593	July 19, 2022

Understanding coords, indexation, Data, ..., for multilevel models

Related topics