The use of coords
in order to specify dimensionality seems like a potentially really useful way to develop complex models with understandable variable names. I’ve been reading up on as much as I can with respect to how to configure dimensionality in pymc, but I’m still struggling to understand how to develop multilevel models with arbitrary interaction between variables of different levels.
To preface: I have learning about/getting up to speed with pymc, and I am still learning about methods in bayesian modeling more generally. So apologies in advance for this long-winded post, and apologies if there are some easy answers that might not be obvious to me. Hopefully this may be helpful for the sake of providing more explanation on e.g. the use of coords
, indexation, and other topics in pymc dimensionality? Anyways, I think that my confusion broadly stems from three related questions:
- Is there “broadcasting” of shape/dimension when performing operations on variables? If so, how does it work?
- When performing indexation on vector or array variables (variables that have dimensionality more complex than just a scalar), most of the multilevel model examples that I’ve found end up doing operations in the dimension of the total number of observations. That is, they perform fancy indexation based on various other dimensions, but that indexation yields a vector that is as long as the number of observations in the data. Is this the recommended way to proceed? My confusion here is that the important dimensionality is sometimes (often?) not as large as the number of observations, and so I wonder if there is a more efficient and understandable way to proceed.
- Is there a recommended way to represent multifactor/multilevel data for convenient use in pymc? I have been experimenting with various implementation details: dataframes that are “long”, versus dataframes that are “wide”, using pandas
MultiIndex
, using onepm.Data
object to store multiple columns’ worth of data (columns are considered in the “long” format), using multiplepm.Data
objects to store each column of data, and so on. Different combinations of these options can give rise to different ways to specify models in pymc, and it’s been difficult to know if I’m making my life easier or harder depending on how I represent my data.
I understand that these are some very broad questions. For concreteness, here is a running example that I have been trying to learn from. This example is modified from the “using shared variables” notebook. That notebook works with daily temperature data as measured at three different cities in Europe. I’ve complicated that example by adding in an extra factor: “time” of the temperature recording, which for simplicity is either day or night. Let’s assume that nights are generally a little bit colder than days. Then we can generate the data with:
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc as pm
RANDOM_SEED = 8927
rng = np.random.default_rng(RANDOM_SEED)
dates = pd.date_range(start="2020-05-01", end="2020-05-29")
ground_truth = [("Berlin", 15, 0.8), ("San Marino", 18, 0.5), ("Paris", 16, 1.2)]
date_col, time_col, city_col, temp_col = [], [], [], []
for city, mu, offset in ground_truth:
temps = rng.normal(loc=mu, scale=1, size=len(dates))
for time in ["day", "night"]:
temps_time = temps + (1 if time == "day" else -1) * rng.normal(loc=offset, scale=0.5, size=len(dates))
date_col.extend(dates)
time_col.extend([time] * len(dates))
city_col.extend([city] * len(dates))
temp_col.extend(temps_time)
assert len(date_col) == len(time_col) == len(city_col) == len(temp_col) == (len(dates) * len(ground_truth) * 2)
df_data = pd.DataFrame(zip(date_col, time_col, city_col, temp_col), columns=["date", "time", "city", "temp"])
df_data = df_data.set_index(["date", "time", "city"], append=True)
print(df_data.sample(10, random_state=RANDOM_SEED))
temp
date time city
151 2020-05-07 night Paris 14.369560
158 2020-05-14 night Paris 13.350646
1 2020-05-02 day Berlin 16.809533
45 2020-05-17 night Berlin 12.081991
39 2020-05-11 night Berlin 14.829763
30 2020-05-02 night Berlin 16.288574
75 2020-05-18 day San Marino 19.536284
70 2020-05-13 day San Marino 19.007176
28 2020-05-29 day Berlin 15.324609
53 2020-05-25 night Berlin 12.587037
dates_idx, dates = pd.factorize(df_data.index.get_level_values("date")) # 29 dates
times_idx, times = pd.factorize(df_data.index.get_level_values("time")) # 2 times
cities_idx, cities = pd.factorize(df_data.index.get_level_values("city"))# 3 cities
obs_idx, _ = pd.factorize(df_data.index.get_level_values(0)) # 29*2*3=174 observations
I’ve deviated from the original notebook by expressing the data in “long” format. It makes the most sense to me as it’s very easy to handle arbitrary combinations of factor levels, as well as possibly imbalanced observations, in this format. I hope this is a wise enough first step. (To question 3: is it?)
My ultimate objective here is to estimate (a) the global or European mean temperature, (b) each city’s mean temp, and (c), new to this modified example, the offset from each city’s mean that we see during the different times of day. But first, here’s my approach to replicating the results in the original notebook, ignoring the “time” dimension.
with pm.Model(coords={"city": cities}) as model:
temperature_pm = pm.MutableData("data_temperature", df_data.temp, dims="obs_idx")
cities_idx_pm = pm.MutableData("cities_idx", cities_idx, dims="obs_idx")
europe_mean = pm.Normal("europe_mean_temp", mu=15.0, sigma=3.0)
city_offset = pm.Normal("city_offset", mu=0.0, sigma=3.0, dims="city")
city_temperature = pm.Deterministic(
"city_temperature", europe_mean + city_offset, dims="city"
)
pm.Normal("likelihood", mu=city_temperature[cities_idx_pm], sigma=0.5, observed=temperature_pm, dims="obs_idx")
This samples just as well as the original example. However, it strikes me that the likelihood node only needs to have a size of 3 (instead, it has size of 174). So, to question 2: why does pymc error out if I try to set the dims argument in the likelihood to just city
(and remove the cities_idx_pm
indexation)? Trying to understand, I also implemented this without using coords
or dims
. But in that case everything seems to work with the likelihood having size of only 3.
with pm.Model() as model:
temperature_pm = pm.MutableData("data_temperature", df_data.temp)
cities_idx_pm = pm.MutableData("cities_idx", cities_idx)
europe_mean = pm.Normal("europe_mean_temp", mu=15.0, sigma=3.0)
city_offset = pm.Normal("city_offset", mu=0.0, sigma=3.0, size=len(cities))
city_temperature = pm.Deterministic(
"city_temperature", europe_mean + city_offset
)
pm.Normal("likelihood", mu=city_temperature[cities_idx_pm], sigma=0.5, observed=temperature_pm, size=len(cities))
Again to question 2: what is the difference between these two models? In this case, it doesn’t make a difference to our results, since sampling still works fine and the stochastic and deterministic variables have the shapes that we want.
But this question does seem relevant to trying to level up and incorporate the “time” dimension in our model. A naive approach that I can think of is to try and estimate a global “day” offset alongside a “night” offset. (The data-generating process had symmetrical average offsets between day and night, but let’s suppose that we didn’t know this.) Here is my best shot at doing this:
with pm.Model(coords={"city": cities, "time": times}) as model:
temperature_pm = pm.MutableData("data_temperature", df_data.temp, dims="obs_idx")
cities_idx_pm = pm.MutableData("cities_idx", cities_idx, dims="obs_idx")
times_idx_pm = pm.MutableData("times_idx", times_idx, dims="obs_idx")
europe_mean = pm.Normal("europe_mean_temp", mu=15.0, sigma=3.0)
city_offset = pm.Normal("city_offset", mu=0.0, sigma=3.0, dims="city")
city_temperature = pm.Deterministic(
"city_temperature", europe_mean + city_offset, dims="city"
)
time_offset = pm.Normal("time_offset", mu=0.0, sigma=1.0, dims="time")
city_time_temperature = city_temperature[cities_idx_pm] + time_offset[times_idx_pm] # is there any way to make this dims (cities, times) ?
pm.Normal("likelihood", mu=city_time_temperature, sigma=0.5, observed=temperature_pm, dims="obs_idx")
When I start writing interactions between the different levels, the only way that I can find to do this is by working in the “observation index” dimension (i.e. in the city_time_temperature
variable). Again to question #2: is this the correct way to proceed? I would love to be able to create a Deterministic city_time_temperature
that has total “rank” of 6 (city x time). Ultimately, this is a posterior estimate of interest, and it would be nice to have it in the trace. But if I make city_time_temperature
a Deterministic, it must have shape 174, and I end up with 174 variables in the trace. I haven’t been able to find a way to create this model without expressing city_time_temperature
as a vector that is as long as the number of observations. Is there a way to do so?
For completeness, I did also implement this model without using coords
. Interestingly, similar to the above non-coords model, it works fine with the likelihood being specified as size=len(cities)*len(times)
. However, since Deterministic does not take shape or size arguments, I could not find a way to express city_time_temperature
in terms of (city, time).
To question #1: I wondered if any type of intelligent broadcasting is possible, something like:
city_time_temperature = pm.Deterministic("city_time_temperature", city_temperature + time_offset, dims=("city", "time"))
Interestingly this model compiles fine, but it throws an error immediately upon sampling with ValueError: Input dimension mismatch. (input[3].shape[0] = 174, input[5].shape[0] = 3)
.
And back to question #3: I’m wondering if I am tying my hands somehow by creating those "idx"
variables that all have length equal to the number of observations. It seems to defeat the purpose of coords
and dims
, a little bit. However, I haven’t been able to find an easier way to work with this multifactor data. In particular, it seems like pm.Data
isn’t built to play nicely with pandas MultiIndex
ed dataframes. Is this the best way to work with multifactor data?
Lastly, ultimately I would wonder about taking this model another hierarchical step further, and try to estimate the “time” offset for each city rather than just globally. I have enough questions about the simpler models, though, that I figured there is plenty to learn before complicating things further.
Seriously, my sincere apologies for somehow turning my questions into an essay. I couldn’t figure out how to clearly explain my curiosities while being any less verbose. On reflection, I suppose that my questions are generally seeking advice on “best practices” and asking for some sanity checks that I haven’t made any fundamental misunderstandings in all of this. Thanks for the wonderful library by the way .
Here’s what I’ve been reading up on, if it helps to have possibly-related content crystallized here:
- Using shared variables (Data container adaptation) — PyMC documentation
- A Primer on Bayesian Methods for Multilevel Modeling — PyMC documentation
- pymc.Data — PyMC dev documentation
- PyMC Dimensionality — PyMC dev documentation
- InferenceData schema specification — ArviZ dev documentation
- PyMC3 with labeled coords and dims | Oriol unraveled
- Multi-Multilevel Modeling - #7 by HectorM14
Possibly-related
- New Feature: Add a TidyData Class · Discussion #5335 · pymc-devs/pymc · GitHub (is this still relevant?)