Spline Regression in PyMC3

I’ve been looking for tutorials, using PyMC3, to walk through the basics of Splines. I’ve stumbled across an RStan resource; I’m somewhat tracking the major concepts, but would rather “see it” in PyMC3 as the language/API isn’t as low level.

Anyone have anything?

The last example in Rethinking 2 chapter 4 is a good start: https://github.com/pymc-devs/resources/blob/master/Rethinking_2/Chp_04.ipynb

2 Likes

Looking at the resource you’ve linked, the author uses:

from patsy import dmatrix

B = dmatrix(
    "bs(year, knots=knots, degree=3, include_intercept=True) - 1",
    {"year": d2.year.values, "knots": knot_list[1:-1]},
)

I have no idea what the -1 is supposed to in the formula-like argument or why it would be used. Any ideas?

These resources were somewhat helpful in general familiarity w/ Patsy, but didn’t answer my specific question.

I forgot what does that do as well - best to generate B with and without -1 to compare.

1 Like

For future readers, I was curious about fitting a spline, as well, so I put together a blog post with additional plots and explanation of the Statistical Rethinking example.

Please let me know if there are any mistakes or ways of improving this post. I made it primarily for my own reference, but figured others would find it useful, too.

6 Likes

Does anyone know how I could fit a spline in a multilevel model? In other words, how can I define a separate spline basis for each group?

Here is what I have tried. The list of B matrices is B_list and the group of each data point is in the "group" column of the data frame d. The problem is that I cannot figure out a way to successfully index a collection of the spline bases. They are different sizes (i.e. different number of data points per group) so I cannot stack them into a 3D array.

group_idx = d["group"].cat.codes.to_list()
w_shape = (NUM_KNOTS + 2, len(np.unique(d["group"])))

with pm.Model() as m:
    # Priors
    a = pm.Normal("a", 0, 2)
    w = pm.Normal("w", 0, 1, shape=w_shape)
    
    # Linear model
    mu = pm.Deterministic("mu", a + pm.math.dot(B_list[group_idx], w[:, group_idx]))
    sgima = pm.Exponential("sigma", 1)
    
    # Likelihood
    D =  pm.Normal("D", mu, sigma, observed=d["observed_D"])

Any help is greatly appreciated!


(Edit: To state what was not working.)

So, and this is pretty monstrous, but could you:

group_idx = d["group"].cat.codes.to_list()
w_shape = (NUM_KNOTS + 2, len(np.unique(d["group"])))

with pm.Model() as m:
    # Priors
    a = pm.Normal("a", 0, 2)
    w = pm.Normal("w", 0, 1, shape=w_shape)
    
    # Linear model
    _mu = a

for group in group_idx:

    with model:
        _mu += pm.math.dot(B_list[group], w[:, group])

with model:
    
    mu = pm.Deterministic("mu", _mu)
    
    sigma = pm.Exponential("sigma", 1)
    
    # Likelihood
    D =  pm.Normal("D", mu, sigma, observed=d["observed_D"])

I’ve done things like this before when I was stuck on how to get a loop working.

Thanks for the help! I implemented your suggestion and got the following error with the traceback pointing the the _mu += ... in the for-loop:

Input dimension mis-match. (input[0].shape[0] = 3988, input[1].shape[0] = 5982)

I think this is saying that the next group cannot be added onto _mu because there are differing numbers of data points between the groups. Any thoughts on how to get around this?

Would it be mathematically “kosher” for me to just pad the smaller B-spline bases with 0’s to match the largest one? Then I could build a 3D matrix and index that.

Yeah that’s what I would do as well - creating a block design matrix might be expensive if the size is large so be aware tho

It definitely is! I wasn’t able to use the pm.math.dot() and tried to use the tt.sum(...) option mentioned here but it still used way too much memory.

Not sure it can be used here but there is also tt.batched_dot which loops over dot-products.

1 Like