Spline Regression in PyMC3

I’ve been looking for tutorials, using PyMC3, to walk through the basics of Splines. I’ve stumbled across an RStan resource; I’m somewhat tracking the major concepts, but would rather “see it” in PyMC3 as the language/API isn’t as low level.

Anyone have anything?

The last example in Rethinking 2 chapter 4 is a good start: https://github.com/pymc-devs/resources/blob/master/Rethinking_2/Chp_04.ipynb

3 Likes

Looking at the resource you’ve linked, the author uses:

from patsy import dmatrix

B = dmatrix(
    "bs(year, knots=knots, degree=3, include_intercept=True) - 1",
    {"year": d2.year.values, "knots": knot_list[1:-1]},
)

I have no idea what the -1 is supposed to in the formula-like argument or why it would be used. Any ideas?

These resources were somewhat helpful in general familiarity w/ Patsy, but didn’t answer my specific question.

I forgot what does that do as well - best to generate B with and without -1 to compare.

1 Like

For future readers, I was curious about fitting a spline, as well, so I put together a blog post with additional plots and explanation of the Statistical Rethinking example.

Please let me know if there are any mistakes or ways of improving this post. I made it primarily for my own reference, but figured others would find it useful, too.

7 Likes

Does anyone know how I could fit a spline in a multilevel model? In other words, how can I define a separate spline basis for each group?

Here is what I have tried. The list of B matrices is B_list and the group of each data point is in the "group" column of the data frame d. The problem is that I cannot figure out a way to successfully index a collection of the spline bases. They are different sizes (i.e. different number of data points per group) so I cannot stack them into a 3D array.

group_idx = d["group"].cat.codes.to_list()
w_shape = (NUM_KNOTS + 2, len(np.unique(d["group"])))

with pm.Model() as m:
    # Priors
    a = pm.Normal("a", 0, 2)
    w = pm.Normal("w", 0, 1, shape=w_shape)
    
    # Linear model
    mu = pm.Deterministic("mu", a + pm.math.dot(B_list[group_idx], w[:, group_idx]))
    sgima = pm.Exponential("sigma", 1)
    
    # Likelihood
    D =  pm.Normal("D", mu, sigma, observed=d["observed_D"])

Any help is greatly appreciated!


(Edit: To state what was not working.)

So, and this is pretty monstrous, but could you:

group_idx = d["group"].cat.codes.to_list()
w_shape = (NUM_KNOTS + 2, len(np.unique(d["group"])))

with pm.Model() as m:
    # Priors
    a = pm.Normal("a", 0, 2)
    w = pm.Normal("w", 0, 1, shape=w_shape)
    
    # Linear model
    _mu = a

for group in group_idx:

    with model:
        _mu += pm.math.dot(B_list[group], w[:, group])

with model:
    
    mu = pm.Deterministic("mu", _mu)
    
    sigma = pm.Exponential("sigma", 1)
    
    # Likelihood
    D =  pm.Normal("D", mu, sigma, observed=d["observed_D"])

I’ve done things like this before when I was stuck on how to get a loop working.

Thanks for the help! I implemented your suggestion and got the following error with the traceback pointing the the _mu += ... in the for-loop:

Input dimension mis-match. (input[0].shape[0] = 3988, input[1].shape[0] = 5982)

I think this is saying that the next group cannot be added onto _mu because there are differing numbers of data points between the groups. Any thoughts on how to get around this?

Would it be mathematically “kosher” for me to just pad the smaller B-spline bases with 0’s to match the largest one? Then I could build a 3D matrix and index that.

Yeah that’s what I would do as well - creating a block design matrix might be expensive if the size is large so be aware tho

It definitely is! I wasn’t able to use the pm.math.dot() and tried to use the tt.sum(...) option mentioned here but it still used way too much memory.

Not sure it can be used here but there is also tt.batched_dot which loops over dot-products.

2 Likes

Hi there, apologies for resurrecting this old thread.

I’ve been trying to carry out some spline regression and this thread, and the blog post, have been incredibly useful.

I’m quite new to all this, so I can only apologise if this is a stupid question, but in my case I need to limit the spline to only have positive gradients. Is there a simple way to achieve this? I naively thought that I should be able to use a HalfNormal distribution for w but that does not seem to be the case.

Welcome, to the PyMC3 discourse! Hopefully you’re enjoying the library and community.

I’m not experienced with using splines and my knowledge of modeling is moderate at best. My first thought is that if the spline slopes need to be positive, then that should be a function of the data, not the variables. In other words, if the data calls for a negative slope, then a negative slope should be accepted. In my experience, positive-only distributions are used when the values must be limited for mathematical reasons, such as for standard deviations. What output did you get when you used the half-normal for w? Did you get an error or diagnostic warnings? If the latter, it’s likely because a restricting w to be positive doesn’t model the data well.

1 Like

Thanks for your quick response.

I would generally agree with you about the data and maybe my approach is all wrong. However, although the splines fitted to that data occasionally show a slight decrease I know that isn’t the reality. I’m not sure if it will mean much to you, but I’m looking at the physical vs genetic distances on chromosomes. So for the next set of analyses I can’t predict a decrease in genetic distance with an increase in physical distance because they just won’t work.

Existing out-the-box approaches (its known as marey mapping) would have me remove points until this problem goes away but obviously determining exactly which points to remove is highly subjective. My hope is that PyMC3 will have the flexibility to allow me to do something a bit more sophisticated without the need to discard data.

Using the half-normal doesn’t actually seem to alter the model output at all, it runs fine but the predicted spline would appear largely identical to using the normal distribution. Which probably means I don’t understand the model as well I should.

Your use case makes sense, but I’m not sure how using the half-normal prior doesn’t have an effect, it shouldn’t be able to produce negative values for w. I would recommend checking everything again to make sure it is all wired up correctly. If everything looks good and you still have the problem, then try making a simplified version to try to get to the root of the problem.

When I run the cherry blossom example with w = pm.HalfNormal("w", 3, shape=B.shape[1]), it seems to work as expected and only give values ≥0 for w. When multiplied against the basis, there can still be drops in the spline, but that is because the next w is smaller (but still positive) than the previous (see plot below where each individual spline curve is positive but the combined spline in blue drops at times). Is that the problem you are having?

1 Like

You are right, of course, w is always positive but that doesn’t mean the combined spline isn’t going to have negative gradients. I was just being daft.

It looks very much like this type of spline regression isn’t going to fit my needs. I need a model that finds the best fit for the data while constrained to always increase. It would be really cool if this could be coded up in pymc3 but I’m pretty sure a solution that is way beyond my capabilities.

I’ve found a package for R called Scam (Shape constrained additive models) which seems to do what is required but my lack of understanding as to what is going on under the hood is bothering me at the minute.

Cheers for your help.

Well I’m glad you figured out the problem even if it wasn’t the outcome you had hoped for. It sounds like an interesting model to fit, so I would encourage you to make a new topic to elicit help if you want to try coding it with PyMC3. Whatever you choose, good luck!

You might try I-splines, which can be used as a basis for monotonic functions. This package has functionality for generating the basis functions which you would use to replace the matrix B with.

1 Like

This seems to work beautifully, thank you.

I’d almost got there before but hadn’t found a python package to generate I-splines, so this completed the puzzle nicely.

Awesome thanks.

1 Like