Spline Regression in PyMC3

jbuddy_13 · November 9, 2020, 10:03pm

I’ve been looking for tutorials, using PyMC3, to walk through the basics of Splines. I’ve stumbled across an RStan resource; I’m somewhat tracking the major concepts, but would rather “see it” in PyMC3 as the language/API isn’t as low level.

Anyone have anything?

junpenglao · November 10, 2020, 7:25am

The last example in Rethinking 2 chapter 4 is a good start: https://github.com/pymc-devs/resources/blob/master/Rethinking_2/Chp_04.ipynb

jbuddy_13 · November 10, 2020, 10:10pm

Looking at the resource you’ve linked, the author uses:

from patsy import dmatrix

B = dmatrix(
    "bs(year, knots=knots, degree=3, include_intercept=True) - 1",
    {"year": d2.year.values, "knots": knot_list[1:-1]},
)

I have no idea what the -1 is supposed to in the formula-like argument or why it would be used. Any ideas?

These resources were somewhat helpful in general familiarity w/ Patsy, but didn’t answer my specific question.

junpenglao · November 11, 2020, 10:44am

I forgot what does that do as well - best to generate B with and without -1 to compare.

jhrcook · November 15, 2020, 9:46pm

For future readers, I was curious about fitting a spline, as well, so I put together a blog post with additional plots and explanation of the Statistical Rethinking example.

Please let me know if there are any mistakes or ways of improving this post. I made it primarily for my own reference, but figured others would find it useful, too.

jhrcook · November 16, 2020, 5:27pm

Does anyone know how I could fit a spline in a multilevel model? In other words, how can I define a separate spline basis for each group?

Here is what I have tried. The list of B matrices is B_list and the group of each data point is in the "group" column of the data frame d. The problem is that I cannot figure out a way to successfully index a collection of the spline bases. They are different sizes (i.e. different number of data points per group) so I cannot stack them into a 3D array.

group_idx = d["group"].cat.codes.to_list()
w_shape = (NUM_KNOTS + 2, len(np.unique(d["group"])))

with pm.Model() as m:
    # Priors
    a = pm.Normal("a", 0, 2)
    w = pm.Normal("w", 0, 1, shape=w_shape)
    
    # Linear model
    mu = pm.Deterministic("mu", a + pm.math.dot(B_list[group_idx], w[:, group_idx]))
    sgima = pm.Exponential("sigma", 1)
    
    # Likelihood
    D =  pm.Normal("D", mu, sigma, observed=d["observed_D"])

Any help is greatly appreciated!

(Edit: To state what was not working.)

NateAM · November 19, 2020, 6:55am

So, and this is pretty monstrous, but could you:

group_idx = d["group"].cat.codes.to_list()
w_shape = (NUM_KNOTS + 2, len(np.unique(d["group"])))

with pm.Model() as m:
    # Priors
    a = pm.Normal("a", 0, 2)
    w = pm.Normal("w", 0, 1, shape=w_shape)
    
    # Linear model
    _mu = a

for group in group_idx:

    with model:
        _mu += pm.math.dot(B_list[group], w[:, group])

with model:
    
    mu = pm.Deterministic("mu", _mu)
    
    sigma = pm.Exponential("sigma", 1)
    
    # Likelihood
    D =  pm.Normal("D", mu, sigma, observed=d["observed_D"])

I’ve done things like this before when I was stuck on how to get a loop working.

jhrcook · November 19, 2020, 4:37pm

Thanks for the help! I implemented your suggestion and got the following error with the traceback pointing the the _mu += ... in the for-loop:

Input dimension mis-match. (input[0].shape[0] = 3988, input[1].shape[0] = 5982)

I think this is saying that the next group cannot be added onto _mu because there are differing numbers of data points between the groups. Any thoughts on how to get around this?

jhrcook · November 19, 2020, 6:45pm

Would it be mathematically “kosher” for me to just pad the smaller B-spline bases with 0’s to match the largest one? Then I could build a 3D matrix and index that.

junpenglao · November 19, 2020, 9:49pm

Yeah that’s what I would do as well - creating a block design matrix might be expensive if the size is large so be aware tho

jhrcook · November 20, 2020, 12:20pm

It definitely is! I wasn’t able to use the pm.math.dot() and tried to use the tt.sum(...) option mentioned here but it still used way too much memory.

twiecki · November 24, 2020, 8:41am

Not sure it can be used here but there is also tt.batched_dot which loops over dot-products.

8TFossilT · September 8, 2021, 7:51pm

Hi there, apologies for resurrecting this old thread.

I’ve been trying to carry out some spline regression and this thread, and the blog post, have been incredibly useful.

I’m quite new to all this, so I can only apologise if this is a stupid question, but in my case I need to limit the spline to only have positive gradients. Is there a simple way to achieve this? I naively thought that I should be able to use a HalfNormal distribution for w but that does not seem to be the case.

jhrcook · September 8, 2021, 8:07pm

Welcome, to the PyMC3 discourse! Hopefully you’re enjoying the library and community.

I’m not experienced with using splines and my knowledge of modeling is moderate at best. My first thought is that if the spline slopes need to be positive, then that should be a function of the data, not the variables. In other words, if the data calls for a negative slope, then a negative slope should be accepted. In my experience, positive-only distributions are used when the values must be limited for mathematical reasons, such as for standard deviations. What output did you get when you used the half-normal for w? Did you get an error or diagnostic warnings? If the latter, it’s likely because a restricting w to be positive doesn’t model the data well.

8TFossilT · September 8, 2021, 8:58pm

Thanks for your quick response.

I would generally agree with you about the data and maybe my approach is all wrong. However, although the splines fitted to that data occasionally show a slight decrease I know that isn’t the reality. I’m not sure if it will mean much to you, but I’m looking at the physical vs genetic distances on chromosomes. So for the next set of analyses I can’t predict a decrease in genetic distance with an increase in physical distance because they just won’t work.

Existing out-the-box approaches (its known as marey mapping) would have me remove points until this problem goes away but obviously determining exactly which points to remove is highly subjective. My hope is that PyMC3 will have the flexibility to allow me to do something a bit more sophisticated without the need to discard data.

Using the half-normal doesn’t actually seem to alter the model output at all, it runs fine but the predicted spline would appear largely identical to using the normal distribution. Which probably means I don’t understand the model as well I should.

jhrcook · September 9, 2021, 11:53am

Your use case makes sense, but I’m not sure how using the half-normal prior doesn’t have an effect, it shouldn’t be able to produce negative values for w. I would recommend checking everything again to make sure it is all wired up correctly. If everything looks good and you still have the problem, then try making a simplified version to try to get to the root of the problem.

When I run the cherry blossom example with w = pm.HalfNormal("w", 3, shape=B.shape[1]), it seems to work as expected and only give values ≥0 for w. When multiplied against the basis, there can still be drops in the spline, but that is because the next w is smaller (but still positive) than the previous (see plot below where each individual spline curve is positive but the combined spline in blue drops at times). Is that the problem you are having?

8TFossilT · September 10, 2021, 10:02am

You are right, of course, w is always positive but that doesn’t mean the combined spline isn’t going to have negative gradients. I was just being daft.

It looks very much like this type of spline regression isn’t going to fit my needs. I need a model that finds the best fit for the data while constrained to always increase. It would be really cool if this could be coded up in pymc3 but I’m pretty sure a solution that is way beyond my capabilities.

I’ve found a package for R called Scam (Shape constrained additive models) which seems to do what is required but my lack of understanding as to what is going on under the hood is bothering me at the minute.

Cheers for your help.

jhrcook · September 11, 2021, 12:32am

Well I’m glad you figured out the problem even if it wasn’t the outcome you had hoped for. It sounds like an interesting model to fit, so I would encourage you to make a new topic to elicit help if you want to try coding it with PyMC3. Whatever you choose, good luck!

bwengals · September 14, 2021, 7:07am

You might try I-splines, which can be used as a basis for monotonic functions. This package has functionality for generating the basis functions which you would use to replace the matrix B with.

8TFossilT · September 20, 2021, 12:52pm

This seems to work beautifully, thank you.

I’d almost got there before but hadn’t found a python package to generate I-splines, so this completed the puzzle nicely.

Awesome thanks.

Topic		Replies	Views
Multivariate b-spline regression with out-of-sample predictions v5 modeling	4	686	May 5, 2023
B-splines regression, out-of-sample predictions v5 modeling , prediction	8	1112	January 16, 2023
Find fit function with linear and quadratic component Questions	1	1152	June 23, 2019
Hierarchical logistic regression Sharing	17	4423	October 28, 2019
Bayesian regression in PyMC3 Questions	1	388	October 16, 2019

Spline Regression in PyMC3

Related topics