B-splines regression, out-of-sample predictions

leomein · January 3, 2023, 2:17pm

Hi all,

I am trying to fit a B-splines model to some longitudinal data. I defined the model like this (using a skewed normal instead of a regular normal distribution as the likelihood, see https://www.pymc.io/projects/docs/en/latest/api/distributions/generated/pymc.SkewNormal.html).

COORDS = {"splines": np.arange(numKnots+2)}

with pm.Model(coords=COORDS) as model:
    x = pm.MutableData('x', value=data.x)
    B = dmatrix(
    "bs(x, knots=knots, degree=3, include_intercept=True) - 1",
    {"x": x.value, "knots": iKnots},
    )
    a = pm.Normal("a", 0, 3)
    w = pm.Normal("w", mu=0, sigma=2, dims="splines")
    mu = pm.Deterministic("mu", a + pm.math.dot(np.asarray(B, order="F"), w.T))

    sigma = pm.HalfStudentT("sigma", nu=4, sigma=1)
    alpha = pm.Normal("alpha", mu=0, sigma=5)
    
    # Compute the mean of a skewed normal distribution.
    mean = pm.Deterministic("mean", mu + sigma * np.sqrt(2/np.pi) * (alpha/np.sqrt(1 + np.power(alpha, 2))))

    y = pm.SkewNormal("y", mu=mu, sigma=sigma, alpha=alpha, observed=data.y, 
                      shape=len(data.x))

I was wondering if it is possible to make out-of-sample posterior predictions with this model as I would like to compute the mean (see definition in the model specification) for values that lie in between those in data.x.

I first tried simply using pm.set_data({'x': newData.x}) and then run pm.sample_posterior_predictive(iData, var_names=['mean'], random_seed = 6) but this just produced predictions for the original x data.

So I then tried the following:

with model:
    x_new = list(range(430)) # these are just some random new data that lie within the limits of data.x
    B = dmatrix(
    "bs(x, knots=knots, degree=3, include_intercept=True) - 1",
    {"x": x_new, "knots": iKnots},
    )
    mu_hat = pm.Deterministic("mu_hat", a + pm.math.dot(np.asarray(B, order="F"), w.T))
    mean_hat = pm.Deterministic("mean_hat", mu_hat + sigma * np.sqrt(2/np.pi) * (alpha/np.sqrt(1 + np.power(alpha, 2))))
    ppc = pm.sample_posterior_predictive(iData, var_names=["mu_hat", "mean_hat"], random_seed=6)

because I thought that B and mu need to be reevaluated when passing new data. However, this results in really big HDIs for mu_hat and mean_hat. So I’m wondering what I’m doing wrong here and what’s the proper way to do it (if possible).

Thanks in advance!

tcapretto · January 6, 2023, 1:48am

I think this is the type of problem that you could solve easily with Bambi. There’s no family with the SkewNormal likelihood though so we need to write a custom family (I can show you if want).

If Bambi is an option for you, let me know and I can help

DanWeitzenfeld · January 9, 2023, 9:31pm

Try included the values that you want to see in data. It shouldn’t matter that data.y will be missing for these x values, PyMC will simply treat the missing values as additional variables (it may slow down sampling a hair).

leomein · January 11, 2023, 10:06pm

Thanks for the offer! I actually tried bambi but would have needed to write a predict method at which point I didn’t go on. I got this far:

# Define likelihood (skew-normal)
likelihood = bmb.Likelihood(
    "SkewNormal",
    parent="mu",
    sigma=bmb.Prior("HalfStudentT", nu=4, sigma=1),
    alpha=bmb.Prior("Normal", mu=0, sigma=5),
)
link = bmb.Link("identity")
family = bmb.Family("skewnormal", likelihood, link)


# Define model
modelSplines = bmb.Model("y ~ bs(x, knots=iKnots, intercept=True)",
                         dataSplines, family=family, dropna=True)

leomein · January 11, 2023, 10:06pm

Thanks for the tip, I’ll try that!

tcapretto · January 12, 2023, 1:16am

@leomein get ready for a long example

First of all, make sure to use the development version of Bambi. We changed many internals recently and it affects how you build families.
Second, I’m going to create the custom family twice. The first will be simpler, but it doesn’t give us posterior predictive sampling. The second is a little more complicated and you need to know a little more about the internals, but it’s implemented exactly the same way than a built-in family. This means that if you want, you can use this code to open a PR to add a new skewnormal family.

Here we go

import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp
from scipy import stats

# simulate data
knots_n = 12
size = 100
rng = np.random.default_rng(1234)

knots = np.linspace(-0.5, 1.5, knots_n)
knots_coeff = rng.normal(size=knots_n)

spline = sp.interpolate.BSpline(knots, knots_coeff, 3, extrapolate=False)

x = rng.uniform(0, 1, size)
x.sort()
y = spline(x) + stats.skewnorm(a=0.9, loc=0, scale=0.5).rvs(size=size, random_state=rng)

data = pd.DataFrame({"x": x, "y": y})

fig, ax = plt.subplots(figsize=(8, 6))

x_plot = np.linspace(0, 1, 200)
ax.plot(x_plot, spline(x_plot), c='k')
ax.scatter(x, y, alpha=0.75, zorder=5)
ax.set_xlim(0, 1);

Now we can create the family and the model. See priors are not specified in the family anymore.

# Define the custom family
likelihood = bmb.Likelihood("SkewNormal", params=["mu", "sigma", "alpha"], parent="mu")
link = bmb.Link("identity")
family = bmb.Family("skewnormal", likelihood, link)

# Define the priors for the auxiliary parameters (all the non-parent params)
priors = {
    "sigma": bmb.Prior("HalfStudentT", nu=4, sigma=1),
    "alpha": bmb.Prior("Normal", mu=0, sigma=5),
}
# Use them in the model
model = bmb.Model("y ~ 1 + bs(x, df=9)", data, family=family, priors=priors, dropna=True)
idata = model.fit()

And now… voila!
(see we get mean predictions without having to do anything)

new_x = np.linspace(0, 1, num=100)
model.predict(idata, kind="mean", data=pd.DataFrame({"x": new_x}))

y_pred_mean = idata.posterior["y_mean"].mean(("chain", "draw")).to_numpy()
hdi_data = idata.posterior["y_mean"].quantile([0.025, 0.975], ("chain", "draw")).to_numpy()

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x_plot, spline(x_plot), color="black")
ax.scatter(x, y, alpha=0.75, zorder=5)
ax.set_xlim(0, 1);
ax.plot(new_x, y_pred_mean, color="C3")
ax.fill_between(new_x, hdi_data[0], hdi_data[1], alpha=0.4, color="C1");

But notice that if you want draws from the posterior predictive distribution, it will fail. This is because it’s not implemented. Bambi does not use PyMC capabilities to get draws from the posterior predictive distribution. It implements them manually (this is another debate).

model.predict(idata, kind="pps")

AttributeError: 'Family' object has no attribute 'posterior_predictive'

So if we really want it, we need to implement it. See that it works

from bambi.utils import get_aliased_name
from bambi.families.univariate import UnivariateFamily
import xarray as xr

class SkewNormalFamily(UnivariateFamily):
    SUPPORTED_LINKS = {"mu": ["identity"], "sigma": ["log"], "alpha": ["log"]}

    def posterior_predictive(self, model, posterior, **kwargs):
        response_name = get_aliased_name(model.response_component.response_term)
        mean = posterior[response_name + "_mean"]
        sigma = posterior[response_name + "_sigma"]
        alpha = posterior[response_name + "_alpha"]
        return xr.apply_ufunc(stats.skewnorm.rvs, alpha, mean, sigma)


likelihood = bmb.Likelihood("SkewNormal", params=["mu", "sigma", "alpha"], parent="mu")
link = bmb.Link("identity")
family = SkewNormalFamily("skewnormal", likelihood, link)

# Define the priors for the auxiliary parameters (all the non-parent params)
priors = {
    "sigma": bmb.Prior("HalfStudentT", nu=4, sigma=1),
    "alpha": bmb.Prior("Normal", mu=0, sigma=5),
}
# Use them in the model
model = bmb.Model("y ~ 1 + bs(x, df=9)", data, family=family, priors=priors, dropna=True)
idata = model.fit()

model.predict(idata, kind="pps")

tcapretto · January 13, 2023, 1:42pm

@leomein I figured out we could implement something in Bambi to provide a default handler for the posterior predictive distribution. Turns out for the majority of the families we won’t need to implement a posterior_predictive method anymore.

This is the PR [WIP] Add default handler for posterior predictive distribution by tomicapretto · Pull Request #625 · bambinos/bambi · GitHub

zweli · January 13, 2023, 3:08pm

I took a look at it this morning and it looks great.

leomein · January 16, 2023, 1:46pm

Great, thanks a lot! I’ll try that.

Topic		Replies	Views
Posterior predictive with spline model v5	2	280	August 31, 2023
PyMC v5.10.3 prediction stuff v5 modeling	11	449	January 15, 2024
Spline Regression in PyMC3 Questions	21	3165	September 25, 2021
Multivariate b-spline regression with out-of-sample predictions v5 modeling	4	657	May 5, 2023
Predictions the other way round (posterior predictive out of sample) v5 gaussian_process , modeling	9	536	August 14, 2023

B-splines regression, out-of-sample predictions

Related topics