B-splines regression, out-of-sample predictions

Hi all,

I am trying to fit a B-splines model to some longitudinal data. I defined the model like this (using a skewed normal instead of a regular normal distribution as the likelihood, see https://www.pymc.io/projects/docs/en/latest/api/distributions/generated/pymc.SkewNormal.html).

COORDS = {"splines": np.arange(numKnots+2)}

with pm.Model(coords=COORDS) as model:
    x = pm.MutableData('x', value=data.x)
    B = dmatrix(
    "bs(x, knots=knots, degree=3, include_intercept=True) - 1",
    {"x": x.value, "knots": iKnots},
    )
    a = pm.Normal("a", 0, 3)
    w = pm.Normal("w", mu=0, sigma=2, dims="splines")
    mu = pm.Deterministic("mu", a + pm.math.dot(np.asarray(B, order="F"), w.T))

    sigma = pm.HalfStudentT("sigma", nu=4, sigma=1)
    alpha = pm.Normal("alpha", mu=0, sigma=5)
    
    # Compute the mean of a skewed normal distribution.
    mean = pm.Deterministic("mean", mu + sigma * np.sqrt(2/np.pi) * (alpha/np.sqrt(1 + np.power(alpha, 2))))

    y = pm.SkewNormal("y", mu=mu, sigma=sigma, alpha=alpha, observed=data.y, 
                      shape=len(data.x))

I was wondering if it is possible to make out-of-sample posterior predictions with this model as I would like to compute the mean (see definition in the model specification) for values that lie in between those in data.x.

I first tried simply using pm.set_data({'x': newData.x}) and then run pm.sample_posterior_predictive(iData, var_names=['mean'], random_seed = 6) but this just produced predictions for the original x data.

So I then tried the following:

with model:
    x_new = list(range(430)) # these are just some random new data that lie within the limits of data.x
    B = dmatrix(
    "bs(x, knots=knots, degree=3, include_intercept=True) - 1",
    {"x": x_new, "knots": iKnots},
    )
    mu_hat = pm.Deterministic("mu_hat", a + pm.math.dot(np.asarray(B, order="F"), w.T))
    mean_hat = pm.Deterministic("mean_hat", mu_hat + sigma * np.sqrt(2/np.pi) * (alpha/np.sqrt(1 + np.power(alpha, 2))))
    ppc = pm.sample_posterior_predictive(iData, var_names=["mu_hat", "mean_hat"], random_seed=6)

because I thought that B and mu need to be reevaluated when passing new data. However, this results in really big HDIs for mu_hat and mean_hat. So I’m wondering what I’m doing wrong here and what’s the proper way to do it (if possible).

Thanks in advance!

1 Like

I think this is the type of problem that you could solve easily with Bambi. There’s no family with the SkewNormal likelihood though so we need to write a custom family (I can show you if want).

If Bambi is an option for you, let me know and I can help :slight_smile:

Try included the values that you want to see in data. It shouldn’t matter that data.y will be missing for these x values, PyMC will simply treat the missing values as additional variables (it may slow down sampling a hair).

Thanks for the offer! I actually tried bambi but would have needed to write a predict method at which point I didn’t go on. I got this far:

# Define likelihood (skew-normal)
likelihood = bmb.Likelihood(
    "SkewNormal",
    parent="mu",
    sigma=bmb.Prior("HalfStudentT", nu=4, sigma=1),
    alpha=bmb.Prior("Normal", mu=0, sigma=5),
)
link = bmb.Link("identity")
family = bmb.Family("skewnormal", likelihood, link)


# Define model
modelSplines = bmb.Model("y ~ bs(x, knots=iKnots, intercept=True)",
                         dataSplines, family=family, dropna=True)

Thanks for the tip, I’ll try that!

@leomein get ready for a long example :smiley:

First of all, make sure to use the development version of Bambi. We changed many internals recently and it affects how you build families.
Second, I’m going to create the custom family twice. The first will be simpler, but it doesn’t give us posterior predictive sampling. The second is a little more complicated and you need to know a little more about the internals, but it’s implemented exactly the same way than a built-in family. This means that if you want, you can use this code to open a PR to add a new skewnormal family.

Here we go

import arviz as az
import bambi as bmb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy as sp
from scipy import stats

# simulate data
knots_n = 12
size = 100
rng = np.random.default_rng(1234)

knots = np.linspace(-0.5, 1.5, knots_n)
knots_coeff = rng.normal(size=knots_n)

spline = sp.interpolate.BSpline(knots, knots_coeff, 3, extrapolate=False)

x = rng.uniform(0, 1, size)
x.sort()
y = spline(x) + stats.skewnorm(a=0.9, loc=0, scale=0.5).rvs(size=size, random_state=rng)

data = pd.DataFrame({"x": x, "y": y})

fig, ax = plt.subplots(figsize=(8, 6))

x_plot = np.linspace(0, 1, 200)
ax.plot(x_plot, spline(x_plot), c='k')
ax.scatter(x, y, alpha=0.75, zorder=5)
ax.set_xlim(0, 1);

Now we can create the family and the model. See priors are not specified in the family anymore.

# Define the custom family
likelihood = bmb.Likelihood("SkewNormal", params=["mu", "sigma", "alpha"], parent="mu")
link = bmb.Link("identity")
family = bmb.Family("skewnormal", likelihood, link)

# Define the priors for the auxiliary parameters (all the non-parent params)
priors = {
    "sigma": bmb.Prior("HalfStudentT", nu=4, sigma=1),
    "alpha": bmb.Prior("Normal", mu=0, sigma=5),
}
# Use them in the model
model = bmb.Model("y ~ 1 + bs(x, df=9)", data, family=family, priors=priors, dropna=True)
idata = model.fit()

And now… voila!
(see we get mean predictions without having to do anything)

new_x = np.linspace(0, 1, num=100)
model.predict(idata, kind="mean", data=pd.DataFrame({"x": new_x}))

y_pred_mean = idata.posterior["y_mean"].mean(("chain", "draw")).to_numpy()
hdi_data = idata.posterior["y_mean"].quantile([0.025, 0.975], ("chain", "draw")).to_numpy()

fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x_plot, spline(x_plot), color="black")
ax.scatter(x, y, alpha=0.75, zorder=5)
ax.set_xlim(0, 1);
ax.plot(new_x, y_pred_mean, color="C3")
ax.fill_between(new_x, hdi_data[0], hdi_data[1], alpha=0.4, color="C1");

But notice that if you want draws from the posterior predictive distribution, it will fail. This is because it’s not implemented. Bambi does not use PyMC capabilities to get draws from the posterior predictive distribution. It implements them manually (this is another debate).

model.predict(idata, kind="pps")
AttributeError: 'Family' object has no attribute 'posterior_predictive'

So if we really want it, we need to implement it. See that it works

from bambi.utils import get_aliased_name
from bambi.families.univariate import UnivariateFamily
import xarray as xr

class SkewNormalFamily(UnivariateFamily):
    SUPPORTED_LINKS = {"mu": ["identity"], "sigma": ["log"], "alpha": ["log"]}

    def posterior_predictive(self, model, posterior, **kwargs):
        response_name = get_aliased_name(model.response_component.response_term)
        mean = posterior[response_name + "_mean"]
        sigma = posterior[response_name + "_sigma"]
        alpha = posterior[response_name + "_alpha"]
        return xr.apply_ufunc(stats.skewnorm.rvs, alpha, mean, sigma)


likelihood = bmb.Likelihood("SkewNormal", params=["mu", "sigma", "alpha"], parent="mu")
link = bmb.Link("identity")
family = SkewNormalFamily("skewnormal", likelihood, link)

# Define the priors for the auxiliary parameters (all the non-parent params)
priors = {
    "sigma": bmb.Prior("HalfStudentT", nu=4, sigma=1),
    "alpha": bmb.Prior("Normal", mu=0, sigma=5),
}
# Use them in the model
model = bmb.Model("y ~ 1 + bs(x, df=9)", data, family=family, priors=priors, dropna=True)
idata = model.fit()

model.predict(idata, kind="pps")
3 Likes

@leomein I figured out we could implement something in Bambi to provide a default handler for the posterior predictive distribution. Turns out for the majority of the families we won’t need to implement a posterior_predictive method anymore.

This is the PR [WIP] Add default handler for posterior predictive distribution by tomicapretto · Pull Request #625 · bambinos/bambi · GitHub

2 Likes

I took a look at it this morning and it looks great.

1 Like

Great, thanks a lot! I’ll try that.

1 Like