Fit interval as a model parameter

Hi,

I have some data whose functional form I know, but only in some interval—outside the interval I do not have good control of the functional form, and neither do I a priori know what the interval is. Up until now I’ve been running a separate fit for each possible window of data, and then looking for plateaux/islands of stability, but the results of this process are hard to visualise, reason about, or derive a single final value from. I’d like to be able to define a single model that treats the fit region bounds as parameters on the same footing as the other parameters to the fit form.

I’ve put together a relatively minimal example of what I’m trying to achieve. Constructing some sample data as:

import matplotlib.pyplot as plt
import numpy as np

def weird_function(x, gate1, gate2, A, B):
    return (
        ((gate1 < x) & (x < gate2)) * (A * (x - gate1) + B)
        + (x < gate1) * B * (1 - np.sin((gate1 - x) * 3) / 10)
        + (x > gate2) * (B + (gate2 - gate1) * A + A / 5 * np.sin((x - gate2) * 10) - (x - gate2) ** 2 / 4)
    )

A = 2
B = 3
gate1 = 3
gate2 = 7
xdata = np.linspace(0, 10, 50)
ydata = weird_function(xdata, gate1, gate2, A, B)
yerror = np.random.random(len(xdata)) / 3 + 0.2

plt.errorbar(xdata, ydata, yerr=yerror, fmt='.')
plt.show()

I’d like the model to locate and fit parameters for the straight line region, giving back values for A, B, gate1, and gate2. (The real data I want to apply this to aren’t linear; this is a simplification for the sake of readability here.)

The model I’ve tried is:

import pymc3 as pm
import theano.tensor as tt
import theano

xtensor = theano.shared(xdata)
ytensor = theano.shared(ydata)
yerrtensor = theano.shared(yerror)

xmin = min(xdata)
xmax = max(xdata)
xspan = xmax - xmin

model = pm.Model()

with model:
    A = pm.Normal('A', mu=5, sigma=2)
    B = pm.Normal('B', mu=5, sigma=2)

    x_lower_bound = xmin + pm.Gamma('x_lower_bound', mu=xspan / 10, sigma=xspan / 2)
    x_upper_bound = xmax - pm.Gamma('x_upper_bound', mu=xspan / 10, sigma=xspan / 2)

    x_to_fit = pm.Deterministic('x_to_fit', xtensor[tt.gt(tt.lt(x_lower_bound, xtensor), x_upper_bound).nonzero()])
    y_to_fit = pm.Deterministic('y_to_fit', ytensor[tt.gt(tt.lt(x_lower_bound, xtensor), x_upper_bound).nonzero()])
    yerr_to_fit = pm.Deterministic('yerr_to_fit', yerrtensor[tt.gt(tt.lt(x_lower_bound, xtensor), x_upper_bound).nonzero()])

    y_form = pm.Deterministic('y_form', A * x_to_fit + B)
    y_obs = pm.Normal('y_obs', mu=y_form, sigma=yerr_to_fit, observed=y_to_fit)

    trace = pm.sample(draws=1000, tune=1000)

This completes on the sample data, but generates warnings and doesn’t give good fit results—the fit bounds seem to try and run away.

It fails on my real data, with the error “could not broadcast input array from shape (1840) into shape (2720)”, presumably because the fit region changing affects the number of points being fitted, which is assumed constant; I suspect this isn’t seen in the example case because the bound never gets to the point that the data change shape?

I think I’m doing this wrong, but can’t work out how to do it properly. Can anyone offer any suggestions, please?

Thanks!

I think you need to model it such that the points outside the bounds are drawn from a different distribution - they can’t just not contribute to the likelihood.

This is hacky but hopefully it can point you in the right direction. I modeled points outside the bounds as coming from a super wide normal.

You might also check out https://docs.pymc.io/notebooks/dependent_density_regression.html

That’s incredibly helpful, thank you!

Since my data are (very) non-linear (and have a relatively constant relative error), I think I need to adjust the sigma so that it scales with the data, so as not to bias the fit towards matching the higher data more closely. I think that in the definition of n, multiplying sigma by the same expression used fir mu may work for this; I’ll see what happens.

Thanks again!